Data Scientist Allison Baker and Development Manager of Data Products Cody Hall work with a talented team of data scientists, software engineers, and web developers, and are building the framework and infrastructure to support a real-time prediction application, with the ability to scale across the entire company. Paramount to these efforts has been the capability of integrating the architecture for software production with the predictive models generated by H2O. This talk will review the processes by which HCA is building a pipeline to predict patient outcomes in real-time, heavily relying on H2O’s POJO scoring API and implemented in Clojure data processing. #h2ony - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata. This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
https://www.bigdataspain.org/2016/program/fri-vp-ww-partners.html https://www.youtube.com/watch?v=LweVVm9n4y4&t=55s&index=8&list=PL6O3g23-p8Tr5eqnIIPdBD_8eE5JBDBik
1) The document discusses how Apache Spark is enabling enterprises to analyze large amounts of data from a variety of sources in real-time to gain insights. 2) It provides examples of how companies are using Spark for applications like online ad personalization, web log analysis, and predictive analytics. 3) The document also outlines trends in Spark adoption in enterprises and strategies for Hortonworks to help further Spark's capabilities and make it easier for enterprises to implement agile analytics and data science.
Systems architecture at TravelBird to support big data analytics, machine learning, and personalization with minimal overhead.
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days. Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert. We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example: Tom: I am not seeing any data for today in my Campaign Metrics Dashboard. Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021. This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
This document discusses how Caserta Concepts used Apache Spark to help a customer master their customer data by cleaning, standardizing, matching, and linking over 6 million customer records and hundreds of millions of data points. Traditional customer data integration approaches were prohibitively expensive and slow for this volume of data. Spark enabled the data to be processed 10x faster by parallelizing data cleansing and transformation. GraphX was also used to model the data as a graph and identify linked customer records, reducing survivorship processing from 2 hours to under 5 minutes.
Users are constantly searching for new content and to stay competitive organizations must act immediately based on up-to-date data. Outdated recommendations decrease the likelihood of presenting the right offer and make it harder to maintain customer loyalty. In order to provide the most relevant recommendations and increase engagement, organizations must track customer interactions and re-score recommendations on the fly. Data sources have expanded dramatically to include a wealth of historical data and a constant influx of behavior data. The key to moving from predictive models, applied in batch, to models that provide responses in real time, is to focus on the efficiency of model application. The speed that recommendations can be served is influenced by: Architecture of the recommendation serving platform Choice of recommendation algorithm Datastore access patterns In this presentation, we’ll discuss how developers can use open source components like HBase and Kiji to develop low-latency recommendation models that can be easily deployed by e-commerce companies. We will give practical advice on how to choose models and design data stores that make use of the architecture and quickly serve new recommendations.
At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.
In this presentation, drawing upon Thorogood’s experience with a customer’s global Data & Analytics division as their MLOps delivery partner, we share important learnings and takeaways from delivering productionized ML solutions and shaping MLOps best practices and organizational standards needed to be successful. We open by providing high-level context & answering key questions such as “What is MLOps exactly?” & “What are the benefits of establishing MLOps Standards?” The subsequent presentation focuses on our learnings & best practices. We start by discussing common challenges when refactoring experimentation use-cases & how to best get ahead of these issues in a global organization. We then outline an Engagement Model for MLOps addressing: People, Processes, and Tools. ‘Processes’ highlights how to manage the often siloed data science use case demand pipeline for MLOps & documentation to facilitate seamless integration with an MLOps framework. ‘People’ provides context around the appropriate team structures & roles to be involved in an MLOps initiative. ‘Tools’ addresses key requirements of tools used for MLOps, considering the match of services to use-cases.
Fraud is prevalent in every industry, and growing at an increasing rate, as the volume of transactions increases with automation. The National Healthcare Anti-Fraud Association estimates $350B of fraudulent spending. Forbes estimates $25B spending by US banks on anti-money laundering compliance. At the same time as fraud and anomaly detection use cases are booming, the skills gap of expert data scientists available to perform fraud detection is widening. The Kavi Global team will present a cloud native, wizard-driven AI anomaly detection solution, enabling Citizen Data Scientists to easily create anomaly detection models to automatically flag Collective, Contextual, and Point anomalies, at the transaction level, as well as collusion between actors. Unsupervised methods (Distribution, Clustering, Association, Sequencing, Historical Occurrence, Custom Rules) and supervised (Random Forest, Neural Network) models are executed in Apache Spark on Databricks. An innovative aggregation framework converts probabilistic fraud scores and their probabilities into a meaningful and actionable prioritized list of suspicious (a statistical outlier) and potentially fraudulent transaction to be investigated from a business point of view. The AI Anomaly Detection models improve over time using Human-in-the-Loop feedback methods to label data for supervised modeling. Finally, The Kavi team overviews the Anomaly Lifecycle: from statistical outlier to validated business fraud for reclaim and business process changes to long term prevention strategies using proactive audits upstream at the time of estimate to prevent revenue leakage. Two client success stories will be presented acros Pharmaceutical Rx and Transportation industries.
Michal Malohlava's presentation on Building Your Own Recommendation Engine 03.17.16 - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
1) eBay's enterprise data platform uses Apache Spark and Hadoop to process large amounts of structured and unstructured data from various sources to power applications and analytics. 2) Key aspects of the platform include an agile data warehouse, data streams platform using Apache Kafka, and data services to simplify access to data and enable collaborative analytics. 3) eBay leverages this platform to power applications such as search, personalization, fraud prevention, and business intelligence through pipelines that ingest behavioral and transactional data.
1) Initially, the data science and engineering teams at Overstock worked independently and were not regularly delivering business value or solving problems in real-time. 2) They came together to solve problems like real-time bidding, where they needed to score users and bid on ads within 10 milliseconds. 3) Over the next 6 months, they improved from scoring users daily to hourly to within minutes by streamlining processes and moving from batch to micro-batch processing. However, they still needed to get faster to enable real-time personalization on the site.
Presented at KDD, August 11, 2015. Abstract of the paper: Machine learning techniques have proved effective in recommender systems and other applications, yet teams working to deploy them lack many of the advantages that those in more established software disciplines today take for granted. The well-known Agile methodology advances projects in a chain of rapid development cycles, with subsequent steps often informed by production experiments. Support for such workflow in machine learning applications remains primitive. The platform developed at if(we) embodies a specific machine learning approach and a rigorous data architecture constraint, so allowing teams to work in rapid iterative cycles. We require models to consume data from a time-ordered event history, and we focus on facilitating creative feature engineering. We make it practical for data scientists to use the same model code in development and in production deployment, and make it practical for them to collaborate on complex models. We deliver real-time recommendations at scale, returning top results from among 10,000,000 candidates with sub-second response times and incorporating new updates in just a few seconds. Using the approach and architecture described here, our team can routinely go from ideas for new models to production-validated results within two weeks.
NoSQL and SQL databases can work together to handle real-time big data needs. Apache Drill is an open source tool that allows interactive analysis of big data using standard SQL queries across NoSQL, Hadoop, and relational data sources. It provides low-latency queries, full ANSI SQL support, and flexibility to handle rapidly evolving schemas and data in different systems. By enabling analysis of all data together using a common interface, it helps tackle challenges of combining operational and decision support systems on big, diverse datasets.