Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc. Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
How Deep Learning Will Make Us More Human Again While deep learning is taking over the AI space, most of us are struggling to keep up with the pace of innovation. Arno Candel shares success stories and challenges in training and deploying state-of-the-art machine learning models on real-world datasets. He will also share his insights into what the future of machine learning and deep learning might look like, and how to best prepare for it. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Sarah Guido gave a presentation on analyzing data with Python. She discussed several Python tools for preprocessing, analysis, and visualization including Pandas for data wrangling, scikit-learn for machine learning, NLTK for natural language processing, MRjob for processing large datasets in parallel, and ggplot for visualization. For each tool, she provided examples and use cases. She emphasized that the best tools depend on the type of data and analysis needs.
Introduction to Analytics with Azure Notebooks and Python for Data Science and Business Intelligence. This is one part of a full day workshop on moving from BI to Analytics
Ted Willke, Principal Engineer/GM, Intel Labs: "Avoiding Cluster-Scale Headaches with Better Tools for Data Quality and Feature Engineering"
How the data science pipelines have to evolve and how it'll be accessible using the right technologies from Scala and the Spark Notebook.
This document summarizes an introduction to data analysis in Python using Wakari. It discusses why Python is a good language for data analysis, highlighting key Python packages like NumPy, Pandas, Matplotlib and IPython. It also introduces Wakari, a browser-based Python environment for collaborative data analysis and reproducible research. Wakari allows sharing of code, notebooks and data through a web link. The document recommends several talks at the PyData conference on efficient computing, machine learning and interactive plotting.
We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly? Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company? In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.” We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.
This document discusses pandas, a popular Python library for data analysis, and its limitations. It introduces Badger, a new project from DataPad that aims to address some of pandas' shortcomings like slow performance on large datasets and lack of tight database integration. The creator describes Badger as using compressed columnar storage, immutable data structures, and C kernels to perform analytics queries much faster than pandas or databases on benchmark tests of a multi-million row dataset. He envisions Badger becoming a distributed, multicore analytics platform that can also be used for ETL jobs.
Wes McKinney discusses challenges in building better analytics workflows. He notes the increasing scale of data and need for more advanced analytics has led more people to learn programming. However, current tools have issues with inefficient workflows, lack of collaboration, and friction between different parts of the analytics process. McKinney advocates for more integrated environments that enhance collaboration and make data science more accessible to address these problems.
This slide deck is from a webinar "Everything you always wanted to know about Machine Learning but did not know where to ask"
This document discusses scalable ensemble learning using the H2O platform. It provides an overview of ensemble methods like bagging, boosting, and stacking. The stacking or Super Learner algorithm trains a "metalearner" to optimally combine the predictions from multiple "base learners". The H2O platform and its Ensemble package implement Super Learner and other ensemble methods for tasks like regression and classification. An R code demo is presented on training ensembles with H2O.
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
This document provides an introduction and agenda for a tutorial on machine learning with H2O and Python. The introduction discusses the presenter's background and qualifications. The agenda outlines topics to be covered including an overview of H2O.ai as a company and machine learning platform, tutorials on using the H2O Python module to import data, build regression and classification models, and improve model performance through techniques like cross-validation, grid search, and stacking. Case study notebooks and examples will be used to demonstrate key machine learning concepts and the H2O framework in Python.
H2O Deep Water is a tool that integrates distributed deep learning with H2O's machine learning platform. It allows users to build, stack, and deploy deep learning models from libraries like TensorFlow, MXNet, and Caffe through a unified interface. Deep Water inherits properties from H2O like scalability, ease of use, and deployment capabilities. It also makes deep learning more accessible by supporting popular network architectures and allowing easy ensemble of deep models with other H2O algorithms.
This document summarizes machine learning scalability from single machine to distributed systems. It discusses how true scalability is about how long it takes to reach a target accuracy level using any available hardware resources. It introduces GraphLab Create and SFrame/SGraph for scalable machine learning and graph processing. Key points include distributed optimization techniques, graph partitioning strategies, and benchmarks showing GraphLab Create can solve problems faster than other systems by using fewer machines.
Slides for a presentation I gave for the Machine Learning with Spark Tokyo meetup. Introduction to Spark, H2O, SparklingWater and live demos of GBM and DL.
Spark & GraphX for recommendation algorithms - presented at a Netflix-hosted Spark Meetup, on 05/19/2015
The document discusses using machine learning for solving time series problems. It outlines typical steps which include converting irregular time series data to regular samples, preprocessing the data through techniques like tapped delay lines and dynamic filters to capture dynamic aspects for static models, and then applying machine learning. Both static models where dynamic preprocessing is used and dynamic models like recurrent neural networks are suitable. Example applications include time series forecasting, classification, and control/decision making problems.
1) The document discusses the history and development of artificial intelligence, machine learning, and deep learning from the 1950s to present. 2) It introduces H2O.ai's Deep Water platform for deep learning which leverages popular open source tools like TensorFlow, MXNet, and Caffe to enable large-scale deep learning on CPUs and GPUs. 3) Deep Water allows users to easily train, compare, and deploy deep learning models through H2O's APIs in R, Python, and Flow for big data use cases like image, video, text, and time series analysis.
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Ray Peck from H2O.ai talks about the roadmap for the upcoming AutoML product in H2O. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata