Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017

Massive distributed processing with H2O
Codemotion,
Milan, 10 November 2017
Gabriele Nocco, Senior Data Scientist

● H2O Introduction
● GBM
● Demo
2
AGENDA

● GBM
● Demo
3
AGENDA

H2O INTRODUCTION
H2O is an opensource in-memory Machine Learning engine. Java-based, it exposes
comfortable APIs in Java, Scala, Python and R. It also has a notebook-like user
interface called Flow.
The transversality of languages enables the access to the framework for many
different professional roles, from analysts to programmers, up to more “academic”
data scientists. So H2O can be a complete infrastructure, from the prototype model
to the engineering solution.

H2O INTRODUCTION - GARTNER
In 2017, H2O.ai became a Visionary in
the Magic Quadrant for Data Science
Platforms:
STRENGTHS
● Market awareness
● Customer satisfaction
● Flexibility and scalability
CAUTIONS
● Data access and preparation
● High technical bar for use
● Visualization and data exploration
● Sales execution
https://www.gartner.com/doc/reprints?id=1-3TKPVG1&ct=170215&st=sb

H2O INTRODUCTION - FEATURES
● H2O Eco-System Benefits:
○ Scalable to massive datasets on large clusters, fully parallelized
○ Low-latency Java (“POJO”) scoring code is auto-generated
○ Easy to deploy on Laptop, Server, Hadoop cluster, Spark cluster, HPC
○ APIs include R, Python, Flow, Scala, Java, Javascript, REST
● Regularization techniques: Dropout, L1/L2
● Early stopping, N-fold cross-validation, Grid search
● Handling of categorical, missing and sparse data
● Gaussian/Laplace/Poisson/Gamma/Tweedie regression with offsets, observation weights,
various loss functions
● Unsupervised mode for nonlinear dimensionality reduction, outlier detection
● File type allowed: csv, ORC, SVMLite, ARFF, XLS, XLSX, Avro, Parquet

H2O INTRODUCTION - ENSEMBLES
In statistics and machine learning, ensemble methods use multiple models to obtain
better predictive performance than could be obtained from any of the constituent
models.
If your set of base learners does not contain the true prediction function, ensembles
can give a good approximation of that function. Ensembles perform better than the
individual base algorithms.
You can use ensemble of weak learners or combine the predictions from multiple
models (Generalized Model Stacking).
Ensembles

H2O INTRODUCTION - DRIVERLESS AI
At the research level, machine
learning problems are
complex and unpredictable,
but the reality is that a lot of
corporates today use machine
learning for relatively
predictable problems.
Driverless AI is the latest
product from H2O.ai aimed at
lowering the barrier to
making data science work in a
corporate context.
Driverless AI

H2O INTRODUCTION - ARCHITECTURE

H2O has the ability to develop Deep Neural Networks natively, or through integration with
TensorFlow. It is now possible to produce very deep networks (5 to 1000 layers!) and it is
possible to handle huge amounts of data, in the order of GBs or TBs.
Another great advantage is the ability to exploit the potential of GPU to perform
computations.
H2O INTRODUCTION - H2O + TENSORFLOW

With the release of
TensorFlow, H2O has
embraced the wave of
enthusiasm for the
growth of Deep Learning.
Thanks to Deep Water,
H2O allows us to interact
in a direct and simple way
with Deep Learning tools
like TensorFlow, MXNet
and Caffe.
H2O INTRODUCTION - H2O + TENSORFLOW

H2O INTRODUCTION - H2O + SPARK
One of the first plugin
developed in H2O was the
one for Apache Spark,
named Sparkling Water.
Binding to an opensource
project on the rise such as
Spark, with the power of
calculation that
distributed computing
allows, has been a great
driving force for the
growth of H2O.

A Sparkling Water
application runs like a job
that can be started with
spark-submit.
At this point the Spark
Master produces the DAG
and divides the execution
for each Worker, in which
the H2O libraries are
loaded in the Java process.

The Sparkling Water
solution is obviously
certificated for all the
Spark distributions:
Hortonworks, Cloudera,
MapR.
Databricks provides a
Spark cluster in cloud, and
H2O works perfectly in
this environment. H2O
Rains with Databricks
Cloud!

● GBM
● Demo
18
AGENDA

Gradient Boosting Machine is one of the most powerful techniques to build predictive models. It
can be applied for classification or regression, so it’s a supervised algorithm.
This is one of the most diffused and used algorithm in the Kaggle community, performing better
than SVMs, Decision Trees and Neural Networks in a large number of cases.
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
GBM can be an optimal solution when the dimension of the dataset or the computing power
doesn’t allow to train a Deep Neural Network.
GBM
Gradient Boosting Machine

Kaggle is the biggest platform for Machine
Learning contests in the world.
https://www.kaggle.com/
In the beginning of March 2017, Google announces
the acquisition of the Kaggle community.
GBM - KAGGLE

GBM - GRADIENT BOOSTING
Summarizing, GBM requires to specify three different components:
● The loss function with respect to the new weak
learners.
● The specific form of the weak learner (e.g., short
decision trees).
● A technique to add weak learners between them
to minimize the loss function.
How Gradient Boosting Works

The loss function determines the behavior of the
algorithm.
The only requirement is differentiability, in order to
allow gradient descent on it. Although you can define
arbitrary losses, in practice only a handful are used.
For example, regression may use a squared error and
classification may use logarithmic loss.
Loss Function

In H2O, the weak learners are implemented as decision trees. In order to
allow the addition of their outputs, regression trees (having real values in
output) are used.
When building each decision tree, the algorithm iteratively
selects a split point in order to minimize the loss. It is
possible to increase the depth of the trees to handle more
complex problems.
On the contrary, to limit overfitting we can constrain the
topology of tree by, e.g. limiting the depth, the number of
splits, or the number of leaf nodes.
Weak Learner

In a GBM with squared loss, the resulting algorithm is
extremely simple: at each step we train a new tree on
the “residual errors” with respect to the previous weak
learners.
This can be seen as a gradient descent step with respect
to our loss, where all previous weak learners are kept
fixed and the gradient is approximated (it can be seen
as optimization in a functional space, click here to go
deeply). This generalizes easily to different losses.
Additive Model

The output for the new tree is then added to the
output of the existing sequence of trees in an effort to
correct or improve the final output of the model. In
particular, we associate a different weighting
parameter to each decision region of the newly
constructed tree.
A fixed number of trees are added or training stops
once loss reaches an acceptable level or no longer
improves on an external validation dataset.
Output and Stop Condition

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.
It can benefit from regularization methods that penalize various parts of the algorithm
and generally improve the performance of the algorithm by reducing overfitting.
There are 4 enhancements to basic gradient boosting:
● Tree Constraints
● Learning Rate
● Stochastic Gradient Boosting
● Penalized Learning (Regularization of regression trees output in L1 or L2)
Improvements to Basic Gradient Boosting

● GBM
● Demo
27
AGENDA

mail: gabriele.nocco@gmail.com
meetup: https://www.meetup.com/it-IT/Machine-Learning-Data-Science-Meetup/
IAML - Italian Association for Machine Learning: https://www.iaml.it/

Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017

Related slideshows

More Related Content

Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017