Sparkling Water Meetup

Sparkling Water
Meetup
@h2oai & @mmalohlava
presents

Memory efficient
Performance of computation
Machine learning algorithms
Parser, GUI, R-interface
User-friendly API for data
transformation
Large and active community
Platform components - SQL
Multitenancy

Sparkling Water
Spark H2O
HDFS
Spark H2O
HDFS
+
RDD 
immutable
world
DataFrame 
mutable
world

Sparkling Water
Spark H2O
HDFS
Spark H2O
HDFS RDD DataFrame

Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and
algorithms with Spark API
Excels in Spark workﬂows requiring
advanced Machine Learning algorithms

Sparkling Water Design
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Contains application
and Sparkling Water
classes
Sparkling
App
implements
?

Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space

Devel Internals
Sparkling Water Assembly
H2O
Core
H2O
Algos
H2O
Scala
API
H2O
Flow
Sparkling Water Core
Spark Platform
Spark
Core
Spark
SQL
Application
Code+
Assembly is deployed
to Spark cluster as regular
Spark application

Hands-On #1 
Sparkling Shell

Sparkling Water
Requirements
Linux or Mac OS X
Oracle Java 1.7+
Spark 1.1.0

Download
http://h2o.ai/download/

Where is the code?
https://github.com/h2oai/sparkling-water/
blob/master/examples/scripts/

Flight delays prediction
“Build a model using weather
and ﬂight data to predict delays
of ﬂights arriving to Chicago
O’Hare International Airport”

Example Outline
Load & Parse CSV data from 2 data sources
Use Spark API to ﬁlter data, do SQL query
for join
Create regression models
Use models to predict delays
Graph residual plot from R

Install and Launch
Unpack zip ﬁle 
 
and  
 
Point SPARK_HOME to your Spark 1.1.0
installation
and 
 
Launch bin/sparkling-shell

What is Sparkling Shell?
Standard spark-shell
With additional Sparkling Water classes
export MASTER=“local-cluster[3,2,1024]”
spark-shell
—-jars sparkling-water.jar
JAR containing
Sparkling
Water
Spark Master
address

Lets play with Sparkling
shell…

Create H2O Client
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
 
val h2oContext = new H2OContext(sc).start(3) 
import h2oContext._
Regular Spark context
provided by Spark shell
Size of demanded
H2O cloud
Contains implicit utility functions Demo speciﬁc
classes

Is Spark Running?
Go to http://localhost:4040

Is H2O running?
http://localhost:54321/ﬂow/index.html

Load Data #1
Load weather data into RDD
 
val weatherDataFile =
“examples/smalldata/weather.csv"
 
val wrawdata = sc.textFile(weatherDataFile,3)
.cache() 
val weatherTable = wrawdata
.map(_.split(“,"))
.map(row => WeatherParse(row))
.filter(!_.isWrongRow())
Regular Spark API
Ad-hoc Parser

Weather Data
case class Weather( val Year : Option[Int], 
val Month : Option[Int], 
val Day : Option[Int], 
val TmaxF : Option[Int], // Max temperatur in F 
val TminF : Option[Int], // Min temperatur in F 
val TmeanF : Option[Float], // Mean temperatur in F 
val PrcpIn : Option[Float], // Precipitation (inches) 
val SnowIn : Option[Float], // Snow (inches) 
val CDD : Option[Float], // Cooling Degree Day 
val HDD : Option[Float], // Heating Degree Day 
val GDD : Option[Float]) // Growing Degree Day
Simple POJO to hold one row of weather data

Load Data #2
Load ﬂights data into H2O frame
import java.io.File
val dataFile =
“examples/smalldata/year2005.csv.gz"
 
val airlinesData = new DataFrame(new File(dataFile))
Shortcut for data load
and parse

Where is the data?
Go to http://localhost:54321/ﬂow/
index.html

Use Spark API for Data
Filtering
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines]
= asRDD[Airlines](airlinesData) 
// And use Spark RDD API directly 
val flightsToORD = airlinesTable
.filter( f => f.Dest == Some(“ORD") )
Regular Spark
RDD call
Create a cheap wrapper
around H2O DataFrame

Use Spark SQL to Data
Join
import org.apache.spark.sql.SQLContext
// We need to create SQL context  
implicit val sqlContext = new SQLContext(sc) 
import sqlContext._  
flightsToORD.registerTempTable("FlightsToORD") 
weatherTable.registerTempTable("WeatherORD")
Make context implicit to
share it with h2oContext

Split data
import hex.splitframe.SplitFrame 
import hex.splitframe.SplitFrameModel.SplitFrameParameters
val sfParams = new SplitFrameParameters() 
sfParams._train = joinedTable 
sfParams._ratios = Array(0.7, 0.2) 
val sf = new SplitFrame(sfParams) 
 
val splits = sf.trainModel().get._output._splits 
val trainTable = splits(0) 
val validTable = splits(1) 
val testTable = splits(2)
Result of
SQL query is
implicitly
converted
into H2O
DataFrame

Launch H2O Algorithms
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel
.DeepLearningParameters
 
// Setup deep learning parameters
val dlParams = new DeepLearningParameters() 
dlParams._train = trainTable 
dlParams._response_column = 'ArrDelay 
dlParams._valid = validTable 
dlParams._epochs = 100 
dlParams._reproducible = true 
dlParams._force_load_balance = false
// Create a new model builder
val dl = new DeepLearning(dlParams) 
val dlModel = dl.trainModel.get
Blocking call

Make a prediction
// Use model to score data
val dlPredictTable = dlModel.score(testTable)(‘predict)
// Collect predicted values via RDD API 
val predictionValues = toShemaRDD(dlPredictTable) 
.collect 
.map (row =>
if (row.isNullAt(0))
Double.NaN
else
row(0)

Hands-On #2 
 
Can I access results from R?
YES!

Requirements
R 3.1.2+
RStudio
H2O R package

Install R package
You can ﬁnd R package  
on USB stick
1. Open RStudio
2. Click on  
“Install Packages”
3. Select  
h2o_0.1.20.99999.tar.gz  
ﬁle from USB

Generate R code
import org.apache.spark.examples.h2o.DemoUtils.residualPlotRCode
residualPlotRCode(
predictionH2OFrame, 'predict,
testFrame, 'ArrDelay)
Utility generating 
R code to show 
residuals plot for  
predicted and actual 
values
In Sparkling Shell:

Residuals Plot in R
# Import H2O library and initialize H2O client
library(h2o)
h = h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")
act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columns
predDelay = pred$predict
actDelay = act$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind(
as.data.frame(actDelay$ArrDelay),
as.data.frame(residuals$predict))
plot( compare[,1:2] )
References
of data
residuals

If you are running R v3.1.0 you will see
different plot:
Warning!
Why? Float number handling was changed in
that version. Our recommendation is to
upgrade your R to the newest version.

Try GBM Algo
import hex.tree.gbm.GBM 
import hex.tree.gbm.GBMModel.GBMParameters 
val gbmParams = new GBMParameters() 
gbmParams._train = trainTable 
gbmParams._response_column = 'ArrDelay 
gbmParams._valid = validTable 
gbmParams._ntrees = 100 
 
val gbm = new GBM(gbmParams) 
val gbmModel = gbm.trainModel.get
// Print R code for residual plot 
val gbmPredictTable = gbmModel.score(testTable)('predict) 
printf( residualPlotRCode(gbmPredictTable, 'predict, testTable,
'ArrDelay) )

Residuals plot for GBM
predictionresiduals

Hands-On #3
 
How Can I Develop and
Run Standalone App?

Requirements
Idea or Eclipse
Git

Use Sparkling Water
Droplet
Clone H2O Droplets repository
git clone https://github.com/h2oai/h2o-droplets.git
cd h2o-droplets/sparkling-water-droplet/

Generate IDE project
For Idea
For Eclipse
./gradlew idea
./gradlew eclipse
… add import project into your IDE

Create An Application
object AirlinesWeatherAnalysis { 
 
/** Entry point */ 
def main(args: Array[String]) { 
// Configure this application 
val conf: SparkConf = new SparkConf().setAppName("Flights Water") 
conf.setIfMissing("spark.master", sys.env.getOrElse("spark.master", "local")) 
 
// Create SparkContext to execute application on Spark cluster 
val sc = new SparkContext(conf) 
// Start H2O cluster only 
new H2OContext(sc).start() 
 
// User code  
// . . . 
} 
}
Create
Spark Context
Create H2O context
and start H2O on top
of Spark

Build the Application
./gradlew build shadowJar
Create an assembly 
which can be submitted 
to Spark cluster
Build and test

Run code on Spark
#!/usr/bin/env bash
APP_CLASS=water.droplets.AirlineWeatherAnalysis
FAT_JAR_FILE=“build/libs/sparkling-water-droplet-app.jar”
MASTER=${MASTER:-"local-cluster[3,2,1024]"}
DRIVER_MEMORY=2g
$SPARK_HOME/bin/spark-submit "$@"
--driver-memory $DRIVER_MEMORY
--master $MASTER
--class "$APP_CLASS" $FAT_JAR_FILE

It is Open Source!
You can participate in
H2O Scala API
Sparkling Water testing
Mesos, Yarn, workﬂows (PUBDEV-23,26,27,31-33)
Spark integration
MLLib Pipelines Check out our JIRA
at http://jira.h2o.ai

Come to Meetup
http://www.meetup.com/Silicon-Valley-Big-Data-Science/

More info
Checkout H2O.ai Training Books
http://learn.h2o.ai/ 
Checkout H2O.ai Blog for Sparkling Water tutorials
http://h2o.ai/blog/ 
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata 
Checkout GitHub
https://github.com/h2oai/sparkling-water

Learn more about H2O at
h2o.ai
or
Thank you!
Follow us at @h2oai
> for r in sparkling-water; do
git clone “git@github.com:h2oai/$r.git”
done

Sparkling Water Meetup

More Related Content

Sparkling Water Meetup