Free Code Friday - Spark Streaming with HBase

© 2015 MapR Technologies 1© 2014 MapR Technologies
Overview of Apache Spark Streaming

© 2015 MapR Technologies 2
Agenda
• Why Apache Spark Streaming ?
• What is Apache Spark Streaming?
– Key Concepts and Architecture
• How it works by Example

Why Spark Streaming?
• Process Time Series data :
– Results in near-real-time
• Use Cases
– Social network trends
– Website statistics, monitoring
– Fraud detection
– Advertising click monetization
put
put
put
put
Time stamped data
data
• Sensor, System Metrics, Events, log files
• Stock Ticker, User Activity
• Hi Volume, Velocity
Data for real-time
monitoring

What is time series data?
• Stuff with timestamps
– Sensor data
– log files
– Phones..
Credit Card Transactions Web user behaviour
Social media
Log files
Geodata
Sensors

Why Spark Streaming ?
What If?
• You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats

Batch Processing
It's 6:01 and 72 degrees
It was hot at 6:05
yesterday!
Batch processing may be too late for some events

Event Processing
It's 6:05 and
90 degrees
Someone should
open a window!
Streaming
Its becoming important to process events as they arrive

What is Spark Streaming?
• extension of the core Spark AP
• enables scalable, high-throughput, fault-tolerant stream
processing of live data
Data Sources Data Sinks

Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps
Stream
Processing

Key Concepts
• Data Sources:
– File Based: HDFS
– Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
• Transformations
• Output Operations
MapR-FS
Topics

Spark Streaming Architecture
• Divide data stream into batches of X seconds
– Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Cached in memory
– Or on disk
• Fault tolerant

Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithErrorRDD.count()
6
linesWithErrorRDD.first()
# Error line
textFile = sc.textFile(”SomeFile.txt”)
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)

Process DStream
transform
Transform
map
reduceByValue
count
DStream
RDDs
Dstream
RDDs
transformtransform
• Process using transformations
– creates new RDDs
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 1 RDD @ time 2 RDD @ time 3

Key Concepts
• Data Sources
• Transformations: create new DStream
– Standard RDD operations: map, filter, union, reduce, join, …
– Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
• Output Operations

Spark Streaming Architecture
• processed results are pushed out in batches
Spark
batches of processed
results
Spark
Streaming
input data
stream
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3

Key Concepts
• Data Sources
• Transformations
• Output Operations: trigger Computation
– saveAsHadoopFiles – save to HDFS
– saveAsHadoopDataset – save to Hbase
– saveAsTextFiles
– foreach – do anything with each batch of RDDs
MapR-DB
MapR-FS

Learning Goals
• How it works by example

Use Case: Time Series Data
Data for
real-time monitoring
read
Spark Processing
Spark
Streaming
Oil Pump Sensor data

Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}

Schema
• All events stored, data CF could be set to expire data
• Filtered alerts put in alerts CF
• Daily summaries put in Stats CF
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0

Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code:
1. create a Dstream
1. Apply transformations
2. Apply output operations
2. Start receiving data and processing it
– using streamingContext.start().
3. Wait for the processing to be stopped
– using streamingContext.awaitTermination().

Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
batch
time 0-1
linesDStream
batch
time 1-2
batch
time 1-2
DStream: a sequence of RDDs representing a
stream of data
stored in memory as an
RDD

Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream = linesDStream.map(parseSensor)
map
new RDDs created
for every batch
batch
time 0-1
linesDStream
RDDs
sensorDstream
RDDs
batch
time 1-2
mapmap
batch
time 1-2

Process DStream
// for Each RDD
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi < 5.0)
. . .
}

DataFrame and SQL Operations
// for Each RDD parse into a sensor object filter
. . .
alertRdd.toDF().registerTempTable(”alert”)
// join alert data with pump maintenance info
val alertViewDF = sqlContext.sql(
"select s.resid,s.psi, p.pumpType
from alert s join pump p on s.resid = p.resid
join maint m on p.resid=m.resid")
. . .
}

Save to HBase
// for Each RDD parse into a sensor object filter
. . .
// convert alert to put object write to HBase alerts
rdd.map(Sensor.convertToPutAlert)
.saveAsHadoopDataset(jobConfig)
}

Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put objects written
To HBase
batch
time 0-1
linesRDD
DStream
sensorRDD
Dstream
batch
time 1-2
mapmap
batch
time 1-2
HBase
save save save
output operation: persist data to external storage

Start Receiving Data
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()

Using HBase as a Source and Sink
read
write
Spark applicationHBase database
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View

HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD(
conf,classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v)
}.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result

Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
// get Result
val resultRDD = hBaseRDD.map(tuple => tuple._2)
// transform into an RDD of (RowKey, ColumnValue)s
val keyValueRDD = resultRDD.map(
result => (Bytes.toString(result.getRow()).split(" ")(0),
Bytes.toDouble(result.value)))
// group by rowkey , get statistics for column value
val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list =>
StatCounter(list))

Write HBase
// save to HBase table CF data
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// convert psi stats to put and write to hbase table stats column family
keyStatsRDD.map { case (k, v) =>
convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/spark-streaming-hbase

Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training

Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/

References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark

Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Free Code Friday - Spark Streaming with HBase

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Free Code Friday - Spark Streaming with HBase

Similar to Free Code Friday - Spark Streaming with HBase (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Free Code Friday - Spark Streaming with HBase

Editor's Notes