Overview of Apache Spark Streaming
• Why Apache Spark Streaming ?
• What is Apache Spark Streaming?
– Key Concepts and Architecture
• How it works by Example
Why Spark Streaming?
• Process Time Series data :
– Results in near-real-time
• Use Cases
– Social network trends
– Website statistics, monitoring
– Fraud detection
– Advertising click monetization
Time stamped data
• Sensor, System Metrics, Events, log files
• Stock Ticker, User Activity
• Hi Volume, Velocity
Data for real-time
What is time series data?
• Stuff with timestamps
– Sensor data
– log files
– Phones..
Credit Card Transactions Web user behaviour
Social media
Log files

Why Spark Streaming ?
What If?
• You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats
Batch Processing
It's 6:01 and 72 degrees
It's 6:02 and 75 degrees
It's 6:03 and 77 degrees
It's 6:04 and 85 degrees
It's 6:05 and 90 degrees
It's 6:06 and 85 degrees
It's 6:07 and 77 degrees
It's 6:08 and 75 degrees
It was hot at 6:05
Batch processing may be too late for some events
Event Processing
It's 6:05 and
90 degrees
Someone should
open a window!
Its becoming important to process events as they arrive
What is Spark Streaming?
• extension of the core Spark AP
• enables scalable, high-throughput, fault-tolerant stream
processing of live data
Data Sources Data Sinks

Stream Processing Architecture
Data Ingest
Data Storage
Key Concepts
• Data Sources:
– File Based: HDFS
– Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
• Transformations
• Output Operations
Spark Streaming Architecture
• Divide data stream into batches of X seconds
– Called DStream = sequence of RDDs
input data
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
• operated on in parallel
• Cached in memory
– Or on disk
• Fault tolerant
Working With RDDs
Action Value
# Error line
textFile = sc.textFile(”SomeFile.txt”)
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
Process DStream
• Process using transformations
– creates new RDDs
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
Key Concepts
• Data Sources
• Transformations: create new DStream
– Standard RDD operations: map, filter, union, reduce, join, …
– Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
• Output Operations

Spark Streaming Architecture
• processed results are pushed out in batches
batches of processed
input data
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
Key Concepts
• Data Sources
• Transformations
• Output Operations: trigger Computation
– saveAsHadoopFiles – save to HDFS
– saveAsHadoopDataset – save to Hbase
– saveAsTextFiles
– foreach – do anything with each batch of RDDs
Learning Goals
• How it works by example
Use Case: Time Series Data
Data for
real-time monitoring
Spark Processing
Oil Pump Sensor data

Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
• All events stored, data CF could be set to expire data
• Filtered alerts put in alerts CF
• Daily summaries put in Stats CF
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2015 MapR Technologies 23
Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code:
1. create a Dstream
1. Apply transformations
2. Apply output operations
2. Start receiving data and processing it
– using streamingContext.start().
3. Wait for the processing to be stopped
– using streamingContext.awaitTermination().
Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
time 0-1
time 1-2
time 1-2
DStream: a sequence of RDDs representing a
stream of data
stored in memory as an

Recommended for you

Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...

Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees. Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis. We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.

hadoop summitdataworks summitdataworks summit 2017
Spark vstez
Spark vstezSpark vstez
Spark vstez

This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks. In the end of the meetup there is small update about our ImpalaToGo project.

map reducelatencytez
Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata Strategies

Good systems development often depends on multiple data management disciplines that provide a solid foundation. One of these is metadata. While much of the discussion around metadata focuses on understanding metadata itself along with its associated technologies, this perspective often represents a typical tool-and-technology focus, which has not achieved significant results to date. A more relevant question when considering pockets of metadata is whether to include them in the scope of organizational metadata practices. By understanding what it means to include items in the scope of your metadata practices, you can begin to build systems that allow you to practice sophisticated ways to advance their data management and supported business initiatives. After a bit of practice in this manner you can position your organization to better exploit any and all metadata technologies in support of business strategy. Takeaways: Metadata value proposition: How to leverage metadata in support of your business strategy Understanding foundational metadata concepts based on the DAMA DMBOK Guiding principles & lessons learned

metadatadata managementdataversity
Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream =
new RDDs created
for every batch
time 0-1
time 1-2
time 1-2
Process DStream
// for Each RDD
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi < 5.0)
. . .
DataFrame and SQL Operations
// for Each RDD parse into a sensor object filter
sensorDStream.foreachRDD { rdd =>
. . .
// join alert data with pump maintenance info
val alertViewDF = sqlContext.sql(
"select s.resid,s.psi, p.pumpType
from alert s join pump p on s.resid = p.resid
join maint m on p.resid=m.resid")
. . .
Save to HBase
// for Each RDD parse into a sensor object filter
sensorDStream.foreachRDD { rdd =>
. . .
// convert alert to put object write to HBase alerts

Save to HBase
Put objects written
To HBase
time 0-1
time 1-2
time 1-2
save save save
output operation: persist data to external storage
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
// Start the computation
// Wait for the computation to terminate
Using HBase as a Source and Sink
Spark applicationHBase database
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View
© 2015 MapR Technologies 32
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD(
classOf[org.apache.hadoop.hbase.client.Result]) { case (k, v) => convertToPut(k, v)
Row key Result
Key Put
Scan Result

Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
// get Result
val resultRDD = => tuple._2)
// transform into an RDD of (RowKey, ColumnValue)s
val keyValueRDD =
result => (Bytes.toString(result.getRow()).split(" ")(0),
// group by rowkey , get statistics for column value
val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list =>
Write HBase
// save to HBase table CF data
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// convert psi stats to put and write to hbase table stats column family { case (k, v) =>
convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
© 2015 MapR Technologies 36
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)

Soon to Come
• Spark On Demand Training
• Spark web site:
• Spark on MapR:
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
• Spark web site:
• Spark on MapR:
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
@mapr maprtech
  • 1. © 2015 MapR Technologies 1© 2014 MapR Technologies Overview of Apache Spark Streaming
  • 2. © 2015 MapR Technologies 2 Agenda • Why Apache Spark Streaming ? • What is Apache Spark Streaming? – Key Concepts and Architecture • How it works by Example
  • 3. © 2015 MapR Technologies 3 Why Spark Streaming? • Process Time Series data : – Results in near-real-time • Use Cases – Social network trends – Website statistics, monitoring – Fraud detection – Advertising click monetization put put put put Time stamped data data • Sensor, System Metrics, Events, log files • Stock Ticker, User Activity • Hi Volume, Velocity Data for real-time monitoring
  • 4. © 2015 MapR Technologies 4 What is time series data? • Stuff with timestamps – Sensor data – log files – Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors
  • 5. © 2015 MapR Technologies 5 Why Spark Streaming ? What If? • You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats
  • 6. © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events
  • 7. © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive
  • 8. © 2015 MapR Technologies 8 What is Spark Streaming? • extension of the core Spark AP • enables scalable, high-throughput, fault-tolerant stream processing of live data Data Sources Data Sinks
  • 9. © 2015 MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps Stream Processing
  • 10. © 2015 MapR Technologies 10 Key Concepts • Data Sources: – File Based: HDFS – Network Based: TCP sockets, Twitter, Kafka, Flume, ZeroMQ, Akka Actor • Transformations • Output Operations MapR-FS Topics
  • 11. © 2015 MapR Technologies 11 Spark Streaming Architecture • Divide data stream into batches of X seconds – Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 12. © 2015 MapR Technologies 12 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
  • 13. © 2015 MapR Technologies 13 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
  • 14. © 2015 MapR Technologies 14 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
  • 15. © 2015 MapR Technologies 15 Process DStream transform Transform map reduceByValue count DStream RDDs Dstream RDDs transformtransform • Process using transformations – creates new RDDs data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3
  • 16. © 2015 MapR Technologies 16 Key Concepts • Data Sources • Transformations: create new DStream – Standard RDD operations: map, filter, union, reduce, join, … – Stateful operations: UpdateStateByKey(function), countByValueAndWindow, … • Output Operations
  • 17. © 2015 MapR Technologies 17 Spark Streaming Architecture • processed results are pushed out in batches Spark batches of processed results Spark Streaming input data stream DStream RDD batches data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 18. © 2015 MapR Technologies 18 Key Concepts • Data Sources • Transformations • Output Operations: trigger Computation – saveAsHadoopFiles – save to HDFS – saveAsHadoopDataset – save to Hbase – saveAsTextFiles – foreach – do anything with each batch of RDDs MapR-DB MapR-FS
  • 19. © 2015 MapR Technologies 19 Learning Goals • How it works by example
  • 20. © 2015 MapR Technologies 20 Use Case: Time Series Data Data for real-time monitoring read Spark Processing Spark Streaming Oil Pump Sensor data
  • 21. © 2015 MapR Technologies 21 Convert Line of CSV data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  • 22. © 2015 MapR Technologies 22 Schema • All events stored, data CF could be set to expire data • Filtered alerts put in alerts CF • Daily summaries put in Stats CF Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 23. © 2015 MapR Technologies 23 Basic Steps for Spark Streaming code These are the basic steps for Spark Streaming code: 1. create a Dstream 1. Apply transformations 2. Apply output operations 2. Start receiving data and processing it – using streamingContext.start(). 3. Wait for the processing to be stopped – using streamingContext.awaitTermination().
  • 24. © 2015 MapR Technologies 24 Create a DStream val ssc = new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream(“/mapr/stream") batch time 0-1 linesDStream batch time 1-2 batch time 1-2 DStream: a sequence of RDDs representing a stream of data stored in memory as an RDD
  • 25. © 2015 MapR Technologies 25 Process DStream val linesDStream = ssc.textFileStream(”directory path") val sensorDStream = map new RDDs created for every batch batch time 0-1 linesDStream RDDs sensorDstream RDDs batch time 1-2 mapmap batch time 1-2
  • 26. © 2015 MapR Technologies 26 Process DStream // for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . . }
  • 27. © 2015 MapR Technologies 27 DataFrame and SQL Operations // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val alertViewDF = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }
  • 28. © 2015 MapR Technologies 28 Save to HBase // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts .saveAsHadoopDataset(jobConfig) }
  • 29. © 2015 MapR Technologies 29 Save to HBase map Put objects written To HBase batch time 0-1 linesRDD DStream sensorRDD Dstream batch time 1-2 mapmap batch time 1-2 HBase save save save output operation: persist data to external storage
  • 30. © 2015 MapR Technologies 30 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  • 31. © 2015 MapR Technologies 31 Using HBase as a Source and Sink read write Spark applicationHBase database EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View
  • 32. © 2015 MapR Technologies 32 HBase HBase Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[], classOf[org.apache.hadoop.hbase.client.Result]) { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result
  • 33. © 2015 MapR Technologies 33 Read HBase // Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
  • 34. © 2015 MapR Technologies 34 Write HBase // save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
  • 35. © 2015 MapR Technologies 35 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data •
  • 36. © 2015 MapR Technologies 36 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • online-training
  • 37. © 2015 MapR Technologies 37 Soon to Come • Spark On Demand Training –
  • 38. © 2015 MapR Technologies 38 References • Spark web site: • • Spark on MapR: – • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 39. © 2015 MapR Technologies 39 References • Spark web site: • • Spark on MapR: – • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 40. © 2015 MapR Technologies 40 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

