Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

SPARK STREAMING:
PUSHING THE THROUGHPUT LIMITS,
THE REACTIVE WAY
François Garillot, Gerard Maas

Who Are We ?
Gerard Maas
Data Processing Team Lead
François Garillot
work done at
Spark Streaming at

@maasg @huitseeker
Spark Streaming (Refresher)

@maasg @huitseeker
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1

@maasg @huitseeker
DStream[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformations

@maasg @huitseeker
DStream[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Actions
Transformations

@maasg @huitseeker
Spark API for Streams
Fault-tolerant
High Throughput
Scalable

@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#0
Consumer
Consumer
Consumer
Scheduling

@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#1
Consumer
Consumer
Consumer
#0
Scheduling
Process Time < Batch Interval

@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#2
Consumer
Consumer
Consumer
#0 #1
#3
Scheduling
Scheduling Delay

@maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
#partitions = receivers x batchInterval /
blockInterval

@maasg @huitseeker
#0
RDD
Partitions
Spark
Spark
Executors
Spark Streaming

@maasg @huitseeker
#0
RDD
Spark
Spark
Executors
Spark Streaming

@maasg @huitseeker
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
spark.streaming.blockInterval = batchInterval x
receivers / (partitionFactor x sparkCores)

@maasg @huitseeker
The Importance of Caching
dstream.foreachRDD { rdd =>
rdd.cache() // cache the RDD before iterating!
keys.foreach{ key =>
rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)
}
rdd.unpersist()
}

@maasg @huitseeker
Intervals
(Read TD’s Adaptive
Stream Processing using
Dynamic Batch Sizing
before drawing any
conclusions !)
O(n²)
O(n²)
O(n)
O(n)

@maasg @huitseeker
The Receiver model
spark.streaming.receiver.maxRate
Fault tolerance ? WAL

@maasg @huitseeker
Direct Kafka Stream
compute(offsets)

Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
Less degrees of freedom
val directKafkaStream = KafkaUtils.
createDirectStream[
[key class],
[value class],
[key decoder class],
[value decoder class] ](
streamingContext, [map of Kafka parameters], [set
of topics to consume]
)
spark.streaming.kafka.maxRatePerPartition

@maasg @huitseeker
Reactive Principles
Reactive Streams : composable back-pressure

@maasg @huitseeker
Spark Streaming made
Reactive

@maasg @huitseeker
Spark Streaming Made
Reactive

@maasg @huitseeker
Pain point : Data Locality
- Where is your job getting executed ?
spark.locality.wait & spark.streaming.blockInterval
- On Mesos, it’s worse (SPARK-4940)

@maasg @huitseeker
Resources
Backpressure in Spark Streaming:
http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i
The Virdata’s Spark Streaming tuning guide:
http://www.virdata.com/tuning-spark/
TD’s paper on dynamic batch sizing :
http://dl.acm.org/citation.cfm?id=2670995
Diving into Spark Streaming Execution Model:
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
Spark Streaming / Storm Trident numbered comparison:
https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf
Kafka direct approach:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

Thanks!
Gerard Maas
@maasg
François Garillot
@huitseeker

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

Related slideshows

More Related Content

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas