Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas
- 2. Who Are We ?
Gerard Maas
Data Processing Team Lead
François Garillot
work done at
Spark Streaming at
- 6. @maasg @huitseeker
Spark Streaming (Refresher)
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Actions
Transformations
- 11. @maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
#partitions = receivers x batchInterval /
blockInterval
- 18. @maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
#partitions = receivers x batchInterval /
blockInterval
- 19. @maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
spark.streaming.blockInterval = batchInterval x
receivers / (partitionFactor x sparkCores)
- 20. @maasg @huitseeker
The Importance of Caching
dstream.foreachRDD { rdd =>
rdd.cache() // cache the RDD before iterating!
keys.foreach{ key =>
rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)
}
rdd.unpersist()
}
- 24. Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
Less degrees of freedom
val directKafkaStream = KafkaUtils.
createDirectStream[
[key class],
[value class],
[key decoder class],
[value decoder class] ](
streamingContext, [map of Kafka parameters], [set
of topics to consume]
)
spark.streaming.kafka.maxRatePerPartition
- 33. @maasg @huitseeker
Pain point : Data Locality
- Where is your job getting executed ?
spark.locality.wait & spark.streaming.blockInterval
- On Mesos, it’s worse (SPARK-4940)
- 34. @maasg @huitseeker
Resources
Backpressure in Spark Streaming:
http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i
The Virdata’s Spark Streaming tuning guide:
http://www.virdata.com/tuning-spark/
TD’s paper on dynamic batch sizing :
http://dl.acm.org/citation.cfm?id=2670995
Diving into Spark Streaming Execution Model:
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
Spark Streaming / Storm Trident numbered comparison:
https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf
Kafka direct approach:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md