Strata NYC 2015: What's new in Spark Streaming
- 2. Who am I?
Project Management Committee (PMC) member of Spark
Started Spark Streaming in AMPLab, UC Berkeley
Current technical lead of Spark Streaming
Software engineer at Databricks
2
- 3. Founded by creators of Spark and remains largest
contributor
Offers a hosted service
• Spark on EC2
• Notebooks
• Plot visualizations
• Cluster management
• Scheduled jobs
What is Databricks?
3
- 4. Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrates with MLlib, SQL,
DataFrames, GraphX
4
- 5. What can you use it for?
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
5
- 6. Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
data streams
receivers
batches results
6
- 7. Word Count with Kafka
val
context
=
new
StreamingContext(conf,
Seconds(1))
val
lines
=
KafkaUtils.createStream(context,
...)
entry point of streaming
functionality
create DStream
from Kafka data
7
- 8. Word Count with Kafka
val
context
=
new
StreamingContext(conf,
Seconds(1))
val
lines
=
KafkaUtils.createStream(context,
...)
val
words
=
lines.flatMap(_.split("
"))
split lines into words
8
- 9. Word Count with Kafka
val
context
=
new
StreamingContext(conf,
Seconds(1))
val
lines
=
KafkaUtils.createStream(context,
...)
val
words
=
lines.flatMap(_.split("
"))
val
wordCounts
=
words.map(x
=>
(x,
1))
.reduceByKey(_
+
_)
wordCounts.print()
context.start()
print some counts on screen
count the words
start receiving and
transforming the data
9
- 11. Combine batch and streaming processing
Join data streams with static data sets
//
Create
data
set
from
Hadoop
file
val
dataset
=
sparkContext.hadoopFile(“file”)
//
Join
each
batch
in
stream
with
the
dataset
kafkaStream.transform
{
batchRDD
=>
batchRDD.join(dataset)
.filter(
...
)
}
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
11
- 12. Combine machine learning with streaming
Learn models offline, apply them online
//
Learn
model
offline
val
model
=
KMeans.train(dataset,
...)
//
Apply
model
online
on
stream
kafkaStream.map
{
event
=>
model.predict(event.feature)
}
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
12
- 13. Combine SQL with streaming
Interactively query streaming data with SQL and DataFrames
//
Register
each
batch
in
stream
as
table
kafkaStream.foreachRDD
{
batchRDD
=>
batchRDD.toDF.registerTempTable("events")
}
//
Interactively
query
table
sqlContext.sql("select
*
from
events")
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
13
- 15. Spark Survey by Databricks
Survey over 1417
individuals from 842
organizations
56% increase in Spark
Streaming users since 2014
Fastest rising component
in Spark
https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html
15
- 16. Feedback from community
We have learnt a lot from our rapidly growing user base
Most of the development in the last few releases have
driven by community demands
16
- 19. Streaming MLlib algorithms
val
model
=
new
StreamingKMeans()
.setK(10)
.setDecayFactor(1.0)
.setRandomCenters(4,
0.0)
//
Train
on
one
DStream
model.trainOn(trainingDStream)
//
Predict
on
another
DStream
model.predictOnValues(
testDStream.map
{
lp
=>
(lp.label,
lp.features)
}
).print()
19
Continuous learning and prediction on
streaming data
StreamingLinearRegression [Spark 1.1]
StreamingKMeans [Spark 1.2]
StreamingLogisticRegression [Spark 1.3]
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
- 20. Python API Improvements
Added Python API for Streaming ML algos [Spark 1.5]
Added Python API for various data sources
Kafka [Spark 1.3 - 1.5]
Flume, Kinesis, MQTT [Spark 1.5]
20
lines
=
KinesisUtils.createStream(streamingContext,
appName,
streamName,
endpointUrl,
regionName,
InitialPositionInStream.LATEST,
2)
counts
=
lines.flatMap(lambda
line:
line.split("
"))
- 22. New Visualizations [Spark 1.4-15]
22
Stats over last 1000
batches
For stability
Scheduling delay should be approx 0
Processing Time approx < batch interval
- 23. New Visualizations [Spark 1.4-15]
23
Details of individual batches
Kafka offsets processed in each batch,
Can help in debugging bad data
List of Spark jobs in each batch
- 25. New Visualizations [Spark 1.4-15]
Memory usage of received data
Can be used to understand memory
consumption across executors
- 28. Non-replayable Sources
Sources that do not support replay
from any position (e.g. Flume, etc.)
Spark Streaming’s saves received
data to a Write Ahead Log (WAL) and
replays data from the WAL on failure
Zero data loss: Two cases
Replayable Sources
Sources that allow data to replayed
from any pos (e.g. Kafka, Kinesis, etc.)
Spark Streaming saves only the record
identifiers and replays the data back
directly from source
- 29. Cluster
Write Ahead Log (WAL) [Spark 1.3]
Save received data in a WAL in a fault-tolerant file system
29
Driver
Executor Data stream
Driver runs receivers
Driver runs
user code
ReceiverDriver runs tasks to
process received data
Receiver buffers
data in memory
and writes to WAL
WAL in HDFS
- 30. Executor
Receiver
Cluster
Write Ahead Log (WAL) [Spark 1.3]
Replay unprocessed data from WAL if driver fails and restarts
30
Restarted
Executor
Tasks read data
from the WAL
WAL in HDFS
Failed
Driver
Restarted
Driver Failed tasks rerun on
restarted executors
- 31. Write Ahead Log (WAL) [Spark 1.3]
WAL can be enabled by setting Spark configuration
spark.streaming.receiver.writeAheadLog.enable to true
Should use reliable receiver, that ensures data written to
WAL for acknowledging sources
Reliable receiver + WAL gives at least once guarantee
31
- 32. Kinesis [Spark 1.5]
Save the Kinesis sequence numbers instead of raw data
Using
KCL
Sequence number ranges
sent to driver
Sequence number
ranges saved to HDFS
32
Driver Executor
- 33. Kinesis [Spark 1.5]
Recover unprocessed data directly from Kinesis using
recovered sequence numbers
Using
AWS SDK
33
Restarted
Driver
Restarted
ExecutorTasks rerun with
recovered ranges
Ranges recovered from HDFS
- 34. Kinesis [Spark 1.5]
After any failure, records are either recovered from saved
sequence numbers or replayed via KCL
No need to replicate received data in Spark Streaming
Provides end-to-end at least once guarantee
34
- 35. Kafka [1.3, graduated in 1.5]
A priori decide the offset ranges to consume in the next batch
35
Every batch interval, latest offset info
fetched for each Kafka partition
Offset ranges for next
batch decided and
saved to HDFS
Driver
- 36. Kafka [1.3, graduated in 1.5]
A priori decide the offset ranges to consume in the next batch
36
Executor
Executor
Executor
Broker
Broker
Broker
Tasks run to read each
range in parallel
Driver
Every batch interval, latest offset info
fetched for each Kafka partition
- 37. Direct Kafka API [Spark 1.5]
Does not use receivers, no need for Spark Streaming to replicate
Can provide up to 10x higher throughput than earlier receiver approach
https://spark-summit.org/2015/events/towards-benchmarking-modern-distributed-streaming-systems/
Can provide exactly once semantics
Output operation to external storage should be idempotent or transactional
Can run Spark batch jobs directly on Kafka
# RDD partitions = # Kafka partitions, easy to reason about
37
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
- 38. System stability
Streaming applications may have to deal with variations
in data rates and processing rates
For stability, any streaming application must receive
data only as fast as it can process
Since 1.1, Spark Streaming allowed setting static limits
ingestion rates on receivers to guard against spikes
38
- 39. Backpressure [Spark 1.5]
System automatically and dynamically adapts rate limits
to ensure stability under any processing conditions
If sinks slow down, then the system automatically
pushes back on the source to slow down receiving
39
receivers
Sources Sinks
- 40. Backpressure [Spark 1.5]
System uses batch processing times and scheduling
delays used to set rate limits
Well known PID controller theory (used in industrial
control systems) is used calculate appropriate rate limits
Contributed by Typesafe
40
- 41. Backpressure [Spark 1.5]
System uses batch processing times and scheduling
delays used to set rate limits
41
Dynamic rate limit prevents
receivers from receiving too fast
Scheduling delay kept in check
by the rate limits
- 42. Backpressure [Spark 1.5]
Experimental, so disabled by default in Spark 1.5
Enabled by setting Spark configuration
spark.streaming.backpressure.enabled to true
Will be enabled by default in future releases
https://issues.apache.org/jira/browse/SPARK-7398
42
- 44. API and Libraries
Support for operations on event time and out of order data
Most demanded feature from the community
Tighter integration between Streaming and SQL + DataFrames
Helps leverage Project Tungsten
44
- 45. Infrastructure
Add native support for Dynamic Allocation for Streaming
Dynamically scale the cluster resources based on processing load
Will work in collaboration with backpressure to scale up/down while
maintaining stability
Note: As of 1.5, existing Dynamic Allocation not optimized for streaming
But users can build their own scaling logic using developer API
sparkContext.requestExecutors(),
sparkContext.killExecutors()
45
- 48. Fastest growing component in the Spark ecosystem
Significant improvements in fault-tolerance, stability,
visualizations and Python API
More community requested features to come
@tathadas