SlideShare a Scribd company logo
Recipes for Running Spark
Streaming Apps in Production
Tathagata “TD” Das
Spark Summit 2015
@tathadas
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrates with MLlib, SQL,
DataFrames, GraphX
Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
3
data streams
receivers
batches results
Word Count with Kafka
val	
  context	
  =	
  new	
  StreamingContext(conf,	
  Seconds(1))	
  
val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
4
entry point of streaming
functionality
create DStream
from Kafka data
Word Count with Kafka
val	
  context	
  =	
  new	
  StreamingContext(conf,	
  Seconds(1))	
  
val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
  
5
split lines into words
Word Count with Kafka
val	
  context	
  =	
  new	
  StreamingContext(conf,	
  Seconds(1))	
  
val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
  
val	
  wordCounts	
  =	
  words.map(x	
  =>	
  (x,	
  1))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  
wordCounts.print()	
  
context.start()	
  
6
print some counts on screen
count the words
start receiving and
transforming the data
Word Count with Kafka
object	
  WordCount	
  {	
  
	
  	
  def	
  main(args:	
  Array[String])	
  {	
  
	
  	
  	
  	
  val	
  context	
  =	
  new	
  StreamingContext(new	
  SparkConf(),	
  Seconds(1))	
  
	
  	
  	
  	
  val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
	
  	
  	
  	
  val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
  
	
  	
  	
  	
  val	
  wordCounts	
  =	
  words.map(x	
  =>	
  (x,1)).reduceByKey(_	
  +	
  _)	
  
	
  	
  	
  	
  wordCounts.print()	
  
	
  	
  	
  	
  context.start()	
  
	
  	
  	
  	
  context.awaitTermination()	
  
	
  	
  }	
  
}	
  
7
Got it working on a small
Spark cluster on little data
What’s next??
How to get it production-ready?
Fault-tolerance and Semantics
Performance and Stability
Monitoring and Upgrading
8
9
Deeper View of
Spark Streaming
Any Spark Application
10
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Tasks sent to
executors for
processing data
Executor
Executor
Executor
Driver launches
executors in
cluster
Spark Streaming Application: Receive data
11
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver	
  
object	
  WordCount	
  {	
  
	
  	
  def	
  main(args:	
  Array[String])	
  {	
  
	
  	
  	
  	
  val	
  context	
  =	
  new	
  StreamingContext(...)	
  
	
  	
  	
  	
  val	
  lines	
  =	
  KafkaUtils.createStream(...)	
  
	
  	
  	
  	
  val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
  
	
  	
  	
  	
  val	
  wordCounts	
  =	
  words.map(x	
  =>	
  (x,1))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  
	
  	
  	
  	
  wordCounts.print()	
  
	
  	
  	
  	
  context.start()	
  
	
  	
  	
  	
  context.awaitTermination()	
  
	
  	
  }	
  
}	
  
Receiver divides
stream into blocks and
keeps in memory
Data Blocks	
  
Blocks also
replicated to
another executor Data Blocks	
  
Spark Streaming Application: Process data
12
Executor
Executor
Receiver
Data Blocks	
  
Data Blocks	
  
Data
store
Every batch
interval, driver
launches tasks to
process the blocks
Driver	
  
object	
  WordCount	
  {	
  
	
  	
  def	
  main(args:	
  Array[String])	
  {	
  
	
  	
  	
  	
  val	
  context	
  =	
  new	
  StreamingContext(...)	
  
	
  	
  	
  	
  val	
  lines	
  =	
  KafkaUtils.createStream(...)	
  
	
  	
  	
  	
  val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
  
	
  	
  	
  	
  val	
  wordCounts	
  =	
  words.map(x	
  =>	
  (x,1))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  
	
  	
  	
  	
  wordCounts.print()	
  
	
  	
  	
  	
  context.start()	
  
	
  	
  	
  	
  context.awaitTermination()	
  
	
  	
  }	
  
}	
  
Fault-tolerance and Semantics
Performance and Stability
Monitoring and upgrading
13
Failures? Why care?
Many streaming applications need zero data loss
guarantees despite any kind of failures in the system
At least once guarantee – every record processed at least once
Exactly once guarantee – every record processed exactly once
Different kinds of failures – executor and driver
Some failures and guarantee requirements need
additional configurations and setups
14
Executor
Receiver
Data Blocks	
  
What if an executor fails?
Tasks and receivers restarted by Spark
automatically, no config needed
15
Executor
Failed Ex.
Receiver
Blocks	
  
Blocks	
  
Driver
If executor fails,
receiver is lost and
all blocks are lost
Receiver
Receiver
restarted
Tasks restarted
on block replicas
What if the driver fails?
16
Executor
Blocks	
  How do we
recover?
When the driver
fails, all the
executors fail
All computation,
all received
blocks are lost
Executor
Receiver
Blocks	
  
Failed Ex.
Receiver
Blocks	
  
Failed
Executor
Blocks	
  
Driver
Failed
Driver
Recovering Driver with Checkpointing
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
17
Executor
Blocks	
  
Executor
Receiver
Blocks	
  
Active
Driver
Checkpoint info
to HDFS / S3
Recovering Driver w/ DStream Checkpointing
18
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
New
Executor
New
Executor
Receiver
New executors
launched and
receivers
restarted
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
1.  Configure automatic driver restart
All cluster managers support this
2.  Set a checkpoint directory in a HDFS-compatible file
system
	
  streamingContext.checkpoint(hdfsDirectory)	
  
3.  Slightly restructure of the code to use checkpoints for
recovery
!
19
Configurating Automatic Driver Restart
Spark Standalone – Use spark-submit with “cluster” mode and “--supervise”
See http://spark.apache.org/docs/latest/spark-standalone.html
YARN – Use spark-submit in “cluster” mode
See YARN config “yarn.resourcemanager.am.max-attempts”
Mesos – Marathon can restart Mesos applications
20
Restructuring code for Checkpointing
21
val	
  context	
  =	
  new	
  StreamingContext(...)	
  
val	
  lines	
  =	
  KafkaUtils.createStream(...)	
  
val	
  words	
  =	
  lines.flatMap(...)	
  
...	
  
context.start()	
  
Create
+
Setup
Start
def	
  creatingFunc():	
  StreamingContext	
  =	
  {	
  	
  
	
  	
  	
  val	
  context	
  =	
  new	
  StreamingContext(...)	
  	
  	
  
	
  	
  	
  val	
  lines	
  =	
  KafkaUtils.createStream(...)	
  
	
  	
  	
  val	
  words	
  =	
  lines.flatMap(...)	
  
	
  	
  	
  ...	
  
	
  	
  	
  context.checkpoint(hdfsDir)	
  
}	
  
Put all setup code into a function that returns a new StreamingContext
Get context setup from HDFS dir OR create a new one with the function
val	
  context	
  =	
  
StreamingContext.getOrCreate(	
  
	
  	
  hdfsDir,	
  creatingFunc)	
  
context.start()	
  
Restructuring code for Checkpointing
StreamingContext.getOrCreate():
If HDFS directory has checkpoint info
recover context from info
else
call creatingFunc() to create
and setup a new context
Restarted process can figure out whether
to recover using checkpoint info or not
22
def	
  creatingFunc():	
  StreamingContext	
  =	
  {	
  	
  
	
  	
  	
  val	
  context	
  =	
  new	
  StreamingContext(...)	
  	
  	
  
	
  	
  	
  val	
  lines	
  =	
  KafkaUtils.createStream(...)	
  
	
  	
 ��	
  val	
  words	
  =	
  lines.flatMap(...)	
  
	
  	
  	
  ...	
  
	
  	
  	
  context.checkpoint(hdfsDir)	
  
}	
  
val	
  context	
  =	
  
StreamingContext.getOrCreate(	
  
	
  	
  hdfsDir,	
  creatingFunc)	
  
context.start()	
  
Received blocks lost on Restart!
23
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
No Blocks	
  
In-memory blocks of
buffered data are
lost on driver restart
Recovering data with Write Ahead Logs
Write Ahead Log (WAL): Synchronously save
received data to fault-tolerant storage
24
Executor
Blocks saved
to HDFS
Executor
Receiver
Blocks	
  
Active
Driver
Data stream
Recovering data with Write Ahead Logs
25
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
Blocks	
  
Blocks recovered
from Write Ahead Log
Write Ahead Log (WAL): Synchronously save
received data to fault-tolerant storage
Recovering data with Write Ahead Logs
1.  Enable checkpointing, logs written in checkpoint directory
2.  Enabled WAL in SparkConf configuration
	
  	
  	
  	
  	
  sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",	
  "true")	
  
3.  Receiver should also be reliable
Acknowledge source only after data saved to WAL
Unacked data will be replayed from source by restarted receiver
4.  Disable in-memory replication (already replicated by HDFS)
	
  Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams
26
RDD Checkpointing
Stateful stream processing can lead to long RDD lineages
Long lineage = bad for fault-tolerance, too much recomputation
RDD checkpointing saves RDD data to the fault-tolerant
storage to limit lineage and recomputation
More: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
27
Fault-tolerance Semantics
28
Zero data loss = every stage processes each
event at least once despite any failure
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
29
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
At least once, w/ Checkpointing + WAL + Reliable receiversReceiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
30
Exactly once receiving with new Kafka Direct approach
Treats Kafka like a replicated log, reads it like a file
Does not use receivers
No need to create multiple DStreams and union them
No need to enable Write Ahead Logs
	
  
	
  val	
  directKafkaStream	
  =	
  KafkaUtils.createDirectStream(...)	
  
	
  
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
31
Exactly once receiving with new Kafka Direct approach
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
Exactly once!
32
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Fault-tolerance and Semantics
Performance and Stability
Monitoring and Upgrading
33
Achieving High Throughput
34
High throughput achieved by sufficient
parallelism at all stages of the pipeline
Sources
Transforming
Sinks
Outputting
Receiving
Scaling the Receivers
35
Sources
Transforming
Sinks
Outputting
Receiving
Sources must be configured with parallel data streams
#partitions in Kafka topics, #shards in Kinesis streams, …
Streaming app should have multiple receivers that
receive the data streams in parallel
Multiple input DStreams, each running a receiver
Can be unioned together to create one DStream
	
  val	
  kafkaStream1	
  =	
  KafkaUtils.createStream(...)	
  
	
  val	
  kafkaStream2	
  =	
  KafkaUtils.createStream(...)	
  
	
  val	
  unionedStream	
  =	
  kafkaStream1.union(kafkaStream2)	
  
	
  
Scaling the Receivers
36
Sources
Transforming
Sinks
Outputting
Receiving
Sufficient number of executors to run all the receivers
Absolute necessity: #cores > #receivers
Good rule of thumb: #executors > #receivers, so that no more
than 1 receiver per executor, and network is not shared
between receivers
Kafka Direct approach does not use receivers
Automatically parallelizes data reading across executors
Parallelism = # Kafka partitions
Stability in Processing
37
Sources
Transforming
Sinks
Outputting
Receiving For stability, must process data as fast as it is received
Must ensure avg batch processing times < batch interval
Previous batch is done by the time next batch is received
Otherwise, new batches keeps queueing up waiting for
previous batches to finish, scheduling delay goes up
Reducing Batch Processing Times
38
Sources
Transforming
Sinks
Outputting
Receiving
More receivers!
Executor running receivers do lot of the processing
Repartition the received data to explicitly distribute load
unionedStream.repartition(40)	
  
Set #partitions in shuffles, make sure its large enough
transformedStream.reduceByKey(reduceFunc,	
  40)	
  
Get more executors and cores!
Reducing Batch Processing Times
39
Sources
Transforming
Sinks
Outputting
Receiving
Use Kryo serialization to serialization costs
Register classes for best performance
See configurations spark.kryo.*
http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
Larger batch durations improve stability
More data aggregated together, amortized cost of shuffle
Limit ingestion rate to handle data surges
See configurations spark.streaming.*maxRate*
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
Speeding up Output Operations
40
Sources
Transforming
Sinks
Outputting
Receiving
Write to data stores efficiently
dataRDD.foreach	
  {	
  event	
  =>	
  
	
  	
  //	
  open	
  connection	
  
	
  	
  //	
  insert	
  single	
  event	
  
	
  	
  //	
  close	
  connection	
  
}	
  
foreach: inefficient
dataRDD.foreachPartition	
  {	
  partition	
  =>	
  
	
  	
  //	
  open	
  connection	
  
	
  	
  //	
  insert	
  all	
  events	
  in	
  partition	
  
	
  	
  //	
  close	
  connection	
  
}	
  
foreachPartition: efficient
dataRDD.foreachPartition	
  {	
  partition	
  =>	
  
	
  	
  //	
  initialize	
  pool	
  or	
  get	
  open	
  connection	
  from	
  pool	
  in	
  executor	
  
	
  	
  //	
  insert	
  all	
  events	
  in	
  partition	
  
	
  	
  //	
  return	
  connection	
  to	
  pool	
  
}	
  
foreachPartition + connection pool: more efficient
Fault-tolerance and Semantics
Performance and Stability
Monitoring and Upgrading
41
Streaming in Spark Web UI
Stats over last 1000 batches
New in Spark 1.4
42
For stability
Scheduling delay should be approx 0
Processing Time approx < batch interval
Streaming in Spark Web UI
Details of individual batches
43
Details of Spark jobs run in a batch
Operational Monitoring
Streaming app stats published through Codahale metrics
Ganglia sink, Graphite sink, custom Codahale metrics sinks
Can see long term trends, across hours and days
Configure the metrics using $SPARK_HOME/conf/metrics.properties	
  
Need to compile Spark with Ganglia LGPL profile for Ganglia support
(see http://spark.apache.org/docs/latest/monitoring.html#metrics)
44
Programmatic Monitoring
StreamingListener – Developer interface to get internal events
onBatchSubmitted, onBatchStarted, onBatchCompleted,
onReceiverStarted, onReceiverStopped, onReceiverError
Take a look at StreamingJobProgressListener (private class) for
inspiration
45
Upgrading Apps
1.  Shutdown your current streaming app gracefully
Will process all data before shutting down cleanly
streamingContext.stop(stopGracefully	
  =	
  true)	
  
2.  Update app code and start it again
Cannot upgrade from previous checkpoints if code
changes or Spark version changes
	
  
46
Much to say I have ... but time I have not
Memory and GC tuning
Using SQLContext
DStream.transform operation
…
Refer to online guide
http://spark.apache.org/docs/latest/streaming-programming-guide.html
47
Thank you
May the stream be with you

More Related Content

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das, Databricks)

  • 1. Recipes for Running Spark Streaming Apps in Production Tathagata “TD” Das Spark Summit 2015 @tathadas
  • 2. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, SQL, DataFrames, GraphX
  • 3. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 3 data streams receivers batches results
  • 4. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   4 entry point of streaming functionality create DStream from Kafka data
  • 5. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   5 split lines into words
  • 6. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   val  wordCounts  =  words.map(x  =>  (x,  1))                                              .reduceByKey(_  +  _)   wordCounts.print()   context.start()   6 print some counts on screen count the words start receiving and transforming the data
  • 7. Word Count with Kafka object  WordCount  {      def  main(args:  Array[String])  {          val  context  =  new  StreamingContext(new  SparkConf(),  Seconds(1))          val  lines  =  KafkaUtils.createStream(context,  ...)          val  words  =  lines.flatMap(_.split("  "))          val  wordCounts  =  words.map(x  =>  (x,1)).reduceByKey(_  +  _)          wordCounts.print()          context.start()          context.awaitTermination()      }   }   7 Got it working on a small Spark cluster on little data What’s next??
  • 8. How to get it production-ready? Fault-tolerance and Semantics Performance and Stability Monitoring and Upgrading 8
  • 10. Any Spark Application 10 Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Executor Executor Executor Driver launches executors in cluster
  • 11. Spark Streaming Application: Receive data 11 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver   object  WordCount  {      def  main(args:  Array[String])  {          val  context  =  new  StreamingContext(...)          val  lines  =  KafkaUtils.createStream(...)          val  words  =  lines.flatMap(_.split("  "))          val  wordCounts  =  words.map(x  =>  (x,1))                                                      .reduceByKey(_  +  _)          wordCounts.print()          context.start()          context.awaitTermination()      }   }   Receiver divides stream into blocks and keeps in memory Data Blocks   Blocks also replicated to another executor Data Blocks  
  • 12. Spark Streaming Application: Process data 12 Executor Executor Receiver Data Blocks   Data Blocks   Data store Every batch interval, driver launches tasks to process the blocks Driver   object  WordCount  {      def  main(args:  Array[String])  {          val  context  =  new  StreamingContext(...)          val  lines  =  KafkaUtils.createStream(...)          val  words  =  lines.flatMap(_.split("  "))          val  wordCounts  =  words.map(x  =>  (x,1))                                                      .reduceByKey(_  +  _)          wordCounts.print()          context.start()          context.awaitTermination()      }   }  
  • 13. Fault-tolerance and Semantics Performance and Stability Monitoring and upgrading 13
  • 14. Failures? Why care? Many streaming applications need zero data loss guarantees despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Different kinds of failures – executor and driver Some failures and guarantee requirements need additional configurations and setups 14
  • 15. Executor Receiver Data Blocks   What if an executor fails? Tasks and receivers restarted by Spark automatically, no config needed 15 Executor Failed Ex. Receiver Blocks   Blocks   Driver If executor fails, receiver is lost and all blocks are lost Receiver Receiver restarted Tasks restarted on block replicas
  • 16. What if the driver fails? 16 Executor Blocks  How do we recover? When the driver fails, all the executors fail All computation, all received blocks are lost Executor Receiver Blocks   Failed Ex. Receiver Blocks   Failed Executor Blocks   Driver Failed Driver
  • 17. Recovering Driver with Checkpointing DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage 17 Executor Blocks   Executor Receiver Blocks   Active Driver Checkpoint info to HDFS / S3
  • 18. Recovering Driver w/ DStream Checkpointing 18 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver New Executor New Executor Receiver New executors launched and receivers restarted DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 19. Recovering Driver w/ DStream Checkpointing 1.  Configure automatic driver restart All cluster managers support this 2.  Set a checkpoint directory in a HDFS-compatible file system  streamingContext.checkpoint(hdfsDirectory)   3.  Slightly restructure of the code to use checkpoints for recovery ! 19
  • 20. Configurating Automatic Driver Restart Spark Standalone – Use spark-submit with “cluster” mode and “--supervise” See http://spark.apache.org/docs/latest/spark-standalone.html YARN – Use spark-submit in “cluster” mode See YARN config “yarn.resourcemanager.am.max-attempts” Mesos – Marathon can restart Mesos applications 20
  • 21. Restructuring code for Checkpointing 21 val  context  =  new  StreamingContext(...)   val  lines  =  KafkaUtils.createStream(...)   val  words  =  lines.flatMap(...)   ...   context.start()   Create + Setup Start def  creatingFunc():  StreamingContext  =  {          val  context  =  new  StreamingContext(...)            val  lines  =  KafkaUtils.createStream(...)        val  words  =  lines.flatMap(...)        ...        context.checkpoint(hdfsDir)   }   Put all setup code into a function that returns a new StreamingContext Get context setup from HDFS dir OR create a new one with the function val  context  =   StreamingContext.getOrCreate(      hdfsDir,  creatingFunc)   context.start()  
  • 22. Restructuring code for Checkpointing StreamingContext.getOrCreate(): If HDFS directory has checkpoint info recover context from info else call creatingFunc() to create and setup a new context Restarted process can figure out whether to recover using checkpoint info or not 22 def  creatingFunc():  StreamingContext  =  {          val  context  =  new  StreamingContext(...)            val  lines  =  KafkaUtils.createStream(...)        val  words  =  lines.flatMap(...)        ...        context.checkpoint(hdfsDir)   }   val  context  =   StreamingContext.getOrCreate(      hdfsDir,  creatingFunc)   context.start()  
  • 23. Received blocks lost on Restart! 23 Failed Driver Restarted Driver New Executor New Ex. Receiver No Blocks   In-memory blocks of buffered data are lost on driver restart
  • 24. Recovering data with Write Ahead Logs Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage 24 Executor Blocks saved to HDFS Executor Receiver Blocks   Active Driver Data stream
  • 25. Recovering data with Write Ahead Logs 25 Failed Driver Restarted Driver New Executor New Ex. Receiver Blocks   Blocks recovered from Write Ahead Log Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage
  • 26. Recovering data with Write Ahead Logs 1.  Enable checkpointing, logs written in checkpoint directory 2.  Enabled WAL in SparkConf configuration          sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",  "true")   3.  Receiver should also be reliable Acknowledge source only after data saved to WAL Unacked data will be replayed from source by restarted receiver 4.  Disable in-memory replication (already replicated by HDFS)  Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams 26
  • 27. RDD Checkpointing Stateful stream processing can lead to long RDD lineages Long lineage = bad for fault-tolerance, too much recomputation RDD checkpointing saves RDD data to the fault-tolerant storage to limit lineage and recomputation More: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing 27
  • 28. Fault-tolerance Semantics 28 Zero data loss = every stage processes each event at least once despite any failure Sources Transforming Sinks Outputting Receiving
  • 29. Fault-tolerance Semantics 29 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost At least once, w/ Checkpointing + WAL + Reliable receiversReceiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 30. Fault-tolerance Semantics 30 Exactly once receiving with new Kafka Direct approach Treats Kafka like a replicated log, reads it like a file Does not use receivers No need to create multiple DStreams and union them No need to enable Write Ahead Logs    val  directKafkaStream  =  KafkaUtils.createDirectStream(...)     https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html http://spark.apache.org/docs/latest/streaming-kafka-integration.html Sources Transforming Sinks Outputting Receiving
  • 31. Fault-tolerance Semantics 31 Exactly once receiving with new Kafka Direct approach Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Exactly once, if outputs are idempotent or transactional End-to-end semantics: Exactly once!
  • 33. Fault-tolerance and Semantics Performance and Stability Monitoring and Upgrading 33
  • 34. Achieving High Throughput 34 High throughput achieved by sufficient parallelism at all stages of the pipeline Sources Transforming Sinks Outputting Receiving
  • 35. Scaling the Receivers 35 Sources Transforming Sinks Outputting Receiving Sources must be configured with parallel data streams #partitions in Kafka topics, #shards in Kinesis streams, … Streaming app should have multiple receivers that receive the data streams in parallel Multiple input DStreams, each running a receiver Can be unioned together to create one DStream  val  kafkaStream1  =  KafkaUtils.createStream(...)    val  kafkaStream2  =  KafkaUtils.createStream(...)    val  unionedStream  =  kafkaStream1.union(kafkaStream2)    
  • 36. Scaling the Receivers 36 Sources Transforming Sinks Outputting Receiving Sufficient number of executors to run all the receivers Absolute necessity: #cores > #receivers Good rule of thumb: #executors > #receivers, so that no more than 1 receiver per executor, and network is not shared between receivers Kafka Direct approach does not use receivers Automatically parallelizes data reading across executors Parallelism = # Kafka partitions
  • 37. Stability in Processing 37 Sources Transforming Sinks Outputting Receiving For stability, must process data as fast as it is received Must ensure avg batch processing times < batch interval Previous batch is done by the time next batch is received Otherwise, new batches keeps queueing up waiting for previous batches to finish, scheduling delay goes up
  • 38. Reducing Batch Processing Times 38 Sources Transforming Sinks Outputting Receiving More receivers! Executor running receivers do lot of the processing Repartition the received data to explicitly distribute load unionedStream.repartition(40)   Set #partitions in shuffles, make sure its large enough transformedStream.reduceByKey(reduceFunc,  40)   Get more executors and cores!
  • 39. Reducing Batch Processing Times 39 Sources Transforming Sinks Outputting Receiving Use Kryo serialization to serialization costs Register classes for best performance See configurations spark.kryo.* http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization Larger batch durations improve stability More data aggregated together, amortized cost of shuffle Limit ingestion rate to handle data surges See configurations spark.streaming.*maxRate* http://spark.apache.org/docs/latest/configuration.html#spark-streaming
  • 40. Speeding up Output Operations 40 Sources Transforming Sinks Outputting Receiving Write to data stores efficiently dataRDD.foreach  {  event  =>      //  open  connection      //  insert  single  event      //  close  connection   }   foreach: inefficient dataRDD.foreachPartition  {  partition  =>      //  open  connection      //  insert  all  events  in  partition      //  close  connection   }   foreachPartition: efficient dataRDD.foreachPartition  {  partition  =>      //  initialize  pool  or  get  open  connection  from  pool  in  executor      //  insert  all  events  in  partition      //  return  connection  to  pool   }   foreachPartition + connection pool: more efficient
  • 41. Fault-tolerance and Semantics Performance and Stability Monitoring and Upgrading 41
  • 42. Streaming in Spark Web UI Stats over last 1000 batches New in Spark 1.4 42 For stability Scheduling delay should be approx 0 Processing Time approx < batch interval
  • 43. Streaming in Spark Web UI Details of individual batches 43 Details of Spark jobs run in a batch
  • 44. Operational Monitoring Streaming app stats published through Codahale metrics Ganglia sink, Graphite sink, custom Codahale metrics sinks Can see long term trends, across hours and days Configure the metrics using $SPARK_HOME/conf/metrics.properties   Need to compile Spark with Ganglia LGPL profile for Ganglia support (see http://spark.apache.org/docs/latest/monitoring.html#metrics) 44
  • 45. Programmatic Monitoring StreamingListener – Developer interface to get internal events onBatchSubmitted, onBatchStarted, onBatchCompleted, onReceiverStarted, onReceiverStopped, onReceiverError Take a look at StreamingJobProgressListener (private class) for inspiration 45
  • 46. Upgrading Apps 1.  Shutdown your current streaming app gracefully Will process all data before shutting down cleanly streamingContext.stop(stopGracefully  =  true)   2.  Update app code and start it again Cannot upgrade from previous checkpoints if code changes or Spark version changes   46
  • 47. Much to say I have ... but time I have not Memory and GC tuning Using SQLContext DStream.transform operation … Refer to online guide http://spark.apache.org/docs/latest/streaming-programming-guide.html 47
  • 48. Thank you May the stream be with you