Exactly-Once Streaming
from Kafka
Kafka is a message queue circular buffer
• Fixed size, based on disk space or time
• Oldest messages deleted to maintain size
• Messages are otherwise immutable
• Split into topic/partition
• Indexed only by offset
• Client tracks read offset, not server
Delivery semantics are your responsibility
1. Save offsets
2. !! Possible failure !!
3. Save results
On failure, restart at saved offset, messages are lost
1. Save results
2. !! Possible failure !!
3. Save offsets
On failure, messages are repeated
No possible magic config option to do better than this

Idempotent exactly-once
1. Save results with a natural unique key
2. !! Possible failure !!
3. Save offsets
On failure, messages are repeated, but we don't care
Immutable messages, pure transformation, same results
Idempotent pros / cons
• Simple
• Works well for shape-preserving transformations (map)
• May be hard to identify natural unique key
• Especially hard for aggregate transformations (fold)
• Won't work for destructive updates
• Results and offsets may be in different data stores 6
Transactional exactly-once
1. Begin transaction
2. Save results
3. Save offsets
4. Ensure offsets are ok (increasing without gaps)
5. Commit transaction
On failure, rollback, results and offsets remain in sync
Transactional pros / cons
• Works easily for any transformation
• Destructive updates ok
• More complex
• Requires a transactional data store
• Results and offsets must be in same data store

Receiver-based stream pros / cons
• WAL design could work with non-Kafka data sources
• Long running receivers = parallelism awkward and costly
• Duplication of write operations
• Dependent on HDFS
• Must use idempotence for exactly-once
• No access to offsets, can't use transactional approach
Direct stream pros / cons
• Spark partition 1:1 Kafka topic/partition, cheap parallelism
• No duplicate writes
• No dependency on HDFS
• Access to offsets, can use idempotent or transactional
• Specific to Kafka
• Need enough Kafka disk (OffsetOutOfRange is your fault)

Don’t care about semantics?
How about server cost?
Basic direct stream API
val stream: InputDStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
// kafka config parameters
"" -> "localhost:9092,anotherhost:9092"),
"auto.offset.reset" -> "largest"
// set of topics to consume
Set("sometopic", "anothertopic")
Basic direct stream API semantics
auto.offset.reset -> largest:
• Starts at latest offset, thus losing data
• Not at-most-once (need to set maxFailures as well)
auto.offset.reset -> smallest:
• Starts at earliest offset
• At-least-once, but replays whole log
If you want finer grained control, must store offsets
Where to store offsets
Easy - Spark Checkpoint:
• No need to access offsets, automatically used on restart
• Must use idempotent, not transactional
• Checkpoints may not be recoverable
Complex - Your own data store:
• Must access offsets, save them, and provide on (re)start
• Idempotent or transactional
• Offsets are just as recoverable as your results

Spark checkpoint
// Direct copy-paste from the docs, same as any other spark checkpoint
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val stream = KafkaUtils.createDirectStream(...) // setup DStream
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(
checkpointDirectory,functionToCreateContext _)
Your own data store
val stream: InputDStream[Int] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder,
Map("" -> “localhost:9092,anotherhost:9092"),
// map of starting offsets, would be read from your storage in real code
Map(TopicAndPartition(“someTopic”, 0) -> someStartingOffset,
TopicAndPartition(“anotherTopic”, 0) -> anotherStartingOffset),
// message handler to get desired value from each message and metadata
(mmd: MessageAndMetadata[String, String]) => mmd.message.length
Accessing offsets, per message
val messageHandler =
(mmd: MessageAndMetadata[String, String]) =>
(mmd.topic, mmd.partition, mmd.offset, mmd.key, mmd.message)
Your message handler has full access to all of the metadata.
Saving offsets per message may not be efficient, though.
Accessing offsets, per batch
stream.foreachRDD { rdd =>
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val results = rdd.someTransformationUsingSparkMethods
// Your save method. Note that this runs on the driver
mySaveBatch(offsetRanges, results)

Accessing offsets, per partition
stream.foreachRDD { rdd =>
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
// index to get the correct offset range for the rdd partition we're working on
val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId)
val perPartitionResult = iter.someTransformationUsingScalaMethods
// Your save method. Note this runs on the executors.
mySavePartition(offsetRange, perPartitionResult)
Be aware of partitioning
rdd.foreachPartition { iter =>
val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId)
rdd.reduceByKey.foreachPartition { iter =>
val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId)
Safe because KafkaRDD partition is 1:1 with Kafka partition
Not safe because there is a shuffle, so no longer 1:1

  • 2. Kafka is a message queue circular buffer • Fixed size, based on disk space or time • Oldest messages deleted to maintain size • Messages are otherwise immutable • Split into topic/partition • Indexed only by offset • Client tracks read offset, not server Delivery semantics are your responsibility 2
  • 3. At-most-once 1. Save offsets 2. !! Possible failure !! 3. Save results On failure, restart at saved offset, messages are lost 3
  • 4. At-least-once 1. Save results 2. !! Possible failure !! 3. Save offsets On failure, messages are repeated No possible magic config option to do better than this 4
  • 5. Idempotent exactly-once 1. Save results with a natural unique key 2. !! Possible failure !! 3. Save offsets On failure, messages are repeated, but we don't care Immutable messages, pure transformation, same results 5
  • 6. Idempotent pros / cons Pro: • Simple • Works well for shape-preserving transformations (map) Con: • May be hard to identify natural unique key • Especially hard for aggregate transformations (fold) • Won't work for destructive updates Note: • Results and offsets may be in different data stores 6
  • 7. Transactional exactly-once 1. Begin transaction 2. Save results 3. Save offsets 4. Ensure offsets are ok (increasing without gaps) 5. Commit transaction On failure, rollback, results and offsets remain in sync 7
  • 8. Transactional pros / cons Pro: • Works easily for any transformation • Destructive updates ok Con: • More complex • Requires a transactional data store Note: • Results and offsets must be in same data store 8
  • 9. 9
  • 10. Receiver-based stream pros / cons Pro: • WAL design could work with non-Kafka data sources Con: • Long running receivers = parallelism awkward and costly • Duplication of write operations • Dependent on HDFS • Must use idempotence for exactly-once • No access to offsets, can't use transactional approach 10
  • 11. 11
  • 12. Direct stream pros / cons Pro: • Spark partition 1:1 Kafka topic/partition, cheap parallelism • No duplicate writes • No dependency on HDFS • Access to offsets, can use idempotent or transactional Con: • Specific to Kafka • Need enough Kafka disk (OffsetOutOfRange is your fault) 12
  • 13. Don’t care about semantics? How about server cost? 13
  • 14. Basic direct stream API 14 val stream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, // kafka config parameters Map( "" -> "localhost:9092,anotherhost:9092"), "auto.offset.reset" -> "largest" ), // set of topics to consume Set("sometopic", "anothertopic") )
  • 15. Basic direct stream API semantics auto.offset.reset -> largest: • Starts at latest offset, thus losing data • Not at-most-once (need to set maxFailures as well) auto.offset.reset -> smallest: • Starts at earliest offset • At-least-once, but replays whole log If you want finer grained control, must store offsets 15
  • 16. Where to store offsets Easy - Spark Checkpoint: • No need to access offsets, automatically used on restart • Must use idempotent, not transactional • Checkpoints may not be recoverable Complex - Your own data store: • Must access offsets, save them, and provide on (re)start • Idempotent or transactional • Offsets are just as recoverable as your results 16
  • 17. Spark checkpoint 17 // Direct copy-paste from the docs, same as any other spark checkpoint def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val stream = KafkaUtils.createDirectStream(...) // setup DStream ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc } // Get StreamingContext from checkpoint data or create a new one val context = StreamingContext.getOrCreate( checkpointDirectory,functionToCreateContext _)
  • 18. Your own data store 18 val stream: InputDStream[Int] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Int]( streamingContext, Map("" -> “localhost:9092,anotherhost:9092"), // map of starting offsets, would be read from your storage in real code Map(TopicAndPartition(“someTopic”, 0) -> someStartingOffset, TopicAndPartition(“anotherTopic”, 0) -> anotherStartingOffset), // message handler to get desired value from each message and metadata (mmd: MessageAndMetadata[String, String]) => mmd.message.length )
  • 19. Accessing offsets, per message 19 val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.partition, mmd.offset, mmd.key, mmd.message) Your message handler has full access to all of the metadata. Saving offsets per message may not be efficient, though.
  • 20. Accessing offsets, per batch 20 stream.foreachRDD { rdd => // Cast the rdd to an interface that lets us get an array of OffsetRange val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges val results = rdd.someTransformationUsingSparkMethods ... // Your save method. Note that this runs on the driver mySaveBatch(offsetRanges, results) }
  • 21. Accessing offsets, per partition 21 stream.foreachRDD { rdd => // Cast the rdd to an interface that lets us get an array of OffsetRange val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd.foreachPartition { iter => // index to get the correct offset range for the rdd partition we're working on val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId) val perPartitionResult = iter.someTransformationUsingScalaMethods // Your save method. Note this runs on the executors. mySavePartition(offsetRange, perPartitionResult) } }
  • 22. Be aware of partitioning 22 rdd.foreachPartition { iter => val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId) rdd.reduceByKey.foreachPartition { iter => val offsetRange: OffsetRange = offsetRanges(TaskContext.get.partitionId) Safe because KafkaRDD partition is 1:1 with Kafka partition Not safe because there is a shuffle, so no longer 1:1