SlideShare a Scribd company logo
A Deep Dive into
Structured Streaming
Tathagata “TD” Das
@tathadas
Spark Summit 2016
Who am I?
Project Mgmt. Committee (PMC) member of Apache Spark
Started Spark Streaming in grad school - AMPLab, UC Berkeley
Software engineerat Databricks and involved with all things
streaming in Spark
2
Streaming in Apache Spark
Spark Streaming changedhow peoplewrite streaming apps
3
SQL Streaming MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 50%users consider most important partof Apache Spark
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
Need to interact with batch data,
interactive analysis, machine learning, etc.
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Aggregateon windows
on eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Aggregateon windows
on eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning
Continuous Applications
Not just streaming any more
1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperatestreaming with batch AND interactive
- RDD/DStream hassimilar API, butstill requirestranslation
3. Reasoning about end-to-end guarantees
- Requirescarefully constructing sinks that handle failurescorrectly
- Data consistency in the storage while being updated
Pain points with DStreams
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming at all
New Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: howfrequently to check
input for newdata
Query: operations on input
usual map/filter/reduce
newwindow, session ops
New Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output
New Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output: Write only new rows
*Not all output modes are feasible withall queries
Static, bounded
data
Streaming, unbounded
data
Single API !
API - Dataset/DataFrame
Batch ETL with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file
Streaming ETL with DataFrames
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.startStream("dest-path")
Read from Json file stream
Replace load() with stream()
Select some devices
Code does not change
Write to Parquet file stream
Replace save() with startStream()
Streaming ETL with DataFrames
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing
Streaming ETL with DataFrames
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.startStream("dest-path")
Continuous Aggregations
Continuously compute average
signal across all devices
Continuously compute average
signal of each type of device
19
input.avg("signal")
input.groupBy("device-type")
.avg("signal")
Continuous Windowed Aggregations
20
input.groupBy(
$"device-type",
window($"event-time-col", "10 min"))
.avg("signal")
Continuously compute
average signal of each type
of device in last 10 minutes
using event-time
Simplifiesevent-time stream processing (notpossible in DStreams)
Works on both, streaming and batch jobs
Joining streams with static data
kafkaDataset = spark.read
.kafka("iot-updates")
.stream()
staticDataset = ctxt.read
.jdbc("jdbc://", "iot-device-info")
joinedDataset =
kafkaDataset.join(
staticDataset, "device-type")
21
Join streaming data from Kafka with
static data via JDBC to enrich the
streaming data …
… withouthaving to thinkthat you
are joining streaming data
Output Modes
Defines what is outputted every time there is a trigger
Different output modes make sensefor different queries
22
input.select("device", "signal")
.write
.outputMode("append")
.format("parquet")
.startStream("dest-path")
Append mode with
non-aggregation queries
input.agg(count("*"))
.write
.outputMode("complete")
.format("parquet")
.startStream("dest-path")
Complete mode with
aggregation queries
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
23
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack
Logically:
Dataset operations on table
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
LogicalPlan
Continuous,
incrementalexecution
Catalyst optimizer
Query Execution
Structured Streaming
High-level streaming API built on Datasets/DataFrames
Eventtime, windowing,sessions,sources& sinks
End-to-end exactly once semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML models
What can you do with this that’s hard
with other engines?
True unification
Same code + same super-optimized engine for everything
Flexible API tightly integratedwith the engine
Choose your own tool - Dataset/DataFrame/SQL
Greater debuggability and performance
Benefitsof Spark
in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …
Underneath the Hood
Batch Execution on Spark SQL
28
DataFrame/
Dataset
Logical
Plan
Abstract
representation
of query
Batch Execution on Spark SQL
29
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalotofmagic!
Batch Execution on Spark SQL
30
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobsto compute results
Bytecode generation
JVM intrinsics, vectorization
Operations on serialized data
Code Optimizations MemoryOptimizations
Compact and fastencoding
Offheap memory
Project Tungsten -Phase 1 and 2
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans, for eachprocessing
the nextchunk of streaming data
31
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4
Continuous Incremental Execution
32
Planner
Incremental
Execution 2
Offsets:[106-197] Count: 92
Plannerpollsfor
new data from
sources
Incremental
Execution 1
Offsets:[19-105] Count: 87
Incrementally executes
new data and writesto sink
Continuous Aggregations
Maintain runningaggregate as in-memory state
backed by WAL in file system for fault-tolerance
33
state data generated and used
across incremental executions
Incremental
Execution 1
state:
87
Offsets:[19-105] Running Count: 87
memory
Incremental
Execution 2
state:
179
Offsets:[106-179] Count: 87+92 = 179
Fault-tolerance
All data and metadata in
the system needsto be
recoverable/ replayable
state
Planner
source sink
Incremental
Execution 1
Incremental
Execution 2
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (WAL) in HDFS
state
Planner
source sink
Offsets written to
fault-tolerant WAL
before execution
Incremental
Execution 2
Incremental
Execution 1
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (WAL) in HDFS
state
Planner
source sink
Failed planner fails
current execution
Incremental
Execution 2
Incremental
Execution 1
Failed Execution
Failed
Planner
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (WAL) in HDFS
Reads log to recover from
failures, and re-execute exact
range of offsets
state
Restarted
Planner
source sink
Offsets read back
from WAL
Incremental
Execution 1
Same executions
regenerated from offsets
Failed Execution
Incremental
Execution 2
Fault-tolerance
Fault-tolerant Sources
Structured streaming sources
are by design replayable (e.g.
Kafka, Kinesis,files) and
generate the exactly same data
given offsets recovered by
planner
state
Planner
sink
Incremental
Execution 1
Incremental
Execution 2
source
Replayable
source
Fault-tolerance
Fault-tolerant State
Intermediate "state data" is a
maintained in versioned,key-
value maps in Spark workers,
backed by HDFS
Plannermakes sure "correct
version"of state used to re-
execute after failure
Planner
source sink
Incremental
Execution 1
Incremental
Execution 2
state
state is fault-tolerant with WAL
Fault-tolerance
Fault-tolerant Sink
Sink are by design idempotent,
and handlesre-executionsto
avoid double committing the
output
Planner
source
Incremental
Execution 1
Incremental
Execution 2
state
sink
Idempotent
by design
41
offset tracking in WAL
+
state management
+
fault-tolerant sourcesand sinks
=
end-to-end
exactly-once
guarantees
42
Fast, fault-tolerant, exactly-once
stateful stream processing
without having to reason about streaming
Release Plan: Spark 2.0 [June 2016]
Basic infrastructureand API
- Eventtime, windows,aggregations
- Append and Complete output modes
- Support for a subsetof batch queries
Sourceand sink
- Sources: Files(*Kafka coming soon
after 2.0 release)
- Sinks: Filesand in-memory table
Experimental release to set
the future direction
Not ready for production
but good to experiment
with and provide feedback
Release Plan: Spark 2.1+
Stability and scalability
Supportfor more queries
Multiple aggregations
Sessionization
More outputmodes
Watermarks and late data
Sourcesand Sinks
Public APIs
ML integrations
Make Structured
Streaming readyfor
production workloads as
soon as possible
Stay tuned on our Databricks blogsfor more information and
examples on Structured Streaming
Try latestversion of ApacheSpark and preview of Spark 2.0
Try Apache Spark with Databricks
45
http://databricks.com/try
Structured Streaming
Making Continuous Applications
easier, faster, and smarter
Follow me @tathadas
AMA @
Databricks Booth
Today: Now - 2:00 PM
Tomorrow: 12:15 PM - 1:00 PM

More Related Content

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

  • 1. A Deep Dive into Structured Streaming Tathagata “TD” Das @tathadas Spark Summit 2016
  • 2. Who am I? Project Mgmt. Committee (PMC) member of Apache Spark Started Spark Streaming in grad school - AMPLab, UC Berkeley Software engineerat Databricks and involved with all things streaming in Spark 2
  • 3. Streaming in Apache Spark Spark Streaming changedhow peoplewrite streaming apps 3 SQL Streaming MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 50%users consider most important partof Apache Spark
  • 4. Streaming apps are growing more complex 4
  • 5. Streaming computations don’t run in isolation Need to interact with batch data, interactive analysis, machine learning, etc.
  • 6. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning
  • 7. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning Continuous Applications Not just streaming any more
  • 8. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperatestreaming with batch AND interactive - RDD/DStream hassimilar API, butstill requirestranslation 3. Reasoning about end-to-end guarantees - Requirescarefully constructing sinks that handle failurescorrectly - Data consistency in the storage while being updated Pain points with DStreams
  • 10. The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 11. New Model Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: howfrequently to check input for newdata Query: operations on input usual map/filter/reduce newwindow, session ops
  • 12. New Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  • 13. New Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch Append output: Write only new rows *Not all output modes are feasible withall queries
  • 15. Batch ETL with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file
  • 16. Streaming ETL with DataFrames input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path") Read from Json file stream Replace load() with stream() Select some devices Code does not change Write to Parquet file stream Replace save() with startStream()
  • 17. Streaming ETL with DataFrames input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  • 18. Streaming ETL with DataFrames 1 2 3 Result [append-only table] Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path")
  • 19. Continuous Aggregations Continuously compute average signal across all devices Continuously compute average signal of each type of device 19 input.avg("signal") input.groupBy("device-type") .avg("signal")
  • 20. Continuous Windowed Aggregations 20 input.groupBy( $"device-type", window($"event-time-col", "10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event-time Simplifiesevent-time stream processing (notpossible in DStreams) Works on both, streaming and batch jobs
  • 21. Joining streams with static data kafkaDataset = spark.read .kafka("iot-updates") .stream() staticDataset = ctxt.read .jdbc("jdbc://", "iot-device-info") joinedDataset = kafkaDataset.join( staticDataset, "device-type") 21 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthaving to thinkthat you are joining streaming data
  • 22. Output Modes Defines what is outputted every time there is a trigger Different output modes make sensefor different queries 22 input.select("device", "signal") .write .outputMode("append") .format("parquet") .startStream("dest-path") Append mode with non-aggregation queries input.agg(count("*")) .write .outputMode("complete") .format("parquet") .startStream("dest-path") Complete mode with aggregation queries
  • 23. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 23 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  • 24. Logically: Dataset operations on table (i.e. as easyto understand as batch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally and continuously) DataFrame LogicalPlan Continuous, incrementalexecution Catalyst optimizer Query Execution
  • 25. Structured Streaming High-level streaming API built on Datasets/DataFrames Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML models
  • 26. What can you do with this that’s hard with other engines? True unification Same code + same super-optimized engine for everything Flexible API tightly integratedwith the engine Choose your own tool - Dataset/DataFrame/SQL Greater debuggability and performance Benefitsof Spark in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …
  • 28. Batch Execution on Spark SQL 28 DataFrame/ Dataset Logical Plan Abstract representation of query
  • 29. Batch Execution on Spark SQL 29 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  • 30. Batch Execution on Spark SQL 30 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobsto compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations MemoryOptimizations Compact and fastencoding Offheap memory Project Tungsten -Phase 1 and 2
  • 31. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans, for eachprocessing the nextchunk of streaming data 31 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  • 32. Continuous Incremental Execution 32 Planner Incremental Execution 2 Offsets:[106-197] Count: 92 Plannerpollsfor new data from sources Incremental Execution 1 Offsets:[19-105] Count: 87 Incrementally executes new data and writesto sink
  • 33. Continuous Aggregations Maintain runningaggregate as in-memory state backed by WAL in file system for fault-tolerance 33 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets:[19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets:[106-179] Count: 87+92 = 179
  • 34. Fault-tolerance All data and metadata in the system needsto be recoverable/ replayable state Planner source sink Incremental Execution 1 Incremental Execution 2
  • 35. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS state Planner source sink Offsets written to fault-tolerant WAL before execution Incremental Execution 2 Incremental Execution 1
  • 36. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS state Planner source sink Failed planner fails current execution Incremental Execution 2 Incremental Execution 1 Failed Execution Failed Planner
  • 37. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS Reads log to recover from failures, and re-execute exact range of offsets state Restarted Planner source sink Offsets read back from WAL Incremental Execution 1 Same executions regenerated from offsets Failed Execution Incremental Execution 2
  • 38. Fault-tolerance Fault-tolerant Sources Structured streaming sources are by design replayable (e.g. Kafka, Kinesis,files) and generate the exactly same data given offsets recovered by planner state Planner sink Incremental Execution 1 Incremental Execution 2 source Replayable source
  • 39. Fault-tolerance Fault-tolerant State Intermediate "state data" is a maintained in versioned,key- value maps in Spark workers, backed by HDFS Plannermakes sure "correct version"of state used to re- execute after failure Planner source sink Incremental Execution 1 Incremental Execution 2 state state is fault-tolerant with WAL
  • 40. Fault-tolerance Fault-tolerant Sink Sink are by design idempotent, and handlesre-executionsto avoid double committing the output Planner source Incremental Execution 1 Incremental Execution 2 state sink Idempotent by design
  • 41. 41 offset tracking in WAL + state management + fault-tolerant sourcesand sinks = end-to-end exactly-once guarantees
  • 42. 42 Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming
  • 43. Release Plan: Spark 2.0 [June 2016] Basic infrastructureand API - Eventtime, windows,aggregations - Append and Complete output modes - Support for a subsetof batch queries Sourceand sink - Sources: Files(*Kafka coming soon after 2.0 release) - Sinks: Filesand in-memory table Experimental release to set the future direction Not ready for production but good to experiment with and provide feedback
  • 44. Release Plan: Spark 2.1+ Stability and scalability Supportfor more queries Multiple aggregations Sessionization More outputmodes Watermarks and late data Sourcesand Sinks Public APIs ML integrations Make Structured Streaming readyfor production workloads as soon as possible
  • 45. Stay tuned on our Databricks blogsfor more information and examples on Structured Streaming Try latestversion of ApacheSpark and preview of Spark 2.0 Try Apache Spark with Databricks 45 http://databricks.com/try
  • 46. Structured Streaming Making Continuous Applications easier, faster, and smarter Follow me @tathadas AMA @ Databricks Booth Today: Now - 2:00 PM Tomorrow: 12:15 PM - 1:00 PM