Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

A Deep Dive into
Structured Streaming
Tathagata “TD” Das
@tathadas
Spark Summit 2016

Who am I?
Project Mgmt. Committee (PMC) member of Apache Spark
Started Spark Streaming in grad school - AMPLab, UC Berkeley
Software engineerat Databricks and involved with all things
streaming in Spark
2

Streaming in Apache Spark
Spark Streaming changedhow peoplewrite streaming apps
3
SQL Streaming MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 50%users consider most important partof Apache Spark

Streaming apps are
growing more complex
4

Streaming computations
don’t run in isolation
Need to interact with batch data,
interactive analysis, machine learning, etc.

Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Aggregateon windows
on eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning

Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Aggregateon windows
on eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning
Continuous Applications
Not just streaming any more

1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperatestreaming with batch AND interactive
- RDD/DStream hassimilar API, butstill requirestranslation
3. Reasoning about end-to-end guarantees
- Requirescarefully constructing sinks that handle failurescorrectly
- Data consistency in the storage while being updated
Pain points with DStreams

The simplest way to perform streaming analytics
is not having to reason about streaming at all

New Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: howfrequently to check
input for newdata
Query: operations on input
usual map/filter/reduce
newwindow, session ops

1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output

1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output: Write only new rows
*Not all output modes are feasible withall queries

Static, bounded
data
Streaming, unbounded
data
Single API !
API - Dataset/DataFrame

Batch ETL with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file

Streaming ETL with DataFrames
input = spark.read
.format("json")
.stream("source-path")
result = input
result.write
.format("parquet")
.startStream("dest-path")
Read from Json file stream
Replace load() with stream()
Select some devices
Code does not change
Write to Parquet file stream
Replace save() with startStream()

input = spark.read
.format("json")
result = input
result.write
.format("parquet")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing

1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
input = spark.read
.format("json")
result = input
result.write
.format("parquet")

Continuous Aggregations
Continuously compute average
signal across all devices
Continuously compute average
signal of each type of device
19
input.avg("signal")
input.groupBy("device-type")
.avg("signal")

Continuous Windowed Aggregations
20
input.groupBy(
$"device-type",
window($"event-time-col", "10 min"))
.avg("signal")
Continuously compute
average signal of each type
of device in last 10 minutes
using event-time
Simplifiesevent-time stream processing (notpossible in DStreams)
Works on both, streaming and batch jobs

Joining streams with static data
kafkaDataset = spark.read
.kafka("iot-updates")
.stream()
staticDataset = ctxt.read
.jdbc("jdbc://", "iot-device-info")
joinedDataset =
kafkaDataset.join(
staticDataset, "device-type")
21
Join streaming data from Kafka with
static data via JDBC to enrich the
streaming data …
… withouthaving to thinkthat you
are joining streaming data

Output Modes
Defines what is outputted every time there is a trigger
Different output modes make sensefor different queries
22
input.select("device", "signal")
.write
.outputMode("append")
.format("parquet")
Append mode with
non-aggregation queries
input.agg(count("*"))
.write
.outputMode("complete")
.format("parquet")
Complete mode with
aggregation queries

Query Management
query = result.write
.format("parquet")
.outputMode("append")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
23
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack

Logically:
Dataset operations on table
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
LogicalPlan
Continuous,
incrementalexecution
Catalyst optimizer
Query Execution

High-level streaming API built on Datasets/DataFrames
Eventtime, windowing,sessions,sources& sinks
End-to-end exactly once semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML models

What can you do with this that’s hard
with other engines?
True unification
Same code + same super-optimized engine for everything
Flexible API tightly integratedwith the engine
Choose your own tool - Dataset/DataFrame/SQL
Greater debuggability and performance
Benefitsof Spark
in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …

Batch Execution on Spark SQL
28
DataFrame/
Dataset
Logical
Plan
Abstract
representation
of query

29
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalotofmagic!

30
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobsto compute results
Bytecode generation
JVM intrinsics, vectorization
Operations on serialized data
Code Optimizations MemoryOptimizations
Compact and fastencoding
Offheap memory
Project Tungsten -Phase 1 and 2

Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans, for eachprocessing
the nextchunk of streaming data
31
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4

Continuous Incremental Execution
32
Planner
Incremental
Execution 2
Offsets:[106-197] Count: 92
Plannerpollsfor
new data from
sources
Incremental
Execution 1
Offsets:[19-105] Count: 87
Incrementally executes
new data and writesto sink

Continuous Aggregations
Maintain runningaggregate as in-memory state
backed by WAL in file system for fault-tolerance
33
state data generated and used
across incremental executions
Incremental
Execution 1
state:
87
Offsets:[19-105] Running Count: 87
memory
Incremental
Execution 2
state:
179
Offsets:[106-179] Count: 87+92 = 179

Fault-tolerance
All data and metadata in
the system needsto be
recoverable/ replayable
state
Planner
source sink
Incremental
Execution 1
Incremental
Execution 2

Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (WAL) in HDFS
state
Planner
source sink
Offsets written to
fault-tolerant WAL
before execution
Incremental
Execution 2
Incremental
Execution 1

Fault-tolerance
state
Planner
source sink
Failed planner fails
current execution
Incremental
Execution 2
Incremental
Execution 1
Failed Execution
Failed
Planner

Fault-tolerance
Reads log to recover from
failures, and re-execute exact
range of offsets
state
Restarted
Planner
source sink
Offsets read back
from WAL
Incremental
Execution 1
Same executions
regenerated from offsets
Failed Execution
Incremental
Execution 2

Fault-tolerance
Fault-tolerant Sources
Structured streaming sources
are by design replayable (e.g.
Kafka, Kinesis,files) and
generate the exactly same data
given offsets recovered by
planner
state
Planner
sink
Incremental
Execution 1
Incremental
Execution 2
source
Replayable
source

Fault-tolerance
Fault-tolerant State
Intermediate "state data" is a
maintained in versioned,key-
value maps in Spark workers,
backed by HDFS
Plannermakes sure "correct
version"of state used to re-
execute after failure
Planner
source sink
Incremental
Execution 1
Incremental
Execution 2
state
state is fault-tolerant with WAL

Fault-tolerance
Fault-tolerant Sink
Sink are by design idempotent,
and handlesre-executionsto
avoid double committing the
output
Planner
source
Incremental
Execution 1
Incremental
Execution 2
state
sink
Idempotent
by design

41
offset tracking in WAL
+
state management
+
fault-tolerant sourcesand sinks
=
end-to-end
exactly-once
guarantees

42
Fast, fault-tolerant, exactly-once
stateful stream processing
without having to reason about streaming

Release Plan: Spark 2.0 [June 2016]
Basic infrastructureand API
- Eventtime, windows,aggregations
- Append and Complete output modes
- Support for a subsetof batch queries
Sourceand sink
- Sources: Files(*Kafka coming soon
after 2.0 release)
- Sinks: Filesand in-memory table
Experimental release to set
the future direction
Not ready for production
but good to experiment
with and provide feedback

Release Plan: Spark 2.1+
Stability and scalability
Supportfor more queries
Multiple aggregations
Sessionization
More outputmodes
Watermarks and late data
Sourcesand Sinks
Public APIs
ML integrations
Make Structured
Streaming readyfor
production workloads as
soon as possible

Stay tuned on our Databricks blogsfor more information and
examples on Structured Streaming
Try latestversion of ApacheSpark and preview of Spark 2.0
Try Apache Spark with Databricks
45
http://databricks.com/try

Making Continuous Applications
easier, faster, and smarter
Follow me @tathadas
AMA @
Databricks Booth
Today: Now - 2:00 PM
Tomorrow: 12:15 PM - 1:00 PM

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Related slideshows

More Related Content

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das