Stream Processing Overview

Stream Processing Overview
Maycon Viana Bordin
Instituto de Informática
Universidade Federal do Rio Grande do Sul

HUGE amounts of data
are being generated in
real-time

4.75B shares
4.5B likes
420M status updates
300M photos
EVERY DAY.

Traffic Monitoring
and Route Planning

• Architecture for Stream and CEP
processing
• Input from buses and SCATS sensors
• Use of crowdsourcing to resolve data
source unreliability
• Dataset of 13GB from Dublin city

Traffic flow estimated using Gaussian Process
Regression

• 6 Billion records per day
• 160 Million customers
• Detect duplicates in a 15 day window
• Records can’t be lost
• Solution: InfoSphere Streams

Number of terminated calls by category in the last
hour
Call termination reason for enterprise customers in
the last hour
Dashboards

• 1.4 Million consumers
• Demand Response Optimization
1. Peak demand forecasting
2. Effective response selection
• Data source: AMIs (Advanced Metering
Infrastructure)
• 3TB of data per day

• Detection of events: earthquakes, typhoons, etc.
• Twitter users as sensors
• Location estimation: Kalman and particle filtering
• Detects 96% of earthquakes repoted by
the Japan Meteorological Agency

Other Applications
• Fraud detection
• Process control in manufacturing
• Surveillance systems
• CDR processing
• Healthcare monitoring

They need to process…
large volumes of data

in real-time

in real-time
continuously

in real-time
continuously
producing actionable information

And are categorized as
Information Flow Processing
technologies

Information
Flow Processing
Active
Databases
Continuous
Queries
Publish-
subscribe
systems
Complex
Event
Processing
Stream
Processing
Systems

Information
Flow Processing
Active
Databases
Continuous
Queries
Publish-
subscribe
systems
Complex
Event
Processing
Stream
Processing
Systems
RCA rules
Triggers

Information
Flow Processing
Active
Databases
Continuous
Queries
Publish-
subscribe
systems
Complex
Event
Processing
Stream
Processing
Systems
Standing queries
query – trigger – stop conditions

Information
Flow Processing
Active
Databases
Continuous
Queries
Publish-
subscribe
systems
Complex
Event
Processing
Stream
Processing
Systems
Decoupled components
Topic and content based

Information
Flow Processing
Active
Databases
Continuous
Queries
Publish-
subscribe
systems
Complex
Event
Processing
Stream
Processing
Systems
Event detection based on
rules and patterns

Data from the stream source may or
may not be structured

The amount of data is usually
unbounded in size

The input rate is variable and
typically unpredictable

OPERATORS
Stateless
(map, filter)

OPERATORS
Stateless
(map, filter)
Stateful

OPERATORS
Stateless
(map, filter)
Stateful
Non-Blocking
(count, sum)

OPERATORS
Stateless
(map, filter)
Stateful
Blocking
(join, freq. itemset)
Non-Blocking
(count, sum)

Blocking operators need all input in
order to generate a result

but that’s not possible since data
streams are unbounded

To solve this issue, tuples are
grouped in windows

window start
(ws)
window end
(we)
Range in time units or number of tuples

old ws old we
new ws new we
advance

Parsing/Filtering/ETL
Aggregation: collection and summarization of tuples
Merging: combining of streams with different schemas
Splitting: partitioning of stream into multiple ones for data/task parallelism or some logical
reason
Data mining/Machine Learning/NLP: spam filtering, fraud detection,
recommendation systems, data stream clustering, sentiment analysis
… Others: relational algebra, artificial intelligence and other custom operations

Traditional vs Data Stream
Processing

Traditional Data Stream
Distributed No Yes
Type of Result Accurate Approximate
Memory Usage Unlimited Restricted
Processing Time Unlimited Restricted
No. of Passes Multiple Single

These differences gave way to a
number of synopsis structures

Sampling: classification, query estimation, order
statistics estimation, distinct value queries
Wavelets: hierarchical decomposition and
summarization
Clustering: knowledge discovery
Sketches: distinct count, heavy hitters, quantiles,
change detection
Histograms: range queries, selectivity estimation

Applications are composed as data
flow graphs

To illustrate, let’s look at the graph of
a Trending Topics application

extract
hashtags
hashtag
counter
Sink
parse

The graph above is the logical view
of the application

The physical view displays the
component instances and their
location in the cluster

extract
File Sink
stream
extractextract extractextract
countmincountmincountmincountmin countmin
node-0 node-1 node-2

a data stream among the instances
of an operator.

one or more data streams
among different operators.

Provides and ensure
the (latency and throughput)

Consists of two stages
and
of operators

Architecture
Independent
Distributed
Hybrid
Algorithm
structure
Centralized
Descentralized
Metric
Load
Latency
Bandwidth
Hybrid
Machine
resources
Operator
importance
Operator-level
operations
Operator reuse
Replication
Reconfiguration
Types of changes
•Network
•Data
•Flow graph
Response
strategy
•Dynamic
•Static
[Lakshmanan, 2008]

Stream Processing
Fault Tolerance

Stream processing systems can suffer
from and

These faults can be dealt with by
and
of components

Events are usually tracked in the
following way…

Processed messages are by
downstream operators

If an ack is not received for an
amount of time, the event is lost

Lost events are replayed from
upstream operators

The upstream component keeps the output
tuples in a queue until they have been
processed

If a downstream component fails…

the tuples are replayed to another component.

Replicas of a component process the same data

The state is thus implicitly synchronized

Once the primary component fails…

a backup component takes over.

Techniques
Passive Replication

Primary component saves its state periodically to a
permanent shared storage

Secondary components synchronize their state through
the shared storage

If the component fails the secondary takes over…

sends the messages in the output queue and asks the
upstream nodes for the messages its has not seen.

An operator periodically saves its
state in a storage

Upon a failure the component is restored to the
previous consistent state

The periodicity is determined by the
type of recovery protocol

In protocols each
component decides when to do a
checkpoint

It is simple to implement, but hard to
guarantee the consistency of the
whole system

protocols, on the other
hand, organize the checkpoint
moments between components

It ensures the consistency of the
whole system at the cost of a
complex and more costly protocol

failures are not visible, except for the increase in
latency

may affect the system beyond latency, e.g.
duplicated tuples

as the components don’t save their state, tuples
can be lost during recovery

2000 2001 2003 2004 20062002 2005 2008 20102009 2011 2012 20142007 2013
• Cougar
• Stream Mill
NiagaraCQ
Cougar
TelegraphCQ
STREAM
Aurora/Medusa
StreamBase
Borealis
BusinessEvents
Oracle CEP
InfoSphere
Streams
Stream Mill
Granules
S4
Storm
Samza
Spark Streaming
MillWheel
TimeStream
Flink Streaming

Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
 Chop up the live stream into batches
of X seconds
 Spark treats each batch of data as
RDDs and processes them using RDD
operations
 Finally, the processed results of the
RDD operations are returned in
batches

Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
122
 Batch sizes as low as ½ second,
latency ~ 1 second
 Potential for combining batch
processing and streaming processing
in the same system

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a
stream of data
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
stored in memory as an
RDD (immutable,
distributed)
Twitter Streaming API

val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one Dstream to create
another DStream
new DStream
new RDDs created
for every batch
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
Dstream
[#cat, #dog, … ]

hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external
storage
flatMa
p
flatMa
p
flatMa
p
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
every batch
saved to HDFS

Java Example
Scala
Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
Function object to define the
transformation

Fault-tolerance
 RDDs remember the
sequence of operations
that created it from the
original fault-tolerant input
data
 Batches of input data are
replicated in memory of
multiple worker nodes,
therefore fault-tolerant
 Data lost due to worker
failure, can be recomputed
from input data

Key concepts
 DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
 Transformations – modify data from on DStream to another
- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …
 Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results

Example 2 – Count the hashtags
val tagCounts = hashTags.countByValue()

Example 3 – Count the hashtags over
last 10 mins
val tagCounts = hashTags.window(Minutes(10),
Seconds(1)).countByValue()
sliding window
operation
window
length
sliding
interval

Example 3 – Counting the hashtags over
last 10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

?
Smart window-based countByValue
val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))
+
+
–

BoltSpout
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt
Topology

BoltSpout
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt
Parallelism hint
2 5
2 1

Scheduler
Supervisor
Master node Worker node 1
Supervisor
Worker node n
Composed of one Nimbus and a set of
supervisors
Storm clusterExecutor Worker (process)
SlotsNimbus (process)

Scheduler
Supervisor
Supervisor
Worker node n
The Nimbus assigns work to supervisors,
manage failures and monitors resource usage.

Scheduler
Supervisor
Supervisor
Worker node n
The number of slots of a supervisor is the
maximum number of workers it can execute

Worker process
Worker Process
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Blue
Bolt
Green
Bolt
Yellow
Bolt
2 2 6
# executors = 10
5 executors per
worker
Green bolt was configured
with 2 executors and 4 tasks

Platform Storm Storm Trident
Spark
Streaming
Samza S4
Processing
Model
Record-at-a-
time
Micro-batches Micro-batches
Record-at-a-
time
Record-at-a-
time
Programming
Model
DAG DAG Monad DAG Actors
Stream
Partitioning
Yes Yes Yes Yes Yes
Rebalancing Yes Yes No No Yes
Dynamic
Cluster
Yes Yes Yes Yes No
Resource
Management
Standalone,
YARN, Mesos
Standalone,
YARN, Mesos
Standalone,
YARN, Mesos
YARN, Mesos Standalone
Coordination Zookeeper Zookeeper Built-in Built-in Zookeeper
Programming
Language
Java, any (via
Thrift)
Java, any (via
Thrift)
Java, Scala,
Python
JVM-
languages
Java

Platform Storm Storm Trident
Spark
Streaming
Samza S4
Implementati
on Language
Java, Clojure Java Scala, Java Scala, Java Java, Groovy
Built-in
Operators
No Yes Yes No No
Deterministic - - Yes - -
Message
System
Netty Netty Netty, Akka Kafka Netty
Data Mobility Pull Pull - Pull Push
Devlivery
Guarantees
At-most-once
At-least-once
Exactly-once
At-most-once
At-least-once
Exactly-once Exactly-once At-most-once
Fault
Tolerance
Rollback recovery
using upstram
backup
-
Coordinated
periodic
checkpoint,
replication, parallel
recovery
Rollback recovery
Uncoordinated
periodic
checkpoint
Dynamic
Graph
No No No Yes Yes
Persistent
State
No Yes Yes Yes Yes

Maycon Viana Bordin
Advisor: Claudio Geyer

Datasets
Number of Nodes
Application 1 2 4 8
word-count 4GB 8GB 16GB 26GB
log-processing 15GB 30GB 60GB 120GB
traffic-monitoring 4GB 8GB* 16GB* 32GB*
machine-outlier 4GB 9GB 18GB 36GB
spam-filter 4GB* 8GB* 16GB* 32GB*
sentiment-analysis 7GB 15GB 30GB 60GB
trending-topics 7GB 15GB 30GB 60GB
click-analytics 15GB 30GB 60GB 120GB
fraud-detection 4GB† 8GB† 16GB† 32GB†
spike-detection 4GB* 8GB* 16GB* 32GB*
*replicated †generated

Parallelism
1:1 Best Best (only source) Best (max mem)
Application Operator base multipliers base multipliers base multipliers base multipliers
word-count
source 1 1...6 1 1...3 1 2, 4, 8 3 1
splitter 1 1...6 5 1...3 5 1 5 1
counter 1 1...6 6 1...3 6 1 6 1, 2
sink 1 1...6 3 1...3 3 1 3 1
log-processing
source 1 1...6 4 1...3 1 1, 2, 8 4 1
status-counter 1 1...6 1 1...3 1 1 1 1
volume-counter 1 1...6 2 1...3 2 1 2 1
geo-locator 1 1...6 4 1...3 4 1 4 1, 2
geo-summarizer 1 1...6 2 1...3 2 1 2 1
sink 1 1...6 4 1...3 4 1 4 1
traffic-monitoring
source 1 1...6 1 1...3 1 2, 4, 8 1 1
map-matcher 1 1...6 2 1...3 2 1 2 1, 2
speed-calculator 1 1...6 2 1...3 2 1 2 1, 2
sink 1 1...6 1 1...3 1 1 1 1
machine-outlier
source 1 1...6 6 1...3 1 1, 2, 4, 8 - -
scorer 1 1...6 1 1...3 1 1 - -
anomaly-scorer 1 1...6 1 1...3 1 1 - -
alert-trigger 1 1...6 4 1...3 4 1 - -
sink 1 1...6 1 1...3 1 1 - -
spam-filter
source 1 1...6 1 1...3 1 2, 4, 8 1 1
tokenizer 1 1...6 10 1...3 10 1 10 1, 2
word-probability 1 1...6 1 1...3 1 1 1 1
bayes-rule 1 1...6 1 1...3 1 1 1 1
sink 1 1...6 1 1...3 1 1 1 1

1:1 Best Best (only source) Best (max mem)
Application Operator base multipliers base multipliers base multipliers base multipliers
sentiment-analysis
source 1 1...6 1 2, 4, 8
tweet-filter 1 1...6 1 1
text-filter 1 1...6 1 1
stemmer 1 1...6 1 1
positive-scorer 1 1...6 1 1
negative-scorer 1 1...6 1 1
joiner 1 1...6 1 1
scorer 1 1...6 1 1
sink 1 1...6 1 1
trending-topics
source 1 1...6 9 1...3 1 1, 2, 4, 8 9 1
topic-extractor 1 1...6 2 1...3 2 1 2 1
counter 1 1...6 1 1...3 1 1 1 2, 4
intermediate-ranker 1 1...6 1 1...3 1 1 1 1
total-ranker 1 1...6 1 1...3 1 1 1 1
sink 1 1...6 1 1...3 1 1 1 1
click-analytics
source 1 1...6 2 1...3 2 2, 4, 8 2 1
repeat-visits 1 1...6 2 1...3 2 1 2 1
total-visits 1 1...6 2 1...3 2 1 2 1
geo-locator 1 1...6 5 1...3 5 1 5 2, 4
geo-summarizer 1 1...6 1 1...3 1 1 1 1
sink-visits 1 1...6 1 1...3 1 1 1 1
sink-locations 1 1...6 1 1...3 1 1 1 1
fraud-detection
source 1 1...6 8 1...3 1 1, 2, 4 8 1
predictor 1 1...6 3 1...3 3 1 3 2, 4
sink 1 1...6 2 1...3 2 1 2 1
spike-detection
source 1 1...6 7 1...3 1 1, 2, 4, 8 7 1
moving-average 1 1...6 3 1...3 3 1 3 2, 4
spike-detector 1 1...6 2 1...3 2 1 2 1
sink 1 1...6 1 1...3 1 1 1 1

Architecture
Azure
broker broker broker
Kafka
Platform
master slave slave slave slave
slave slave slave slave
metrics

0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
0
531
1062
1593
2124
2655
3186
3717
4248
4779
5310
5841
6372
6903
7434
7965
8496
9027
9558
10089
10620
11156
11687
12218
12749
13280
13816
14347
14878
15409
15945
16481
17017
17548
18079
18610
19146
19682
20213
20744
21285
21836
22367
22898
23429
Throughput(tuples/sec)
Time (seconds)
Throughput: nodes=1, parallelism=3
source
splitSentence
wordCount

CPU usage
0
20
40
60
80
100
120
1
218
435
652
869
1086
1303
1520
1737
1954
2171
2388
2605
2822
3039
3256
3473
3690
3907
4124
4341
4558
4775
4992
5209
5426
5643
5860
6077
6294
6511
6728
6945
7162
7379
7596
7813
8030
8247
8464
8681
8898
9115
9332
9549
9766
9983
10200
10417
10634
10851
11068
11285
11502
11719

Memory usage
0
0.5
1
1.5
2
2.5
3
3.5
1
218
435
652
869
1086
1303
1520
1737
1954
2171
2388
2605
2822
3039
3256
3473
3690
3907
4124
4341
4558
4775
4992
5209
5426
5643
5860
6077
6294
6511
6728
6945
7162
7379
7596
7813
8030
8247
8464
8681
8898
9115
9332
9549
9766
9983
10200
10417
10634
10851
11068
11285
11502
11719

Network usage
0
2
4
6
8
10
12
1
260
519
778
1037
1296
1555
1814
2073
2332
2591
2850
3109
3368
3627
3886
4145
4404
4663
4922
5181
5440
5699
5958
6217
6476
6735
6994
7253
7512
7771
8030
8289
8548
8807
9066
9325
9584
9843
10102
10361
10620
10879
11138
11397
11656
MB/sec
net recv (MB/s)
net sent (MB/s)

HDD Read/Write – Kafka Broker
0
10
20
30
40
50
60
70
80
90
MBytes/sec
SDD_READ
SDD_WRITE
SDB_READ
SDB_WRITE

0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
54000
56000
58000
60000
62000
64000
66000
68000
70000
72000
74000
0
430
861
1292
1722
2153
2584
3014
3445
3876
4306
4737
5168
5598
6029
6460
6890
7321
7752
8182
8613
9044
9479
9920
10351
10782
11212
11643
12074
12504
12935
13366
13801
14237
14668
15098
15534
15970
16405
16841
17272
17702
18133
18564
19004
19455
19896
source
splitSentence
wordCount

CPU usage
0
20
40
60
80
100
120
1
186
371
556
741
926
1111
1296
1481
1666
1851
2036
2221
2406
2591
2776
2961
3146
3331
3516
3701
3886
4071
4256
4441
4626
4811
4996
5181
5366
5551
5736
5921
6106
6291
6476
6661
6846
7031
7216
7401
7586
7771
7956
8141
8326
8511
8696
8881
9066
9251
9436
9621
9806
9991

Memory usage
0
500
1000
1500
2000
2500
3000
3500
1
186
371
556
741
926
1111
1296
1481
1666
1851
2036
2221
2406
2591
2776
2961
3146
3331
3516
3701
3886
4071
4256
4441
4626
4811
4996
5181
5366
5551
5736
5921
6106
6291
6476
6661
6846
7031
7216
7401
7586
7771
7956
8141
8326
8511
8696
8881
9066
9251
9436
9621
9806
9991

Heinze, Thomas, et al. "Tutorial: Cloud-based Data Stream Processing." (2014).
Artikis, Alexander, Matthias Weidlich, Francois Schnitzler, Ioannis Boutsis, Thomas Liebig, Nico Piatkowski, Christian Bockermann et al.
"Heterogeneous Stream Processing and Crowdsourcing for Urban Traffic Management." In EDBT, pp. 712-723. 2014.
Bouillet, Eric, et al. "Processing 6 billion CDRs/day: from research to production (experience report)." Proceedings of the 6th ACM
International Conference on Distributed Event-Based Systems. ACM, 2012.
Lakshmanan, G. T., LI, Y., and Strom, R. Placement strategies for internet-scale data stream systems. Internet Computing, IEEE 12, 6 (2008),
50–60.
Simmhan, Yogesh, et al. "An informatics approach to demand response optimization in smart grids." NATURAL GAS 31 (2011): 60.
Sakaki, Takeshi, Makoto Okazaki, and Yutaka Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors."
Proceedings of the 19th international conference on World wide web. ACM, 2010.

Stream Processing Overview

Related slideshows

More Related Content

Stream Processing Overview