Stream Processing Overview
- 8. • Architecture for Stream and CEP
processing
• Input from buses and SCATS sensors
• Use of crowdsourcing to resolve data
source unreliability
• Dataset of 13GB from Dublin city
- 12. • 6 Billion records per day
• 160 Million customers
• Detect duplicates in a 15 day window
• Records can’t be lost
• Solution: InfoSphere Streams
- 14. Number of terminated calls by category in the last
hour
Call termination reason for enterprise customers in
the last hour
Dashboards
- 16. • 1.4 Million consumers
• Demand Response Optimization
1. Peak demand forecasting
2. Effective response selection
• Data source: AMIs (Advanced Metering
Infrastructure)
• 3TB of data per day
- 19. • Detection of events: earthquakes, typhoons, etc.
• Twitter users as sensors
• Location estimation: Kalman and particle filtering
• Detects 96% of earthquakes repoted by
the Japan Meteorological Agency
- 21. Other Applications
• Fraud detection
• Process control in manufacturing
• Surveillance systems
• CDR processing
• Healthcare monitoring
- 24. They need to process…
large volumes of data
in real-time
- 25. They need to process…
large volumes of data
in real-time
continuously
- 26. They need to process…
large volumes of data
in real-time
continuously
producing actionable information
- 63. Parsing/Filtering/ETL
Aggregation: collection and summarization of tuples
Merging: combining of streams with different schemas
Splitting: partitioning of stream into multiple ones for data/task parallelism or some logical
reason
Data mining/Machine Learning/NLP: spam filtering, fraud detection,
recommendation systems, data stream clustering, sentiment analysis
… Others: relational algebra, artificial intelligence and other custom operations
- 65. Traditional Data Stream
Distributed No Yes
Type of Result Accurate Approximate
Memory Usage Unlimited Restricted
Processing Time Unlimited Restricted
No. of Passes Multiple Single
- 67. Sampling: classification, query estimation, order
statistics estimation, distinct value queries
Wavelets: hierarchical decomposition and
summarization
Clustering: knowledge discovery
Sketches: distinct count, heavy hitters, quantiles,
change detection
Histograms: range queries, selectivity estimation
- 73. The physical view displays the
component instances and their
location in the cluster
- 78. one or more data streams
among different operators.
- 89. If an ack is not received for an
amount of time, the event is lost
- 104. sends the messages in the output queue and asks the
upstream nodes for the messages its has not seen.
- 107. Upon a failure the component is restored to the
previous consistent state
- 110. It is simple to implement, but hard to
guarantee the consistency of the
whole system
- 111. protocols, on the other
hand, organize the checkpoint
moments between components
- 112. It ensures the consistency of the
whole system at the cost of a
complex and more costly protocol
- 119. 2000 2001 2003 2004 20062002 2005 2008 20102009 2011 2012 20142007 2013
• Cougar
• Stream Mill
NiagaraCQ
Cougar
TelegraphCQ
STREAM
Aurora/Medusa
StreamBase
Borealis
BusinessEvents
Oracle CEP
InfoSphere
Streams
Stream Mill
Granules
S4
Storm
Samza
Spark Streaming
MillWheel
TimeStream
Flink Streaming
- 121. Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
Chop up the live stream into batches
of X seconds
Spark treats each batch of data as
RDDs and processes them using RDD
operations
Finally, the processed results of the
RDD operations are returned in
batches
- 122. Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
122
Batch sizes as low as ½ second,
latency ~ 1 second
Potential for combining batch
processing and streaming processing
in the same system
- 123. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a
stream of data
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
stored in memory as an
RDD (immutable,
distributed)
Twitter Streaming API
- 124. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one Dstream to create
another DStream
new DStream
new RDDs created
for every batch
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
Dstream
[#cat, #dog, … ]
- 125. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external
storage
flatMa
p
flatMa
p
flatMa
p
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
every batch
saved to HDFS
- 126. Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object to define the
transformation
- 127. Fault-tolerance
RDDs remember the
sequence of operations
that created it from the
original fault-tolerant input
data
Batches of input data are
replicated in memory of
multiple worker nodes,
therefore fault-tolerant
Data lost due to worker
failure, can be recomputed
from input data
- 128. Key concepts
DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
Transformations – modify data from on DStream to another
- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …
Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
- 129. Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
- 130. Example 3 – Count the hashtags over
last 10 mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10),
Seconds(1)).countByValue()
sliding window
operation
window
length
sliding
interval
- 131. Example 3 – Counting the hashtags over
last 10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
- 139. Scheduler
Supervisor
Master node Worker node 1
Supervisor
Worker node n
The Nimbus assigns work to supervisors,
manage failures and monitors resource usage.
Storm clusterExecutor Worker (process)
SlotsNimbus (process)
- 140. Scheduler
Supervisor
Master node Worker node 1
Supervisor
Worker node n
The number of slots of a supervisor is the
maximum number of workers it can execute
Storm clusterExecutor Worker (process)
SlotsNimbus (process)
- 144. Platform Storm Storm Trident
Spark
Streaming
Samza S4
Processing
Model
Record-at-a-
time
Micro-batches Micro-batches
Record-at-a-
time
Record-at-a-
time
Programming
Model
DAG DAG Monad DAG Actors
Stream
Partitioning
Yes Yes Yes Yes Yes
Rebalancing Yes Yes No No Yes
Dynamic
Cluster
Yes Yes Yes Yes No
Resource
Management
Standalone,
YARN, Mesos
Standalone,
YARN, Mesos
Standalone,
YARN, Mesos
YARN, Mesos Standalone
Coordination Zookeeper Zookeeper Built-in Built-in Zookeeper
Programming
Language
Java, any (via
Thrift)
Java, any (via
Thrift)
Java, Scala,
Python
JVM-
languages
Java
- 145. Platform Storm Storm Trident
Spark
Streaming
Samza S4
Implementati
on Language
Java, Clojure Java Scala, Java Scala, Java Java, Groovy
Built-in
Operators
No Yes Yes No No
Deterministic - - Yes - -
Message
System
Netty Netty Netty, Akka Kafka Netty
Data Mobility Pull Pull - Pull Push
Devlivery
Guarantees
At-most-once
At-least-once
Exactly-once
At-most-once
At-least-once
Exactly-once Exactly-once At-most-once
Fault
Tolerance
Rollback recovery
using upstram
backup
-
Coordinated
periodic
checkpoint,
replication, parallel
recovery
Rollback recovery
Uncoordinated
periodic
checkpoint
Dynamic
Graph
No No No Yes Yes
Persistent
State
No Yes Yes Yes Yes
- 160. Datasets
Number of Nodes
Application 1 2 4 8
word-count 4GB 8GB 16GB 26GB
log-processing 15GB 30GB 60GB 120GB
traffic-monitoring 4GB 8GB* 16GB* 32GB*
machine-outlier 4GB 9GB 18GB 36GB
spam-filter 4GB* 8GB* 16GB* 32GB*
sentiment-analysis 7GB 15GB 30GB 60GB
trending-topics 7GB 15GB 30GB 60GB
click-analytics 15GB 30GB 60GB 120GB
fraud-detection 4GB† 8GB† 16GB† 32GB†
spike-detection 4GB* 8GB* 16GB* 32GB*
*replicated †generated
- 161. Parallelism
1:1 Best Best (only source) Best (max mem)
Application Operator base multipliers base multipliers base multipliers base multipliers
word-count
source 1 1...6 1 1...3 1 2, 4, 8 3 1
splitter 1 1...6 5 1...3 5 1 5 1
counter 1 1...6 6 1...3 6 1 6 1, 2
sink 1 1...6 3 1...3 3 1 3 1
log-processing
source 1 1...6 4 1...3 1 1, 2, 8 4 1
status-counter 1 1...6 1 1...3 1 1 1 1
volume-counter 1 1...6 2 1...3 2 1 2 1
geo-locator 1 1...6 4 1...3 4 1 4 1, 2
geo-summarizer 1 1...6 2 1...3 2 1 2 1
sink 1 1...6 4 1...3 4 1 4 1
traffic-monitoring
source 1 1...6 1 1...3 1 2, 4, 8 1 1
map-matcher 1 1...6 2 1...3 2 1 2 1, 2
speed-calculator 1 1...6 2 1...3 2 1 2 1, 2
sink 1 1...6 1 1...3 1 1 1 1
machine-outlier
source 1 1...6 6 1...3 1 1, 2, 4, 8 - -
scorer 1 1...6 1 1...3 1 1 - -
anomaly-scorer 1 1...6 1 1...3 1 1 - -
alert-trigger 1 1...6 4 1...3 4 1 - -
sink 1 1...6 1 1...3 1 1 - -
spam-filter
source 1 1...6 1 1...3 1 2, 4, 8 1 1
tokenizer 1 1...6 10 1...3 10 1 10 1, 2
word-probability 1 1...6 1 1...3 1 1 1 1
bayes-rule 1 1...6 1 1...3 1 1 1 1
sink 1 1...6 1 1...3 1 1 1 1
- 162. 1:1 Best Best (only source) Best (max mem)
Application Operator base multipliers base multipliers base multipliers base multipliers
sentiment-analysis
source 1 1...6 1 2, 4, 8
tweet-filter 1 1...6 1 1
text-filter 1 1...6 1 1
stemmer 1 1...6 1 1
positive-scorer 1 1...6 1 1
negative-scorer 1 1...6 1 1
joiner 1 1...6 1 1
scorer 1 1...6 1 1
sink 1 1...6 1 1
trending-topics
source 1 1...6 9 1...3 1 1, 2, 4, 8 9 1
topic-extractor 1 1...6 2 1...3 2 1 2 1
counter 1 1...6 1 1...3 1 1 1 2, 4
intermediate-ranker 1 1...6 1 1...3 1 1 1 1
total-ranker 1 1...6 1 1...3 1 1 1 1
sink 1 1...6 1 1...3 1 1 1 1
click-analytics
source 1 1...6 2 1...3 2 2, 4, 8 2 1
repeat-visits 1 1...6 2 1...3 2 1 2 1
total-visits 1 1...6 2 1...3 2 1 2 1
geo-locator 1 1...6 5 1...3 5 1 5 2, 4
geo-summarizer 1 1...6 1 1...3 1 1 1 1
sink-visits 1 1...6 1 1...3 1 1 1 1
sink-locations 1 1...6 1 1...3 1 1 1 1
fraud-detection
source 1 1...6 8 1...3 1 1, 2, 4 8 1
predictor 1 1...6 3 1...3 3 1 3 2, 4
sink 1 1...6 2 1...3 2 1 2 1
spike-detection
source 1 1...6 7 1...3 1 1, 2, 4, 8 7 1
moving-average 1 1...6 3 1...3 3 1 3 2, 4
spike-detector 1 1...6 2 1...3 2 1 2 1
sink 1 1...6 1 1...3 1 1 1 1
- 171. HDD Read/Write – Kafka Broker
0
10
20
30
40
50
60
70
80
90
MBytes/sec
SDD_READ
SDD_WRITE
SDB_READ
SDB_WRITE
- 180. Heinze, Thomas, et al. "Tutorial: Cloud-based Data Stream Processing." (2014).
Artikis, Alexander, Matthias Weidlich, Francois Schnitzler, Ioannis Boutsis, Thomas Liebig, Nico Piatkowski, Christian Bockermann et al.
"Heterogeneous Stream Processing and Crowdsourcing for Urban Traffic Management." In EDBT, pp. 712-723. 2014.
Bouillet, Eric, et al. "Processing 6 billion CDRs/day: from research to production (experience report)." Proceedings of the 6th ACM
International Conference on Distributed Event-Based Systems. ACM, 2012.
Lakshmanan, G. T., LI, Y., and Strom, R. Placement strategies for internet-scale data stream systems. Internet Computing, IEEE 12, 6 (2008),
50–60.
Simmhan, Yogesh, et al. "An informatics approach to demand response optimization in smart grids." NATURAL GAS 31 (2011): 60.
Sakaki, Takeshi, Makoto Okazaki, and Yutaka Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors."
Proceedings of the 19th international conference on World wide web. ACM, 2010.