Building Big Data Streaming Architectures

Building streaming
architectures
David Martinez Rego
BigD@ta Coruña 25-26 April 2016

Index
• Why?
• When?
• What?
• How?

The world does not wait
• Big data applications are build with the sole
purpose of managing a business case of gathering
an understanding about the word that would give
an advantage.
• The necessity of building streaming applications
arises from the fact that in many applications, the
value of the information gathered drops
dramatically with time.

Batch/streaming duality
• Streaming applications can bring value by giving
an approximate answer just on time. If timing is not
an issue (daily), batch pipelines can provide a
good solution.
time
value
streaming
batch

Start big, grow small
• Despite the advertisement of vendors, jump in to a
streaming application is not always advisable
• It is harder to get it right and you encounter limitations:
probabilistic data structures, guarantees, …
• The value of the data you are about to gather is not
clear in a discovery phase.
• Some new libraries provide the same set of primitives
both for batch and streaming. It is possible to develop
the core of the idea and just translate that to a streaming
pipeline later.

Not always practical
• As a developer, you can
face any of the following
situations
• It is mandatory
• It is doubtful
• It will never be necessary

Gathering Brokering SinkProcessing/Analysis
Execution engine
Coordination
Persistent
Persistent
~Persistent
External systems

https://www.mapr.com/developercentral/lambda-architecture

Lambda architecture
• Batch layer (ex. Spark, HDFS): process the master
dataset (append only) to precompute batch views
(a view the front end will query
• Speed layer (streaming): calculate ephemeral
views only based on recent data!
• Motto: take into account reprocessing and
recovery

Lambda architecture
• Problems:!
• Maintaing two code bases in sync (often different
because speed layer cannot reproduce the
same)
• Synchronisation of the two layers in the query
layer is an additional problem

Gathering Brokering
Reservoir of Master Data
Production
Production
Catching up…
Production pipeline
New pipeline

Gathering Brokering
Reservoir of Master Data
Production
Done!
Old pipeline
Production pipeline

Kappa approach
• Only maintain one code base and reduce
accidental complexity by using too many
technologies.
• Can reverse back if something goes wrong
• Not a silver bullet and not prescription of
technologies, just a framework.

Concepts are basic
• There are multiple frameworks
available nowadays who
change terminology trying to
differentiate.
• It makes starting on streaming
a bit confusing…

Concepts are basic
• It makes starting on streaming
a bit confusing…
• Actually there are many
concepts which are shared
between them and they are
quite logical.

Step 1: data structure
• The basic data structure is made of 4 elements
• Sink: where is this thing going?
• Partition key: to which shard?
• Sequence id: when was this produced?
• Data: anything that can be serialised (JSON, Avro, photo, …)
Partition key Sequence idSink Data
( , , ),

Step 2: hashing
• The holy grail trick of big data to split the work, and
also major block of streaming
• We use hashing in the reverse of classical, force
the clashing of the things that are if my interest
Partition key Sequence idSink Data
( , , ),
h(k) mod N

Step 3: fault tolerance
“Distributed computing is parallel computing when
you cannot trust anything or anyone”

• At any point any node producing the data in the
source can stop working
• Non persistent: data is lost
• Persistent: data is replicated so it can always be
recovered from other node

• At any point any node computing our pipeline can
go down
• at most once: we let data be lost, once delivered
do not reprocess.
• at least once, we ensure delivery, can be
reprocessed.
• exactly once, we ensure delivery and no
reprocessing

• At any point any node computing our pipeline can go
down
• checkpointing: If we have been running the pipeline for
hours and something goes wrong, do I have to start
from the beginning?
• Streaming systems put in place mechanisms to
checkpoint progress so the new worker knows
previous state and where to start from.
• Usually involves other systems to save checkpoints
and synchronise.

Step 4: delivery
• One at a time: we process each message
individually. Increases response time per message.
• Micro-batch: we always process data in batches
gathered at speciﬁed time intervals or size. Makes
it impossible to reduce message processing below
a limit.

Partition keyTopic Data
, , )(
…
, , )(
, , )(
…
, , )(

, , )(
…
, , )(
, , )(
…
, , )(
h(k) mod N
h(k) mod N
h(k) mod N
h(k) mod N

…
Consumer 1
Consumer 1
Zookeeper
…
…

…
Consumer 2
Consumer 1
Zookeeper
…
…
Consumer 2 …
Consumer 1 …

Produce to
Kafka, consume
from Kafka

one!
at-a-time
mini!
batch
exactly!
once
Deploy Windowing Functional Catch
Yes Yes * Yes *
Custom
YARN
Yes * ~ DRPC
No Yes Yes
YARN
Mesos
Yes Yes
MLlib,
ecosystem
Yes Yes Yes YARN Yes Yes
Flexible
windowing
Yes ~ No YARN ~ No
DB update
log plugin
Yes Yes Yes Google Yes ~
Google
ecosystem
Yes you No AWS you No
AWS
ecosystem
* with Trident

Flink basic concepts
• Stream: source of data that feeds computations (a
batch dataset is a bounded stream)
• Transformations: operation that takes one or more
streams as input and computes an output stream.
They can be stateless of stateful (exactly once).
• Sink: endpoint that received the output stream of a
transformation
• Dataﬂow: DAG of streams, transformations and sinks.

Samza basic concepts
• Streams: persistent set of immutable messages of similar type
and category with transactional nature.
• Jobs: code that performs logical transformations on a set of
input streams to append to a set of output streams.
• Partitions: Each stream breaks into partitions, set of totally
ordered sequence of examples.
• Tasks: Each task consumes data from one partition.
• Dataﬂow: composition of jobs that connects a set of streams.
• Containers: physical unit of parallelism.

Storm basic concepts
• Spout: source of data from any external system.
• Bolts: transformations of one or more streams into another
set of output streams.
• Stream grouping: shufﬂing of streaming data between bolts.
• Topology: set of spouts and bolts that process a stream of
data.
• Tasks and Workers: unit of work deployable into one
container. Workers can process one or more tasks. Task
deploy to one worker.

Storm basic concepts
Trident Topology
Compile

Spark basic concepts
• DStream: continuous stream of data represented by a
series of RDDs. Each RDD contains data for a speciﬁc
time interval.
• Input DStream and Receiver: source of data that feeds a
DStream.
• Transformations: operations that transform one DStream
in another DStream (stateless and stateful with exactly
once semantics).
• Output operations: operations that periodically push data
of a DStream to a speciﬁc output system.

Conclusions…
• Think on streaming when there is a hard constraint on time-to-information
• Use a queue system as your place of orchestration
• Select the processing system that best suits to your use case
• Samza: early stage, more to come in the close future.
• Spark: good option if mini batch will always work for you.
• Storm: good option if you can setup the infrastructure. DRPC provides an interesting pattern
for some use cases.
• Flink: reduced ecosystem because it has a shorter history. Its design learnt from all past
frameworks and is the most ﬂexible.
• Datastream: original inspiration for Flink. Good and ﬂexible model if you want to go the
managed route and make use of Google toolbox (Bigtable, etc)
• Kinesis: Only if you have some legacy. Probably better off using Spark connector in AWS
EMR.

Where to go…
• All code examples are available in Github
• Kafka https://github.com/torito1984/kafka-
playground.git, https://github.com/torito1984/kafka-
doyle-generator.git
• Spark https://github.com/torito1984/spark-doyle.git!
• Storm https://github.com/torito1984/trident-doyle.git!
• Flink https://github.com/torito1984/ﬂink-sherlock.git!
• Samza https://github.com/torito1984/samza-locations.git

Building Big Data Streaming Architectures

Related slideshows

More Related Content

Building Big Data Streaming Architectures