K. Tzoumas & S. Ewen – Flink Forward Keynote

Welcome to
The ﬁrst conference on Apache Flink
Sponsored by

Some practical info
§  Registration, cloakroom, and meals are in
Palais
§  Information point always staffed
§  WiFi is FlinkForward
§  Twitter hashtag is #ff15
§  Follow @FlinkForward

Some practical info
§  Need help? Look for a volunteer (pink badges)
§  All sessions are recorded and will be made
available online
§  This includes the training sessions
3

Getting around
4
Please go around while
talks are in progress

Kostas Tzoumas and Stephan Ewen
@kostas_tzoumas | @StephanEwen
Apache FlinkTM: From
Incubation to Flink 1.0

7
1.  A bit of history
2.  The streaming era and Flink
3.  Inside Flink 0.10
4.  Towards Flink 1.0 and beyond

A bit of history
From incubation until now
8

9
DataSet API (Java/Scala)
Flink core
Local Remote Yarn
Apr 2014 Jun 2015Dec 2014
0.70.60.5 0.90.9-m1 0.10
Oct 2015
Top level
0.8
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm

Community growth
Flink is one of the largest and most active Apache
big data projects with well over 120 contributors
10

Flink meetups around the globe
11

The streaming era
Welcome to
13

14
batch
event
based
need new systems
well served

15
Streaming is the biggest change in
data infrastructure since Hadoop

16
1.  Radically simpliﬁed infrastructure
2.  Internet of Things, on-demand services
3.  Can completely subsume batch

17
In a world of events and isolated apps, the stream processor is the
backbone of the data infrastructure
App App
App
local view
local viewlocal view
Consistent
movement,
analytics
App App App
Global view
Consistent store

18
§  Until now, stream processors were less
mature than batch processors
§  This led to
•  in-house solutions
•  abuse of batch processors
•  Lambda architectures
§  This is no longer the case

19
Flink 0.10
With the upcoming 0.10 release, Flink
signiﬁcantly surpasses the state of the art in
open source stream processing systems.
And, we are heading to Flink 1.0 after that.

20
§  Streaming technology has matured
•  e.g., Flink, Kafka, Dataﬂow
§  Flink and Dataﬂow duality
•  a Google technology
•  an open source Apache project
+

21
§  Streaming is happening
§  Better adapt now
§  Flink 0.10: a ready to use open
source stream processor

Flink 0.10
Flink for the streaming era
22

Improved DataStream API
§  Stream data analysis differs from batch data
analysis by introducing time
§  Streams are unbounded and produce data
over time
§  Simple as batch API if handling time in a
simple way
§  Powerful if you want to handle time in an
advanced way (out-of-order records,
preliminary results, etc)
23

24
case
class
Event(location:
Location,
numVehicles:
Long)

val
stream:
DataStream[Event]
=
…;

stream

.filter
{
evt
=>
isIntersection(evt.location)
}

25
case
class
Event(location:
Location,
numVehicles:
Long)

val
stream:
DataStream[Event]
=
…;

stream

.filter
{
evt
=>
}

.keyBy("location")

.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))

.sum("numVehicles")

26
case
class
Event(location:
Location,
numVehicles:
Long)

val
stream:
DataStream[Event]
=
…;

stream

.filter
{
evt
=>
}

.keyBy("location")

MINUTES),
Time.of(5,
MINUTES))

.trigger(new
Threshold(200))

.sum("numVehicles")

27
case
class
Event(location:
Location,
numVehicles:
Long)

val
stream:
DataStream[Event]
=
…;

stream

.filter
{
evt
=>
}

.keyBy("location")

MINUTES),
Time.of(5,
MINUTES))

.trigger(new
Threshold(200))

.sum("numVehicles")

.keyBy(
evt
=>
evt.location.grid
)

.mapWithState
{
(evt,
state:
Option[Model])
=>
{

val
model
=
state.orElse(new
Model())

(model.classify(evt),
Some(model.update(evt)))

}}

IoT / Mobile Applications
28
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log

32
Out of order !!!
First burst of events
Second burst of events

33
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out of order time (event time) windows,
arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events

High Availability and Consistency
34
No Single-Point-Of-Failure
any more
Exactly-once processing semantics
across pipeline
Checkpoints/Fault Tolerance is decoupled from windows
è Allows for highly ﬂexible window implementations
ZooKeeper
ensemble
Multiple
Masters
failover

Performance
35
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With conﬁgurable throughput/latency tradeoff

Batch and Streaming
36
case
class
WordCount(word:
String,
count:
Int)

val
text:
DataStream[String]
=
…;

text

.flatMap
{
line
=>
line.split("
")
}

.map
{
word
=>
new
WordCount(word,
1)
}

.keyBy("word")

.window(GlobalWindows.create())

.trigger(new
EOFTrigger())

.sum("count")

Batch Word Count in the DataStream API

Batch and Streaming
37
Batch Word Count in the DataSet API
case
class
WordCount(word:
String,
count:
Int)

val
text:
DataStream[String]
=
…;

text

.flatMap
{
line
=>
line.split("
")
}

.map
{
word
=>
new
WordCount(word,
1)
}

.keyBy("word")

.window(GlobalWindows.create())

.trigger(new
EOFTrigger())

.sum("count")

val
text:
DataSet[String]
=
…;

text

.flatMap
{
line
=>
line.split("
")
}

.map
{
word
=>
new
WordCount(word,
1)
}

.groupBy("word")

.sum("count")

Batch and Streaming
38
Pipelined and
blocking operators Streaming Dataﬂow Runtime
Batch Parameters
DataSet DataStream
Relational
Optimizer
Window
Optimization
Pipelined and
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming Parameters

Batch and Streaming
39
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
HadoopM/R
Dataflow
Dataflow
Cascading
Table
Storm

Batch and Streaming
40
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
HadoopM/R
Dataflow
Dataflow
Cascading
Table
Storm
More details at Dongwon Kim's Talk
"A comparative performance evaluation of Flink"

Integration (picture not complete)
41
POSIX
Java/Scala
Collections
POSIX

Monitoring
42
Life system metrics and
user-deﬁned accumulators/statistics
Get
http://flink-‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators

Monitoring REST API for
custom monitoring tools
{
"id":
"dceafe2df1f57a1206fcb907cb38ad97",
"user-‐accumulators":
[

{
"name":"avglen",
"type":"DoubleCounter",
"value":"123.03259440000001"
},

{
"name":"genwords",
"type":"LongCounter",
"value":"75000000"
}
]
}

Flink 0.10 Summary
§  Focus on operational readiness
•  high availability
•  monitoring
•  integration with other systems
§  First-class support for event time
§  Reﬁned DataStream API: easy and
powerful
43

Towards Flink 1.0 and beyond
Where we see the project going
44

Towards Flink 1.0
§  Flink 1.0 is around the corner
§  Focus on deﬁning public APIs and
automatic API compatibility checks
§  Guarantee backwards compatibility in all
Flink 1.X versions
45

Beyond Flink 1.0
§  Flink engine has most features in place
§  Focus on usability features on top of
DataStream API
•  e.g., SQL, ML, more connectors
§  Continue work on elasticity and memory
management
46

47

Enjoy the rest of
The ﬁrst conference on Apache Flink

K. Tzoumas & S. Ewen – Flink Forward Keynote

Related slideshows

More Related Content

K. Tzoumas & S. Ewen – Flink Forward Keynote