K. Tzoumas & S. Ewen – Flink Forward Keynote
- 2. Some practical info
§ Registration, cloakroom, and meals are in
Palais
§ Information point always staffed
§ WiFi is FlinkForward
§ Twitter hashtag is #ff15
§ Follow @FlinkForward
- 3. Some practical info
§ Need help? Look for a volunteer (pink badges)
§ All sessions are recorded and will be made
available online
§ This includes the training sessions
3
- 6. Kostas Tzoumas and Stephan Ewen
@kostas_tzoumas | @StephanEwen
Apache FlinkTM: From
Incubation to Flink 1.0
- 7. 7
1. A bit of history
2. The streaming era and Flink
3. Inside Flink 0.10
4. Towards Flink 1.0 and beyond
- 8. A bit of history
From incubation until now
8
- 9. 9
DataSet API (Java/Scala)
Flink core
Local Remote Yarn
Apr 2014 Jun 2015Dec 2014
0.70.60.5 0.90.9-m1 0.10
Oct 2015
Top level
0.8
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
- 17. 17
In a world of events and isolated apps, the stream processor is the
backbone of the data infrastructure
App App
App
local view
local viewlocal view
Consistent
movement,
analytics
App App App
Global view
Consistent store
- 18. 18
§ Until now, stream processors were less
mature than batch processors
§ This led to
• in-house solutions
• abuse of batch processors
• Lambda architectures
§ This is no longer the case
- 19. 19
Flink 0.10
With the upcoming 0.10 release, Flink
significantly surpasses the state of the art in
open source stream processing systems.
And, we are heading to Flink 1.0 after that.
- 20. 20
§ Streaming technology has matured
• e.g., Flink, Kafka, Dataflow
§ Flink and Dataflow duality
• a Google technology
• an open source Apache project
+
- 21. 21
§ Streaming is happening
§ Better adapt now
§ Flink 0.10: a ready to use open
source stream processor
- 23. Improved DataStream API
§ Stream data analysis differs from batch data
analysis by introducing time
§ Streams are unbounded and produce data
over time
§ Simple as batch API if handling time in a
simple way
§ Powerful if you want to handle time in an
advanced way (out-of-order records,
preliminary results, etc)
23
- 24. Improved DataStream API
24
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
- 25. Improved DataStream API
25
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
.keyBy("location")
.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))
.sum("numVehicles")
- 26. Improved DataStream API
26
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
.keyBy("location")
.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))
.trigger(new
Threshold(200))
.sum("numVehicles")
- 27. Improved DataStream API
27
case
class
Event(location:
Location,
numVehicles:
Long)
val
stream:
DataStream[Event]
=
…;
stream
.filter
{
evt
=>
isIntersection(evt.location)
}
.keyBy("location")
.timeWindow(Time.of(15,
MINUTES),
Time.of(5,
MINUTES))
.trigger(new
Threshold(200))
.sum("numVehicles")
.keyBy(
evt
=>
evt.location.grid
)
.mapWithState
{
(evt,
state:
Option[Model])
=>
{
val
model
=
state.orElse(new
Model())
(model.classify(evt),
Some(model.update(evt)))
}}
- 28. IoT / Mobile Applications
28
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log
- 32. IoT / Mobile Applications
32
Out of order !!!
First burst of events
Second burst of events
- 33. IoT / Mobile Applications
33
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out of order time (event time) windows,
arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events
- 34. High Availability and Consistency
34
No Single-Point-Of-Failure
any more
Exactly-once processing semantics
across pipeline
Checkpoints/Fault Tolerance is decoupled from windows
è Allows for highly flexible window implementations
ZooKeeper
ensemble
Multiple
Masters
failover
- 36. Batch and Streaming
36
case
class
WordCount(word:
String,
count:
Int)
val
text:
DataStream[String]
=
…;
text
.flatMap
{
line
=>
line.split("
")
}
.map
{
word
=>
new
WordCount(word,
1)
}
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new
EOFTrigger())
.sum("count")
Batch Word Count in the DataStream API
- 37. Batch and Streaming
37
Batch Word Count in the DataSet API
case
class
WordCount(word:
String,
count:
Int)
val
text:
DataStream[String]
=
…;
text
.flatMap
{
line
=>
line.split("
")
}
.map
{
word
=>
new
WordCount(word,
1)
}
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new
EOFTrigger())
.sum("count")
val
text:
DataSet[String]
=
…;
text
.flatMap
{
line
=>
line.split("
")
}
.map
{
word
=>
new
WordCount(word,
1)
}
.groupBy("word")
.sum("count")
- 38. Batch and Streaming
38
Pipelined and
blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational
Optimizer
Window
Optimization
Pipelined and
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming Parameters
- 39. Batch and Streaming
39
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
- 40. Batch and Streaming
40
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
More details at Dongwon Kim's Talk
"A comparative performance evaluation of Flink"
- 42. Monitoring
42
Life system metrics and
user-defined accumulators/statistics
Get
http://flink-‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators
Monitoring REST API for
custom monitoring tools
{
"id":
"dceafe2df1f57a1206fcb907cb38ad97",
"user-‐accumulators":
[
{
"name":"avglen",
"type":"DoubleCounter",
"value":"123.03259440000001"
},
{
"name":"genwords",
"type":"LongCounter",
"value":"75000000"
}
]
}
- 43. Flink 0.10 Summary
§ Focus on operational readiness
• high availability
• monitoring
• integration with other systems
§ First-class support for event time
§ Refined DataStream API: easy and
powerful
43
- 45. Towards Flink 1.0
§ Flink 1.0 is around the corner
§ Focus on defining public APIs and
automatic API compatibility checks
§ Guarantee backwards compatibility in all
Flink 1.X versions
45
- 46. Beyond Flink 1.0
§ Flink engine has most features in place
§ Focus on usability features on top of
DataStream API
• e.g., SQL, ML, more connectors
§ Continue work on elasticity and memory
management
46
- 47. 47
Enjoy the rest of
The first conference on Apache Flink