SlideShare a Scribd company logo
Welcome to
The first conference on Apache Flink
Sponsored by
Some practical info
§  Registration, cloakroom, and meals are in
Palais
§  Information point always staffed
§  WiFi is FlinkForward
§  Twitter hashtag is #ff15
§  Follow @FlinkForward
Some practical info
§  Need help? Look for a volunteer (pink badges)
§  All sessions are recorded and will be made
available online
§  This includes the training sessions
3
Getting around
4
Please go around while
talks are in progress
Our speaker organizations
5
Kostas Tzoumas and Stephan Ewen
@kostas_tzoumas | @StephanEwen
Apache FlinkTM: From
Incubation to Flink 1.0
7
1.  A bit of history
2.  The streaming era and Flink
3.  Inside Flink 0.10
4.  Towards Flink 1.0 and beyond
A bit of history
From incubation until now
8
9
DataSet API (Java/Scala)
Flink core
Local Remote Yarn
Apr 2014 Jun 2015Dec 2014
0.70.60.5 0.90.9-m1 0.10
Oct 2015
Top level
0.8
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
Community growth
Flink is one of the largest and most active Apache
big data projects with well over 120 contributors
10
Flink meetups around the globe
11
Featured in
12
The streaming era
Welcome to
13
14
batch
event
based
need new systems
well served
15
Streaming is the biggest change in
data infrastructure since Hadoop
16
1.  Radically simplified infrastructure
2.  Internet of Things, on-demand services
3.  Can completely subsume batch
17
In a world of events and isolated apps, the stream processor is the
backbone of the data infrastructure
App App
App
local view
local viewlocal view
Consistent
movement,
analytics
App App App
Global view
Consistent store
18
§  Until now, stream processors were less
mature than batch processors
§  This led to
•  in-house solutions
•  abuse of batch processors
•  Lambda architectures
§  This is no longer the case
19
Flink 0.10
With the upcoming 0.10 release, Flink
significantly surpasses the state of the art in
open source stream processing systems.
And, we are heading to Flink 1.0 after that.
20
§  Streaming technology has matured
•  e.g., Flink, Kafka, Dataflow
§  Flink and Dataflow duality
•  a Google technology
•  an open source Apache project
+
21
§  Streaming is happening
§  Better adapt now
§  Flink 0.10: a ready to use open
source stream processor
Flink 0.10
Flink for the streaming era
22
Improved DataStream API
§  Stream data analysis differs from batch data
analysis by introducing time
§  Streams are unbounded and produce data
over time
§  Simple as batch API if handling time in a
simple way
§  Powerful if you want to handle time in an
advanced way (out-of-order records,
preliminary results, etc)
23
Improved DataStream API
24
case	
  class	
  Event(location:	
  Location,	
  numVehicles:	
  Long)	
  	
  
val	
  stream:	
  DataStream[Event]	
  =	
  …;	
  
	
  
stream	
  
	
  	
  	
  .filter	
  {	
  evt	
  =>	
  isIntersection(evt.location)	
  }	
  
Improved DataStream API
25
case	
  class	
  Event(location:	
  Location,	
  numVehicles:	
  Long)	
  	
  
val	
  stream:	
  DataStream[Event]	
  =	
  …;	
  
	
  
stream	
  
	
  	
  	
  .filter	
  {	
  evt	
  =>	
  isIntersection(evt.location)	
  }	
  	
  
	
  	
  	
  .keyBy("location")	
  
	
  	
  	
  .timeWindow(Time.of(15,	
  MINUTES),	
  Time.of(5,	
  MINUTES))	
  
	
  	
  	
  .sum("numVehicles")	
  
Improved DataStream API
26
case	
  class	
  Event(location:	
  Location,	
  numVehicles:	
  Long)	
  	
  
val	
  stream:	
  DataStream[Event]	
  =	
  …;	
  
	
  
stream	
  
	
  	
  	
  .filter	
  {	
  evt	
  =>	
  isIntersection(evt.location)	
  }	
  	
  
	
  	
  	
  .keyBy("location")	
  
	
  	
  	
  .timeWindow(Time.of(15,	
  MINUTES),	
  Time.of(5,	
  MINUTES))	
  
	
  	
  	
  .trigger(new	
  Threshold(200))	
  
	
  	
  	
  .sum("numVehicles")	
  
Improved DataStream API
27
case	
  class	
  Event(location:	
  Location,	
  numVehicles:	
  Long)	
  	
  
val	
  stream:	
  DataStream[Event]	
  =	
  …;	
  
	
  
stream	
  
	
  	
  	
  .filter	
  {	
  evt	
  =>	
  isIntersection(evt.location)	
  }	
  	
  
	
  	
  	
  .keyBy("location")	
  
	
  	
  	
  .timeWindow(Time.of(15,	
  MINUTES),	
  Time.of(5,	
  MINUTES))	
  
	
  	
  	
  .trigger(new	
  Threshold(200))	
  
	
  	
  	
  .sum("numVehicles")	
  
	
  
	
  	
  	
  .keyBy(	
  evt	
  =>	
  evt.location.grid	
  )	
  
	
  	
  	
  .mapWithState	
  {	
  (evt,	
  state:	
  Option[Model])	
  =>	
  {	
  
	
  	
  	
  	
  	
  	
  val	
  model	
  =	
  state.orElse(new	
  Model())	
  
	
  	
  	
  	
  	
  	
  (model.classify(evt),	
  Some(model.update(evt)))	
  
	
  	
  	
  	
  }}	
  
IoT / Mobile Applications
28
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log
IoT / Mobile Applications
29
IoT / Mobile Applications
30
IoT / Mobile Applications
31
IoT / Mobile Applications
32
Out of order !!!
First burst of events
Second burst of events
IoT / Mobile Applications
33
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out of order time (event time) windows,
arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events
High Availability and Consistency
34
No Single-Point-Of-Failure
any more
Exactly-once processing semantics
across pipeline
Checkpoints/Fault Tolerance is decoupled from windows
è Allows for highly flexible window implementations
ZooKeeper
ensemble
Multiple
Masters
failover
Performance
35
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With configurable throughput/latency tradeoff
Batch and Streaming
36
case	
  class	
  WordCount(word:	
  String,	
  count:	
  Int)	
  	
  
val	
  text:	
  DataStream[String]	
  =	
  …;	
  
	
  
text	
  
	
  	
  .flatMap	
  {	
  line	
  =>	
  line.split("	
  ")	
  }	
  
	
  	
  .map	
  {	
  word	
  =>	
  new	
  WordCount(word,	
  1)	
  }	
  
	
  	
  .keyBy("word")	
  
	
  	
  .window(GlobalWindows.create())	
  
	
  	
  .trigger(new	
  EOFTrigger())	
  
	
  	
  .sum("count")	
  
Batch Word Count in the DataStream API
Batch and Streaming
37
Batch Word Count in the DataSet API
case	
  class	
  WordCount(word:	
  String,	
  count:	
  Int)	
  	
  
val	
  text:	
  DataStream[String]	
  =	
  …;	
  
	
  
text	
  
	
  	
  .flatMap	
  {	
  line	
  =>	
  line.split("	
  ")	
  }	
  
	
  	
  .map	
  {	
  word	
  =>	
  new	
  WordCount(word,	
  1)	
  }	
  
	
  	
  .keyBy("word")	
  
	
  	
  .window(GlobalWindows.create())	
  
	
  	
  .trigger(new	
  EOFTrigger())	
  
	
  	
  .sum("count")	
  
val	
  text:	
  DataSet[String]	
  =	
  …;	
  
	
  
text	
  
	
  	
  .flatMap	
  {	
  line	
  =>	
  line.split("	
  ")	
  }	
  
	
  	
  .map	
  {	
  word	
  =>	
  new	
  WordCount(word,	
  1)	
  }	
  
	
  	
  .groupBy("word")	
  
	
  	
  .sum("count")	
  
	
  
	
  
Batch and Streaming
38
Pipelined and
blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational
Optimizer
Window
Optimization
Pipelined and
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming Parameters
Batch and Streaming
39
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
Batch and Streaming
40
A full-fledged batch processor as well
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Cascading
Table
Storm
More details at Dongwon Kim's Talk
"A comparative performance evaluation of Flink"
Integration (picture not complete)
41
POSIX	
   Java/Scala
Collections
POSIX	
  
Monitoring
42
Life system metrics and
user-defined accumulators/statistics
Get	
  http://flink-­‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators	
  
Monitoring REST API for
custom monitoring tools
{	
  "id":	
  "dceafe2df1f57a1206fcb907cb38ad97",	
  "user-­‐accumulators":	
  [	
  
	
  	
  {	
  "name":"avglen",	
  "type":"DoubleCounter",	
  "value":"123.03259440000001"	
  },	
  
	
  	
  {	
  "name":"genwords",	
  "type":"LongCounter",	
  "value":"75000000"	
  }	
  ]	
  }	
  
Flink 0.10 Summary
§  Focus on operational readiness
•  high availability
•  monitoring
•  integration with other systems
§  First-class support for event time
§  Refined DataStream API: easy and
powerful
43
Towards Flink 1.0 and beyond
Where we see the project going
44
Towards Flink 1.0
§  Flink 1.0 is around the corner
§  Focus on defining public APIs and
automatic API compatibility checks
§  Guarantee backwards compatibility in all
Flink 1.X versions
45
Beyond Flink 1.0
§  Flink engine has most features in place
§  Focus on usability features on top of
DataStream API
•  e.g., SQL, ML, more connectors
§  Continue work on elasticity and memory
management
46
47	
  
Enjoy the rest of
The first conference on Apache Flink

More Related Content

K. Tzoumas & S. Ewen – Flink Forward Keynote

  • 1. Welcome to The first conference on Apache Flink Sponsored by
  • 2. Some practical info §  Registration, cloakroom, and meals are in Palais §  Information point always staffed §  WiFi is FlinkForward §  Twitter hashtag is #ff15 §  Follow @FlinkForward
  • 3. Some practical info §  Need help? Look for a volunteer (pink badges) §  All sessions are recorded and will be made available online §  This includes the training sessions 3
  • 4. Getting around 4 Please go around while talks are in progress
  • 6. Kostas Tzoumas and Stephan Ewen @kostas_tzoumas | @StephanEwen Apache FlinkTM: From Incubation to Flink 1.0
  • 7. 7 1.  A bit of history 2.  The streaming era and Flink 3.  Inside Flink 0.10 4.  Towards Flink 1.0 and beyond
  • 8. A bit of history From incubation until now 8
  • 9. 9 DataSet API (Java/Scala) Flink core Local Remote Yarn Apr 2014 Jun 2015Dec 2014 0.70.60.5 0.90.9-m1 0.10 Oct 2015 Top level 0.8 Gelly Table FlinkML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Flink dataflow engine Local Remote Yarn Tez Embedded Dataflow Dataflow Cascading Table Storm
  • 10. Community growth Flink is one of the largest and most active Apache big data projects with well over 120 contributors 10
  • 11. Flink meetups around the globe 11
  • 15. 15 Streaming is the biggest change in data infrastructure since Hadoop
  • 16. 16 1.  Radically simplified infrastructure 2.  Internet of Things, on-demand services 3.  Can completely subsume batch
  • 17. 17 In a world of events and isolated apps, the stream processor is the backbone of the data infrastructure App App App local view local viewlocal view Consistent movement, analytics App App App Global view Consistent store
  • 18. 18 §  Until now, stream processors were less mature than batch processors §  This led to •  in-house solutions •  abuse of batch processors •  Lambda architectures §  This is no longer the case
  • 19. 19 Flink 0.10 With the upcoming 0.10 release, Flink significantly surpasses the state of the art in open source stream processing systems. And, we are heading to Flink 1.0 after that.
  • 20. 20 §  Streaming technology has matured •  e.g., Flink, Kafka, Dataflow §  Flink and Dataflow duality •  a Google technology •  an open source Apache project +
  • 21. 21 §  Streaming is happening §  Better adapt now §  Flink 0.10: a ready to use open source stream processor
  • 22. Flink 0.10 Flink for the streaming era 22
  • 23. Improved DataStream API §  Stream data analysis differs from batch data analysis by introducing time §  Streams are unbounded and produce data over time §  Simple as batch API if handling time in a simple way §  Powerful if you want to handle time in an advanced way (out-of-order records, preliminary results, etc) 23
  • 24. Improved DataStream API 24 case  class  Event(location:  Location,  numVehicles:  Long)     val  stream:  DataStream[Event]  =  …;     stream        .filter  {  evt  =>  isIntersection(evt.location)  }  
  • 25. Improved DataStream API 25 case  class  Event(location:  Location,  numVehicles:  Long)     val  stream:  DataStream[Event]  =  …;     stream        .filter  {  evt  =>  isIntersection(evt.location)  }          .keyBy("location")        .timeWindow(Time.of(15,  MINUTES),  Time.of(5,  MINUTES))        .sum("numVehicles")  
  • 26. Improved DataStream API 26 case  class  Event(location:  Location,  numVehicles:  Long)     val  stream:  DataStream[Event]  =  …;     stream        .filter  {  evt  =>  isIntersection(evt.location)  }          .keyBy("location")        .timeWindow(Time.of(15,  MINUTES),  Time.of(5,  MINUTES))        .trigger(new  Threshold(200))        .sum("numVehicles")  
  • 27. Improved DataStream API 27 case  class  Event(location:  Location,  numVehicles:  Long)     val  stream:  DataStream[Event]  =  …;     stream        .filter  {  evt  =>  isIntersection(evt.location)  }          .keyBy("location")        .timeWindow(Time.of(15,  MINUTES),  Time.of(5,  MINUTES))        .trigger(new  Threshold(200))        .sum("numVehicles")          .keyBy(  evt  =>  evt.location.grid  )        .mapWithState  {  (evt,  state:  Option[Model])  =>  {              val  model  =  state.orElse(new  Model())              (model.classify(evt),  Some(model.update(evt)))          }}  
  • 28. IoT / Mobile Applications 28 Events occur on devices Queue / Log Events analyzed in a data streaming system Stream Analysis Events stored in a log
  • 29. IoT / Mobile Applications 29
  • 30. IoT / Mobile Applications 30
  • 31. IoT / Mobile Applications 31
  • 32. IoT / Mobile Applications 32 Out of order !!! First burst of events Second burst of events
  • 33. IoT / Mobile Applications 33 Event time windows Arrival time windows Instant event-at-a-time Flink supports out of order time (event time) windows, arrival time windows (and mixtures) plus low latency processing. First burst of events Second burst of events
  • 34. High Availability and Consistency 34 No Single-Point-Of-Failure any more Exactly-once processing semantics across pipeline Checkpoints/Fault Tolerance is decoupled from windows è Allows for highly flexible window implementations ZooKeeper ensemble Multiple Masters failover
  • 36. Batch and Streaming 36 case  class  WordCount(word:  String,  count:  Int)     val  text:  DataStream[String]  =  …;     text      .flatMap  {  line  =>  line.split("  ")  }      .map  {  word  =>  new  WordCount(word,  1)  }      .keyBy("word")      .window(GlobalWindows.create())      .trigger(new  EOFTrigger())      .sum("count")   Batch Word Count in the DataStream API
  • 37. Batch and Streaming 37 Batch Word Count in the DataSet API case  class  WordCount(word:  String,  count:  Int)     val  text:  DataStream[String]  =  …;     text      .flatMap  {  line  =>  line.split("  ")  }      .map  {  word  =>  new  WordCount(word,  1)  }      .keyBy("word")      .window(GlobalWindows.create())      .trigger(new  EOFTrigger())      .sum("count")   val  text:  DataSet[String]  =  …;     text      .flatMap  {  line  =>  line.split("  ")  }      .map  {  word  =>  new  WordCount(word,  1)  }      .groupBy("word")      .sum("count")      
  • 38. Batch and Streaming 38 Pipelined and blocking operators Streaming Dataflow Runtime Batch Parameters DataSet DataStream Relational Optimizer Window Optimization Pipelined and windowed operators Schedule lazily Schedule eagerly Recompute whole operators Periodic checkpoints Streaming data movement Stateful operations DAG recovery Fully buffered streams DAG resource management Streaming Parameters
  • 39. Batch and Streaming 39 A full-fledged batch processor as well Gelly Table FlinkML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Flink dataflow engine Local Remote Yarn Tez Embedded Dataflow Dataflow Cascading Table Storm
  • 40. Batch and Streaming 40 A full-fledged batch processor as well Gelly Table FlinkML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Flink dataflow engine Local Remote Yarn Tez Embedded Dataflow Dataflow Cascading Table Storm More details at Dongwon Kim's Talk "A comparative performance evaluation of Flink"
  • 41. Integration (picture not complete) 41 POSIX   Java/Scala Collections POSIX  
  • 42. Monitoring 42 Life system metrics and user-defined accumulators/statistics Get  http://flink-­‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators   Monitoring REST API for custom monitoring tools {  "id":  "dceafe2df1f57a1206fcb907cb38ad97",  "user-­‐accumulators":  [      {  "name":"avglen",  "type":"DoubleCounter",  "value":"123.03259440000001"  },      {  "name":"genwords",  "type":"LongCounter",  "value":"75000000"  }  ]  }  
  • 43. Flink 0.10 Summary §  Focus on operational readiness •  high availability •  monitoring •  integration with other systems §  First-class support for event time §  Refined DataStream API: easy and powerful 43
  • 44. Towards Flink 1.0 and beyond Where we see the project going 44
  • 45. Towards Flink 1.0 §  Flink 1.0 is around the corner §  Focus on defining public APIs and automatic API compatibility checks §  Guarantee backwards compatibility in all Flink 1.X versions 45
  • 46. Beyond Flink 1.0 §  Flink engine has most features in place §  Focus on usability features on top of DataStream API •  e.g., SQL, ML, more connectors §  Continue work on elasticity and memory management 46
  • 47. 47   Enjoy the rest of The first conference on Apache Flink