SlideShare a Scribd company logo
Apache Flink Stream Processing
Suneel Marthi
@suneelmarthi
Washington DC Apache Flink Meetup,
Capital One, Vienna, VA
November 19, 2015
Source Code
2
https://github.com/smarthi/DC-FlinkMeetup
Flink Stack
3
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
The Full Flink Stack
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Cluster Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading
Streaming dataflow runtime
Storm(WiP)
Zeppelin
Stream Processing ?
▪ Real World Data doesn’t originate in micro
batches and is pushed through systems.
▪ Stream Analysis today is an extension of
the Batch paradigm.
▪ Recent frameworks like Apache Flink,
Confluent are built to handle streaming
data.
5
Web server KafkaTopic
Requirements for a Stream Processor
▪ Low Latency
▪ Quick Results (milliseconds)
▪ High Throughput
▪ able to handle million events/sec
▪ Exactly-once guarantees
▪ Deliver results in failure scenarios
6
Fault Tolerance in Streaming
▪ at least once: all operators see all events
▪ Storm: re-processes the entire stream in
failure scenarios
▪ exactly once: operators do not perform
duplicate updates to their state
▪ Flink: Distributed Snapshots
▪ Spark: Micro-batches
7
Batch is an extension of Streaming
▪ Batch: process a bounded
stream (DataSet) on a stream
processor
▪ Form a Global Window over
the entire DataSet for join or
grouping operations
Flink Window Processing
9
Courtesy: Data Artisans
What is a Window?
▪ Grouping of elements info finite buckets
▪ by timestamps
▪ by record counts
▪ Have a maximum timestamp, which means, at
some point, all elements that need to be
assigned to a window would have arrived.
10
Why Window?
▪ Process subsets of Streams
▪ based on timestamps
▪ or by record counts
▪ Have a maximum timestamp, which means, at
some point, all elements that need to be
assigned to a window will have arrived.
11
Different Window Schemes
▪ Global Windows: All incoming elements are assigned to the same
window
stream.window(GlobalWindows.create());
▪ Tumbling time Windows: elements are assigned to a window of
size (1 sec below) based on their timestamp, elements assigned to
exactly one window
keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS));
▪ Sliding time Windows: elements are assigned to a window of
certain size based on their timestamp, windows “slide” by the
provided value and hence overlap
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1,
TimeUnit.SECONDS)));
12
Different Window Schemes
▪ Tumbling count Windows: defines window of 1000
elements, that “tumbles”. Elements are grouped
according to their arrival time in groups of 1000
elements, each element belongs to exactly one window
stream.countWindow(1000);
▪ Sliding count Windows: defines a window of 1000
elements that slides every “100” elements, elements
can belong to multiple windows.
stream.countWindow(1000, 100)
13
Tumbling Count Windows Animation
14
Courtesy: Data Artisans
Count Windows
15
Tumbling Count Window, Size = 3
Count Windows
16
Tumbling Count Window, Size = 3
Count Windows
17
Tumbling Count Window, Size = 3
Count Windows
18
Tumbling Count Window, Size = 3
Count Windows
19
Tumbling Count Window, Size = 3
Count Windows
20
Tumbling Count Window, Size = 3
Count Windows
21
Tumbling Count Window, Size = 3
Count Windows
22
Tumbling Count Window, Size = 3
Sliding every 2 elements
Count Windows
23
Tumbling Count Window, Size = 3
Sliding every 2 elements
Count Windows
24
Tumbling Count Window, Size = 3
Sliding every 2 elements
Count Windows
25
Tumbling Count Window, Size = 3
Sliding every 2 elements
Flink Streaming API
26
Flink DataStream API
27
public class StreamingWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
// Converts DataStream -> KeyedStream
.keyBy(0) //Group by first element of the Tuple
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/
java/org/apache/flink/examples/StreamingWordCount.java
Streaming WordCount (Explained)
▪ Obtain a StreamExecutionEnvironment
▪ Connect to a DataSource
▪ Specify Transformations on the
DataStreams
▪ Specifying Output for the processed data
▪ Executing the program
28
Flink DataStream API
29
public class StreamingWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
// Converts DataStream -> KeyedStream
.keyBy(0) //Group by first element of the Tuple
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/
java/org/apache/flink/examples/StreamingWordCount.java
Flink Window API
30
Keyed Windows (Grouped by Key)
31
public class WindowWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
.keyBy(0) //Group by first element of the Tuple
// create a Window of 'windowSize' records and slide window
// by 'slideSize' records

.countWindow(windowSize, slideSize)
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Keyed Windows
32
public class WindowWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
.keyBy(0) //Group by first element of the Tuple
// Converts KeyedStream -> WindowStream
.timeWindow(Time.of(1, TimeUnit.SECONDS))
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Global Windows
33
All incoming elements of a given key are assigned to
the same window.
lines.flatMap(new LineSplitter())

//group by the tuple field "0"

.keyBy(0)

// all records for a given key are assigned to the same window
.GlobalWindows.create()

// and sum up tuple field "1"

.sum(1)

// consider only word counts > 1

.filter(new WordCountFilter())
Flink Streaming API (Tumbling Windows)
34
• All incoming elements are assigned to a window of
a certain size based on their timestamp,
• Each element is assigned to exactly one window
Flink Streaming API (Tumbling Window)
35
public class WindowWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
.keyBy(0) //Group by first element of the Tuple
// Tumbling Window
.timeWindow(Time.of(1, TimeUnit.SECONDS))
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Demos
36
}
Twitter + Flink Streaming
37
• Create a Flink DataStream from live Twitter feed
• Split the Stream into multiple DataStreams based
on some criterion
• Persist the respective streams to Storage
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/twitter
Flink Event Processing: Animation
38
Courtesy: Ufuk Celebi and Stephan Ewen, Data Artisans
39
32-35
24-27
20-23
8-110-3
4-7
Tumbling Windows of 4 Seconds
123412
4
59
9 0
20
20
22212326323321
26
353642
39
tl;dr
40
• Event Time Processing is unique to Apache Flink
• Flink provides exactly-once guarantees
• With Release 0.10.0, Flink supports Streaming
windows, sessions, triggers, multi-triggers, deltas
and event-time.
References
41
• Data Streaming Fault Tolerance in Flink
Data Streaming Fault Tolerance in Flink
• Light Weight Asynchronous snapshots for
distributed Data Flows
http://arxiv.org/pdf/1506.08603.pdf
• Google DataFlow paper
Google Data Flow
Acknowledgements
42
Thanks to following folks from Data Artisans for their
help and feedback:
• Ufuk Celebi
• Till Rohrmann
• Stephan Ewen
• Marton Balassi
• Robert Metzger
• Fabian Hueske
• Kostas Tzoumas
Questions ???
43

More Related Content

Apache Flink Stream Processing

  • 1. Apache Flink Stream Processing Suneel Marthi @suneelmarthi Washington DC Apache Flink Meetup, Capital One, Vienna, VA November 19, 2015
  • 3. Flink Stack 3 Streaming dataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment
  • 4. The Full Flink Stack Gelly Table ML SAMOA DataSet (Java/Scala) DataStream HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading Streaming dataflow runtime Storm(WiP) Zeppelin
  • 5. Stream Processing ? ▪ Real World Data doesn’t originate in micro batches and is pushed through systems. ▪ Stream Analysis today is an extension of the Batch paradigm. ▪ Recent frameworks like Apache Flink, Confluent are built to handle streaming data. 5 Web server KafkaTopic
  • 6. Requirements for a Stream Processor ▪ Low Latency ▪ Quick Results (milliseconds) ▪ High Throughput ▪ able to handle million events/sec ▪ Exactly-once guarantees ▪ Deliver results in failure scenarios 6
  • 7. Fault Tolerance in Streaming ▪ at least once: all operators see all events ▪ Storm: re-processes the entire stream in failure scenarios ▪ exactly once: operators do not perform duplicate updates to their state ▪ Flink: Distributed Snapshots ▪ Spark: Micro-batches 7
  • 8. Batch is an extension of Streaming ▪ Batch: process a bounded stream (DataSet) on a stream processor ▪ Form a Global Window over the entire DataSet for join or grouping operations
  • 10. What is a Window? ▪ Grouping of elements info finite buckets ▪ by timestamps ▪ by record counts ▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window would have arrived. 10
  • 11. Why Window? ▪ Process subsets of Streams ▪ based on timestamps ▪ or by record counts ▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window will have arrived. 11
  • 12. Different Window Schemes ▪ Global Windows: All incoming elements are assigned to the same window stream.window(GlobalWindows.create()); ▪ Tumbling time Windows: elements are assigned to a window of size (1 sec below) based on their timestamp, elements assigned to exactly one window keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS)); ▪ Sliding time Windows: elements are assigned to a window of certain size based on their timestamp, windows “slide” by the provided value and hence overlap stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))); 12
  • 13. Different Window Schemes ▪ Tumbling count Windows: defines window of 1000 elements, that “tumbles”. Elements are grouped according to their arrival time in groups of 1000 elements, each element belongs to exactly one window stream.countWindow(1000); ▪ Sliding count Windows: defines a window of 1000 elements that slides every “100” elements, elements can belong to multiple windows. stream.countWindow(1000, 100) 13
  • 14. Tumbling Count Windows Animation 14 Courtesy: Data Artisans
  • 22. Count Windows 22 Tumbling Count Window, Size = 3 Sliding every 2 elements
  • 23. Count Windows 23 Tumbling Count Window, Size = 3 Sliding every 2 elements
  • 24. Count Windows 24 Tumbling Count Window, Size = 3 Sliding every 2 elements
  • 25. Count Windows 25 Tumbling Count Window, Size = 3 Sliding every 2 elements
  • 27. Flink DataStream API 27 public class StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/ java/org/apache/flink/examples/StreamingWordCount.java
  • 28. Streaming WordCount (Explained) ▪ Obtain a StreamExecutionEnvironment ▪ Connect to a DataSource ▪ Specify Transformations on the DataStreams ▪ Specifying Output for the processed data ▪ Executing the program 28
  • 29. Flink DataStream API 29 public class StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/ java/org/apache/flink/examples/StreamingWordCount.java
  • 31. Keyed Windows (Grouped by Key) 31 public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple // create a Window of 'windowSize' records and slide window // by 'slideSize' records
 .countWindow(windowSize, slideSize) .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/WindowWordCount.java
  • 32. Keyed Windows 32 public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple // Converts KeyedStream -> WindowStream .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/WindowWordCount.java
  • 33. Global Windows 33 All incoming elements of a given key are assigned to the same window. lines.flatMap(new LineSplitter())
 //group by the tuple field "0"
 .keyBy(0)
 // all records for a given key are assigned to the same window .GlobalWindows.create()
 // and sum up tuple field "1"
 .sum(1)
 // consider only word counts > 1
 .filter(new WordCountFilter())
  • 34. Flink Streaming API (Tumbling Windows) 34 • All incoming elements are assigned to a window of a certain size based on their timestamp, • Each element is assigned to exactly one window
  • 35. Flink Streaming API (Tumbling Window) 35 public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple // Tumbling Window .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/WindowWordCount.java
  • 37. Twitter + Flink Streaming 37 • Create a Flink DataStream from live Twitter feed • Split the Stream into multiple DataStreams based on some criterion • Persist the respective streams to Storage https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/twitter
  • 38. Flink Event Processing: Animation 38 Courtesy: Ufuk Celebi and Stephan Ewen, Data Artisans
  • 39. 39 32-35 24-27 20-23 8-110-3 4-7 Tumbling Windows of 4 Seconds 123412 4 59 9 0 20 20 22212326323321 26 353642 39
  • 40. tl;dr 40 • Event Time Processing is unique to Apache Flink • Flink provides exactly-once guarantees • With Release 0.10.0, Flink supports Streaming windows, sessions, triggers, multi-triggers, deltas and event-time.
  • 41. References 41 • Data Streaming Fault Tolerance in Flink Data Streaming Fault Tolerance in Flink • Light Weight Asynchronous snapshots for distributed Data Flows http://arxiv.org/pdf/1506.08603.pdf • Google DataFlow paper Google Data Flow
  • 42. Acknowledgements 42 Thanks to following folks from Data Artisans for their help and feedback: • Ufuk Celebi • Till Rohrmann • Stephan Ewen • Marton Balassi • Robert Metzger • Fabian Hueske • Kostas Tzoumas