SlideShare a Scribd company logo
Using Location Data to
Showcase Keys, Windows, and Joins
in Kafka Streams DSL and KSQL
Where are my Keys?
Neil Buesing
Kafka Summit
London, 2019
!1
Introduction
• Object Partners, Inc
• Located : Minneapolis, Minnesota & Omaha, Nebraska, USA
• http://www.objectpartners.com
• Software Development Consulting
• JVM Technologies, Mobile/Web, DevOps, Real-Time Data
• Neil Buesing
• Director, Real-Time Data
• 19 Years with Object Partners, Inc.
• Find me at:
• https://www.linkedin.com/in/neilbuesing
• https://twitter.com/nbuesing
• https://github.com/nbuesing
!2
https://github.com/nbuesing/kafka-summit-london-2019
Source Code
• Fully Contained GitHub Repository
• Java
• Spring Boot
• Gradle Build Files
• Docker Container (Kafka Cluster)
!3
The “Pre” Projects
• Common
• Avro Data Model 

• KdTree

• Bucket / Bucket Factory

• Geolocation
• RESTful Location to Airport Lookup Service

• Connector
• OpenSky Apache Kafka Source Connector

• Web Application
• D3 / Spring Boot / Spring / Spring MVC
• Docker
• 3 broker, 1 zookeeper, 1 SR docker compose file
!4
Avro Data Model
Record (OpenSky)
Aircraft
Location
Nearest Airport
Count
Distance
!5
Airport Lookup Service
• Get Location of An Airport
• /airport/{code}
• Closest Airport
• /airport?latitude={latitude}&longitude={longitude}
!6
OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
The OpenSky Network, http://www.opensky-network.org
!7
OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
The OpenSky Network, http://www.opensky-network.org
!7
OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
transponder
callsign
geolocation
position update
The OpenSky Network, http://www.opensky-network.org
!7
OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
transponder
callsign
geolocation
position update
api.getStates(0, null, new OpenSkyApi.BoundingBox(24.39, 49.38, -124.84, -66.88));
The OpenSky Network, http://www.opensky-network.org
!7
OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
transponder
callsign
geolocation
position update
api.getStates(0, null, new OpenSkyApi.BoundingBox(24.39, 49.38, -124.84, -66.88));
api.getStates(0, null, new OpenSkyApi.BoundingBox(-80.0, 80.0, -180.0, 180.0));
The OpenSky Network, http://www.opensky-network.org
!7
OpenSky Source Connector
The OpenSky Network, http://www.opensky-network.org
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
• Kafka Connect
• Not a lot of code needed to write
• Considering making this an Open

Source Connector
!8
Nearest Airport
• How many aircrafts are closer to a given airport than
any other airport?
• What is my stateful window/duration I want to
measure?
• How do I access the data?
• How do I count my data?
• How do I find my data?
• How do I provide a visual of this data to the user?
!9
Kafka Streams
Nearest Airport
• How many aircrafts are closer to a given airport than
any other airport?
• What is my stateful window/duration I want to
measure?
• How do I access the data?
• How do I count my data?
• How do I find my data?
• How do I provide a visual of this data to the user?
!9
D3
Kafka Streams
Nearest Airport
• How many aircrafts are closer to a given airport than
any other airport?
• What is my stateful window/duration I want to
measure?
• How do I access the data?
• How do I count my data?
• How do I find my data?
• How do I provide a visual of this data to the user?
!9
Nearest Airport
• Airport Lookup is based distance
• First thought - Airport as a KTable.
• How do I find my keys?
• Window - 5 minute tumbling window
• What to do with late arriving data?
• Source - Airports
• RESTful endpoint
• Source - OpenSky
• Kafka Source Connector
• Frequency of updates?
• Source Offsets?
!10
• Create A Windowed KTable of all flights

• 5 minute window, 1 minute grace period

• On update, keep the most recent reading

• Materialize the data so it is programmatically accessible

• Repeat for other items—Materialize for all state-stores

• What about multiple instances of my Kafka Streams Applications?
Nearest Airport - Lookup
!11
Nearest Airport - Lookup
public KTable<Windowed<String>, Record> flightsStore() {
return flights()
.groupByKey()
.windowedBy(TimeWindows.of("5m").grace("1m"))
.reduce((current, v) -> { return v; }, Materialized.as(“flights”));
}
!12
Keep latest
Make programmatically
accessible
Nearest Airport - Lookup
private ReadOnlyWindowStore<String, Record> flights() {
return kafkaStreams().store(
"flights", QueryableStoreTypes.windowStore()
);
}
• Provide access to the state store as a queryable read-only window
store.
!13
Access by Name
Nearest Airport - Lookup
public List<AircraftJson> flights(Long start, Long end) {
KeyValueIterator<Windowed<String>, Record> iterator =
flights().fetchAll(Instant.ofEpochMilli(start), Instant.ofEpochMilli(end));
…
}
• Provide access to the state store as a queryable read-only window
store.
!14
fetchAll for the given time
window
state-store for kTables only store
data in the partitions being
processed by a given streams
Nearest Airport - Lookup
Aircrafts
Pulled from flightStore()
Materialized Stores
streams.allMetadataForStore("nearest_airport").forEach(app -> {
list.add(app.host() + ":" + app.hostInfo().port());
});
!15
Nearest Airport - Count
!16
• Algorithm #1 - Count the Aircrafts
Nearest Airport - Count
.selectKey((key, value) -> value.getAirport())
.groupByKey()
.windowedBy(TimeWindows.of(“5m").grace(Duration.of("1m"))
.aggregate(
() -> 0,
(key, value, aggregate) -> {
return aggregate + 1;
},
Materialized.as(NEAREST_AIRPORT_STORE)
)
.toStream((wk, v) -> wk.key())
!17
Nearest Airport - Count
.selectKey((key, value) -> value.getAirport())
.groupByKey()
.windowedBy(TimeWindows.of(“5m").grace(Duration.of("1m"))
.aggregate(
() -> 0,
(key, value, aggregate) -> {
return aggregate + 1;
},
Materialized.as(NEAREST_AIRPORT_STORE)
)
.toStream((wk, v) -> wk.key())
What’s wrong with
this approach?
!17
Nearest Airport - Count - KSQL
@UdfDescription(name = "closestAirport", description = "return airport")
public class ClosestAirport {
private Geolocation geolocation = Feign.builder()
.options(new Request.Options(200, 200))
.encoder(new JacksonEncoder())
.decoder(new JacksonDecoder())
.target(Geolocation.class, "http://geolocation:9080");
@Udf(description = "find closest airport to given location.")
public String closestAirport(final Double latitude, final Double longitude) {
return geolocation.closestAirport(latitude, longitude).getCode();
}
}
!18
• Step 1 : create specialized User Defined Function Written in Java
Nearest Airport - Count - KSQL
!19
• Step 2 : deploy specialized function compiled as uber jar
Nearest Airport - Count - KSQL
create stream 
ksql_nearest_airport 
as select 
aircraft->transponder transponder, 
closestAirport(location->latitude, location->longitude) as airport, 
location 
from flights 
partition by transponder;
!20
• Step 3 : use specialized function to enrich the streaming data
Nearest Airport - Count - KSQL
create table 
ksql_nearest_airport_count 
as select 
airport, 
count(*) as count 
from ksql_nearest_airport window tumbling (size 5 minutes) 
group by airport;
!21
• Step 4 : aggregate the data using standard KSQL syntax
Nearest Airport - Aggregate
!22
• Algorithm #2 - Collect the Aircrafts
Nearest Airport - Aggregate
flights()
.map((key, value) -> {
Airport airport = geolocation.closestAirport(value.getLocation());
return KeyValue.pair(key, createNearestAirport(airport, value));
})
.groupBy((k, v) -> v.getAirport())
.windowedBy(TimeWindows.of("5m"))
.aggregate(() -> null,
(key, value, aggregate) -> {
if (aggregate == null) {
aggregate = createAgg(value.getAirport(), value.getAirportLocation());
}
if (!aggregate.getAircrafts().contains(value.getCallsign())) {
aggregate.getAircrafts().add(value.getCallsign());
}
return aggregate;
}, Materialized.as(NEAREST_AIRPORT_AGG_STORE));
!23
Nearest Airport - Aggregate
flights()
.map((key, value) -> {
Airport airport = geolocation.closestAirport(value.getLocation());
return KeyValue.pair(key, createNearestAirport(airport, value));
})
.groupBy((k, v) -> v.getAirport())
.windowedBy(TimeWindows.of("5m"))
.aggregate(() -> null,
(key, value, aggregate) -> {
if (aggregate == null) {
aggregate = createAgg(value.getAirport(), value.getAirportLocation());
}
if (!aggregate.getAircrafts().contains(value.getCallsign())) {
aggregate.getAircrafts().add(value.getCallsign());
}
return aggregate;
}, Materialized.as(NEAREST_AIRPORT_AGG_STORE));
Double Counting?
!23
create table 
ksql_nearest_airport_count_agg_count 
as select 
airport, 
countList(count) 
from ksql_nearest_airport window tumbling (size 5 minutes) 
group by airport;
Nearest Airport - Aggregate - KSQL
create table 
ksql_nearest_airport_count_agg 
as select 
airport, 
collect_set(transponder) as count 
from ksql_nearest_airport window tumbling (size 5 minutes) 
group by airport;
!24
Nearest Airport - Suppression
!25
• Algorithm #3 - Suppress Streaming of the Topology

• Count With Suppression 

(KIP-328: Ability to suppress updates for KTables)

• Kafka Streams 2.1
Nearest Airport - Suppression
public KStream<String, Record> flightsSuppressed() {
return flightsStore()
.suppress(Suppressed.untilWindowCloses(
Suppressed.BufferConfig.unbounded())
).toStream()
.selectKey((k, v) -> k.key());
}
KStream<String, NearestAirport> stream =
flightsSuppressed()
.map((key, value) -> {
Airport airport = geolocation.closestAirport(value.getLocation());
return KeyValue.pair(key, createNearestAirport(airport, value));
});
!26
Nearest Airport - Suppression
!27
Nearest Airport - Suppression
!27
Algorithm Concerns?
Nearest Airport - Suppression
!27
Algorithm Concerns?
Information Delay
Nearest Airport - Suppression
!27
Algorithm Concerns?
Information Delay
Memory
Nearest Airport - Suppression
!27
Algorithm Concerns?
Information Delay
Memory
Not Available in KSQL
Nearest Airport - Retrospective
• Kafka Streams
• Windows are not magic
• treating it like magic means you will get it wrong
• Window state-stores are powerful
• late arriving messages
• retention (default 24 hours)
• materialization
• change-log topics
• suppression
• keeps evolving (read the KIPs)
!28
Nearest Neighbor
!29
Nearest Neighbor
!30
• Find the nearest Blue Team aircraft for every given
Red Team aircraft.

• Make sure the algorithm can be properly sharded so
the work can be distributed.

• Selected a five minute time window.
Nearest Neighbor
For this red aircraft,
find the closest aircraft.
!31
Nearest Neighbor
Create a 3°x3° region that I call
the “bucket”
this becomes the topic key
!32
Nearest Neighbor
Aircraft 1
!33
Nearest Neighbor
!34
Bucket overlap
distance calculation performed
Nearest Neighbor
!35
Distance Object
Distance, Red Aircraft, & Blue
Aircraft
Keep all the information needed
for next operation
Nearest Neighbor
!35
Distance Object
Distance, Red Aircraft, & Blue
Aircraft
Keep all the information needed
for next operation
What did I do
wrong?
Nearest Neighbor
Aircraft 2
!36
Nearest Neighbor
Place blue aircraft
into 9 “buckets”
(replicate the data)
!37
DSL : flatMap()
KSQL : “insert into”
Nearest Neighbor
Bucket overlap
!38
Nearest Neighbor
Calculate distance
!39
Nearest Neighbor
Aircraft 3
!40
Nearest Neighbor
Place (replicate)
aircraft into 9 “buckets”
!41
Nearest Neighbor
No Bucket overlap
no distance calculated
Aircraft 3 not sharded
with red aircraft
!42
Nearest Neighbor
Nearest Neighbor
Aircraft 4
!43
Nearest Neighbor
Place (replicate)
aircraft into 9 “buckets”
!44
Nearest Neighbor
Bucket overlap
distance calculation performed
!45
51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A Ba caa bbb ccAA BBa cb
!46
51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A Ba ca a bbb ccAA BBa cb
!46
51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A
B
a ca
a
b
b
b c
c
A A
B
B
a
cb
!46
51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A
B
a ca
a
b
b
b c
c
A
A
B
B
a
cb
!46
Nearest Neighbor
!47
Nearest Neighbor
!47
Stream Concepts
Nearest Neighbor
!47
Stream Concepts
Key on bucket
Nearest Neighbor
!47
Stream Concepts
Key on bucket
Re-Key on red aircraft
Nearest Neighbor
!47
Stream Concepts
Key on bucket
Re-Key on red aircraft
Aggregate (reduce) on red aircraft
(keeping smallest distance)
Nearest Neighbor
!48
Nearest Neighbor
!48
Algorithm Limitations?
Nearest Neighbor
!48
Algorithm Limitations?
Bucket size selection
Nearest Neighbor
!48
Algorithm Limitations?
Bucket size selection
Sparse Data Location
missing result
Nearest Neighbor
!48
Algorithm Limitations?
Bucket size selection
Sparse Data Location
missing result
Sparse Data Location
wrong result
Nearest Neighbor
!49
Nearest Neighbor
!49
Performance Limitations?
Nearest Neighbor
!49
Performance Limitations?
Partitioning and Key Hash
Nearest Neighbor
!49
Performance Limitations?
Partitioning and Key Hash
Uniformity of the Data
Nearest Neighbor
!49
Performance Limitations?
Partitioning and Key Hash
Uniformity of the Data
Replication of Data
Nearest NeighborNearest Neighbor
!50
Nearest Neighbor
!51
Nearest Neighbor
!52
Nearest Neighbor
!53
red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
Same as Join Window?
Nearest Neighbor - KSQL
!56
create stream blueBucket 
as select 
bucketLocation(location->latitude, location->longitude, 3.0) as bucket, 
aircraft->transponder as transponder, aircraft->callsign as callsign, 
location->latitude as latitude, location->longitude as longitude 
from blue partition by bucket;
create stream blueBucket_w 
as select 
bucketLocation(location->latitude, location->longitude, 3.0, 'w') as bucket, 
aircraft->transponder as transponder, aircraft->callsign as callsign, 
location->latitude as latitude, location->longitude as longitude 
from blue partition by bucket;
insert into blueBucket select * from blueBucket_w;
Nearest Neighbor - Retrospective
• Kafka Streams
• understand your keys
• flatMap / insert into
• maintain the state you need within the domain
• understand intermediate topics
!57
Retrospective
• How do these examples help me / apply to me?

• Do I really need to write a distributed application?
• Should I be programmatically accessing the Kafka Stream State Stores? 

• Kafka Streams or KSQL
• how do I choose?
• do I have to choose?

• Some settings to investigate
• group.initial.rebalance.delay.ms
• segment size on change-log topic
• num.standby.replica
!58
Credits
• Object Partners, Inc.
• Apache Kafka
• Confluent Platform
• OpenSky - https://opensky-network.org/
• D3
• D3 v4
• http://techslides.com/d3-map-starter-i
• Apache Avro
• KdTree Author Justin Wetherell
• https://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/
KdTree.java
• Apache 2.0 License
• Distance Formula
• https://stackoverflow.com/questions/837872/calculate-distance-in-meters-when-you-know-longitude-and-latitude-in-java
• Additional open source libraries referenced in repository
!59
Questions
!60

More Related Content

Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL and KSQL (Neil Buesing, Object Partners, Inc) Kafka Summit London 2019

  • 1. Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL and KSQL Where are my Keys? Neil Buesing Kafka Summit London, 2019 !1
  • 2. Introduction • Object Partners, Inc • Located : Minneapolis, Minnesota & Omaha, Nebraska, USA • http://www.objectpartners.com • Software Development Consulting • JVM Technologies, Mobile/Web, DevOps, Real-Time Data • Neil Buesing • Director, Real-Time Data • 19 Years with Object Partners, Inc. • Find me at: • https://www.linkedin.com/in/neilbuesing • https://twitter.com/nbuesing • https://github.com/nbuesing !2
  • 3. https://github.com/nbuesing/kafka-summit-london-2019 Source Code • Fully Contained GitHub Repository • Java • Spring Boot • Gradle Build Files • Docker Container (Kafka Cluster) !3
  • 4. The “Pre” Projects • Common • Avro Data Model • KdTree • Bucket / Bucket Factory • Geolocation • RESTful Location to Airport Lookup Service • Connector • OpenSky Apache Kafka Source Connector • Web Application • D3 / Spring Boot / Spring / Spring MVC • Docker • 3 broker, 1 zookeeper, 1 SR docker compose file !4
  • 5. Avro Data Model Record (OpenSky) Aircraft Location Nearest Airport Count Distance !5
  • 6. Airport Lookup Service • Get Location of An Airport • /airport/{code} • Closest Airport • /airport?latitude={latitude}&longitude={longitude} !6
  • 7. OpenSky Source Connector • Pulls current data from OpenSky API • Offset — a timestamp • Polling - 30 seconds The OpenSky Network, http://www.opensky-network.org !7
  • 8. OpenSky Source Connector • Pulls current data from OpenSky API • Offset — a timestamp • Polling - 30 seconds { "time": 1535739820, "states": [ [ "a12345", “N0000 ", "United States", 1535739649, 1535739649, -122.5351, 38.1321, 167.64, false, 31.29, 226.33, -2.93, null, 160.02, null, false, 0 ] ] } The OpenSky Network, http://www.opensky-network.org !7
  • 9. OpenSky Source Connector • Pulls current data from OpenSky API • Offset — a timestamp • Polling - 30 seconds { "time": 1535739820, "states": [ [ "a12345", “N0000 ", "United States", 1535739649, 1535739649, -122.5351, 38.1321, 167.64, false, 31.29, 226.33, -2.93, null, 160.02, null, false, 0 ] ] } transponder callsign geolocation position update The OpenSky Network, http://www.opensky-network.org !7
  • 10. OpenSky Source Connector • Pulls current data from OpenSky API • Offset — a timestamp • Polling - 30 seconds { "time": 1535739820, "states": [ [ "a12345", “N0000 ", "United States", 1535739649, 1535739649, -122.5351, 38.1321, 167.64, false, 31.29, 226.33, -2.93, null, 160.02, null, false, 0 ] ] } transponder callsign geolocation position update api.getStates(0, null, new OpenSkyApi.BoundingBox(24.39, 49.38, -124.84, -66.88)); The OpenSky Network, http://www.opensky-network.org !7
  • 11. OpenSky Source Connector • Pulls current data from OpenSky API • Offset — a timestamp • Polling - 30 seconds { "time": 1535739820, "states": [ [ "a12345", “N0000 ", "United States", 1535739649, 1535739649, -122.5351, 38.1321, 167.64, false, 31.29, 226.33, -2.93, null, 160.02, null, false, 0 ] ] } transponder callsign geolocation position update api.getStates(0, null, new OpenSkyApi.BoundingBox(24.39, 49.38, -124.84, -66.88)); api.getStates(0, null, new OpenSkyApi.BoundingBox(-80.0, 80.0, -180.0, 180.0)); The OpenSky Network, http://www.opensky-network.org !7
  • 12. OpenSky Source Connector The OpenSky Network, http://www.opensky-network.org { "time": 1535739820, "states": [ [ "a12345", “N0000 ", "United States", 1535739649, 1535739649, -122.5351, 38.1321, 167.64, false, 31.29, 226.33, -2.93, null, 160.02, null, false, 0 ] ] } • Kafka Connect • Not a lot of code needed to write • Considering making this an Open
 Source Connector !8
  • 13. Nearest Airport • How many aircrafts are closer to a given airport than any other airport? • What is my stateful window/duration I want to measure? • How do I access the data? • How do I count my data? • How do I find my data? • How do I provide a visual of this data to the user? !9
  • 14. Kafka Streams Nearest Airport • How many aircrafts are closer to a given airport than any other airport? • What is my stateful window/duration I want to measure? • How do I access the data? • How do I count my data? • How do I find my data? • How do I provide a visual of this data to the user? !9
  • 15. D3 Kafka Streams Nearest Airport • How many aircrafts are closer to a given airport than any other airport? • What is my stateful window/duration I want to measure? • How do I access the data? • How do I count my data? • How do I find my data? • How do I provide a visual of this data to the user? !9
  • 16. Nearest Airport • Airport Lookup is based distance • First thought - Airport as a KTable. • How do I find my keys? • Window - 5 minute tumbling window • What to do with late arriving data? • Source - Airports • RESTful endpoint • Source - OpenSky • Kafka Source Connector • Frequency of updates? • Source Offsets? !10
  • 17. • Create A Windowed KTable of all flights • 5 minute window, 1 minute grace period • On update, keep the most recent reading • Materialize the data so it is programmatically accessible • Repeat for other items—Materialize for all state-stores • What about multiple instances of my Kafka Streams Applications? Nearest Airport - Lookup !11
  • 18. Nearest Airport - Lookup public KTable<Windowed<String>, Record> flightsStore() { return flights() .groupByKey() .windowedBy(TimeWindows.of("5m").grace("1m")) .reduce((current, v) -> { return v; }, Materialized.as(“flights”)); } !12 Keep latest Make programmatically accessible
  • 19. Nearest Airport - Lookup private ReadOnlyWindowStore<String, Record> flights() { return kafkaStreams().store( "flights", QueryableStoreTypes.windowStore() ); } • Provide access to the state store as a queryable read-only window store. !13 Access by Name
  • 20. Nearest Airport - Lookup public List<AircraftJson> flights(Long start, Long end) { KeyValueIterator<Windowed<String>, Record> iterator = flights().fetchAll(Instant.ofEpochMilli(start), Instant.ofEpochMilli(end)); … } • Provide access to the state store as a queryable read-only window store. !14 fetchAll for the given time window state-store for kTables only store data in the partitions being processed by a given streams
  • 21. Nearest Airport - Lookup Aircrafts Pulled from flightStore() Materialized Stores streams.allMetadataForStore("nearest_airport").forEach(app -> { list.add(app.host() + ":" + app.hostInfo().port()); }); !15
  • 22. Nearest Airport - Count !16 • Algorithm #1 - Count the Aircrafts
  • 23. Nearest Airport - Count .selectKey((key, value) -> value.getAirport()) .groupByKey() .windowedBy(TimeWindows.of(“5m").grace(Duration.of("1m")) .aggregate( () -> 0, (key, value, aggregate) -> { return aggregate + 1; }, Materialized.as(NEAREST_AIRPORT_STORE) ) .toStream((wk, v) -> wk.key()) !17
  • 24. Nearest Airport - Count .selectKey((key, value) -> value.getAirport()) .groupByKey() .windowedBy(TimeWindows.of(“5m").grace(Duration.of("1m")) .aggregate( () -> 0, (key, value, aggregate) -> { return aggregate + 1; }, Materialized.as(NEAREST_AIRPORT_STORE) ) .toStream((wk, v) -> wk.key()) What’s wrong with this approach? !17
  • 25. Nearest Airport - Count - KSQL @UdfDescription(name = "closestAirport", description = "return airport") public class ClosestAirport { private Geolocation geolocation = Feign.builder() .options(new Request.Options(200, 200)) .encoder(new JacksonEncoder()) .decoder(new JacksonDecoder()) .target(Geolocation.class, "http://geolocation:9080"); @Udf(description = "find closest airport to given location.") public String closestAirport(final Double latitude, final Double longitude) { return geolocation.closestAirport(latitude, longitude).getCode(); } } !18 • Step 1 : create specialized User Defined Function Written in Java
  • 26. Nearest Airport - Count - KSQL !19 • Step 2 : deploy specialized function compiled as uber jar
  • 27. Nearest Airport - Count - KSQL create stream ksql_nearest_airport as select aircraft->transponder transponder, closestAirport(location->latitude, location->longitude) as airport, location from flights partition by transponder; !20 • Step 3 : use specialized function to enrich the streaming data
  • 28. Nearest Airport - Count - KSQL create table ksql_nearest_airport_count as select airport, count(*) as count from ksql_nearest_airport window tumbling (size 5 minutes) group by airport; !21 • Step 4 : aggregate the data using standard KSQL syntax
  • 29. Nearest Airport - Aggregate !22 • Algorithm #2 - Collect the Aircrafts
  • 30. Nearest Airport - Aggregate flights() .map((key, value) -> { Airport airport = geolocation.closestAirport(value.getLocation()); return KeyValue.pair(key, createNearestAirport(airport, value)); }) .groupBy((k, v) -> v.getAirport()) .windowedBy(TimeWindows.of("5m")) .aggregate(() -> null, (key, value, aggregate) -> { if (aggregate == null) { aggregate = createAgg(value.getAirport(), value.getAirportLocation()); } if (!aggregate.getAircrafts().contains(value.getCallsign())) { aggregate.getAircrafts().add(value.getCallsign()); } return aggregate; }, Materialized.as(NEAREST_AIRPORT_AGG_STORE)); !23
  • 31. Nearest Airport - Aggregate flights() .map((key, value) -> { Airport airport = geolocation.closestAirport(value.getLocation()); return KeyValue.pair(key, createNearestAirport(airport, value)); }) .groupBy((k, v) -> v.getAirport()) .windowedBy(TimeWindows.of("5m")) .aggregate(() -> null, (key, value, aggregate) -> { if (aggregate == null) { aggregate = createAgg(value.getAirport(), value.getAirportLocation()); } if (!aggregate.getAircrafts().contains(value.getCallsign())) { aggregate.getAircrafts().add(value.getCallsign()); } return aggregate; }, Materialized.as(NEAREST_AIRPORT_AGG_STORE)); Double Counting? !23
  • 32. create table ksql_nearest_airport_count_agg_count as select airport, countList(count) from ksql_nearest_airport window tumbling (size 5 minutes) group by airport; Nearest Airport - Aggregate - KSQL create table ksql_nearest_airport_count_agg as select airport, collect_set(transponder) as count from ksql_nearest_airport window tumbling (size 5 minutes) group by airport; !24
  • 33. Nearest Airport - Suppression !25 • Algorithm #3 - Suppress Streaming of the Topology
 • Count With Suppression 
 (KIP-328: Ability to suppress updates for KTables)
 • Kafka Streams 2.1
  • 34. Nearest Airport - Suppression public KStream<String, Record> flightsSuppressed() { return flightsStore() .suppress(Suppressed.untilWindowCloses( Suppressed.BufferConfig.unbounded()) ).toStream() .selectKey((k, v) -> k.key()); } KStream<String, NearestAirport> stream = flightsSuppressed() .map((key, value) -> { Airport airport = geolocation.closestAirport(value.getLocation()); return KeyValue.pair(key, createNearestAirport(airport, value)); }); !26
  • 35. Nearest Airport - Suppression !27
  • 36. Nearest Airport - Suppression !27 Algorithm Concerns?
  • 37. Nearest Airport - Suppression !27 Algorithm Concerns? Information Delay
  • 38. Nearest Airport - Suppression !27 Algorithm Concerns? Information Delay Memory
  • 39. Nearest Airport - Suppression !27 Algorithm Concerns? Information Delay Memory Not Available in KSQL
  • 40. Nearest Airport - Retrospective • Kafka Streams • Windows are not magic • treating it like magic means you will get it wrong • Window state-stores are powerful • late arriving messages • retention (default 24 hours) • materialization • change-log topics • suppression • keeps evolving (read the KIPs) !28
  • 42. Nearest Neighbor !30 • Find the nearest Blue Team aircraft for every given Red Team aircraft.
 • Make sure the algorithm can be properly sharded so the work can be distributed.
 • Selected a five minute time window.
  • 43. Nearest Neighbor For this red aircraft, find the closest aircraft. !31
  • 44. Nearest Neighbor Create a 3°x3° region that I call the “bucket” this becomes the topic key !32
  • 47. Nearest Neighbor !35 Distance Object Distance, Red Aircraft, & Blue Aircraft Keep all the information needed for next operation
  • 48. Nearest Neighbor !35 Distance Object Distance, Red Aircraft, & Blue Aircraft Keep all the information needed for next operation What did I do wrong?
  • 50. Nearest Neighbor Place blue aircraft into 9 “buckets” (replicate the data) !37 DSL : flatMap() KSQL : “insert into”
  • 55. Nearest Neighbor No Bucket overlap no distance calculated Aircraft 3 not sharded with red aircraft !42 Nearest Neighbor
  • 58. Nearest Neighbor Bucket overlap distance calculation performed !45
  • 59. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W A Ba caa bbb ccAA BBa cb !46
  • 60. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W A Ba ca a bbb ccAA BBa cb !46
  • 61. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W A B a ca a b b b c c A A B B a cb !46
  • 62. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W A B a ca a b b b c c A A B B a cb !46
  • 66. Nearest Neighbor !47 Stream Concepts Key on bucket Re-Key on red aircraft
  • 67. Nearest Neighbor !47 Stream Concepts Key on bucket Re-Key on red aircraft Aggregate (reduce) on red aircraft (keeping smallest distance)
  • 71. Nearest Neighbor !48 Algorithm Limitations? Bucket size selection Sparse Data Location missing result
  • 72. Nearest Neighbor !48 Algorithm Limitations? Bucket size selection Sparse Data Location missing result Sparse Data Location wrong result
  • 76. Nearest Neighbor !49 Performance Limitations? Partitioning and Key Hash Uniformity of the Data
  • 77. Nearest Neighbor !49 Performance Limitations? Partitioning and Key Hash Uniformity of the Data Replication of Data
  • 82. red() .map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()), value)) .join(blue() .flatMap((key, value) -> bucketFactory.createSurronding(value.getLocation()) .stream() .map((b) -> KeyValue.pair(b.toString(), value)) .collect(Collectors.toList())).selectKey((key, value) -> key), (value1, value2) -> { double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation()); return new Distance(value1, value2, d); }, JoinWindows.of(WINDOW), Joined.with(Serdes.String(), recordSerde, recordSerde)) .to("distance", Produced.with(Serdes.String(), distaneSerde)); Nearest Neighbor !54
  • 83. red() .map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()), value)) .join(blue() .flatMap((key, value) -> bucketFactory.createSurronding(value.getLocation()) .stream() .map((b) -> KeyValue.pair(b.toString(), value)) .collect(Collectors.toList())).selectKey((key, value) -> key), (value1, value2) -> { double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation()); return new Distance(value1, value2, d); }, JoinWindows.of(WINDOW), Joined.with(Serdes.String(), recordSerde, recordSerde)) .to("distance", Produced.with(Serdes.String(), distaneSerde)); Nearest Neighbor !54
  • 84. red() .map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()), value)) .join(blue() .flatMap((key, value) -> bucketFactory.createSurronding(value.getLocation()) .stream() .map((b) -> KeyValue.pair(b.toString(), value)) .collect(Collectors.toList())).selectKey((key, value) -> key), (value1, value2) -> { double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation()); return new Distance(value1, value2, d); }, JoinWindows.of(WINDOW), Joined.with(Serdes.String(), recordSerde, recordSerde)) .to("distance", Produced.with(Serdes.String(), distaneSerde)); Nearest Neighbor !54
  • 85. red() .map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()), value)) .join(blue() .flatMap((key, value) -> bucketFactory.createSurronding(value.getLocation()) .stream() .map((b) -> KeyValue.pair(b.toString(), value)) .collect(Collectors.toList())).selectKey((key, value) -> key), (value1, value2) -> { double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation()); return new Distance(value1, value2, d); }, JoinWindows.of(WINDOW), Joined.with(Serdes.String(), recordSerde, recordSerde)) .to("distance", Produced.with(Serdes.String(), distaneSerde)); Nearest Neighbor !54
  • 86. red() .map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()), value)) .join(blue() .flatMap((key, value) -> bucketFactory.createSurronding(value.getLocation()) .stream() .map((b) -> KeyValue.pair(b.toString(), value)) .collect(Collectors.toList())).selectKey((key, value) -> key), (value1, value2) -> { double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation()); return new Distance(value1, value2, d); }, JoinWindows.of(WINDOW), Joined.with(Serdes.String(), recordSerde, recordSerde)) .to("distance", Produced.with(Serdes.String(), distaneSerde)); Nearest Neighbor !54
  • 87. KTable<Windowed<String>, Distance> result = distance() .selectKey((k, v) -> v.getRed().getAircraft().getTransponder()) .groupByKey() .windowedBy(TimeWindows.of(WINDOW)) .aggregate( () -> new Distance(null, null, Double.MAX_VALUE), (k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg); result.toStream() .map((k, v) -> KeyValue.pair(k.key().toString(), v)) .to("closest"); Nearest Neighbor !55
  • 88. KTable<Windowed<String>, Distance> result = distance() .selectKey((k, v) -> v.getRed().getAircraft().getTransponder()) .groupByKey() .windowedBy(TimeWindows.of(WINDOW)) .aggregate( () -> new Distance(null, null, Double.MAX_VALUE), (k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg); result.toStream() .map((k, v) -> KeyValue.pair(k.key().toString(), v)) .to("closest"); Nearest Neighbor !55
  • 89. KTable<Windowed<String>, Distance> result = distance() .selectKey((k, v) -> v.getRed().getAircraft().getTransponder()) .groupByKey() .windowedBy(TimeWindows.of(WINDOW)) .aggregate( () -> new Distance(null, null, Double.MAX_VALUE), (k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg); result.toStream() .map((k, v) -> KeyValue.pair(k.key().toString(), v)) .to("closest"); Nearest Neighbor !55
  • 90. KTable<Windowed<String>, Distance> result = distance() .selectKey((k, v) -> v.getRed().getAircraft().getTransponder()) .groupByKey() .windowedBy(TimeWindows.of(WINDOW)) .aggregate( () -> new Distance(null, null, Double.MAX_VALUE), (k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg); result.toStream() .map((k, v) -> KeyValue.pair(k.key().toString(), v)) .to("closest"); Nearest Neighbor !55
  • 91. KTable<Windowed<String>, Distance> result = distance() .selectKey((k, v) -> v.getRed().getAircraft().getTransponder()) .groupByKey() .windowedBy(TimeWindows.of(WINDOW)) .aggregate( () -> new Distance(null, null, Double.MAX_VALUE), (k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg); result.toStream() .map((k, v) -> KeyValue.pair(k.key().toString(), v)) .to("closest"); Nearest Neighbor !55
  • 92. KTable<Windowed<String>, Distance> result = distance() .selectKey((k, v) -> v.getRed().getAircraft().getTransponder()) .groupByKey() .windowedBy(TimeWindows.of(WINDOW)) .aggregate( () -> new Distance(null, null, Double.MAX_VALUE), (k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg); result.toStream() .map((k, v) -> KeyValue.pair(k.key().toString(), v)) .to("closest"); Nearest Neighbor !55 Same as Join Window?
  • 93. Nearest Neighbor - KSQL !56 create stream blueBucket as select bucketLocation(location->latitude, location->longitude, 3.0) as bucket, aircraft->transponder as transponder, aircraft->callsign as callsign, location->latitude as latitude, location->longitude as longitude from blue partition by bucket; create stream blueBucket_w as select bucketLocation(location->latitude, location->longitude, 3.0, 'w') as bucket, aircraft->transponder as transponder, aircraft->callsign as callsign, location->latitude as latitude, location->longitude as longitude from blue partition by bucket; insert into blueBucket select * from blueBucket_w;
  • 94. Nearest Neighbor - Retrospective • Kafka Streams • understand your keys • flatMap / insert into • maintain the state you need within the domain • understand intermediate topics !57
  • 95. Retrospective • How do these examples help me / apply to me?
 • Do I really need to write a distributed application? • Should I be programmatically accessing the Kafka Stream State Stores? 
 • Kafka Streams or KSQL • how do I choose? • do I have to choose?
 • Some settings to investigate • group.initial.rebalance.delay.ms • segment size on change-log topic • num.standby.replica !58
  • 96. Credits • Object Partners, Inc. • Apache Kafka • Confluent Platform • OpenSky - https://opensky-network.org/ • D3 • D3 v4 • http://techslides.com/d3-map-starter-i • Apache Avro • KdTree Author Justin Wetherell • https://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/ KdTree.java • Apache 2.0 License • Distance Formula • https://stackoverflow.com/questions/837872/calculate-distance-in-meters-when-you-know-longitude-and-latitude-in-java • Additional open source libraries referenced in repository !59