Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL and KSQL (Neil Buesing, Object Partners, Inc) Kafka Summit London 2019
- 1. Using Location Data to
Showcase Keys, Windows, and Joins
in Kafka Streams DSL and KSQL
Where are my Keys?
Neil Buesing
Kafka Summit
London, 2019
!1
- 2. Introduction
• Object Partners, Inc
• Located : Minneapolis, Minnesota & Omaha, Nebraska, USA
• http://www.objectpartners.com
• Software Development Consulting
• JVM Technologies, Mobile/Web, DevOps, Real-Time Data
• Neil Buesing
• Director, Real-Time Data
• 19 Years with Object Partners, Inc.
• Find me at:
• https://www.linkedin.com/in/neilbuesing
• https://twitter.com/nbuesing
• https://github.com/nbuesing
!2
- 4. The “Pre” Projects
• Common
• Avro Data Model
• KdTree
• Bucket / Bucket Factory
• Geolocation
• RESTful Location to Airport Lookup Service
• Connector
• OpenSky Apache Kafka Source Connector
• Web Application
• D3 / Spring Boot / Spring / Spring MVC
• Docker
• 3 broker, 1 zookeeper, 1 SR docker compose file
!4
- 6. Airport Lookup Service
• Get Location of An Airport
• /airport/{code}
• Closest Airport
• /airport?latitude={latitude}&longitude={longitude}
!6
- 7. OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
The OpenSky Network, http://www.opensky-network.org
!7
- 8. OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
The OpenSky Network, http://www.opensky-network.org
!7
- 9. OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
transponder
callsign
geolocation
position update
The OpenSky Network, http://www.opensky-network.org
!7
- 10. OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
transponder
callsign
geolocation
position update
api.getStates(0, null, new OpenSkyApi.BoundingBox(24.39, 49.38, -124.84, -66.88));
The OpenSky Network, http://www.opensky-network.org
!7
- 11. OpenSky Source Connector
• Pulls current data from OpenSky API
• Offset — a timestamp
• Polling - 30 seconds
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
transponder
callsign
geolocation
position update
api.getStates(0, null, new OpenSkyApi.BoundingBox(24.39, 49.38, -124.84, -66.88));
api.getStates(0, null, new OpenSkyApi.BoundingBox(-80.0, 80.0, -180.0, 180.0));
The OpenSky Network, http://www.opensky-network.org
!7
- 12. OpenSky Source Connector
The OpenSky Network, http://www.opensky-network.org
{
"time": 1535739820,
"states": [
[
"a12345",
“N0000 ",
"United States",
1535739649,
1535739649,
-122.5351,
38.1321,
167.64,
false,
31.29,
226.33,
-2.93,
null,
160.02,
null,
false,
0
]
]
}
• Kafka Connect
• Not a lot of code needed to write
• Considering making this an Open
Source Connector
!8
- 13. Nearest Airport
• How many aircrafts are closer to a given airport than
any other airport?
• What is my stateful window/duration I want to
measure?
• How do I access the data?
• How do I count my data?
• How do I find my data?
• How do I provide a visual of this data to the user?
!9
- 14. Kafka Streams
Nearest Airport
• How many aircrafts are closer to a given airport than
any other airport?
• What is my stateful window/duration I want to
measure?
• How do I access the data?
• How do I count my data?
• How do I find my data?
• How do I provide a visual of this data to the user?
!9
- 15. D3
Kafka Streams
Nearest Airport
• How many aircrafts are closer to a given airport than
any other airport?
• What is my stateful window/duration I want to
measure?
• How do I access the data?
• How do I count my data?
• How do I find my data?
• How do I provide a visual of this data to the user?
!9
- 16. Nearest Airport
• Airport Lookup is based distance
• First thought - Airport as a KTable.
• How do I find my keys?
• Window - 5 minute tumbling window
• What to do with late arriving data?
• Source - Airports
• RESTful endpoint
• Source - OpenSky
• Kafka Source Connector
• Frequency of updates?
• Source Offsets?
!10
- 17. • Create A Windowed KTable of all flights
• 5 minute window, 1 minute grace period
• On update, keep the most recent reading
• Materialize the data so it is programmatically accessible
• Repeat for other items—Materialize for all state-stores
• What about multiple instances of my Kafka Streams Applications?
Nearest Airport - Lookup
!11
- 18. Nearest Airport - Lookup
public KTable<Windowed<String>, Record> flightsStore() {
return flights()
.groupByKey()
.windowedBy(TimeWindows.of("5m").grace("1m"))
.reduce((current, v) -> { return v; }, Materialized.as(“flights”));
}
!12
Keep latest
Make programmatically
accessible
- 19. Nearest Airport - Lookup
private ReadOnlyWindowStore<String, Record> flights() {
return kafkaStreams().store(
"flights", QueryableStoreTypes.windowStore()
);
}
• Provide access to the state store as a queryable read-only window
store.
!13
Access by Name
- 20. Nearest Airport - Lookup
public List<AircraftJson> flights(Long start, Long end) {
KeyValueIterator<Windowed<String>, Record> iterator =
flights().fetchAll(Instant.ofEpochMilli(start), Instant.ofEpochMilli(end));
…
}
• Provide access to the state store as a queryable read-only window
store.
!14
fetchAll for the given time
window
state-store for kTables only store
data in the partitions being
processed by a given streams
- 21. Nearest Airport - Lookup
Aircrafts
Pulled from flightStore()
Materialized Stores
streams.allMetadataForStore("nearest_airport").forEach(app -> {
list.add(app.host() + ":" + app.hostInfo().port());
});
!15
- 23. Nearest Airport - Count
.selectKey((key, value) -> value.getAirport())
.groupByKey()
.windowedBy(TimeWindows.of(“5m").grace(Duration.of("1m"))
.aggregate(
() -> 0,
(key, value, aggregate) -> {
return aggregate + 1;
},
Materialized.as(NEAREST_AIRPORT_STORE)
)
.toStream((wk, v) -> wk.key())
!17
- 24. Nearest Airport - Count
.selectKey((key, value) -> value.getAirport())
.groupByKey()
.windowedBy(TimeWindows.of(“5m").grace(Duration.of("1m"))
.aggregate(
() -> 0,
(key, value, aggregate) -> {
return aggregate + 1;
},
Materialized.as(NEAREST_AIRPORT_STORE)
)
.toStream((wk, v) -> wk.key())
What’s wrong with
this approach?
!17
- 25. Nearest Airport - Count - KSQL
@UdfDescription(name = "closestAirport", description = "return airport")
public class ClosestAirport {
private Geolocation geolocation = Feign.builder()
.options(new Request.Options(200, 200))
.encoder(new JacksonEncoder())
.decoder(new JacksonDecoder())
.target(Geolocation.class, "http://geolocation:9080");
@Udf(description = "find closest airport to given location.")
public String closestAirport(final Double latitude, final Double longitude) {
return geolocation.closestAirport(latitude, longitude).getCode();
}
}
!18
• Step 1 : create specialized User Defined Function Written in Java
- 26. Nearest Airport - Count - KSQL
!19
• Step 2 : deploy specialized function compiled as uber jar
- 27. Nearest Airport - Count - KSQL
create stream
ksql_nearest_airport
as select
aircraft->transponder transponder,
closestAirport(location->latitude, location->longitude) as airport,
location
from flights
partition by transponder;
!20
• Step 3 : use specialized function to enrich the streaming data
- 28. Nearest Airport - Count - KSQL
create table
ksql_nearest_airport_count
as select
airport,
count(*) as count
from ksql_nearest_airport window tumbling (size 5 minutes)
group by airport;
!21
• Step 4 : aggregate the data using standard KSQL syntax
- 30. Nearest Airport - Aggregate
flights()
.map((key, value) -> {
Airport airport = geolocation.closestAirport(value.getLocation());
return KeyValue.pair(key, createNearestAirport(airport, value));
})
.groupBy((k, v) -> v.getAirport())
.windowedBy(TimeWindows.of("5m"))
.aggregate(() -> null,
(key, value, aggregate) -> {
if (aggregate == null) {
aggregate = createAgg(value.getAirport(), value.getAirportLocation());
}
if (!aggregate.getAircrafts().contains(value.getCallsign())) {
aggregate.getAircrafts().add(value.getCallsign());
}
return aggregate;
}, Materialized.as(NEAREST_AIRPORT_AGG_STORE));
!23
- 31. Nearest Airport - Aggregate
flights()
.map((key, value) -> {
Airport airport = geolocation.closestAirport(value.getLocation());
return KeyValue.pair(key, createNearestAirport(airport, value));
})
.groupBy((k, v) -> v.getAirport())
.windowedBy(TimeWindows.of("5m"))
.aggregate(() -> null,
(key, value, aggregate) -> {
if (aggregate == null) {
aggregate = createAgg(value.getAirport(), value.getAirportLocation());
}
if (!aggregate.getAircrafts().contains(value.getCallsign())) {
aggregate.getAircrafts().add(value.getCallsign());
}
return aggregate;
}, Materialized.as(NEAREST_AIRPORT_AGG_STORE));
Double Counting?
!23
- 32. create table
ksql_nearest_airport_count_agg_count
as select
airport,
countList(count)
from ksql_nearest_airport window tumbling (size 5 minutes)
group by airport;
Nearest Airport - Aggregate - KSQL
create table
ksql_nearest_airport_count_agg
as select
airport,
collect_set(transponder) as count
from ksql_nearest_airport window tumbling (size 5 minutes)
group by airport;
!24
- 33. Nearest Airport - Suppression
!25
• Algorithm #3 - Suppress Streaming of the Topology
• Count With Suppression
(KIP-328: Ability to suppress updates for KTables)
• Kafka Streams 2.1
- 34. Nearest Airport - Suppression
public KStream<String, Record> flightsSuppressed() {
return flightsStore()
.suppress(Suppressed.untilWindowCloses(
Suppressed.BufferConfig.unbounded())
).toStream()
.selectKey((k, v) -> k.key());
}
KStream<String, NearestAirport> stream =
flightsSuppressed()
.map((key, value) -> {
Airport airport = geolocation.closestAirport(value.getLocation());
return KeyValue.pair(key, createNearestAirport(airport, value));
});
!26
- 39. Nearest Airport - Suppression
!27
Algorithm Concerns?
Information Delay
Memory
Not Available in KSQL
- 40. Nearest Airport - Retrospective
• Kafka Streams
• Windows are not magic
• treating it like magic means you will get it wrong
• Window state-stores are powerful
• late arriving messages
• retention (default 24 hours)
• materialization
• change-log topics
• suppression
• keeps evolving (read the KIPs)
!28
- 42. Nearest Neighbor
!30
• Find the nearest Blue Team aircraft for every given
Red Team aircraft.
• Make sure the algorithm can be properly sharded so
the work can be distributed.
• Selected a five minute time window.
- 59. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A Ba caa bbb ccAA BBa cb
!46
- 60. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A Ba ca a bbb ccAA BBa cb
!46
- 61. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A
B
a ca
a
b
b
b c
c
A A
B
B
a
cb
!46
- 62. 51° N, 6° W51° N, 3° E 51° N, 3° W51° N, 0° W 51° N, 9° W
A
B
a ca
a
b
b
b c
c
A
A
B
B
a
cb
!46
- 82. red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
- 83. red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
- 84. red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
- 85. red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
- 86. red()
.map((key, value) -> KeyValue.pair(bucketFactory.create(value.getLocation()),
value))
.join(blue()
.flatMap((key, value) ->
bucketFactory.createSurronding(value.getLocation())
.stream()
.map((b) -> KeyValue.pair(b.toString(), value))
.collect(Collectors.toList())).selectKey((key, value) -> key),
(value1, value2) -> {
double d = DistanceUtil.distance(value1.getLocation(), value2.getLocation());
return new Distance(value1, value2, d);
}, JoinWindows.of(WINDOW),
Joined.with(Serdes.String(), recordSerde, recordSerde))
.to("distance", Produced.with(Serdes.String(), distaneSerde));
Nearest Neighbor
!54
- 87. KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
- 88. KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
- 89. KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
- 90. KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
- 91. KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
- 92. KTable<Windowed<String>, Distance> result = distance()
.selectKey((k, v) -> v.getRed().getAircraft().getTransponder())
.groupByKey()
.windowedBy(TimeWindows.of(WINDOW))
.aggregate(
() -> new Distance(null, null, Double.MAX_VALUE),
(k, v, agg) -> (v.getDistance() < agg.getDistance()) ? v : agg);
result.toStream()
.map((k, v) -> KeyValue.pair(k.key().toString(), v))
.to("closest");
Nearest Neighbor
!55
Same as Join Window?
- 93. Nearest Neighbor - KSQL
!56
create stream blueBucket
as select
bucketLocation(location->latitude, location->longitude, 3.0) as bucket,
aircraft->transponder as transponder, aircraft->callsign as callsign,
location->latitude as latitude, location->longitude as longitude
from blue partition by bucket;
create stream blueBucket_w
as select
bucketLocation(location->latitude, location->longitude, 3.0, 'w') as bucket,
aircraft->transponder as transponder, aircraft->callsign as callsign,
location->latitude as latitude, location->longitude as longitude
from blue partition by bucket;
insert into blueBucket select * from blueBucket_w;
- 94. Nearest Neighbor - Retrospective
• Kafka Streams
• understand your keys
• flatMap / insert into
• maintain the state you need within the domain
• understand intermediate topics
!57
- 95. Retrospective
• How do these examples help me / apply to me?
• Do I really need to write a distributed application?
• Should I be programmatically accessing the Kafka Stream State Stores?
• Kafka Streams or KSQL
• how do I choose?
• do I have to choose?
• Some settings to investigate
• group.initial.rebalance.delay.ms
• segment size on change-log topic
• num.standby.replica
!58
- 96. Credits
• Object Partners, Inc.
• Apache Kafka
• Confluent Platform
• OpenSky - https://opensky-network.org/
• D3
• D3 v4
• http://techslides.com/d3-map-starter-i
• Apache Avro
• KdTree Author Justin Wetherell
• https://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/
KdTree.java
• Apache 2.0 License
• Distance Formula
• https://stackoverflow.com/questions/837872/calculate-distance-in-meters-when-you-know-longitude-and-latitude-in-java
• Additional open source libraries referenced in repository
!59