Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB
- 2. Bio
Hojjat Jafarpour
● Software Engineer @ Confluent
○ Creator of KSQL
● Previously at NEC Labs, Informatica, Quantcast and
Tidemark
○ Worked on various big data projects
● Ph.D. in Computer Science from UC Irvine
○ Scalable stream processing and Publish/Subscribe systems
● @Hojjat
- 5. Real time customer support
Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support
- 6. Real time customer support
Ecommerce Site Event Stream Hadoop ScyllaDB
TOO SLOW!!!
Customer Support
- 7. Real time customer support
Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support
- 10. KSQL: the Streaming SQL Engine for Apache Kafka
® from Confluent
● Enables stream processing with zero coding required
● The simplest way to process streams of data in real
time
● Powered by Kafka: scalable, distributed, battle-tested
● All you need is Kafka–no complex deployments of
bespoke systems for stream processing
- 11. What is it for?
Streaming ETL
● Kafka is popular for data pipelines.
● KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
- 12. What is it for?
Anomaly Detection
● Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
- 13. What is it for?
Real Time Monitoring
● Log data monitoring, tracking and alerting
● Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
- 17. alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
- 18. Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
- 19. Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
● STREAM: data in motion
- 20. Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
● STREAM: data in motion
● TABLE: collected state of a stream
○ One record per key (per window)
○ Current values (compacted topic)
○ Changelog
- 22. Features
● Aggregation
● Window
○ Tumbling
○ Hopping
○ Session
● Join
○ Stream-Stream
○ Stream-Table
○ Table-Table
● Nested data
○ STRUCT
● UDF/UDAF
● AVRO, JSON, CSV
○ More to come
● And many more...
- 25. Define a STREAM
CREATE STREAM ratings (
rating_id long,
user_id int,
stars int,
route_id int,
rating_time long,
channel varchar,
message varchar
) WITH (
value_format=‘JSON',
kafka_topic=‘ratings');
- 26. SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
- 27. SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
- 28. SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
LIMIT 10;
- 29. SELECTing from the Stream
And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
- 30. SELECTing from the Stream
And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
CREATE STREAM poor_ratings AS
- 32. Define reference tables
CREATE TABLE users (
uid int,
name varchar,
elite varchar)
WITH (
value_format=‘JSON’,
kafka_topic=‘mysql-users');
key = ‘uid’,
- 33. Joins for Enrichment
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users
- 34. Joins for Enrichment
CREATE STREAM vip_poor_ratings AS
SELECT uid, name, elite,
stars, route_id, rating_time,
message
FROM poor_ratings r
LEFT JOIN users u ON r.user_id = u.uid
WHERE u.elite = 'P';
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users
- 35. Aggregates and Windowing
● COUNT, SUM, MIN, MAX
● Windowing - Not strictly ANSI SQL
● Three window types supported:
○ TUMBLING
○ HOPPING (aka ‘sliding’)
○ SESSION
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name;
- 36. Continuous Aggregates
Save the results of our aggregation to a TABLE
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name
HAVING count(*) > 2;
- 37. Continuous Aggregates
Save the results of our aggregation to a TABLE
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name
HAVING count(*) > 2;
CREATE TABLE sad_vips AS
- 38. Where to go from here?
Time to get involved!
● Download Confluent Platform
● Step through the QuickStart
● Play with the examples and demos
http://confluent.io/ksql
https://github.com/confluentinc/ksql
https://slackpass.io/confluentcommunity #ksql
Editor's Notes
- A quick intro to KSQL in case they missed Niel’s talk.
- STREAM and TABLE are both first-class citizens in KSQL
Both of these are interpretations of topic content. Topics are what *are*. Streams and tables are KSQL abstractions.
- STREAM - data in motion. An unbounded sequence of facts (aka events, messages).
TABLE - collected state of a stream. An evolving collection of facts.
One record per key (per window)
Changelog