SlideShare a Scribd company logo
Scalable Stream
Processing with KSQL,
Kafka and ScyllaDB
Hojjat Jafarpour
Software Engineer
hojjat@confluent.io
@Hojjat
Bio
Hojjat Jafarpour
● Software Engineer @ Confluent
○ Creator of KSQL
● Previously at NEC Labs, Informatica, Quantcast and
Tidemark
○ Worked on various big data projects
● Ph.D. in Computer Science from UC Irvine
○ Scalable stream processing and Publish/Subscribe systems
● @Hojjat
Outline
● Stream Processing Use Case
● Introduction to KSQL
● Concepts and Features
● Where to go
Real time customer support
Ecommerce Site Event Stream
Real time customer support
Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support
Real time customer support
Ecommerce Site Event Stream Hadoop ScyllaDB
TOO SLOW!!!
Customer Support
Real time customer support
Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support
Real time customer support
Ecommerce Site Event Stream ScyllaDB Customer Support
Outline
● Stream Processing Use Case
● Introduction to KSQL
● Concepts and Features
● Where to go
KSQL: the Streaming SQL Engine for Apache Kafka
® from Confluent
● Enables stream processing with zero coding required
● The simplest way to process streams of data in real
time
● Powered by Kafka: scalable, distributed, battle-tested
● All you need is Kafka–no complex deployments of
bespoke systems for stream processing
What is it for?
Streaming ETL
● Kafka is popular for data pipelines.
● KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
What is it for?
Anomaly Detection
● Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
What is it for?
Real Time Monitoring
● Log data monitoring, tracking and alerting
● Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
Outline
● Stream Processing Use Case
● Introduction to KSQL
● Concepts and Features
● Where to go
Do you think that’s a table you
are querying ?
Stream/Table Duality
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
● STREAM: data in motion
Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
● STREAM: data in motion
● TABLE: collected state of a stream
○ One record per key (per window)
○ Current values (compacted topic)
○ Changelog
Features
● Aggregation
● Window
○ Tumbling
○ Hopping
○ Session
● Join
○ Stream-Stream
○ Stream-Table
○ Table-Table
Features
● Aggregation
● Window
○ Tumbling
○ Hopping
○ Session
● Join
○ Stream-Stream
○ Stream-Table
○ Table-Table
● Nested data
○ STRUCT
● UDF/UDAF
● AVRO, JSON, CSV
○ More to come
● And many more...
Outline
● Stream Processing Use Case
● Introduction to KSQL
● Concepts and Features
● Where to go
Real time customer support
Ecommerce Site Event Stream ScyllaDB Customer Support
Define a STREAM
CREATE STREAM ratings (
rating_id long,
user_id int,
stars int,
route_id int,
rating_time long,
channel varchar,
message varchar
) WITH (
value_format=‘JSON',
kafka_topic=‘ratings');
SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
LIMIT 10;
SELECTing from the Stream
And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
SELECTing from the Stream
And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
CREATE STREAM poor_ratings AS
Bring in Tables!
http://www.projectluangwa.org
Define reference tables
CREATE TABLE users (
uid int,
name varchar,
elite varchar)
WITH (
value_format=‘JSON’,
kafka_topic=‘mysql-users');
key = ‘uid’,
Joins for Enrichment
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users
Joins for Enrichment
CREATE STREAM vip_poor_ratings AS
SELECT uid, name, elite,
stars, route_id, rating_time,
message
FROM poor_ratings r
LEFT JOIN users u ON r.user_id = u.uid
WHERE u.elite = 'P';
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users
Aggregates and Windowing
● COUNT, SUM, MIN, MAX
● Windowing - Not strictly ANSI SQL
● Three window types supported:
○ TUMBLING
○ HOPPING (aka ‘sliding’)
○ SESSION
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name;
Continuous Aggregates
Save the results of our aggregation to a TABLE
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name
HAVING count(*) > 2;
Continuous Aggregates
Save the results of our aggregation to a TABLE
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name
HAVING count(*) > 2;
CREATE TABLE sad_vips AS
Where to go from here?
Time to get involved!
● Download Confluent Platform
● Step through the QuickStart
● Play with the examples and demos
http://confluent.io/ksql
https://github.com/confluentinc/ksql
https://slackpass.io/confluentcommunity #ksql

More Related Content

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

  • 1. Scalable Stream Processing with KSQL, Kafka and ScyllaDB Hojjat Jafarpour Software Engineer hojjat@confluent.io @Hojjat
  • 2. Bio Hojjat Jafarpour ● Software Engineer @ Confluent ○ Creator of KSQL ● Previously at NEC Labs, Informatica, Quantcast and Tidemark ○ Worked on various big data projects ● Ph.D. in Computer Science from UC Irvine ○ Scalable stream processing and Publish/Subscribe systems ● @Hojjat
  • 3. Outline ● Stream Processing Use Case ● Introduction to KSQL ● Concepts and Features ● Where to go
  • 4. Real time customer support Ecommerce Site Event Stream
  • 5. Real time customer support Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support
  • 6. Real time customer support Ecommerce Site Event Stream Hadoop ScyllaDB TOO SLOW!!! Customer Support
  • 7. Real time customer support Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support
  • 8. Real time customer support Ecommerce Site Event Stream ScyllaDB Customer Support
  • 9. Outline ● Stream Processing Use Case ● Introduction to KSQL ● Concepts and Features ● Where to go
  • 10. KSQL: the Streaming SQL Engine for Apache Kafka ® from Confluent ● Enables stream processing with zero coding required ● The simplest way to process streams of data in real time ● Powered by Kafka: scalable, distributed, battle-tested ● All you need is Kafka–no complex deployments of bespoke systems for stream processing
  • 11. What is it for? Streaming ETL ● Kafka is popular for data pipelines. ● KSQL enables easy transformations of data within the pipe CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
  • 12. What is it for? Anomaly Detection ● Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3;
  • 13. What is it for? Real Time Monitoring ● Log data monitoring, tracking and alerting ● Sensor / IoT data CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;
  • 14. Outline ● Stream Processing Use Case ● Introduction to KSQL ● Concepts and Features ● Where to go
  • 15. Do you think that’s a table you are querying ?
  • 17. alice 1 alice 1 charlie 1 alice 2 charlie 1 alice 2 charlie 1 bob 1 TABLE STREAM TABLE (“alice”, 1) (“charlie”, 1) (“alice”, 2) (“bob”, 1) alice 1 alice 1 charlie 1 alice 2 charlie 1 alice 2 charlie 1 bob 1
  • 18. Streams & Tables ● STREAM and TABLE as first class citizens ○ Interpretation of the topic content
  • 19. Streams & Tables ● STREAM and TABLE as first class citizens ○ Interpretation of the topic content ● STREAM: data in motion
  • 20. Streams & Tables ● STREAM and TABLE as first class citizens ○ Interpretation of the topic content ● STREAM: data in motion ● TABLE: collected state of a stream ○ One record per key (per window) ○ Current values (compacted topic) ○ Changelog
  • 21. Features ● Aggregation ● Window ○ Tumbling ○ Hopping ○ Session ● Join ○ Stream-Stream ○ Stream-Table ○ Table-Table
  • 22. Features ● Aggregation ● Window ○ Tumbling ○ Hopping ○ Session ● Join ○ Stream-Stream ○ Stream-Table ○ Table-Table ● Nested data ○ STRUCT ● UDF/UDAF ● AVRO, JSON, CSV ○ More to come ● And many more...
  • 23. Outline ● Stream Processing Use Case ● Introduction to KSQL ● Concepts and Features ● Where to go
  • 24. Real time customer support Ecommerce Site Event Stream ScyllaDB Customer Support
  • 25. Define a STREAM CREATE STREAM ratings ( rating_id long, user_id int, stars int, route_id int, rating_time long, channel varchar, message varchar ) WITH ( value_format=‘JSON', kafka_topic=‘ratings');
  • 26. SELECTing from the Stream Let’s test our new stream definition by finding all the low-scoring ratings from our iPhone app SELECT * FROM ratings WHERE stars <= 2
  • 27. SELECTing from the Stream Let’s test our new stream definition by finding all the low-scoring ratings from our iPhone app SELECT * FROM ratings WHERE stars <= 2 AND lcase(channel) LIKE ‘%ios%’
  • 28. SELECTing from the Stream Let’s test our new stream definition by finding all the low-scoring ratings from our iPhone app SELECT * FROM ratings WHERE stars <= 2 AND lcase(channel) LIKE ‘%ios%’ LIMIT 10;
  • 29. SELECTing from the Stream And set this to run as a continuous transformation, with results being saved into a new topic SELECT * FROM ratings WHERE stars <= 2 AND lcase(channel) LIKE ‘%ios%’;
  • 30. SELECTing from the Stream And set this to run as a continuous transformation, with results being saved into a new topic SELECT * FROM ratings WHERE stars <= 2 AND lcase(channel) LIKE ‘%ios%’; CREATE STREAM poor_ratings AS
  • 32. Define reference tables CREATE TABLE users ( uid int, name varchar, elite varchar) WITH ( value_format=‘JSON’, kafka_topic=‘mysql-users'); key = ‘uid’,
  • 33. Joins for Enrichment Enrich the ‘poor_ratings’ stream with data about each user, and derive a stream of low quality ratings posted only by our Platinum Elite users
  • 34. Joins for Enrichment CREATE STREAM vip_poor_ratings AS SELECT uid, name, elite, stars, route_id, rating_time, message FROM poor_ratings r LEFT JOIN users u ON r.user_id = u.uid WHERE u.elite = 'P'; Enrich the ‘poor_ratings’ stream with data about each user, and derive a stream of low quality ratings posted only by our Platinum Elite users
  • 35. Aggregates and Windowing ● COUNT, SUM, MIN, MAX ● Windowing - Not strictly ANSI SQL ● Three window types supported: ○ TUMBLING ○ HOPPING (aka ‘sliding’) ○ SESSION SELECT uid, name, count(*) as rating_count FROM vip_poor_ratings WINDOW TUMBLING(size 5 minutes) GROUP BY uid, name;
  • 36. Continuous Aggregates Save the results of our aggregation to a TABLE SELECT uid, name, count(*) as rating_count FROM vip_poor_ratings WINDOW TUMBLING(size 5 minutes) GROUP BY uid, name HAVING count(*) > 2;
  • 37. Continuous Aggregates Save the results of our aggregation to a TABLE SELECT uid, name, count(*) as rating_count FROM vip_poor_ratings WINDOW TUMBLING(size 5 minutes) GROUP BY uid, name HAVING count(*) > 2; CREATE TABLE sad_vips AS
  • 38. Where to go from here? Time to get involved! ● Download Confluent Platform ● Step through the QuickStart ● Play with the examples and demos http://confluent.io/ksql https://github.com/confluentinc/ksql https://slackpass.io/confluentcommunity #ksql

Editor's Notes

  1. A quick intro to KSQL in case they missed Niel’s talk.
  2. STREAM and TABLE are both first-class citizens in KSQL Both of these are interpretations of topic content. Topics are what *are*. Streams and tables are KSQL abstractions.
  3. STREAM - data in motion. An unbounded sequence of facts (aka events, messages). TABLE - collected state of a stream. An evolving collection of facts. One record per key (per window) Changelog