Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

Scalable Stream
Processing with KSQL,
Kafka and ScyllaDB
Hojjat Jafarpour
Software Engineer
hojjat@confluent.io
@Hojjat

Bio
Hojjat Jafarpour
● Software Engineer @ Confluent
○ Creator of KSQL
● Previously at NEC Labs, Informatica, Quantcast and
Tidemark
○ Worked on various big data projects
● Ph.D. in Computer Science from UC Irvine
○ Scalable stream processing and Publish/Subscribe systems
● @Hojjat

Outline
● Stream Processing Use Case
● Introduction to KSQL
● Concepts and Features
● Where to go

Real time customer support
Ecommerce Site Event Stream

Ecommerce Site Event Stream Hadoop ScyllaDB Customer Support

Ecommerce Site Event Stream Hadoop ScyllaDB
TOO SLOW!!!
Customer Support

Ecommerce Site Event Stream ScyllaDB Customer Support

KSQL: the Streaming SQL Engine for Apache Kafka
® from Confluent
● Enables stream processing with zero coding required
● The simplest way to process streams of data in real
time
● Powered by Kafka: scalable, distributed, battle-tested
● All you need is Kafka–no complex deployments of
bespoke systems for stream processing

What is it for?
Streaming ETL
● Kafka is popular for data pipelines.
● KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';

What is it for?
Anomaly Detection
● Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;

What is it for?
Real Time Monitoring
● Log data monitoring, tracking and alerting
● Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;

Do you think that’s a table you
are querying ?

alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1

Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content

Streams & Tables
● STREAM: data in motion

Streams & Tables
● STREAM: data in motion
● TABLE: collected state of a stream
○ One record per key (per window)
○ Current values (compacted topic)
○ Changelog

Features
● Aggregation
● Window
○ Tumbling
○ Hopping
○ Session
● Join
○ Stream-Stream
○ Stream-Table
○ Table-Table

Features
● Aggregation
● Window
○ Tumbling
○ Hopping
○ Session
● Join
○ Stream-Stream
○ Stream-Table
○ Table-Table
● Nested data
○ STRUCT
● UDF/UDAF
● AVRO, JSON, CSV
○ More to come
● And many more...

Define a STREAM
CREATE STREAM ratings (
rating_id long,
user_id int,
stars int,
route_id int,
rating_time long,
channel varchar,
message varchar
) WITH (
value_format=‘JSON',
kafka_topic=‘ratings');

SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2

SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’

SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
LIMIT 10;

And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;

And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
CREATE STREAM poor_ratings AS

Bring in Tables!
http://www.projectluangwa.org

Define reference tables
CREATE TABLE users (
uid int,
name varchar,
elite varchar)
WITH (
value_format=‘JSON’,
kafka_topic=‘mysql-users');
key = ‘uid’,

Joins for Enrichment
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users

Joins for Enrichment
CREATE STREAM vip_poor_ratings AS
SELECT uid, name, elite,
stars, route_id, rating_time,
message
FROM poor_ratings r
LEFT JOIN users u ON r.user_id = u.uid
WHERE u.elite = 'P';
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users

Aggregates and Windowing
● COUNT, SUM, MIN, MAX
● Windowing - Not strictly ANSI SQL
● Three window types supported:
○ TUMBLING
○ HOPPING (aka ‘sliding’)
○ SESSION
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name;

Continuous Aggregates
Save the results of our aggregation to a TABLE
GROUP BY uid, name

Continuous Aggregates
Save the results of our aggregation to a TABLE
GROUP BY uid, name
CREATE TABLE sad_vips AS

Where to go from here?
Time to get involved!
● Download Confluent Platform
● Step through the QuickStart
● Play with the examples and demos
http://confluent.io/ksql
https://github.com/confluentinc/ksql
https://slackpass.io/confluentcommunity #ksql

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

Related slideshows

More Related Content

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

Editor's Notes