Event streaming webinar feb 2020

Building a Real-time Streaming
ETL Framework Using ksqlDB
and NoSQL
Hojjat Jafarpour, Software Engineer at Confluent
Maheedhar Gunturu, Solutions Architect at ScyllaDB

Presenters
Hojjat Jafarpour, Software Engineer at Confluent
Hojjat is a software engineer and the creator of KSQL, the Streaming SQL engine for Apache
Kafka, at Confluent. Before joining Confluent he worked at NEC Labs, Informatica, Quantcast
and Tidemark on various big data management projects. He has a Ph.D. in computer
science from UC Irvine, where he worked on scalable stream processing and
publish/subscribe systems.
Maheedhar Gunturu, Solutions Architect at ScyllaDB
Maheedhar held senior roles both in engineering and sales organizations. He has over a decade
of experience designing & developing server-side applications in the cloud and working on big
data and ETL frameworks in companies such as Samsung, MapR, Apple, VoltDB, Zscaler and
Qualcomm.
2

Agenda
+ Overview of ScyllaDB
+ Apache Kafka and The Confluent Platform
+ Example Use Cases
+ QA
3

5
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
and Amazon DynamoDB
+ 10X the performance & low tail latency
+ Open Source, Enterprise and Cloud options
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA, USA; Herzelia, Israel;
Warsaw, Poland
About ScyllaDB

Scylla Design Principles
C++ instead of Java Shard per Core All Things Async
Unified Cache I/O Scheduler Self-Optimizing

Compatibility
+ CQL native protocol
+ JMX management protocol
+ Management command line
+ SSTable file format
+ Configuration file format
+ CQL language
8
/REST

+ Helps with
+ Database mirroring/replication/state propagation
+ Direct data into a Kafka stream
+ Configurable subscription options to the change log (per table)
+ Post-image (Changed state)
+ Delta (changes per column)
+ Pre-image (Previous state)
+ Scylla CDC-Kafka source connector coming out soon!
Ref: https://www.scylladb.com/tech-talk/change-data-capture-in-scylla/
Change Data Capture (CDC) from Scylla

Apache Kafka and Confluent Platform

Apache Kafka
Kafka
Cluster
A Distributed Commit Log. Publish and subscribe to
streams of records. Highly scalable, high throughput.
Supports transactions. Persisted data.
Reads are a single seek & scan
Writes
are
append
only

Apache Kafka
Kafka Connect API
Reliable and scalable integration of Kafka with other systems –
no coding required.

Apache Kafka
Kafka Streams API
Write standard Java applications & microservices to
process your data in real-time
Orders
Table
Customers
Kafka Streams API

Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt

Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()

Simplicity
Flexibility
Consumer,
Producer
send(), flush()
Kafka Streams
map(), filter(),
aggregate(), join()

Simplicity
Flexibility
Consumer,
Producer
send(), flush()
Kafka Streams
map(), filter(),
aggregate(), join()
ksqlDB
SELECT … FROM
… JOIN .. GROUP
BY ...

ksqlDB
+ The event streaming database purpose-built for stream
processing applications
+ Enables stream processing with zero coding required
+ The simplest way to process streams of data in real
time
+ Powered by Kafka: scalable, distributed, battle-tested
+ All you need is Kafka–no complex deployments of
bespoke systems for stream processing

ksqlDB
+ Streaming ETL
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';

ksqlDB
+ Real-Time Monitoring
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;

ksqlDB
+ Features
+ Aggregation
+ Window
+ Tumbling
+ Hopping
+ Session
+ Join
+ Stream-Stream
+ Stream-Table
+ Table-Table
+ Nested data
+ STRUCT
+ UDF/UDAF/UDTF
+ AVRO, JSON, CSV
+ Protobuf to come soon
+ And many more...

Using Syslog to Detect SSH Attacks
KSQL
Syslog Syslog Data Syslog Data
Sink
connector
ksql> CREATE SINK CONNECTORSINK_SCYLLA_SYSLOG WITH (
'connector.class' = 'io.connect.scylladb.ScyllaDbSinkConnector',
'connection.url' = 'localhost:9092',
'type.name' = '',
'behavior.on.malformed.documents' = 'warn',
'errors.tolerance' = 'all',
'errors.log.enable' = 'true',
'errors.log.include.messages' = 'true',
'topics' = 'syslog',
'key.ignore' = 'true',
'schema.ignore' = 'true',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter'
);
ksql> CREATE STREAM SYSLOG WITH (KAFKA_TOPIC='syslog', VALUE_FORMAT='AVRO');
ksql> SELECT TIMESTAMPTOSTRING(S.DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, S.HOST,
F.DESCRIPTION AS FACILITY, S.MESSAGE, S.REMOTEADDRESS FROM SYSLOG S
LEFT OUTER JOIN FACILITY F ON S.FACILITY=F.ROWKEY WHERE S.HOST='demo' EMIT CHANGES;
ksql> CREATE STREAM SYSLOG_INVALID_USERS AS SELECT * FROM SYSLOG WHERE MESSAGE LIKE
'Invalid user%';
ksql> CREATE STREAM SSH_ATTACKS AS SELECT TIMESTAMPTOSTRING(DATE, 'yyyy-MM-dd HH:mm:ss')
AS SYSLOG_TS, HOST, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[0] AS ATTACK_USER,
SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[1] AS ATTACK_IP FROM
SYSLOG_INVALID_USERS EMIT CHANGES;
ksql> CREATE TABLE SSH_ATTACKS_BY_USER AS SELECT ATTACK_USER, COUNT(*) AS ATTEMPTS FROM
SSH_ATTACKS GROUP BY ATTACK_USER;
ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER EMIT CHANGES; (push)
ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER WHERE ROWKEY='oracle'; (pull)
Sink
connector
ksql> CREATE SOURCE CONNECTOR SOURCE_SYSLOG_UDP_01 WITH (
'tasks.max' = '1',
'connector.class',
'io.confluent.connect.syslog.SyslogSourceConnector',
'topic' = 'syslog',
'syslog.port' = '42514',
'syslog.listener' = 'UDP',
'syslog.reverse.dns.remote.ip' = 'true',
'confluent.license' = '',
'confluent.topic.bootstrap.servers' = 'kafka:29092',
'confluent.topic.replication.factor' = '1'
);
ksql> CREATE SINK CONNECTOR SINK_ELASTIC_SYSLOG WITH (
'connector.class' =
'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector',
'connection.url' = 'http://elasticsearch:9200',
'type.name' = '',
'behavior.on.malformed.documents' = 'warn',
'errors.tolerance' = 'all',
'errors.log.enable' = 'true',
'errors.log.include.messages' = 'true',
'topics' = 'SYSLOG_INVALID_USERS',
'key.ignore' = 'true',
'schema.ignore' = 'true',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter'
);

IOT - Smart Home
Hub
Mode
Service CTADevice
State
Device
health
Streams of real time events CDC
Hub Info Lookup info
Device data
mgmt Apps and services
MQTT PROXY
ksql> CREATE STREAM device_stream_mode WITH
(KAFKA_TOPIC='syslog',VALUE_FORMAT ='AVRO');
ksql> CREATE STREAM device_change_mode AS SELECT D.dev_id,
D.dev_type, H.hub_mode AS device_mode FROM hub_mode H LEFT OUTER
JOIN device_data D ON H.hub_id=D.hub_id EMIT CHANGES;
ksql> CREATE STREAM device_stream_mode AS SELECT DS.dev_id ,
DS.dev_type, DS.mode, F.state AS dev_state FROM device_change_mode
DS LEFT OUTER JOIN FACILITY F ON DS.dev_type=F.dev_type WHERE
DS.mode=<DEVICE_MODE> EMIT CHANGES;
### CONFIGURE THE MQTT SINK
ksql> INSERT INTO hub_mode SELECT * FROM /mqttTopicA/+/sensors
[WITHCONVERTER=`myclass.AvroConverter`]
ksql> CREATE STREAM hub_mode WITH (KAFKA_TOPIC='hub_mode', VALUE_FORMAT='AVRO');
### Create the necessary sink and CDC Source connector to SCYLLA
Source and Sink
connector

Customer Satisfaction - CES Score
CDC
Segmentation ChurnCustomer Loyalty Support
Customer
Service
Number of
attempts per
issue
Violate SLA
Customer
Customer
interaction
Customer
Log
ksql> CREATE STREAM cust_interactions (incident_Id VARCHAR,
timestamp) WITH (VALUE_FORMAT='JSON', PARTITIONS=1,
KAFKA_TOPIC=cust_interaction);
ksql> CREATE TABLE cust_log_aggregate AS SELECT ROWKEY AS
customer_id, COUNT(*) AS touch_points FROM cust_interactions GROUP
BY customer_id;
ksql> CREATE TABLE cust_log_by_issue AS SELECT ROWKEY AS
incident_Id, customer_id, COUNT_DISTINCT(touch_points) AS
UNIQUE_TOUCH_POINTS FROM cust_interactions GROUP BY ROWKEY EMIT
CHANGES;
ksql> Select C.incident_id, (C.incident_id_first_touch_point_TS -
CL.incident_id_last_touchpoint_TS)/1000/60/60 AS current_SLA_hours
FROM customer_log C INNER JOIN call_log CL ON C.incident_id =
CL.incident_id WHERE current_SLA_hours > 24;
Customer 360

Security - Endpoint Security
Syslog DNSnetflow Firewall
Streams of real time events
#JOIN the various streams of DATA using the SOURCE
connector from CASSANDRA.
#BUILD AND DEPLOY THE CUSTOM UDF
ksql> CREATE STREAM entity_risk_score AS SELECT
source_IP, mac_ID, derived_risk_score(priority_errors
, DNS_burstiness, reputation,
firewall_intrusion_attempts) AS risk_score FROM
endpoint_profile WHERE
derived_risk_score(priority_errors , DNS_burstiness,
reputation, firewall_intrusion_attempts) >
<THRESHOLD>;
Ref: https://www.confluent.io/blog/build-udf-udaf-ksql-5-0/

Takeaways
+ ScyllaDB now Supports Change Data Capture (CDC)
+ ksqlDB provides a SQL interface for Streaming Applications
+ ksqlDB is easily extensible with Custom UDFs
+ Scylla has a new SINK connector (CDC source connector is coming soon!)

Resources
+ Scylla Sink Connector
+ Source Connector (Cassandra)
+ Scylla CDC Presentation
+ Debezium
+ Scylla’s 7 Design Principles
+ Scylla Benchmarks

+ Useful Links:
+ Stream Processing Book Bundle
+ Kafka tutorials
+ ksqlDB
+ Confluent - Scylla Partnership Overview
+ Kafka Summits 2020
+ Kafka Summit London: April 27 - 28
+ Kafka Summit Austin: August 24 - 25
Confluent Resources

Q&A
maheedhar@scylladb.com
@vanguard_space
hojjat@confluent.io
@Hojjat
Stay in touch

United States
545 Faber Place
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you

Event streaming webinar feb 2020

More Related Content

Event streaming webinar feb 2020

Editor's Notes