SlideShare a Scribd company logo
Building a Real-time Streaming
ETL Framework Using ksqlDB
and NoSQL
Hojjat Jafarpour, Software Engineer at Confluent
Maheedhar Gunturu, Solutions Architect at ScyllaDB
Presenters
Hojjat Jafarpour, Software Engineer at Confluent
Hojjat is a software engineer and the creator of KSQL, the Streaming SQL engine for Apache
Kafka, at Confluent. Before joining Confluent he worked at NEC Labs, Informatica, Quantcast
and Tidemark on various big data management projects. He has a Ph.D. in computer
science from UC Irvine, where he worked on scalable stream processing and
publish/subscribe systems.
Maheedhar Gunturu, Solutions Architect at ScyllaDB
Maheedhar held senior roles both in engineering and sales organizations. He has over a decade
of experience designing & developing server-side applications in the cloud and working on big
data and ETL frameworks in companies such as Samsung, MapR, Apple, VoltDB, Zscaler and
Qualcomm.
2
Agenda
+ Overview of ScyllaDB
+ Apache Kafka and The Confluent Platform
+ Example Use Cases
+ QA
3
About ScyllaDB
5
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
and Amazon DynamoDB
+ 10X the performance & low tail latency
+ Open Source, Enterprise and Cloud options
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA, USA; Herzelia, Israel;
Warsaw, Poland
About ScyllaDB
Scylla Design Principles
C++ instead of Java Shard per Core All Things Async
Unified Cache I/O Scheduler Self-Optimizing
Seastar Framework
Compatibility
+ CQL native protocol
+ JMX management protocol
+ Management command line
+ SSTable file format
+ Configuration file format
+ CQL language
8
/REST
+ Helps with
+ Database mirroring/replication/state propagation
+ Direct data into a Kafka stream
+ Configurable subscription options to the change log (per table)
+ Post-image (Changed state)
+ Delta (changes per column)
+ Pre-image (Previous state)
+ Scylla CDC-Kafka source connector coming out soon!
Ref: https://www.scylladb.com/tech-talk/change-data-capture-in-scylla/
Change Data Capture (CDC) from Scylla
Apache Kafka and Confluent Platform
Pre-Streaming
New World: Streaming First
Apache Kafka
Kafka
Cluster
A Distributed Commit Log. Publish and subscribe to
streams of records. Highly scalable, high throughput.
Supports transactions. Persisted data.
Reads are a single seek & scan
Writes
are
append
only
Apache Kafka
Kafka Connect API
Reliable and scalable integration of Kafka with other systems –
no coding required.
Apache Kafka
Kafka Streams API
Write standard Java applications & microservices to
process your data in real-time
Orders
Table
Customers
Kafka Streams API
Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()
Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()
Kafka Streams
map(), filter(),
aggregate(), join()
Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()
Kafka Streams
map(), filter(),
aggregate(), join()
ksqlDB
SELECT … FROM
… JOIN .. GROUP
BY ...
ksqlDB
+ The event streaming database purpose-built for stream
processing applications
+ Enables stream processing with zero coding required
+ The simplest way to process streams of data in real
time
+ Powered by Kafka: scalable, distributed, battle-tested
+ All you need is Kafka–no complex deployments of
bespoke systems for stream processing
ksqlDB
+ Streaming ETL
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
ksqlDB
+ Real-Time Monitoring
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
ksqlDB
+ Features
+ Aggregation
+ Window
+ Tumbling
+ Hopping
+ Session
+ Join
+ Stream-Stream
+ Stream-Table
+ Table-Table
+ Nested data
+ STRUCT
+ UDF/UDAF/UDTF
+ AVRO, JSON, CSV
+ Protobuf to come soon
+ And many more...
Example Use Cases
Using Syslog to Detect SSH Attacks
KSQL
Syslog Syslog Data Syslog Data
Sink
connector
ksql> CREATE SINK CONNECTORSINK_SCYLLA_SYSLOG WITH (
'connector.class' = 'io.connect.scylladb.ScyllaDbSinkConnector',
'connection.url' = 'localhost:9092',
'type.name' = '',
'behavior.on.malformed.documents' = 'warn',
'errors.tolerance' = 'all',
'errors.log.enable' = 'true',
'errors.log.include.messages' = 'true',
'topics' = 'syslog',
'key.ignore' = 'true',
'schema.ignore' = 'true',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter'
);
ksql> CREATE STREAM SYSLOG WITH (KAFKA_TOPIC='syslog', VALUE_FORMAT='AVRO');
ksql> SELECT TIMESTAMPTOSTRING(S.DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, S.HOST,
F.DESCRIPTION AS FACILITY, S.MESSAGE, S.REMOTEADDRESS FROM SYSLOG S
LEFT OUTER JOIN FACILITY F ON S.FACILITY=F.ROWKEY WHERE S.HOST='demo' EMIT CHANGES;
ksql> CREATE STREAM SYSLOG_INVALID_USERS AS SELECT * FROM SYSLOG WHERE MESSAGE LIKE
'Invalid user%';
ksql> CREATE STREAM SSH_ATTACKS AS SELECT TIMESTAMPTOSTRING(DATE, 'yyyy-MM-dd HH:mm:ss')
AS SYSLOG_TS, HOST, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[0] AS ATTACK_USER,
SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[1] AS ATTACK_IP FROM
SYSLOG_INVALID_USERS EMIT CHANGES;
ksql> CREATE TABLE SSH_ATTACKS_BY_USER AS SELECT ATTACK_USER, COUNT(*) AS ATTEMPTS FROM
SSH_ATTACKS GROUP BY ATTACK_USER;
ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER EMIT CHANGES; (push)
ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER WHERE ROWKEY='oracle'; (pull)
Sink
connector
ksql> CREATE SOURCE CONNECTOR SOURCE_SYSLOG_UDP_01 WITH (
'tasks.max' = '1',
'connector.class',
'io.confluent.connect.syslog.SyslogSourceConnector',
'topic' = 'syslog',
'syslog.port' = '42514',
'syslog.listener' = 'UDP',
'syslog.reverse.dns.remote.ip' = 'true',
'confluent.license' = '',
'confluent.topic.bootstrap.servers' = 'kafka:29092',
'confluent.topic.replication.factor' = '1'
);
ksql> CREATE SINK CONNECTOR SINK_ELASTIC_SYSLOG WITH (
'connector.class' =
'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector',
'connection.url' = 'http://elasticsearch:9200',
'type.name' = '',
'behavior.on.malformed.documents' = 'warn',
'errors.tolerance' = 'all',
'errors.log.enable' = 'true',
'errors.log.include.messages' = 'true',
'topics' = 'SYSLOG_INVALID_USERS',
'key.ignore' = 'true',
'schema.ignore' = 'true',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter'
);
IOT - Smart Home
Hub
Mode
Service CTADevice
State
Device
health
Streams of real time events CDC
Hub Info Lookup info
Device data
mgmt Apps and services
MQTT PROXY
ksql> CREATE STREAM device_stream_mode WITH
(KAFKA_TOPIC='syslog',VALUE_FORMAT ='AVRO');
ksql> CREATE STREAM device_change_mode AS SELECT D.dev_id,
D.dev_type, H.hub_mode AS device_mode FROM hub_mode H LEFT OUTER
JOIN device_data D ON H.hub_id=D.hub_id EMIT CHANGES;
ksql> CREATE STREAM device_stream_mode AS SELECT DS.dev_id ,
DS.dev_type, DS.mode, F.state AS dev_state FROM device_change_mode
DS LEFT OUTER JOIN FACILITY F ON DS.dev_type=F.dev_type WHERE
DS.mode=<DEVICE_MODE> EMIT CHANGES;
### CONFIGURE THE MQTT SINK
ksql> INSERT INTO hub_mode SELECT * FROM /mqttTopicA/+/sensors
[WITHCONVERTER=`myclass.AvroConverter`]
ksql> CREATE STREAM hub_mode WITH (KAFKA_TOPIC='hub_mode', VALUE_FORMAT='AVRO');
### Create the necessary sink and CDC Source connector to SCYLLA
Source and Sink
connector
Customer Satisfaction - CES Score
CDC
Segmentation ChurnCustomer Loyalty Support
Customer
Service
Number of
attempts per
issue
Violate SLA
Customer
Customer
interaction
Customer
Log
ksql> CREATE STREAM cust_interactions (incident_Id VARCHAR,
timestamp) WITH (VALUE_FORMAT='JSON', PARTITIONS=1,
KAFKA_TOPIC=cust_interaction);
ksql> CREATE TABLE cust_log_aggregate AS SELECT ROWKEY AS
customer_id, COUNT(*) AS touch_points FROM cust_interactions GROUP
BY customer_id;
ksql> CREATE TABLE cust_log_by_issue AS SELECT ROWKEY AS
incident_Id, customer_id, COUNT_DISTINCT(touch_points) AS
UNIQUE_TOUCH_POINTS FROM cust_interactions GROUP BY ROWKEY EMIT
CHANGES;
ksql> Select C.incident_id, (C.incident_id_first_touch_point_TS -
CL.incident_id_last_touchpoint_TS)/1000/60/60 AS current_SLA_hours
FROM customer_log C INNER JOIN call_log CL ON C.incident_id =
CL.incident_id WHERE current_SLA_hours > 24;
Customer 360
Security - Endpoint Security
Syslog DNSnetflow Firewall
Streams of real time events
#JOIN the various streams of DATA using the SOURCE
connector from CASSANDRA.
#BUILD AND DEPLOY THE CUSTOM UDF
ksql> CREATE STREAM entity_risk_score AS SELECT
source_IP, mac_ID, derived_risk_score(priority_errors
, DNS_burstiness, reputation,
firewall_intrusion_attempts) AS risk_score FROM
endpoint_profile WHERE
derived_risk_score(priority_errors , DNS_burstiness,
reputation, firewall_intrusion_attempts) >
<THRESHOLD>;
Ref: https://www.confluent.io/blog/build-udf-udaf-ksql-5-0/
Takeaways
Takeaways
+ ScyllaDB now Supports Change Data Capture (CDC)
+ ksqlDB provides a SQL interface for Streaming Applications
+ ksqlDB is easily extensible with Custom UDFs
+ Scylla has a new SINK connector (CDC source connector is coming soon!)
Resources
+ Scylla Sink Connector
+ Source Connector (Cassandra)
+ Scylla CDC Presentation
+ Debezium
+ Scylla’s 7 Design Principles
+ Scylla Benchmarks
+ Useful Links:
+ Stream Processing Book Bundle
+ Kafka tutorials
+ ksqlDB
+ Confluent - Scylla Partnership Overview
+ Kafka Summits 2020
+ Kafka Summit London: April 27 - 28
+ Kafka Summit Austin: August 24 - 25
Confluent Resources
Q&A
maheedhar@scylladb.com
@vanguard_space
hojjat@confluent.io
@Hojjat
Stay in touch
United States
545 Faber Place
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you

More Related Content

Event streaming webinar feb 2020

  • 1. Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL Hojjat Jafarpour, Software Engineer at Confluent Maheedhar Gunturu, Solutions Architect at ScyllaDB
  • 2. Presenters Hojjat Jafarpour, Software Engineer at Confluent Hojjat is a software engineer and the creator of KSQL, the Streaming SQL engine for Apache Kafka, at Confluent. Before joining Confluent he worked at NEC Labs, Informatica, Quantcast and Tidemark on various big data management projects. He has a Ph.D. in computer science from UC Irvine, where he worked on scalable stream processing and publish/subscribe systems. Maheedhar Gunturu, Solutions Architect at ScyllaDB Maheedhar held senior roles both in engineering and sales organizations. He has over a decade of experience designing & developing server-side applications in the cloud and working on big data and ETL frameworks in companies such as Samsung, MapR, Apple, VoltDB, Zscaler and Qualcomm. 2
  • 3. Agenda + Overview of ScyllaDB + Apache Kafka and The Confluent Platform + Example Use Cases + QA 3
  • 5. 5 + The Real-Time Big Data Database + Drop-in replacement for Apache Cassandra and Amazon DynamoDB + 10X the performance & low tail latency + Open Source, Enterprise and Cloud options + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA, USA; Herzelia, Israel; Warsaw, Poland About ScyllaDB
  • 6. Scylla Design Principles C++ instead of Java Shard per Core All Things Async Unified Cache I/O Scheduler Self-Optimizing
  • 8. Compatibility + CQL native protocol + JMX management protocol + Management command line + SSTable file format + Configuration file format + CQL language 8 /REST
  • 9. + Helps with + Database mirroring/replication/state propagation + Direct data into a Kafka stream + Configurable subscription options to the change log (per table) + Post-image (Changed state) + Delta (changes per column) + Pre-image (Previous state) + Scylla CDC-Kafka source connector coming out soon! Ref: https://www.scylladb.com/tech-talk/change-data-capture-in-scylla/ Change Data Capture (CDC) from Scylla
  • 10. Apache Kafka and Confluent Platform
  • 13. Apache Kafka Kafka Cluster A Distributed Commit Log. Publish and subscribe to streams of records. Highly scalable, high throughput. Supports transactions. Persisted data. Reads are a single seek & scan Writes are append only
  • 14. Apache Kafka Kafka Connect API Reliable and scalable integration of Kafka with other systems – no coding required.
  • 15. Apache Kafka Kafka Streams API Write standard Java applications & microservices to process your data in real-time Orders Table Customers Kafka Streams API
  • 16. Stream Processing by Analogy Kafka Cluster Connect API Stream Processing Connect API $ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
  • 17. Stream Processing in Kafka Simplicity Flexibility Consumer, Producer subscribe(), poll(), send(), flush()
  • 18. Stream Processing in Kafka Simplicity Flexibility Consumer, Producer subscribe(), poll(), send(), flush() Kafka Streams map(), filter(), aggregate(), join()
  • 19. Stream Processing in Kafka Simplicity Flexibility Consumer, Producer subscribe(), poll(), send(), flush() Kafka Streams map(), filter(), aggregate(), join() ksqlDB SELECT … FROM … JOIN .. GROUP BY ...
  • 20. ksqlDB + The event streaming database purpose-built for stream processing applications + Enables stream processing with zero coding required + The simplest way to process streams of data in real time + Powered by Kafka: scalable, distributed, battle-tested + All you need is Kafka–no complex deployments of bespoke systems for stream processing
  • 21. ksqlDB + Streaming ETL CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
  • 22. ksqlDB + Real-Time Monitoring CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;
  • 23. ksqlDB + Features + Aggregation + Window + Tumbling + Hopping + Session + Join + Stream-Stream + Stream-Table + Table-Table + Nested data + STRUCT + UDF/UDAF/UDTF + AVRO, JSON, CSV + Protobuf to come soon + And many more...
  • 25. Using Syslog to Detect SSH Attacks KSQL Syslog Syslog Data Syslog Data Sink connector ksql> CREATE SINK CONNECTORSINK_SCYLLA_SYSLOG WITH ( 'connector.class' = 'io.connect.scylladb.ScyllaDbSinkConnector', 'connection.url' = 'localhost:9092', 'type.name' = '', 'behavior.on.malformed.documents' = 'warn', 'errors.tolerance' = 'all', 'errors.log.enable' = 'true', 'errors.log.include.messages' = 'true', 'topics' = 'syslog', 'key.ignore' = 'true', 'schema.ignore' = 'true', 'key.converter' = 'org.apache.kafka.connect.storage.StringConverter' ); ksql> CREATE STREAM SYSLOG WITH (KAFKA_TOPIC='syslog', VALUE_FORMAT='AVRO'); ksql> SELECT TIMESTAMPTOSTRING(S.DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, S.HOST, F.DESCRIPTION AS FACILITY, S.MESSAGE, S.REMOTEADDRESS FROM SYSLOG S LEFT OUTER JOIN FACILITY F ON S.FACILITY=F.ROWKEY WHERE S.HOST='demo' EMIT CHANGES; ksql> CREATE STREAM SYSLOG_INVALID_USERS AS SELECT * FROM SYSLOG WHERE MESSAGE LIKE 'Invalid user%'; ksql> CREATE STREAM SSH_ATTACKS AS SELECT TIMESTAMPTOSTRING(DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, HOST, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[0] AS ATTACK_USER, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[1] AS ATTACK_IP FROM SYSLOG_INVALID_USERS EMIT CHANGES; ksql> CREATE TABLE SSH_ATTACKS_BY_USER AS SELECT ATTACK_USER, COUNT(*) AS ATTEMPTS FROM SSH_ATTACKS GROUP BY ATTACK_USER; ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER EMIT CHANGES; (push) ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER WHERE ROWKEY='oracle'; (pull) Sink connector ksql> CREATE SOURCE CONNECTOR SOURCE_SYSLOG_UDP_01 WITH ( 'tasks.max' = '1', 'connector.class', 'io.confluent.connect.syslog.SyslogSourceConnector', 'topic' = 'syslog', 'syslog.port' = '42514', 'syslog.listener' = 'UDP', 'syslog.reverse.dns.remote.ip' = 'true', 'confluent.license' = '', 'confluent.topic.bootstrap.servers' = 'kafka:29092', 'confluent.topic.replication.factor' = '1' ); ksql> CREATE SINK CONNECTOR SINK_ELASTIC_SYSLOG WITH ( 'connector.class' = 'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector', 'connection.url' = 'http://elasticsearch:9200', 'type.name' = '', 'behavior.on.malformed.documents' = 'warn', 'errors.tolerance' = 'all', 'errors.log.enable' = 'true', 'errors.log.include.messages' = 'true', 'topics' = 'SYSLOG_INVALID_USERS', 'key.ignore' = 'true', 'schema.ignore' = 'true', 'key.converter' = 'org.apache.kafka.connect.storage.StringConverter' );
  • 26. IOT - Smart Home Hub Mode Service CTADevice State Device health Streams of real time events CDC Hub Info Lookup info Device data mgmt Apps and services MQTT PROXY ksql> CREATE STREAM device_stream_mode WITH (KAFKA_TOPIC='syslog',VALUE_FORMAT ='AVRO'); ksql> CREATE STREAM device_change_mode AS SELECT D.dev_id, D.dev_type, H.hub_mode AS device_mode FROM hub_mode H LEFT OUTER JOIN device_data D ON H.hub_id=D.hub_id EMIT CHANGES; ksql> CREATE STREAM device_stream_mode AS SELECT DS.dev_id , DS.dev_type, DS.mode, F.state AS dev_state FROM device_change_mode DS LEFT OUTER JOIN FACILITY F ON DS.dev_type=F.dev_type WHERE DS.mode=<DEVICE_MODE> EMIT CHANGES; ### CONFIGURE THE MQTT SINK ksql> INSERT INTO hub_mode SELECT * FROM /mqttTopicA/+/sensors [WITHCONVERTER=`myclass.AvroConverter`] ksql> CREATE STREAM hub_mode WITH (KAFKA_TOPIC='hub_mode', VALUE_FORMAT='AVRO'); ### Create the necessary sink and CDC Source connector to SCYLLA Source and Sink connector
  • 27. Customer Satisfaction - CES Score CDC Segmentation ChurnCustomer Loyalty Support Customer Service Number of attempts per issue Violate SLA Customer Customer interaction Customer Log ksql> CREATE STREAM cust_interactions (incident_Id VARCHAR, timestamp) WITH (VALUE_FORMAT='JSON', PARTITIONS=1, KAFKA_TOPIC=cust_interaction); ksql> CREATE TABLE cust_log_aggregate AS SELECT ROWKEY AS customer_id, COUNT(*) AS touch_points FROM cust_interactions GROUP BY customer_id; ksql> CREATE TABLE cust_log_by_issue AS SELECT ROWKEY AS incident_Id, customer_id, COUNT_DISTINCT(touch_points) AS UNIQUE_TOUCH_POINTS FROM cust_interactions GROUP BY ROWKEY EMIT CHANGES; ksql> Select C.incident_id, (C.incident_id_first_touch_point_TS - CL.incident_id_last_touchpoint_TS)/1000/60/60 AS current_SLA_hours FROM customer_log C INNER JOIN call_log CL ON C.incident_id = CL.incident_id WHERE current_SLA_hours > 24; Customer 360
  • 28. Security - Endpoint Security Syslog DNSnetflow Firewall Streams of real time events #JOIN the various streams of DATA using the SOURCE connector from CASSANDRA. #BUILD AND DEPLOY THE CUSTOM UDF ksql> CREATE STREAM entity_risk_score AS SELECT source_IP, mac_ID, derived_risk_score(priority_errors , DNS_burstiness, reputation, firewall_intrusion_attempts) AS risk_score FROM endpoint_profile WHERE derived_risk_score(priority_errors , DNS_burstiness, reputation, firewall_intrusion_attempts) > <THRESHOLD>; Ref: https://www.confluent.io/blog/build-udf-udaf-ksql-5-0/
  • 30. Takeaways + ScyllaDB now Supports Change Data Capture (CDC) + ksqlDB provides a SQL interface for Streaming Applications + ksqlDB is easily extensible with Custom UDFs + Scylla has a new SINK connector (CDC source connector is coming soon!)
  • 31. Resources + Scylla Sink Connector + Source Connector (Cassandra) + Scylla CDC Presentation + Debezium + Scylla’s 7 Design Principles + Scylla Benchmarks
  • 32. + Useful Links: + Stream Processing Book Bundle + Kafka tutorials + ksqlDB + Confluent - Scylla Partnership Overview + Kafka Summits 2020 + Kafka Summit London: April 27 - 28 + Kafka Summit Austin: August 24 - 25 Confluent Resources
  • 34. United States 545 Faber Place Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank you

Editor's Notes

  1. So here is a brief agenda for today’s talk - First --- we will begin with a high-level overview about ScyllaDB -- and the open source framework -- Seastar --- on which scylla is built on top of. We will also briefly touch upon some new features and how customers can take advantage of them. Next ----- we will dive into Apache Kafka and the various Confluent Platform components like Kafka Connect, ksqlDB etc. we will then -- explore some joint use-cases based on customers and prospects who use Kafka and various nosql solutions across domains. Towards the end we will take some questions - Please post your questions into the Q and A window as and when you get them - so we can get them prioritized and answered appropriately.
  2. Apache Kafka and the Confluent Platform
  3. A little bit about ScyllaDB and our product Scylla, the real-time big data database. Scylla is a highly performant, Low latency, scalable and autonomous NoSQL database -- which supports both Apache Cassandra and Dynamo API. So, the same client drivers and application code written on top of Cassandra or Dynamo are compatible with Scylla without any code changes. Scylla has three offerings: Scylla Open Source, Scylla Enterprise and Scylla Cloud which is a fully managed Service. We also support the option of running scylla cloud in the customer’s cloud account. We are founded by the creators of the KVM hypervisor which as some of you might know is an opensource virtualization technology that is built into linux. ScyllaDB, the company has about 100 employees, located in more than 15 countries across the world. We are expanding and currently hiring C++ and GoLang developers and evops Engineers.
  4. Now let's spend some time understanding some of the basic design principles that differentiates scylla from other NOSQL databases - Scylla is written in C++ -- which results in faster performance, lower latencies and efficient use of the hardware. Scylla has a shard per core isolated architecture where each scylla shard is associated with a CPU core. Also the necessary RAM is allocated by shard as well. This ensures that we are never starved of resources. Everything is async on the platform - so there are no locks ------ Scylla also runs on XFS --- so we get the performance benefits of using an asynchronous file system as well. Scylla comes along with a cache which is highly optimized and auto-tuned to your workload. So no need to worry about key caching, row caching, Linux caching, on heap, off heap - everything is auto-tuned. Scylla is build on top of an opensource framework called Seastar which orchestrates both Task IO and Disk IO scheduling - we will go into a bit more details in the next slide. Scylla is an autonomous database which means yours administrators get the benefit of minimum tuning.
  5. As I mentioned before ---- Scylla is build on an open source framework called the seastar. The framework understands that there are essentially five different kinds of back-end operations - Commitlogs, memtables, Compactions, Queries and Repairs. These individual operations are scheduled through the Seastar scheduler. So effectively, the Seastar scheduler moved the scheduling out of the kernel space and into the user space, where every request is parallelized and given its own priority. Also when you install scylla , it tunes itself according to the Disk , RAM, CPU and network made available to the server. This in effect makes the database autonomous.
  6. ScyllaDB’s users --- benefit from Scylla’s full compatibility with Cassandra and the Cassandra ecosystem. All the ecosystem components listed here and many more---- just work out of the box with Scylla. In addition, there are many client drivers available in a variety of programming languages that can be reused as well. Scylla has created optimized drivers for developers using Java and Go. These optimized drivers can better take advantage of the internals of Scylla Enterprise and Scylla Cloud.
  7. We recently released a new feature called CDC or Change Data Capture. Change Data Capture logging basically captures and keeps a track of data that has changed. It can also help with mirroring your database , replication and state propagation across various microservice. It is configured per table, with limits on the amount of disk space to consume for storing the CDC logs. we co-locate CDC Log partitions along with the Base Table partitions as it greatly improves the performance. There are a variety of configurable subscription options that helps you configure the stream of the change log ------- you can choose to have either the post image i.e. the changed state ----or just the delta ( i.e. changes made to the columns) or you can also stream the pre-image that is the previous state. Feel free refer the link I pasted below--- for more detailed information. We are planning on releasing a SCYLLA CDC-kafka source connector--- which overall makes it easier for developers to build globally synchronized services. This also significantly reduces the amount of data to be moved and provides a reliable way to stream the data between various systems .
  8. Now let's dive into kafka and the confluent platform. Hojjat onto you --
  9. Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
  10. Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
  11. Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
  12. Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
  13. Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
  14. Based on a number of customers and prospects who use Kafka and various nosql systems - I would like to walk you through a few use-cases --- that are relevant for ksqlDB and CDC. Some of the examples we are going to talk about are How to detect SSH Attacks, IOT - Smart Home, Customer 360 and Network and Endpoint Security.
  15. Here is a simple use-case ---- where we have access to certain logs - say syslogs - and now we need to detect if there were any brute force ssh attacks. You want to see the results in Scylla and point your monitoring to it and also to elasticsearch for some ad hoc analysis on top this data. <CLICK> Syslog is built into Linux and its also common in networking and IoT devices to stream the log messages, along with metadata such as the source host, severity of the message, message payload, tags etc to either a local logfile or a centralized syslog server - here in this case lets stream it into kafka using the connect framework <CLICK> and it can be done with this simple one liner using the connect framework. ksqlDB now includes the ability to define connectors from within it, which makes setting things up a lot easier. Now let's try to set up the elastic Sink - which you can do with this simple one liner and then you can setup the Scylla Sink <CLICK> ( currently this is underdevelopment and you should have it available soon). Now that we have establish all the necessary connections lets write a few simple ksql queries. To begin let's create a stream SYSLOG - which reads from the topic syslog - then you can browse through the data --- it already has the necessary schema - now lets join it with the FACILITY table which should contain all the necessary information regarding various message levels say 0 is kernel message , 1 is user level messages and so on. And after you joined the data…. Lets filter the messages which contain invalid ssh attempts…. As you can see in the second query you can filter out the messages which contain the “invalid user” as this indicates that there was a failed ssh attempt - Then lets create a stream of such messages - you can do a bit or reorganization of the data using the SPLIT and REPLACE primitives. Now if you want to persist the query - you can simply create a table based off the stream - this is what Hojjat explained before about the stream table Duality. There are two ways to interact with this data - either via a push query or a pull query….
  16. Smart home and the IOT ecosystems are becoming very popular. For example, even in my home. I have about 15-20 connected devices that are active at any point in time. There are typically 3 parts to this connected ecosystem - There is the smart hub to which all the devices are connected to and then, there is the mobile which can control all the devices and then there is the cloud which keeps track of the state of the hub, various devices connected to it directly or via the partner network, various services enabled etc. The mobile typically connects and control all the devices but all the orchestration of the various automations is initiated via the cloud. Let’s go through an example here - <CLICK> If you were to put your smart home into “away” mode, then all your lights need to be turned off, doors need to be locked, and the air-conditioning needs to go into “eco” mode, and cameras need to detect motion etc. The state of all the devices that are connected to the hub is typically monitored in the cloud. The devices which are attached to the hub are automatically moved to the corresponding state, and all of this is communicated from the cloud to the hub. This communication can happen via multiple protocols - COAPP, MQTT or TCP. For this example, let’s assume its been communicated via MQTT. Here is some sample code which can orchestrate all of this via KSQL. <CLICK> you could possibly use the MQTT proxy which is available as part of Confluent enterprise, and the data once it comes into Kafka can be automatically transformed into Avro schema. You should set up the necessary SINK and CDC connectors to Scylla or Scylla cloud. So, now the data regarding the change of mode of the hub gets streamed into Scylla. This change is detected by CDC on the Scylla side and then it propagates the state to a few custom topics. KSQL is going to pick up this change and run the stream table join operations (As you can see listed ) , and these join operations combine the data of all the devices that are connected with the hub with a look-up table which contains the final state those devices need to be in. And this state of the devices can simply be communicated via the mqtt proxy.
  17. Lets dive into another use-case - Let say that you are pushing out a new release of your product - and you would like to track various customer issues and also access your support. There are a number of ways to do this but calculating CES or Customer engagement score is one easy way to find out. And also eventually you want all of this information to be populated into your customer 360 solution. You could start off by logging all your customer interaction data into Kafka from your application or you could persist it into a mysql instance directly as well - if you enable CDC on this then the necessary changes from the customer log tables are streamed out into Kafka - now by simply running the ksql queries on this data --- you can find out the number of issues organized per customer, you can list them by the number of touch points per incident ID - Also you can set a quality of service level where in you want to track issues that have received repeated calls from a customer on the same incident over the last 24 hrs. All of this can be achieved by these 3 simple queries.
  18. So, in this use case, let’s assume we are collecting syslog, netflow, dns and firewall logs from all the devices that are connected to your internal corporate network. These could be from devices like mobile phones, laptops, routers, servers, switches. Using these logs, we need to identify the higher risk end-points that could possibly be compromised. This is typically a very high throughput use-case and this type of risk analysis can either be done in either in real-time or done as post-mortem analysis for incident response. After identifying the bad actors within the network, the sys admin would remove the compromised end-points from within the network. To implement this, usually involves a lot of complexity and IT infrastructure needs to be setup. Let’s explore how we can simplify this - The data from all the various end-points is streamed into Kafka and using this sink-connector pushed into Scylla. Now, using the KSQL stream-stream-join, you can combine these various streams to derive a risk-score. <CLICK> KSQL provides a rich set of SQL like primitives but sometimes you might need to create a custom ML algorithm often times very specific to your solution and also you want to make this available to your business analysts or data scientist who want to do some ad-hoc analysis on this data - In this case - we can derive the risk score by combining the data from different streams--- and as you can see derived_risk_score is a UDF that I have defined which takes the data from multiple sources and gives out a risk score - you can simply use this UDF as part of your regular KSQL. Once we calculate the risk scores, we can order them in descending ORDER and simply pick the end-points that cross a certain given threshold. This will trigger a monitoring alert for the sys-admin who would then look into re-mediating the risk.
  19. So ---------- to summarize, here are some takeaways from today’s webinar.
  20. Here are some Scylla Resources that you can use to further your understanding …. These slides would be sent out as a follow-up to the webinar, feel free to go over them.