Kafka as your Data Lake - is it Feasible?

Agenda
1. What is a Data Lake?
2. Four Architecture Blueprints for “treating Kafka as a Data Lake”
3. Summary
Demo environment and code samples available here: https://github.com/gschmutz/kafka-as-your-datalake-demo3

Bulk Source
Data Consumer
DB
Extract
File
DB
What is a Data Lake? Traditional Data Lake
Architecture
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Initial Idea of Data Lake
• Single store of all data (incl. raw data) in the enterprise
• Put an end to data silos
• Reporting, Visualization, Analytics and Machine
Learning
• Focus on Schema-on-Read
Tech for 1st Gen Data Lake
• HDFS, MapReduce, Pig, Hive, Impala, Flume,
Sqoop
Tech for 2nd Gen Data Lake (Cloud native)
• Object Store (S3, Azure Blob Storage, …), Spark,
Flink, Presto, StreamSets, …
SQL / Search
Parallel
Processing
Query
Engine BI Apps
Data Science
Workbench
7
high latency

”Streaming Data Lake” – aka. Kappa Architecture
Event
Stream
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Reply Bulk Data Flow
Hadoop ClusterdHadoop Cluster(Big) Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Bulk
Data Flow
Data
Consumer
BI Apps
Dashboard
Serving
Result V1.0
Result V2.0
API
(Switcher)
{ }
Parallel
Processing
Query
Engine
SQL / Search
“Native” Raw
Data Science
Workbench
Result
Stream Source
of
Truth
11
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Change
Data
Capture
Event
Stream
[8] – Questioning the Lambda Architecture – by Jay Kreps

“Streaming Data Lake” Zones
12

Bulk
Data Flow
Result
Stream
SQL / Search
“Native” Raw
Event
Stream
Event Hub
Storage
Storage
Raw
Refined/
UsageOpt
Data
Consumer
BI Apps
Dashboard
Serving
Result V1.0
Result V2.0
API
(Switcher)
{ }
Parallel
Processing
Query
Engine
Data Science
Workbench
Reply Bulk Data Flow
Source
of
Truth
[1] Turning the database inside out with Apache Samza – by Martin Kleppmann13
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Change
Data
Capture
Event
Stream
Moving the Source of Truth to Event Hub
Turning the
database inside-out!

Bulk
Data Flow
Result
Stream
SQL / Search
“Native” Raw
Event
Stream
Event Hub
Storage
Storage
Raw
Refined/
UsageOpt
Data
Consumer
BI Apps
Dashboard
Serving
Result V1.0
Result V2.0
API
(Switcher)
{ }
Data Science
Workbench
Source
of
Truth
Moving the Source of Truth to Event Hub
[2] – It’s Okay To Store Data In Apache Kafka – by Jay Kreps14
Parallel
Processing
Query
Engine
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Change
Data
Capture
Event
Stream
is it feasible?

Confluent Enterprise Tiered Storage
Data Retention
• Never
• Time (TTL) or Size-based
• Log-Compacted based
Tiered Storage uses two tiers of storage
• Local (same local disks on brokers)
• Remote (Object storage, currently AWS S3 only)
Enables Kafka to be a long-term storage
solution
• Transparent (no ETL pipelines needed)
• Cheaper storage for cold data
• Better scalability and less complex operations
Broker 1
Broker 2
Broker 3
Object
Storage
hot cold
[3] Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla
[4] KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal
15

Four Architecture Blueprints
for “treating Kafka as a Data
Lake”
20

How can you access a Kafka topic?
Streaming Queries
• Latest - start at end and continuously consume new data
• Earliest – start at beginning and consume history and then continuously consume new data
• Seek to offset – start at a given offset, consume history, and continuously consume new data
• Seek to timestamp – start at given timestamp, consume history and continuously consume new data
Batch Queries
• From start offset to end offset – start at a given offset and consume until another offset
• From start timestamp to end timestamp – start at a given offset and consume until another offset
• Full scan – Scan the complete topic from start to end
All above access options can be applied on topic or on a set of partitions
21

BP-1: ”Streaming” Data Lake
• Using Stream Processing tools to
perform processing on ”data in
motion” instead of in batch
• Can consume from multiple sources
• Works well if no or limited history is
needed
• Queryable State Stores, aka.
Interactive Queries or Pull Queries
[5] Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner22

BP-1_1: ”Streaming” Data Lake with ksqlDB /
Kafka Streams
• Kafka Streams or ksqlDB fit perfectly
• Using ksqlDB pull queries to retrieve
current state of materialized views
• Store results in another Kafka topic to
persist state store information
• Can be combined with BP-4 to store
results/state in a database
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes noPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes noSupports Exactly Once23

Demo Use Case – Vehicle Tracking
Truck-2
truck_
position
Truck-n
Refinement
truck_
position_avro
detect_proble
matic_driving
problematic_
driving
Truck
Driver
jdbc-source
truck_
driver
join_problematic
_driving_driver
problematic_
driving_driver
27, Walter, Ward, Y, 24-JUL-85, 2017-10-02 15:19:00
console
consumer
{"id":19,"firstName":"Walter",
"lastName":"Ward","available
":"Y","birthdate":"24-JUL-
85","last_update":150692305
2012}
2020-06-02 14:39:56.605,98,27,803014426,
Wichita to Little Rock Route2,
Normal,38.65,90.21,5187297736652502631
24
Truck-1
2020-06-02 14:39:56.605,21,19,803014427,
Overspeed,32.35,91.21,5187297736652502632
2020-06-02 14:39:56.605,21,19,803014427,
Overspeed,32.35,91.21,5187297736652502632
aggregate by eventType
over time window
problematic_
driving_agg Pull query
Overspeed,10,10:00:00,10:00:059
Pull query
Raw Refined Usage Opt

BP-2: Batch Processing with Event Hub as Source
• Using a Batch Processing framework
to process Event Hub data
retrospectively (full history available)
• Write back results to Event Hub
• Read and join multiple sources
• Can be combined with Advanced
Analytics capabilities (i.e. machine
learning / AI)
26

BP-2_1: Apache Spark with Kafka as Source
• Apache Spark is a unified analytics
engine for large-scale data processing
• Provides complex analytics through
MLlib and GraphX
• Can consume from/produce to Kafka
both in Streaming as well as Batch
Mode
• Use Data Frame / Dataset abstraction
as you would with other data sources
Yes No
Yes No
Yes No
Supports Protobuf
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoPartition Filter Pushdown
Yes NoSupports Exactly Once27

truckPositionSchema = StructType().add("timestamp", TimestampType())
.add("truckId",LongType())
.add("driverId", LongType())
.add("routeId", LongType())
.add("eventType", StringType())
.add("latitude", DoubleType())
.add("longitude", DoubleType())
.add("correlationId", StringType())
rawDf = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "kafka-1:19092,kafka-2:19093")
.option("subscribe", "truck_position")
.load()
jsonDf = rawDf.selectExpr("CAST(value AS string)")
jsonDf = jsonDf.select(from_json(jsonDf.value, truckPositionSchema)
.alias("json"))
.selectExpr("json.*",
"cast(cast (json.timestamp as double) / 1000 as timestamp) as eventTime")29

30

BP-3: Batch Query with Event Hub as Source
• Using a Query Virtualization
framework to consume (query) Event
Hub data retrospectively (full history
available)
• Optionally produce (insert) data into
Event Hub
• Read and join multiple sources
• Based on SQL and with the full power
of SQL at hand (functions and
optionally UDF/UDFA/UDTF)
• Batch SQL not Streaming SQL
31

BP-3_1: Presto with Kafka as Source
• Presto is a distributed SQL query
engine for big data
• Supports accessing data from multiple
systems within a single query
• Supports Kafka as a source (query)
and as a target (insert for raw & json)
• Does not yet support pushdown of
timestamp queries
• Starburst Enterprise Presto provides
fined grained access control
Yes No
Yes No
Yes No
Supports Protobuf
Yes NoSupports Avro
Yes NoSupports JSON

kafka.nodes=kafka-1:9092
kafka.table-names=truck_position, truck_driver
kafka.default-schema=logistics
kafka.hide-internal-columns=false
kafka.table-description-dir=etc/kafka
kafka.properties
select * from truck_position;
34

{
"tableName": "truck_position",
"schemaName": "logistics",
"topicName": "truck_position",
"key": {
"dataFormat": "raw",
"fields": [
{
"name": "kafka_key",
"dataFormat": "BYTE",
"type": "VARCHAR",
"hidden": "false"
}
]
},
"message": {
"dataFormat": "json",
"fields": [
{
"name": "timestamp",
"mapping": "timestamp",
"type": "BIGINT"
},
{
"name": "truck_id",
"mapping": "truckId",
"type": "BIGINT"
},
...
etc/kafka/truck_position.json etc/kafka/truck_driver.json
{
"tableName": "truck_position",
"schemaName": "logistics",
"topicName": "truck_position",
"key": {
"dataFormat": "raw",
"fields": [
{
"name": "kafka_key",
"dataFormat": "BYTE",
"type": "VARCHAR",
"hidden": "false"
}
]
},
"message": {
"dataFormat": "json",
"fields": [
{
"name": "timestamp",
"mapping": "timestamp",
"type": "BIGINT"
},
{
"name": "truck_id",
"mapping": "truckId",
"type": "BIGINT"
},
...
35

select * from truck_position
select * from truck_driver
36

Join truck_position with truck_driver (removing non-compacted entries using Presto
WINDOW Function)
SELECT d.id, d.first_name, d.last_name, t.*
FROM truck_position t
LEFT JOIN (
SELECT *
FROM truck_driver
WHERE (last_update) IN
(SELECT LAST_VALUE(last_update)
OVER (PARTITION BY id
ORDER BY last_update
RANGE BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS last_update
FROM truck_driver) ) d
ON t.driver_id = d.id
WHERE t.event_type != 'Normal';
37

BP-3_2: Apache Drill with Kafka as Source
• Apache Drill is a schema-free SQL
Query Engine for Hadoop, NoSQL and
Cloud Storage
• Supports accessing data from multiple
systems within a single query
• Can push down filters on partitions,
timestamp and offset
Yes No
Yes No
Yes No
Supports Protobuf
Yes NoSupports Avro
Yes NoSupports JSON

BP-3_3: Hive/Spark SQL with Kafka as Source
• Apache Hive facilitates reading,
writing, and managing large datasets
residing in distributed storage using
SQL
• Part of any Hadoop distribution
• A special storage handler allows
access to Kafka topic via Hive external
tables
• Spark SQL on data frame as shown in
BP-2_1 or by integrating Hive
Metastore
Yes No
Yes No
Yes No
Supports Protobuf
Yes NoSupports Avro
Yes NoSupports JSON

BP-3_4: Oracle Access to Kafka with Kafka as
Source
• Oracle SQL Access to Kafka is a PL/SQL
package that enables Oracle SQL to
query Kafka topics via DB views and
underlying external tables [6]
• Runs in the Oracle database
• Supports Kafka as a source (query) but
not (yet) as a target
• Use Oracle SQL to access the Kafka
topics and optionally join to RDBMS
tables
Yes No
Yes No
Yes No
Supports Protobuf
Yes NoSupports Avro
Yes NoSupports JSON

BP-4: Use any storage as “materialized view”
• Use any persistence
technology to provide a
“Materialized View” to the
Data Consumers
• Can be provided in
retrospective, run-once,
batch or streaming update
(in sync) mode
• “On Demand” use cases
• Provide a sandbox
environment for data
scientists
• Provide part of Kafka topics
materialized in object
storage
41

Architecture Blueprints Overview
Blueprint
Capability
Streaming
BP1_1
Batch Processing
BP2_1
Query
BP3_1
Query
BP3_2
Query
BP3_3
Query
BP3_4
Supports JSON 🟢 🟢 🟢 🟢 🟢 🟢
Supports Avro 🟢 🟢 🟢 🔴 🟢 🔴
Supports Protobuf 🟢 🔴 🔴 🔴 🔴 🔴
Schema Registry Integration 🟢 🟢 🔴 🔴 🔴 🔴
Timestamp Filter Pushdown ⚪ 🟢 🔴 🟢 🟢 🟢
Offset Filter Pushdown ⚪ 🟢 🔴 🟢 🟢 🟢
Partition Filter Pushdown ⚪ �� 🔴 🟢 🟢 🟢
Supports Produce Operation 🟢 🟢 🟢 🔴 🟢 🔴
Supports Exactly Once 🟢 🔴 🔴 🔴 🟢 🔴
• BP-1_1: Streaming Data Lake using Kafka Streams / ksqlDB
• BP-2_1: Apache Spark with Kafka as Source
• BP-3_1: Presto with Kafka as Source
• BP-3_2: Apache Drill with Kafka as Source
• BP-3_3: Hive/Spark SQL with Kafka as Source
• BP-3_4: Oracle Access to Kafka with Kafka as Source
42

Summary
• Move processing / analytics from batch to stream processing pipelines
• Event Hub (Kafka) as the single source of truth => turning the database inside out!
• everything else is just a “Materialized Views” of the Event Hub topics data
• Can still be HDFS, Object Store (S3, …) but also Kudu on Parquet
• NoSQL Databases & Relational Databases
• In-Memory Databases
• Confluent Platform Tiered Storage makes long-term storage feasible
• Does not apply for large, unstructured data (images, videos, …) => separate path around Event Hub
necessary, but sending metadata through Event Hub
• This is the result of a Proof-of-Concept: only functional test done so far, performance tests will
follow
44

References
1. Turning the database inside out with Apache Samza – by Martin Kleppmann
2. It’s Okay To Store Data In Apache Kafka – by Jay Kreps
3. Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla
4. KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal
5. Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner
6. Read data from Kafka topic using Oracle SQL Access to Kafka (OSAK) - by Mohammad H.
AbdelQader
7. Demo environment and code samples - by Guido Schmutz (on GitHub)
8. Questioning the Lambda Architecture - by Jay Kreps
45

Updates to Slides
24.8.2020 – Presto supports Avro
24.8.2020 – Presto supports Insert for raw and json
20.7.2020 – intial version
46

Kafka as your Data Lake - is it Feasible?

Kafka as your Data Lake - is it Feasible?

Related slideshows

More Related Content

Kafka as your Data Lake - is it Feasible?