Introduction to Apache Kafka

❏ Introduction
❏ First use case: Linkedin
❏ Architecture
❏ Main components
❏ Kafka protocol
❏ Kafka ecosystem
❏ Kafka Connect
❏ Schema Registry
❏ Installation/Configuration
❏ Datalab Use Cases
❏ Demo
Contents

Introduction
Maslow Hierarchy of human needs

Introduction
Data hierarchy of needs
Vision
mission
Products
Data Science
Data Infrastructure
Data Access

Introduction - Data Ingestion
script to
read data
aggregation
script
aggregation
script
Tweet fetch
script
script to
read data
api rest
script
script to
read data

Introduction - Data Ingestion
Take something in
Absorb something

Linkedin data pipeline problem
They had a lot of data
● User activity tracking
● Server logs and metrics
● Messaging
● Analytics
They Build products on data
● Newsfeed
● Recommendation
● Search
● Metrics and monitoring
Problem
How to integrate this variety of data
and make it available to all their
products?

Frontend Server
Metrics Server
Inter-process communication channel

Many publisher using direct connections
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database MonitorActive Monitoring
Backend server

Publish/subscribe system
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database MonitorActive Monitoring
Backend server
Metrics pub/sub

Publish-Subscribe
Topic
Publisher Publisher Publisher
SubscriberSubscriberSubscriber
Pattern Protocol Technology
AMQP
MQTT
implements implements

Log Search
Multiple publish/subscribe systems
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database Monitor
Active Monitoring
Backend server
Metrics pub/sub
Log Search
Offline processing
Logging pub/sub Tracking pub/sub

Log Search
Custom infrastructure for the data pipeline
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Active Monitoring
Backend server
Metrics pub/sub
Log Search
Offline processing
Logging pub/sub Tracking pub/sub

Log Search
Frontend ServerFrontend Server
Database Server Chat Server
Metrics Analysis
Backend server
Log Search
Offline processing
● Decouple data pipelines
● Provide persistence for
message data to allow
multiple consumers
● Optimize for high
throughput of messages
● Allow for horizontal scaling
of the system to grow as the
data stream grow
Linkedin data pipeline problem - Kafka Goals

Log Search
Kafka Architecture - Elements
Frontend ServerFrontend Server
Producer Producer
Metrics Analysis
Consumer Consumer
Producer
Log Search
Consumer
Kafka Cluster
Kafka → distributed, replicated commit log
Broker Partition X / Topic Y

commit log
Kafka Architecture - Broker
Producer
Consumer 1Consumer 2
Read at
offset
Read at
offset

Kafka Cluster
Kafka Architecture - Broker
Broker 1
Broker 2
Broker 3
Partition 0 / Topic B
Partition 0 / Topic C
Partition 0 / Topic A
Partition 0 / Topic B
Partition 1 / Topic A
distributed
replicated

Kafka Architecture - Producer/Consumer
Log Search
Frontend Server
Producer
Consumer
Kafka Cluster
Basic Concepts
● Latency
● Throughput
● Quality of service:
at most once, at least once, exactly once
Use Case Requirements
o Quality of service / Latency
o Throughput / Latency
Producer/Consumer Technology
o Ingestion Technologies
o Kafka Client API
o Kafka Connect

Kafka Architecture - Producer / Consumer

Kafka Architecture - Producer/Consumer API
Log Search
Frontend Server
Producer
Consumer
Kafka Cluster

Kafka Producer
Kafka Protocol - Producer
Producer Record
Topic
[Partition]
[Key]
Value
Broker Partition 0 / Topic A
Producer.send (record)
exception/metadata

Productor Kafka
Topic / Partition Buffer Sender Thread
Producer Record
Topic
[Partition]
[Key]
Value
Serializer
Partitioner
Topic A / Partition 0
Batch 0
Batch 1
Batch 0 / Topic A /
Partition 0
Batch 0 / Topic B /
Partition 0
Batch 0 / Topic B /
Partition 1
Batch 1 / Topic B /
Partition 0
Retry
Fail
Yes
Yes
No
NoException
Metadata
Topic
Partition
X
Partition
Commit
Metadata
Topic Part. Offset
Send
Kafka Protocol - Producer

Broker
91 2 4 5 6 7 83
Consumer
Partition 0
Kafka Protocol - Consumer
● Subscribe (topic) & poll
● Reads topic-partition-offset
● Order is guaranteed only within a partition
● Data is kept only for a limited time (configurable)
● Numbers represents offsets not messages
● Deserialize data
Topic A

Broker Kafka Consumer Group
91 2 4 5 6 7 83
1 3 4 5 6 72
1 2 3 5 64
1 2 3 4 6 7 85
Consumer 2
Consumer 1
Consumer 0
Partition 0
Partition 1
Partition 2
Partition 3
Topic A

Topic Consumer
Group
Partition 0 Consumer 0
Partition 1
Partition 2
Partition 3
Consumer 1
Consumer 2
Consumer 3
Topic Consumer
Group
Partition 1
Partition 2
Partition 3
Consumer 1
Topic Consumer
Group
Consumer 2
Consumer 3

Consumer Group
Producer
Producer
Broker
aemet
Producer
Producer
Server
aemet
google places 2
Exchange
google places
google places 1
Consumer
Consumer
Consumer Consumer
Consumer Group
Consumer
Consumer
Consumer Consumer
AMQP
Kafka
● Consumes topic
● High throughput scenario
● Consumes queue
● Low latency scenario

Kafka Connect
Make it easy to add new systems to your
scalable and secure stream data pipelines
Kafka Connect is a framework included in
Apache Kafka that integrates Kafka with other
systems
KafkaSourceConnect
KafkaSinkConnect

Kafka Connect
Task
Conector
Config
Properties
Standalone Distributed
Worker
Kafka
Producer
Kafka
Consumer
Worker
TaskThread

Schema Registry
Task
stream
stream
stream
Worker
Conector
Source record
Source record
Source record
sendRecords()
Task
Worker
Schema
Registry
Converter
Producer recordProducer record
Producer record
schema
id
Converter
id
schema
Subject
topic
schema
version
Subject
topic
schema
version
Subject
topic
schema
version
Producer recordProducer record
Consumer record
pollConsumer()
Sink record
Sink record
Sink record
Connector

Configuration
✓ zookeeper.properties
✓ server.properties
✓ schema-registry.properties
✓ connect-distributed.properties
✓ connect-standalone.properties
✓ kafka-rest.properties
Zookeeper
Kafka Server
Kafka Connect
Schema
Registry
Kafka rest

Demo
Fuentes
Kafka
Producer
Single Broker - Single Instance
Kafka
Connect
Schema Registry

Introduction to Apache Kafka

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Apache Kafka

Similar to Introduction to Apache Kafka (20)

Recently uploaded

Recently uploaded (20)

Introduction to Apache Kafka