SlideShare a Scribd company logo
Apache Kafka is distributed message broker to handle large volume of real-time data
efficiently.
It is used as Pub/Sub messaging system.
Kafka cluster is highly scalable and fault tolerant.
Much higher throughput compared to other message broker such as ActiveMQ or
RabbitMQ
Latency of less than 10ms – real time
Integration with Spark, Flink, Storm, Hadoop and many more Big Data technologies.
Three key capabilities :
1. Publish and subscribe to streams of records, similar to a message queue or enterprise
messaging system
2. Store streams of records in a fault-tolerant durable way
3. Process streams of records as they occur.
Topics :
A topic is a feed name to which records are published.
A topic can have zero, one, or many consumers that subscribe to the data written to it.
Partitions :
For each Topic, data stream are split into partitions.
Each partition is an ordered.
The records in the partitions are assigned a sequential id number called the offset that
uniquely identifies each record within the partition.
The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream
of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream
from one or more topics and producing an output stream to one or more output topics, effectively
transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka
topics to existing applications or data systems.
Order is guaranteed only within a partition.
Once data is written to a partition, it can't be changed.
Data is assigned randomly to a partition – unless a key is provided.
Topic 1
Partition 1
Topic 2
Partition 0
Topic 1
Partition 2
Topic 2
Partition 1
Topic 1
Partition 0
Brokers :
A Kafka cluster is composed of multiple brokers (servers)
Each broker has its own unique ID
Broker1 Broker2 Broker3
Topic 1
Partition 1
Topic 1
Partition 0
Topic 1
Partition 1
Topic 1
Partition 0
Topic Replication Factor :
Broker1 Broker2 Broker3
If Broker2 is down, still Broker1 and Broker3 can serve the data.
Topic 1
Partition 1
Topic 2
Partition 0
Topic 1
Partition 2
Topic 2
Partition 1
Topic 1
Partition 0
Zookeeper :
Broker1 Broker2 Broker3
Zookeeper
Kafka Cluster
Zookeeper :
Zookeeper keeps a list Kafka brokers.
Zookeeper sends notification to Kafka in case of changes such as new topic, broker
dies, broker comes up, topic deleted etc)
Kafka can't work without Zookeeper.
Zookeeper usually operates in an odd quorum (cluster) of servers (1,3,5,7...)
Zookeeper1
(Follower)
Zookeeper3
(Follower)
Zookeeper2
(Leader)
Kafka
Broker1
Kafka
Broker2
Kafka
Broker3
Kafka
Broker4
Kafka
Broker5

More Related Content

Apache kafka introduction

  • 1. Apache Kafka is distributed message broker to handle large volume of real-time data efficiently. It is used as Pub/Sub messaging system. Kafka cluster is highly scalable and fault tolerant. Much higher throughput compared to other message broker such as ActiveMQ or RabbitMQ Latency of less than 10ms – real time Integration with Spark, Flink, Storm, Hadoop and many more Big Data technologies.
  • 2. Three key capabilities : 1. Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system 2. Store streams of records in a fault-tolerant durable way 3. Process streams of records as they occur.
  • 3. Topics : A topic is a feed name to which records are published. A topic can have zero, one, or many consumers that subscribe to the data written to it. Partitions : For each Topic, data stream are split into partitions. Each partition is an ordered. The records in the partitions are assigned a sequential id number called the offset that uniquely identifies each record within the partition.
  • 4. The Producer API allows an application to publish a stream of records to one or more Kafka topics. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems.
  • 5. Order is guaranteed only within a partition. Once data is written to a partition, it can't be changed. Data is assigned randomly to a partition – unless a key is provided.
  • 6. Topic 1 Partition 1 Topic 2 Partition 0 Topic 1 Partition 2 Topic 2 Partition 1 Topic 1 Partition 0 Brokers : A Kafka cluster is composed of multiple brokers (servers) Each broker has its own unique ID Broker1 Broker2 Broker3
  • 7. Topic 1 Partition 1 Topic 1 Partition 0 Topic 1 Partition 1 Topic 1 Partition 0 Topic Replication Factor : Broker1 Broker2 Broker3 If Broker2 is down, still Broker1 and Broker3 can serve the data.
  • 8. Topic 1 Partition 1 Topic 2 Partition 0 Topic 1 Partition 2 Topic 2 Partition 1 Topic 1 Partition 0 Zookeeper : Broker1 Broker2 Broker3 Zookeeper Kafka Cluster
  • 9. Zookeeper : Zookeeper keeps a list Kafka brokers. Zookeeper sends notification to Kafka in case of changes such as new topic, broker dies, broker comes up, topic deleted etc) Kafka can't work without Zookeeper. Zookeeper usually operates in an odd quorum (cluster) of servers (1,3,5,7...) Zookeeper1 (Follower) Zookeeper3 (Follower) Zookeeper2 (Leader) Kafka Broker1 Kafka Broker2 Kafka Broker3 Kafka Broker4 Kafka Broker5