Apache kafka introduction

Apache Kafka is distributed message broker to handle large volume of real-time data
efficiently.
It is used as Pub/Sub messaging system.
Kafka cluster is highly scalable and fault tolerant.
Much higher throughput compared to other message broker such as ActiveMQ or
RabbitMQ
Latency of less than 10ms – real time
Integration with Spark, Flink, Storm, Hadoop and many more Big Data technologies.

Three key capabilities :
1. Publish and subscribe to streams of records, similar to a message queue or enterprise
messaging system
2. Store streams of records in a fault-tolerant durable way
3. Process streams of records as they occur.

Topics :
A topic is a feed name to which records are published.
A topic can have zero, one, or many consumers that subscribe to the data written to it.
Partitions :
For each Topic, data stream are split into partitions.
Each partition is an ordered.
The records in the partitions are assigned a sequential id number called the offset that
uniquely identifies each record within the partition.

The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream
of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream
from one or more topics and producing an output stream to one or more output topics, effectively
transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka
topics to existing applications or data systems.

Order is guaranteed only within a partition.
Once data is written to a partition, it can't be changed.
Data is assigned randomly to a partition – unless a key is provided.

Topic 1
Partition 1
Topic 2
Partition 0
Topic 1
Partition 2
Topic 2
Partition 1
Topic 1
Partition 0
Brokers :
A Kafka cluster is composed of multiple brokers (servers)
Each broker has its own unique ID
Broker1 Broker2 Broker3

Topic 1
Partition 1
Topic 1
Partition 0
Topic 1
Partition 1
Topic 1
Partition 0
Topic Replication Factor :
If Broker2 is down, still Broker1 and Broker3 can serve the data.

Topic 1
Partition 1
Topic 2
Partition 0
Topic 1
Partition 2
Topic 2
Partition 1
Topic 1
Partition 0
Zookeeper :
Zookeeper
Kafka Cluster

Zookeeper :
Zookeeper keeps a list Kafka brokers.
Zookeeper sends notification to Kafka in case of changes such as new topic, broker
dies, broker comes up, topic deleted etc)
Kafka can't work without Zookeeper.
Zookeeper usually operates in an odd quorum (cluster) of servers (1,3,5,7...)
Zookeeper1
(Follower)
Zookeeper3
(Follower)
Zookeeper2
(Leader)
Kafka
Broker1
Kafka
Broker2
Kafka
Broker3
Kafka
Broker4
Kafka
Broker5

Apache kafka introduction

More Related Content

Apache kafka introduction