SlideShare a Scribd company logo
Introduction to Apache Kafka &
LinkedIn Camus
Deep Shah
Software Engineer Intern
Twitter : @dsshah22
LinkedIn : https://www/linkedin.com/in/deepshah22
Architecture…!!!
A Distributed system consists of multiple computers that communicate and coordinate their actions by passing messages. The
components interact with each other in order to achieve a common goal.
● Configuration Management
○ Cluster member nodes Bootstrapping configuration from a central source
● Distributed Cluster Management
○ Node Join/Leave
○ Node Status in real time
● Naming Service – e.g. DNS
● Distributed Synchronization – locks, barriers
● Leader election
● Centralized and Highly reliable Registry 2
Apache Zookeeper
Architecture…!!!
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
● Broker: The cluster consisting of one or more servers in Kafka.
● Topics: The categories in which Kafka maintains its feeds of messages.
● Producers: The processes that publish messages to a topic.
● Consumers: The processes that subscribe to topics so as to fetch the above published messages.
● Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)
3
Apache Kafka
Consumer
Broker Producer
Topic
4
Server-4
Server-8
Server-6
Server-2
Broker - 2Broker - 1 Broker - 3
Apache Kafka
Broker
● It is kafka cluster consisting of one or more servers.
Server-7
Server-5
Server-3
Server-1
Architecture…!!!
● Receive messages from Producers (push), deliver messages to Consumers (pull).
● Broker: The cluster consisting of one or more servers in Kafka.
● Topics: The categories in which Kafka maintains its feeds of messages.
● Producers: The processes that publish messages to a topic.
● Consumers: The processes that subscribe to topics so as to fetch the above published messages.
5
Apache Kafka
Consumer
Broker Producer
Topic
● A topic is a category or feed name to which messages are published.
● Each partition is an ordered, immutable sequence of messages that is
continually appended to—a commit log.
● The messages in the partitions are each assigned a sequential id number
called the offset that uniquely identifies each message within the partition.
This offset are in data folder of kafka.
● This offset is controlled by the consumer: normally a consumer will
advance its offset linearly as it reads messages, but in fact the position is
controlled by the consumer and it can consume messages in any order it
likes.
● The partitions of the log are distributed over the servers.
● Each server handling data and requests for a share of the partitions.
● Each partition is replicated across a configurable number of servers for
fault tolerance.
Topics and Partitions
6
Apache Kafka
● Each partition has one server which acts as the leader and zero or more servers which act as followers.
● The leader handles all read and write requests for the partition, while followers passively replicate the leader.
● This replication helps to retain messages on leader’s failure. If the leader fails, one of the followers automatically becomes the new leader.
● Each server acts as a leader for some of its partitions and a follower for others, so load is well balanced within the cluster
7
Apache Kafka
Leader and Follower
Server-1
Leader
Follower
Server-2
Server-3
Follower
Follower
Server-4
Server-1
Leader
Follower
Server-2
Server-3
Follower
Follower
Server-4
Leader
Server-2
Server-3
Follower
Follower
Server-4
Partition - 1
Partition - 1 Partition - 1
● Receive messages from Producers (push), deliver messages to Consumers (pull).
● Broker: The cluster consisting of one or more servers in Kafka.
● Topics: The categories in which Kafka maintains its feeds of messages.
● Producers: The processes that publish messages to a topic.
● Consumers: The processes that subscribe to topics so as to fetch the above published messages.
Architecture…!!!
8
Apache Kafka
Consumer
Broker Producer
Topic
● Receive messages from Producers (push), deliver messages to Consumers (pull).
● Producer
○ Producers publish data to the topics of their choice.
○ The producer is responsible for choosing which message to assign to which
partition within the topic.
○ Using Round-robin fashion or Simple partition function.
● Consumers
○ Consumers request a range of messages from a Broker.
○ Messaging Models: queuing and publish-subscribe.
○ Consumers label themselves with a consumer group name and each message
published to a topic is delivered to one consumer instance within each
subscribing consumer group.
○ Consumer instances can be on separate processes or on separate machines.
Producers and Consumers
9
Apache Kafka
Offset Management
10
● All the consumer offset commit requests are sent as produce requests to a
special topic named “__offsets”. Refer to this topic as the “offsets topic” here
on.
● The offset commit messages are partitioned based on the consumer group in
the key. This would result in all the messages of a given consumer group
ending to a single broker and thus facilitates offset fetch requests without
having to scatter-gather from several brokers.
Apache Kafka
● Camus is LinkedIn's Kafka --> HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.
● A single execution of Camus consists of three stages:
a. Setup stage fetches available topics and partitions from Zookeeper and the latest offsets from the Kafka Nodes.
b. Hadoop job stage allocates topic pulls among a set number of tasks.
c. Cleanup Stage reads counts from the all tasks, aggregates values and submits the result to the Kafka for Consumption of Kafka
Audits.
Architecture…!!!
11
LinkedIn Camus
● Setup stage fetches from Zookeeper Kafka broker urls and topics (in /brokers/id and /brokers/topics). This data is transient and will be gone
once Kafka server is down.
● Topic offsets stored in HDFS. Camus maintains its own status by storing offset for each topic in HDFS. This data is persistent.
● Setup stage allocates all topics and partitions among a fixed number of tasks.
1. Setup Stage
12
LinkedIn Camus
I. Pulling the Data
Each hadoop task uses a list of topic partitions with offsets generated by setup stage as input. It uses them to initialize Kafka requests
and fetch events from Kafka brokers. Each task generates four types of outputs (by using a custom MultipleOutputFormat): Data files, Count
statistics files, Updated offset files, and Error files.
I. Committing the Data
Once a task has successfully completed, all topics pulled are committed to their final output directories. If a task doesn't complete
successfully, then none of the output is committed. When a task appears to be running slowly. In that case the job tracker then schedules the
task on a different node and runs both the main task and the speculative task in parallel. Once one of the tasks completes, the other task is
killed.
I. Producing Audit Counts
Successful tasks also write audit counts to HDFS.
I. Storing the Offsets
2. Hadoop Job
13
LinkedIn Camus
● Once the hadoop job has completed, the main client reads all the written audit counts and aggregates them. The aggregated results are then
submitted to Kafka.
3. Job Cleanup
14
LinkedIn Camus
1. http://kafka.apache.org/
2. https://github.com/linkedin/camus
3. http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
References
15

More Related Content

Copy of Kafka-Camus

  • 1. Introduction to Apache Kafka & LinkedIn Camus Deep Shah Software Engineer Intern Twitter : @dsshah22 LinkedIn : https://www/linkedin.com/in/deepshah22
  • 2. Architecture…!!! A Distributed system consists of multiple computers that communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. ● Configuration Management ○ Cluster member nodes Bootstrapping configuration from a central source ● Distributed Cluster Management ○ Node Join/Leave ○ Node Status in real time ● Naming Service – e.g. DNS ● Distributed Synchronization – locks, barriers ● Leader election ● Centralized and Highly reliable Registry 2 Apache Zookeeper
  • 3. Architecture…!!! Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. ● Broker: The cluster consisting of one or more servers in Kafka. ● Topics: The categories in which Kafka maintains its feeds of messages. ● Producers: The processes that publish messages to a topic. ● Consumers: The processes that subscribe to topics so as to fetch the above published messages. ● Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) 3 Apache Kafka Consumer Broker Producer Topic
  • 4. 4 Server-4 Server-8 Server-6 Server-2 Broker - 2Broker - 1 Broker - 3 Apache Kafka Broker ● It is kafka cluster consisting of one or more servers. Server-7 Server-5 Server-3 Server-1
  • 5. Architecture…!!! ● Receive messages from Producers (push), deliver messages to Consumers (pull). ● Broker: The cluster consisting of one or more servers in Kafka. ● Topics: The categories in which Kafka maintains its feeds of messages. ● Producers: The processes that publish messages to a topic. ● Consumers: The processes that subscribe to topics so as to fetch the above published messages. 5 Apache Kafka Consumer Broker Producer Topic
  • 6. ● A topic is a category or feed name to which messages are published. ● Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. ● The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. This offset are in data folder of kafka. ● This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes. ● The partitions of the log are distributed over the servers. ● Each server handling data and requests for a share of the partitions. ● Each partition is replicated across a configurable number of servers for fault tolerance. Topics and Partitions 6 Apache Kafka
  • 7. ● Each partition has one server which acts as the leader and zero or more servers which act as followers. ● The leader handles all read and write requests for the partition, while followers passively replicate the leader. ● This replication helps to retain messages on leader’s failure. If the leader fails, one of the followers automatically becomes the new leader. ● Each server acts as a leader for some of its partitions and a follower for others, so load is well balanced within the cluster 7 Apache Kafka Leader and Follower Server-1 Leader Follower Server-2 Server-3 Follower Follower Server-4 Server-1 Leader Follower Server-2 Server-3 Follower Follower Server-4 Leader Server-2 Server-3 Follower Follower Server-4 Partition - 1 Partition - 1 Partition - 1
  • 8. ● Receive messages from Producers (push), deliver messages to Consumers (pull). ● Broker: The cluster consisting of one or more servers in Kafka. ● Topics: The categories in which Kafka maintains its feeds of messages. ● Producers: The processes that publish messages to a topic. ● Consumers: The processes that subscribe to topics so as to fetch the above published messages. Architecture…!!! 8 Apache Kafka Consumer Broker Producer Topic
  • 9. ● Receive messages from Producers (push), deliver messages to Consumers (pull). ● Producer ○ Producers publish data to the topics of their choice. ○ The producer is responsible for choosing which message to assign to which partition within the topic. ○ Using Round-robin fashion or Simple partition function. ● Consumers ○ Consumers request a range of messages from a Broker. ○ Messaging Models: queuing and publish-subscribe. ○ Consumers label themselves with a consumer group name and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. ○ Consumer instances can be on separate processes or on separate machines. Producers and Consumers 9 Apache Kafka
  • 10. Offset Management 10 ● All the consumer offset commit requests are sent as produce requests to a special topic named “__offsets”. Refer to this topic as the “offsets topic” here on. ● The offset commit messages are partitioned based on the consumer group in the key. This would result in all the messages of a given consumer group ending to a single broker and thus facilitates offset fetch requests without having to scatter-gather from several brokers. Apache Kafka
  • 11. ● Camus is LinkedIn's Kafka --> HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. ● A single execution of Camus consists of three stages: a. Setup stage fetches available topics and partitions from Zookeeper and the latest offsets from the Kafka Nodes. b. Hadoop job stage allocates topic pulls among a set number of tasks. c. Cleanup Stage reads counts from the all tasks, aggregates values and submits the result to the Kafka for Consumption of Kafka Audits. Architecture…!!! 11 LinkedIn Camus
  • 12. ● Setup stage fetches from Zookeeper Kafka broker urls and topics (in /brokers/id and /brokers/topics). This data is transient and will be gone once Kafka server is down. ● Topic offsets stored in HDFS. Camus maintains its own status by storing offset for each topic in HDFS. This data is persistent. ● Setup stage allocates all topics and partitions among a fixed number of tasks. 1. Setup Stage 12 LinkedIn Camus
  • 13. I. Pulling the Data Each hadoop task uses a list of topic partitions with offsets generated by setup stage as input. It uses them to initialize Kafka requests and fetch events from Kafka brokers. Each task generates four types of outputs (by using a custom MultipleOutputFormat): Data files, Count statistics files, Updated offset files, and Error files. I. Committing the Data Once a task has successfully completed, all topics pulled are committed to their final output directories. If a task doesn't complete successfully, then none of the output is committed. When a task appears to be running slowly. In that case the job tracker then schedules the task on a different node and runs both the main task and the speculative task in parallel. Once one of the tasks completes, the other task is killed. I. Producing Audit Counts Successful tasks also write audit counts to HDFS. I. Storing the Offsets 2. Hadoop Job 13 LinkedIn Camus
  • 14. ● Once the hadoop job has completed, the main client reads all the written audit counts and aggregates them. The aggregated results are then submitted to Kafka. 3. Job Cleanup 14 LinkedIn Camus
  • 15. 1. http://kafka.apache.org/ 2. https://github.com/linkedin/camus 3. http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/ References 15