SlideShare a Scribd company logo
Kafka - Quick Intro
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Problem
• An Organisation can have multiple servers at front-end like Web or Application server for hosting website or application, Chat
Server for the customer to provide chat facilities, a separate server for Payment etc.
• Organisation can also have multiple servers at the backend, which will be receiving messages from different servers based on
their requirements. They can have Security systems for user authentication & authorization, a real-time monitoring system,
data warehouse, etc.
• Now as you can see the data pipelines are getting complex with the
increase in a number of systems. So, adding a new system or sever
require more data pipelines which will again make the data flow more
complicated.
• Managing these data pipelines also becomes very difficult as each
data pipelines has its own set of requirements. Adding some pipelines
or removing some pipelines also becomes more difficult from this
complex system.
• LinkedIn (a social network specifically designed for career and business professionals to connect) has come across this problem.
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
LinkedIn - Use Case
• Some of the LinkedIn features are,
• Page visits and clicks
• User activities
• Events corresponding to logins
• Social networking activities such as likes, shares, and comments
• Application-specific metrics (e.g. logs, page load time, performance etc.)
• This data can be used to run analytics in real time serving various purposes, some of which are :
• Delivering advertisements
• Tracking abnormal user behaviors
• Displaying search based on relevance
• Showing recommendations based on previous activities
Problem: Collecting all the data is not easy as data is generated from various sources in different formats
Solution: One of the ways to solve this problem is to use a messaging system. Messaging systems provide seamless integration
between distributed applications with the help of messages.
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Solution - Apache Kafka
• Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and later on became a part of
the Apache project.
• The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
• Its storage layer is essentially a "massively scalable pub/sub message queue designed as a distributed transaction log, making
it highly valuable for enterprise infrastructures to process streaming data.
• Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a
Java stream processing library.
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Kafka - Architecture
Kafka Cluster
Consumer
Producer Producer Producer
Consumer Consumer
Apache Zookeeper
Inside Kafka
Primary
Broker
Replicas
(Brokers)
Topic: A stream of messages belonging to a particular category is called a topic
Producer: A producer can be any application that can publish messages to a topic
Consumer: A consumer can be any application that subscribes to topics and consumes the messages
Broker: Kafka cluster is a set of servers, each of which is called a broker
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Kafka - Inside Broker
Partition: Kafka topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data in a
particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers
to read from a topic in parallel
Leader: A single partition can be assigned to many brokers, only one broker at the time is considered as the owner of partition or
as a leader
Follower: Other brokers that have the same data (messages) as the leader, are called sync replicas or follower
Topic with
- 4 Partitions
- Replication
factor = 3
Partition 0
Follower
Partition 0
Leader
Partition 0
Follower
Partition 1
Leader
Partition 2
Follower
Partition 3
Follower
Partition 1
Follower
Partition 2
Follower
Partition 3
Leader
Partition 1
Follower
Partition 2
Leader
Partition 3
Follower
Broker 1 Broker 2 Broker 3
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Kafka - Cluster Types
Kafka is scalable and allows creation of multiple types of clusters
Apache Zookeeper
Kafka
Broker
Producer
Producer
Producer
Consumer
Consumer
ConsumerSingle Node Single Broker
Apache Zookeeper
Producer
Producer
Producer
Consumer
Consumer
Consumer
Single Node Multiple Brokers
Broker 2
Broker 1
Broker 3
Apache Zookeeper
Multiple Nodes Multiple Brokers Node 2
Consumer
Consumer
Consumer
What’s the role of ZooKeeper ?
Each Kafka broker coordinates with
other Kafka brokers using
ZooKeeper. Producers and
Consumers are notified by the
ZooKeeper service about the
presence of new brokers or failure of
the broker in the Kafka system
Producer
Producer
Node 1Producer
Broker 1
Broker 2
Broker 1
Broker 2
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Getting started with Kafka
Prerequisite
Components
Download Links
Java - https://www.oracle.com/technetwork/java/javase/downloads/index.html
Apache Zookeeper - https://zookeeper.apache.org/releases.html
Apache Kafka - https://kafka.apache.org/downloads
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Kafka - Installation and setup
After successful download, add kafka to $PATH variable as follows,
1) Open bash_profile (In mac) by typing sudo vi ~/.bash_profile
2) Add the following lines in that file,
1) export KAFKA_HOME=/Users/jothibasu/Downloads/kafka_2.12-2.1.1 Your Folder Location
2) export PATH=$PATH://Users/jothibasu/Downloads/kafka_2.12-2.1.1/bin
3) Restart your terminal, after saving the document
Start the apache zookeeper server by following commands,
1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home)
2) Start the server by,
./bin/zookeeper-server-start.sh config/zookeeper.properties
Open a new terminal to start apache kafka server by following commands,
1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home)
2) Start the server by,
/bin/kafka-server-start.sh config/server.properties
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Kafka - Create Topic & Publish Messages
Next, we’ll create topic to send messages,
1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home)
2) ./bin/kafka-topics.sh - -create - -zookeeper <<ipaddress : port>> - -replication-factor <<n>> - -partitions <<n>> - -topic
<<topic-name>>
Eg : /bin/kafka-topics.sh - -create - -zookeeper localhost:2181 - -replication-factor 1 - -partitions 1 - -topic test-topic
3) To check topic is created list the topics by the following,
/bin/kafka-topics.sh - -list - -zookeeper localhost:2181
Next start a consumer by the following,
1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home)
2) /bin/kafka-console-consumer.sh - -bootstrap-server localhost:9200 - -topic test-topic - -from-beginning
Next start a producer in new tab by the following,
1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home)
2) /bin/kafka-console-producer.sh - -broker-list localhost:9200 - -topic test-topic
Note
1) ’N’ number of tabs - ’N’ number of consumers/producers
2) All the above - Only for testing purpose
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Producer / Consumer - Java API
Get started by creating the maven project,
1) Add the following dependencies in your pom.xml
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.12</version>
</dependency>
2) Create separate classes for producer and consumer
1) Producer.java
2) Consumer.java
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Producer API
package com.strakin.kafkalearning;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.KafkaException;
import org.apache.kafka.common.errors.AuthorizationException;
import org.apache.kafka.common.errors.OutOfOrderSequenceException;
import org.apache.kafka.common.errors.ProducerFencedException;
import org.apache.kafka.common.serialization.StringSerializer;
public class Producer {
public static void main(String[] args) {
KafkaProperties propertiesObj = new KafkaProperties();
props.put("bootstrap.servers", "localhost:9092");
props.put("transactional.id", "my-transactional-id");
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props, new StringSerializer(),
new StringSerializer());
producer.initTransactions();
try {
producer.beginTransaction();
for (int i = 0; i < 5; i++)
producer.send(new ProducerRecord<String, String>("test-topic", Integer.toString(i), Integer.toString(i)));
producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
// We can't recover from these exceptions, so our only option is to close the
// producer and exit.
producer.close();
} catch (KafkaException e) {
// For all other exceptions, just abort the transaction and try again.
producer.abortTransaction();
}
producer.close();
}
}
STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved
Consumer API
package com.strakin.kafkalearning;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class Consumer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("test-topic", “abc"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
}
Note
1) All the above - Only for testing purpose
“Thank You”

More Related Content

Apache Kafka - Strakin Technologies Pvt Ltd

  • 2. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Problem • An Organisation can have multiple servers at front-end like Web or Application server for hosting website or application, Chat Server for the customer to provide chat facilities, a separate server for Payment etc. • Organisation can also have multiple servers at the backend, which will be receiving messages from different servers based on their requirements. They can have Security systems for user authentication & authorization, a real-time monitoring system, data warehouse, etc. • Now as you can see the data pipelines are getting complex with the increase in a number of systems. So, adding a new system or sever require more data pipelines which will again make the data flow more complicated. • Managing these data pipelines also becomes very difficult as each data pipelines has its own set of requirements. Adding some pipelines or removing some pipelines also becomes more difficult from this complex system. • LinkedIn (a social network specifically designed for career and business professionals to connect) has come across this problem.
  • 3. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved LinkedIn - Use Case • Some of the LinkedIn features are, • Page visits and clicks • User activities • Events corresponding to logins • Social networking activities such as likes, shares, and comments • Application-specific metrics (e.g. logs, page load time, performance etc.) • This data can be used to run analytics in real time serving various purposes, some of which are : • Delivering advertisements • Tracking abnormal user behaviors • Displaying search based on relevance • Showing recommendations based on previous activities Problem: Collecting all the data is not easy as data is generated from various sources in different formats Solution: One of the ways to solve this problem is to use a messaging system. Messaging systems provide seamless integration between distributed applications with the help of messages.
  • 4. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Solution - Apache Kafka • Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and later on became a part of the Apache project. • The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. • Its storage layer is essentially a "massively scalable pub/sub message queue designed as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. • Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.
  • 5. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Kafka - Architecture Kafka Cluster Consumer Producer Producer Producer Consumer Consumer Apache Zookeeper Inside Kafka Primary Broker Replicas (Brokers) Topic: A stream of messages belonging to a particular category is called a topic Producer: A producer can be any application that can publish messages to a topic Consumer: A consumer can be any application that subscribes to topics and consumes the messages Broker: Kafka cluster is a set of servers, each of which is called a broker
  • 6. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Kafka - Inside Broker Partition: Kafka topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel Leader: A single partition can be assigned to many brokers, only one broker at the time is considered as the owner of partition or as a leader Follower: Other brokers that have the same data (messages) as the leader, are called sync replicas or follower Topic with - 4 Partitions - Replication factor = 3 Partition 0 Follower Partition 0 Leader Partition 0 Follower Partition 1 Leader Partition 2 Follower Partition 3 Follower Partition 1 Follower Partition 2 Follower Partition 3 Leader Partition 1 Follower Partition 2 Leader Partition 3 Follower Broker 1 Broker 2 Broker 3
  • 7. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Kafka - Cluster Types Kafka is scalable and allows creation of multiple types of clusters Apache Zookeeper Kafka Broker Producer Producer Producer Consumer Consumer ConsumerSingle Node Single Broker Apache Zookeeper Producer Producer Producer Consumer Consumer Consumer Single Node Multiple Brokers Broker 2 Broker 1 Broker 3 Apache Zookeeper Multiple Nodes Multiple Brokers Node 2 Consumer Consumer Consumer What’s the role of ZooKeeper ? Each Kafka broker coordinates with other Kafka brokers using ZooKeeper. Producers and Consumers are notified by the ZooKeeper service about the presence of new brokers or failure of the broker in the Kafka system Producer Producer Node 1Producer Broker 1 Broker 2 Broker 1 Broker 2
  • 8. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Getting started with Kafka Prerequisite Components Download Links Java - https://www.oracle.com/technetwork/java/javase/downloads/index.html Apache Zookeeper - https://zookeeper.apache.org/releases.html Apache Kafka - https://kafka.apache.org/downloads
  • 9. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Kafka - Installation and setup After successful download, add kafka to $PATH variable as follows, 1) Open bash_profile (In mac) by typing sudo vi ~/.bash_profile 2) Add the following lines in that file, 1) export KAFKA_HOME=/Users/jothibasu/Downloads/kafka_2.12-2.1.1 Your Folder Location 2) export PATH=$PATH://Users/jothibasu/Downloads/kafka_2.12-2.1.1/bin 3) Restart your terminal, after saving the document Start the apache zookeeper server by following commands, 1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home) 2) Start the server by, ./bin/zookeeper-server-start.sh config/zookeeper.properties Open a new terminal to start apache kafka server by following commands, 1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home) 2) Start the server by, /bin/kafka-server-start.sh config/server.properties
  • 10. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Kafka - Create Topic & Publish Messages Next, we’ll create topic to send messages, 1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home) 2) ./bin/kafka-topics.sh - -create - -zookeeper <<ipaddress : port>> - -replication-factor <<n>> - -partitions <<n>> - -topic <<topic-name>> Eg : /bin/kafka-topics.sh - -create - -zookeeper localhost:2181 - -replication-factor 1 - -partitions 1 - -topic test-topic 3) To check topic is created list the topics by the following, /bin/kafka-topics.sh - -list - -zookeeper localhost:2181 Next start a consumer by the following, 1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home) 2) /bin/kafka-console-consumer.sh - -bootstrap-server localhost:9200 - -topic test-topic - -from-beginning Next start a producer in new tab by the following, 1) Type cd KAFKA_HOME in your terminal (Navigation to kafka home) 2) /bin/kafka-console-producer.sh - -broker-list localhost:9200 - -topic test-topic Note 1) ’N’ number of tabs - ’N’ number of consumers/producers 2) All the above - Only for testing purpose
  • 11. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Producer / Consumer - Java API Get started by creating the maven project, 1) Add the following dependencies in your pom.xml <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>2.1.0</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.12</version> </dependency> 2) Create separate classes for producer and consumer 1) Producer.java 2) Consumer.java
  • 12. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Producer API package com.strakin.kafkalearning; import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.common.KafkaException; import org.apache.kafka.common.errors.AuthorizationException; import org.apache.kafka.common.errors.OutOfOrderSequenceException; import org.apache.kafka.common.errors.ProducerFencedException; import org.apache.kafka.common.serialization.StringSerializer; public class Producer { public static void main(String[] args) { KafkaProperties propertiesObj = new KafkaProperties(); props.put("bootstrap.servers", "localhost:9092"); props.put("transactional.id", "my-transactional-id"); KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props, new StringSerializer(), new StringSerializer()); producer.initTransactions(); try { producer.beginTransaction(); for (int i = 0; i < 5; i++) producer.send(new ProducerRecord<String, String>("test-topic", Integer.toString(i), Integer.toString(i))); producer.commitTransaction(); } catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) { // We can't recover from these exceptions, so our only option is to close the // producer and exit. producer.close(); } catch (KafkaException e) { // For all other exceptions, just abort the transaction and try again. producer.abortTransaction(); } producer.close(); } }
  • 13. STRAKIN Copyright © 2019, strakin and/or its affiliates. All rights reserved Consumer API package com.strakin.kafkalearning; import java.util.Arrays; import java.util.Properties; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.ConsumerRecords; import org.apache.kafka.clients.consumer.KafkaConsumer; public class Consumer { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("group.id", "test"); props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "1000"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(Arrays.asList("test-topic", “abc")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value()); } } } Note 1) All the above - Only for testing purpose