SlideShare a Scribd company logo
Abstract
Rayapati Praveen
&
Mostafa Jubayer Khan
Contents
● Definitions
● History
● Kafka Architecture
● Capabilities & Core API
● Advantages
● Limitations
● Usage
● References
● Future challenges
Apache Kafka® is a distributed streaming platform.
. A stream is a pipeline to which your applications receives data continuously.What exactly does that mean?
It is an open source distributed streaming platform that simplifies data integration between systems
Created and open sourced by LinkedIn in 2011.Written in Scala & Java.
Kafka has quickly evolved from messaging queue to a full-fledged streaming platform
A streaming platform has three key capabilities:
● Publish & subscribe to streams of records, similar to a message queue or enterprise messaging system.
● Store streams of records in a fault-tolerant durable way.
● Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
● Data Integration: Building real-time streaming data pipelines that reliably get data between systems or applications
● Stream Processing: Building real-time streaming applications that transform or react to the streams of data
Architecture, Capabilities & Core API
Kafka system has three main components:
A Producer: The service that emits the source data.
A Broker: acts as an intermediary between the producer and the consumer.
It uses the power of API's to get and broadcast data.
A Consumer: The service that uses the data which the broker will broadcast.
Kafka, in general,
Run as a cluster on one or more servers that can span multiple datacenters.
Stores streams of records in categories called topics,
Each record consists of a key, a value, and a timestamp.
Kafka has four core Application Programming Interface (API), are
The Producer API to publish a stream of records to one or more Kafka topics.
The Consumer API to subscribe to one or more topics and processes the streams
The Streams API to act as a stream processor, transforming the input streams to output streams.
The Connector API to build and run reusable producers or consumers to import and export heavy data from the DB and others
systems
Advantages
● Used for complex and heavy load of data pipelines for data integration than other software e.g. Redis, RabbitMQ, AMQP,
Microsoft Azure bus etc.
● Create a series of validations, transformations
● Keep record of the information for later consumption called commit log
● Fault-tolerant, replayable, real-time & reliable to use
● Work with external stream processing systems e.g. Apache Apex, Apache Flink, Apache Spark, and Apache Storm.
Limitations
● It's NOT Plug & Play.
● Need to write bunch of codes for applications
● Expert usually don't prefer to use in terms of lower chunk of data streaming.
● Need to know configuration parameters to customize or tune Kafka behaviour as per the user requirements.
● Problematic for older versus newer version of Kafka in terms of data streaming.
Users :
Apple Inc.,Netflix, Walmart, Cisco Systems,
eBay, PayPal, The New York Times etc.
References
1. http://kafka.apache.org/intro
2. https://www.youtube.com/watch?v=udnX21__SuU&t=57s
3. https://www.youtube.com/watch?v=dq-ZACSt_gA
4. https://en.wikipedia.org/wiki/Apache_Kafka
5. https://scotch.io/tutorials/build-a-distributed-streaming-system-with-apache-kafka-and-pythons
Any Query?

More Related Content

A Short Presentation on Kafka

  • 2. Contents ● Definitions ● History ● Kafka Architecture ● Capabilities & Core API ● Advantages ● Limitations ● Usage ● References ● Future challenges
  • 3. Apache Kafka® is a distributed streaming platform. . A stream is a pipeline to which your applications receives data continuously.What exactly does that mean? It is an open source distributed streaming platform that simplifies data integration between systems Created and open sourced by LinkedIn in 2011.Written in Scala & Java. Kafka has quickly evolved from messaging queue to a full-fledged streaming platform A streaming platform has three key capabilities: ● Publish & subscribe to streams of records, similar to a message queue or enterprise messaging system. ● Store streams of records in a fault-tolerant durable way. ● Process streams of records as they occur. Kafka is generally used for two broad classes of applications: ● Data Integration: Building real-time streaming data pipelines that reliably get data between systems or applications ● Stream Processing: Building real-time streaming applications that transform or react to the streams of data
  • 4. Architecture, Capabilities & Core API Kafka system has three main components: A Producer: The service that emits the source data. A Broker: acts as an intermediary between the producer and the consumer. It uses the power of API's to get and broadcast data. A Consumer: The service that uses the data which the broker will broadcast. Kafka, in general, Run as a cluster on one or more servers that can span multiple datacenters. Stores streams of records in categories called topics, Each record consists of a key, a value, and a timestamp. Kafka has four core Application Programming Interface (API), are The Producer API to publish a stream of records to one or more Kafka topics. The Consumer API to subscribe to one or more topics and processes the streams The Streams API to act as a stream processor, transforming the input streams to output streams. The Connector API to build and run reusable producers or consumers to import and export heavy data from the DB and others systems
  • 5. Advantages ● Used for complex and heavy load of data pipelines for data integration than other software e.g. Redis, RabbitMQ, AMQP, Microsoft Azure bus etc. ● Create a series of validations, transformations ● Keep record of the information for later consumption called commit log ● Fault-tolerant, replayable, real-time & reliable to use ● Work with external stream processing systems e.g. Apache Apex, Apache Flink, Apache Spark, and Apache Storm. Limitations ● It's NOT Plug & Play. ● Need to write bunch of codes for applications ● Expert usually don't prefer to use in terms of lower chunk of data streaming. ● Need to know configuration parameters to customize or tune Kafka behaviour as per the user requirements. ● Problematic for older versus newer version of Kafka in terms of data streaming. Users : Apple Inc.,Netflix, Walmart, Cisco Systems, eBay, PayPal, The New York Times etc.
  • 6. References 1. http://kafka.apache.org/intro 2. https://www.youtube.com/watch?v=udnX21__SuU&t=57s 3. https://www.youtube.com/watch?v=dq-ZACSt_gA 4. https://en.wikipedia.org/wiki/Apache_Kafka 5. https://scotch.io/tutorials/build-a-distributed-streaming-system-with-apache-kafka-and-pythons Any Query?

Editor's Notes

  1. Hi Everyone. I, Mostafa & Praveen, welcoming you all to our presentation on Apache Kafka. We choose Kafka for our Project. Briefly talk about the contents of our topic we chose. It is a distributed streamlining platform for data integration.
  2. Here are the contents of our discussion throughout the entire paper. We are going to cover brief overview of the Apache Kafka.
  3. Kafka started its Journey back around 8 years ago in 2011 by LinkedIn. After that it steadily evolved as a large scale queuing enterprise messaging system.
  4. Stream API: consuming an input stream from one or more topics and producing an output stream to one or more output topics, Connector API: for example, a connector to a relational database might capture every change to a table. that connect Kafka topics to existing applications or data systems.
  5. Collecting data from mobiles, sensors, machine learning to real time sensor. Data are immutable In the paper, we are going to cover all topics in details for having a clear of Apache Kafka and how it works.
  6. How do you make data available for applications across wide area network ? How do you serve data efficiently from closer geos ? How do you implement data sovereignty rules ?