Kafka basics

What’s Kafka
• It’s an open-source message broker written in Scala
Java…
• Which aims to provide a unified, high-throughput,
low-latency platform for handling real-time data
feeds.
• Whose design is heavily influenced by transaction
logs.

Kafka it’s also…
• A distributed, partitioned, replicated commit log
service.
• A streaming process platform.
• Both queue and publish/subscribe paradigms

Kafka concepts
• Maintains feeds of messages in categories called
topics.
• Processes that publish messages to Kafka are
called producers.
• Processes that subscribe to topics and process the
feed of published messages are called consumers.
• Run as a cluster comprised of one or more servers
each of which is called a broker.

Data Retention
• Kafka retains all published messages for a
configurable period of time.
• Retaining lots of data is not a problem.

Producers and Consumers
Producers send messages over the network to the
Kafka cluster which in turn serves them up to
consumers like this:

The Topic
A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks
like this:

The Partition
• Each partition is an ordered, immutable sequence
of messages that is continually appended to.
• The messages in the partitions are each assigned a
sequential number called the offset.
• The offset uniquely identifies each message within
the partition.

More on partitions
• Partitions in the log allow it to scale beyond a size
that would fit on a single server.
• A topic may have many partitions.
• Partitions also act as the unit of parallelism.

Partitions… again…
• Partitions are distributed over the servers in the
Kafka cluster.
• Each partition is replicated across servers for fault
tolerance.

Guess what… Yep,
partitions…
• Each partition has one server which acts as the
“leader".
• Each partition has zero or more servers which act
as “followers".
• If the leader fails, one of the followers will become
the leader.

…
• The leader handles all requests for the partition
while the followers replicate the leader.
• Each server/node/broker acts as a leader for some
of its partitions and a follower for others.

Producers
• Producers publish data to the topics of their choice.
• The producer is responsible for choosing which
message to assign to which partition within the
topic.

Consumers
• Kafka offers a single consumer abstraction called
the consumer group.
• Consumers label themselves with a consumer
group name.
• Each message published to a topic is delivered to
one consumer within each consumer group.

Guarantees
• Messages sent by a producer to a particular topic
partition will be appended in the order they are
sent.
• A consumer instance sees messages in the order
they are stored in the log.
• For a topic with replication factor N, Kafka will
tolerate up to N-1 server failures without losing
any messages committed to the log.

Zookeeper
• Kafka uses Zookeeper to store metadata about
the Kafka cluster, as well as consumer client
details.

AVRO
• AVRO is the preferred serialization format for
Kafka messages.
• It’s independent of platform and/or language.
• Allows schemas to be evolved.
• Schemas are defined in a JSON like format.

AVRO
{"namespace": "customerManagement.avro",
"type": "record",
"name": "Customer",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string""},
{"name": "faxNumber", "type": ["null", "string"], "def
]
}

Schema Registry
• It’s a REST service.
• Allows a AVRO schema to be registered to one or
more topics.
• Stores multiple versions of a schema.
• Validates schemas compatibility.

Schema Registry API
https://docs.confluent.io/current/schema-registry/docs/api.html
http://ae34acbe5ed9b11e8810a0a4e9b68c10-2021023861.us-east-
1.elb.amazonaws.com:8081/subjects
1.elb.amazonaws.com:8081/subjects/orders-avro-value/versions
1.elb.amazonaws.com:8081/subjects/orders-avro-value/versions/1
1.elb.amazonaws.com:8081/subjects/orders-avro-value/versions/1/schema
1.elb.amazonaws.com:8081/schemas/ids/41

Monitoring
http://aebb8cb14eec211e8810a0a4e9b68c10-1296651079.us-east-
1.elb.amazonaws.com:9000/

Kafka Streams
• Is a client library for building applications and microservices, where the input
and output data are stored in Kafka clusters.

Kafka Streams
• A stream is the most important abstraction provided by Kafka Streams. It
represents an unbounded, continuously updating data set.
• A stream processing application is any program that makes use of the Kafka
Streams library.
• A stream processor is a node in the processor topology.
• There are two special processors in the topology:
• Source Processor: A source processor is a special type of stream
processor that does not have any upstream processors.
• Sink Processor: A sink processor is a special type of stream processor that
does not have down-stream processors.

• Dataframe Schema (Reading)
Spark & Kafka

• Dataframe Schema (Writing)
Spark & Kafka

• Required configurations (Reading)
Spark & Kafka

• Required configurations (Writing)
Spark & Kafka

Spark & Kafka
val ordersStreamDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
• Reading

• Writing
Spark & Kafka
val ordersDFQueryKafka =
ordersWithItemsAndProductsDF
.selectExpr("CAST(Timestamp as STRING) as key", "CAST(Discount as
.writeStream
.format(“kafka")
.option("kafka.bootstrap.servers", brokers)
.option("topic", topic + "-out")
.option(“checkpointLocation",
checkpointBucketKafka)
.start()

Ecosystem
• https://cwiki.apache.org/confluence/display/KAFK
A/Ecosystem

Kafka basics

Related slideshows

More Related Content

Kafka basics