Tokyo AK Meetup Speedtest - Share.pdf

Speedtest: Benchmark Your Apache
Kafka

Understand and tune
• Producers
• Consumers
• Brokers
Producer tuning is key
• Efﬁcient batching is essential
for overall performance
Focus on fundamentals
• Large impact & gains
• Advanced topics e.g. in
• Tail Latency at Scale with
Apache Kafka
Where to begin?
3

Service goals and
tradeoffs
4
Non-performance objectives
• Business requirements take
priority
• Durability, availability and
ordering?
Performance objectives
• Trade off between throughput
and latency
Example approach
• Set conﬁguration to ensure data
durability
• Optimize for throughput
Throughput Latency
Availability
Durability
payments
logging
Next Best
Offer
Centralized
Kafka

Agenda
5
01. Introduction
Setting the scene & review of relevant terminology
02. Producers
Deep dive into producer internals.
Why is producer behavior key for cluster performance?
03. Consumers
Understand fetching and consumer group behavior.
04. Brokers, Zookeepers and Topics
How are requests handled? Why does Zookeeper matter?
05. Optimising and Tuning Client Applications
Key parameters to consider for different service goals.
06. Summary
Summary and outlook.

Identify your
service goal
Throughput, latency,
durability, or availability
Understand
Kafka
internals
Producer, Consumer
and Broker behavior
Conﬁgure
cluster and
clients
Ensure service goals are
met
Benchmark,
monitor, and
tune
Iterative procedure to
drive performance
It is a journey...

Producer
8
acks=1
enable.idempotence=false
max.request.size=1MB
retries=MAX_INT
delivery.timeout.ms=2min
max.in.flight.requests.
per.connection=5
Serializer
● Retrieves and
caches schemas
from Schema
Registry
Partitioner
● Java client uses
murmur2 for
hashing
● If key not
provided
performs round
robin
● If keys
unbalanced it will
overload one
leader
Sender thread
● Batches grouped
by destination
broker into
requests
● Multiple batches
to different
partitions
potentially in the
same producer
request
Record accumulator
● Buffer per partition,
seldom used partitions
may not achieve high
batching
● If many producers are in
the same JVM, memory
and GC could become
important
● Sticky partitioner could
be used to increase
batches in the case of
round robin
(KIP-408/KIP-794)
Compression
● At batch level
● Allows faster transfer to
the broker
● Reduces the inter
broker replication load
● Reduces page cache &
disk space utilization on
brokers
● Gzip is more CPU
intensive, Snappy is
lighter, LZ4/ZStd are a
good balance*
compress.type=none
batch.size=16KB
buffer.memory=32MB
max.block.ms=60s
record batch request
batch.size=16KB
linger.ms=0
buffer.memory=32MB
max.block.ms=60s
compress.type=none

Batching is key
to overall performance
9
Beneﬁts to batching
● Reduced network bandwidth
○ producer to broker
○ broker to broker (replication)
○ broker to consumer
● Less storage requirements on broker disks
● Reduced CPU requirement due to fewer
requests
From Tail Latency at Scale with Apache Kafka
“Batching reduces the cost of each record by
amortizing costs on both the clients and
brokers.
Generally, bigger batches reduce processing
overhead and reduce network and disk IO, which
improves network and disk utilization.”

Start the demo
environment
10
in docker-compose (on my mac)
1 * zookeeper
5 * brokers
1 * Squid proxy (sends JMX metrics to Health+)
Not starting:
schema registry
connect
ksqlDB
REST Proxy
Conﬂuent Control Center

Kafka performance
test tools
12
kafka-producer-perf-test
--num-records 1000000
--record-size 1000
--topic demo-perf-topic
--throughput 10000
--print-metrics
--producer-props bootstrap.servers=kafka:9092
acks=all batch.size=300000 linger.ms=100
compression.type=lz4
Overview
● CLI tools to write & read sample data
to/from topics
● Helpful to enhance understanding of
parameters & impact
Disclaimer
● Performance numbers are not
representative for speciﬁc customer use
cases!
○ Random test data is reused
● Use case speciﬁc performance testing is
required
kafka-consumer-perf-test
kafka-producer-perf-test

Most signiﬁcant producer performance metrics
Metric Meaning MBean
record-size-avg Avg record size kafka.producer:type=producer-metrics,client-id=([-.w]+)
batch-size-avg
Avg number of bytes sent per partition
per-request
kafka.producer:type=producer-metrics,client-id=([-.w]+)
bufferpool-wait-ratio
Faction of time an appender waits for
space allocation
compression-rate-avg
Avg compression rate for a topic.
Compressed / uncompressed batch size
kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),to
pic=([-.w]+)
record-queue-time-avg
Avg time (ms) record batches spent in
the send buffer
request-latency-avg Avg request latency (ms) kafka.producer:type=producer-metrics,client-id=([-.w]+)
produce-throttle-time-avg
Avg time (ms) a request was throttled
by a broker
record-retry-rate
Avg per-second number of retried record
sends for a topic
kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),to
pic=([-.w]+)
Overview Java metrics & librdkafka statistics

Consumer application
Kafka consumers
fetch batches of
events!
Embrace
at-least-once
semantics!

Consumers
Partitions
● Basis for scalability
● No partition will be assigned to more than one consumer in the same group
Key parameters
# of partitions
fetch.min.bytes=1
fetch.max.wait.ms=500ms
max.partition.fetch.bytes=10MB
fetch.max.bytes=50MB
max.poll.records=500
max.poll.interval.ms=5min
auto.commit.interval.ms=5s (if being used)

Key positions in each
partition
17
Log end offset
• Latest data added to the partition
• Position of the producer
• Not accessible to consumers
High watermark
• Offsets up to the watermark can be
consumed
• Data has been replicated to all insync
replicas
Current position
• Speciﬁc to consumer instances
• Current message being processed in
poll-loop
Last committed offset
• Last position persisted in the
__consumer_offsets topic
0 1 2 3 4 5 6 7 8 9 10 11 12
Last
committed
offset
Current
position of
consumer
High
watermark
Log end
offset

Consumer groups
Consumer
Any Broker
(bootstrap)
Coordinator
Broker
Find coordinator
Coordinator details
Join consum
er group
Leader details
Sync group
Partition assignm
ent
Rebalances
● Every time a new consumer joins or
leaves (fails) the group
● Until Kafka 2.4 “stop the world” event
(solved in KIP-429)
● Consider setting group.instance.id
to minimize rebalances (KIP-345)
Partition assignment
● Based on
partition.assignment.strategy
● Options: Range (default), round robin,
sticky, cooperative sticky
● Is customizable
Heartbeat
heartbeat.interval.ms=3s
session.timeout.ms=10s
group.initial.
rebalance.delay.ms=3s

Selected consumer performance metrics
Metric Meaning MBean
fetch-latency-avg Avg time taken for a fetch request kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
.w]+)
fetch-size-avg Avg number of bytes fetched per request kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
.w]+)
commit-latency-avg Avg time commit request kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.w
]+)
rebalance-latency-total Total time taken for group rebalances kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.
w]+)
fetch-throttle-time-avg Avg throttle time (ms) kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
.w]+)
Overview Java metrics and librdkafka statistics

Consumer
Benchmarking
20
(1) Start with most simple test: Without any
tuning, we get extremely good results
Highlights:
● 10M messages in less than 30 seconds
● 1Gb data retrieved
● 325 Mb/s
Conclusion:
● Tuning producer is key, if it is correctly
tuned, there (can be) almost no tuning
required on consumer side

04. Brokers, Zookeepers
and Topics

Overview
Brokers and Zookeeper
24
Request lifecycle in broker
● How are produce & fetch requests
handled?
● How can inefﬁcient batching impact
performance?
● How to identify where time is spent during
request handling?
Controller, leaders, and Zookeeper
● How is the Controller elected?
● How are broker failures detected?
● Why does the partition count matter for
the recovery time after a controller failure?
(Next 8 slides skipped)

04. Optimizing and Tuning
Client Applications
https://docs.conﬂuent.io/cloud/current/client-apps/optimizing/index.html#optimizing-and-tuning

04. Recommendations &
Conclusions

Recommendations
27
Benchmarking
● Benchmark all applications with a significant & representative load
● Consider a test cluster with
the applications requirements configured (either it is durability, availability or any other)
real data (size, schema, serialization format, ...)
● Test the different parameters to see the impact in the test data (throughput, latency, ...) considering
different configurations (batch size, compression, linger, ...)
● Evaluate the traffic and leave space for growth when determining the number of partitions
● Low volume applications may need care too
● Re-evaluate after major changes in application or message content (JSON size, ...) and volume
Monitoring
● Should be used to identify bottlenecks in running clusters
● Client monitoring is as important as broker monitoring

Conclusion
28
Resources
● Optimizing Your Apache Kafka®
Deployment
● Optimizing and Tuning
● White paper
Optimization approach
● Determine service goals
● Understand Kafka’s internals
● Conﬁgure clients & cluster
● Benchmark, monitor & tune
Continue the conversation
● How to monitor the cluster & clients?
● Integration with external systems
● Tuning of Kafka Streams & ksqlDB
applications?

29
https://www.confluent.io/get-started/ https://www.conﬂuent.io/get-started/

Tokyo AK Meetup Speedtest - Share.pdf

Tokyo AK Meetup Speedtest - Share.pdf

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

Similar to Tokyo AK Meetup Speedtest - Share.pdf

Similar to Tokyo AK Meetup Speedtest - Share.pdf (20)

Recently uploaded

Recently uploaded (20)

Tokyo AK Meetup Speedtest - Share.pdf