Speedtest: Benchmark Your Apache
01. Introduction
Understand and tune
• Producers
• Consumers
• Brokers
Producer tuning is key
• Efficient batching is essential
for overall performance
Focus on fundamentals
• Large impact & gains
• Advanced topics e.g. in
• Tail Latency at Scale with
Apache Kafka
Where to begin?
Service goals and
Non-performance objectives
• Business requirements take
• Durability, availability and
Performance objectives
• Trade off between throughput
and latency
Example approach
• Set configuration to ensure data
• Optimize for throughput
Throughput Latency
01. Introduction
Setting the scene & review of relevant terminology
02. Producers
Deep dive into producer internals.
Why is producer behavior key for cluster performance?
03. Consumers
Understand fetching and consumer group behavior.
04. Brokers, Zookeepers and Topics
How are requests handled? Why does Zookeeper matter?
05. Optimising and Tuning Client Applications
Key parameters to consider for different service goals.
06. Summary
Summary and outlook.
Identify your
service goal
Throughput, latency,
durability, or availability
Producer, Consumer
and Broker behavior
cluster and
Ensure service goals are
monitor, and
Iterative procedure to
drive performance
It is a journey...
02. Producers
● Retrieves and
caches schemas
from Schema
● Java client uses
murmur2 for
● If key not
performs round
● If keys
unbalanced it will
overload one
Sender thread
● Batches grouped
by destination
broker into
● Multiple batches
to different
potentially in the
same producer
Record accumulator
● Buffer per partition,
seldom used partitions
may not achieve high
● If many producers are in
the same JVM, memory
and GC could become
● Sticky partitioner could
be used to increase
batches in the case of
round robin
● At batch level
● Allows faster transfer to
the broker
● Reduces the inter
broker replication load
● Reduces page cache &
disk space utilization on
● Gzip is more CPU
intensive, Snappy is
lighter, LZ4/ZStd are a
good balance*
Batching is key
to overall performance
Benefits to batching
● Reduced network bandwidth
○ producer to broker
○ broker to broker (replication)
○ broker to consumer
● Less storage requirements on broker disks
● Reduced CPU requirement due to fewer
From Tail Latency at Scale with Apache Kafka
“Batching reduces the cost of each record by
amortizing costs on both the clients and
Generally, bigger batches reduce processing
overhead and reduce network and disk IO, which
improves network and disk utilization.”
Start the demo
in docker-compose (on my mac)
1 * zookeeper
5 * brokers
1 * Squid proxy (sends JMX metrics to Health+)
Not starting:
schema registry
REST Proxy
Confluent Control Center
Kafka performance
test tools
--num-records 1000000 
--record-size 1000 
--topic demo-perf-topic 
--throughput 10000 
--producer-props bootstrap.servers=kafka:9092
acks=all batch.size=300000
● CLI tools to write & read sample data
to/from topics
● Helpful to enhance understanding of
parameters & impact
● Performance numbers are not
representative for specific customer use
○ Random test data is reused
Most significant producer performance metrics
Metric Meaning MBean
record-size-avg Avg record size kafka.producer:type=producer-metrics,client-id=([-.w]+)
Avg number of bytes sent per partition
Faction of time an appender waits for
space allocation
Avg compression rate for a topic.
Compressed / uncompressed batch size
Avg time (ms) record batches spent in
the send buffer
request-latency-avg Avg request latency (ms) kafka.producer:type=producer-metrics,client-id=([-.w]+)
Avg time (ms) a request was throttled
by a broker
Avg per-second number of retried record
sends for a topic
Overview Java metrics & librdkafka statistics
03. Consumers
Consumer application
Kafka consumers
fetch batches of
● Basis for scalability
● No partition will be assigned to more than one consumer in the same group
Key parameters
# of partitions
Key positions in each
Log end offset
• Latest data added to the partition
• Position of the producer
• Not accessible to consumers
High watermark
• Offsets up to the watermark can be
• Data has been replicated to all insync
Current position
• Specific to consumer instances
• Current message being processed in
Last committed offset
• Last position persisted in the
__consumer_offsets topic
0 1 2 3 4 5 6 7 8 9 10 11 12
position of
Log end
Consumer groups
Any Broker
Find coordinator
Coordinator details
Join consum
er group
Leader details
Sync group
Partition assignm
● Every time a new consumer joins or
leaves (fails) the group
● Until Kafka 2.4 “stop the world” event
(solved in KIP-429)
● Consider setting
to minimize rebalances (KIP-345)
Partition assignment
● Based on
● Options: Range (default), round robin,
sticky, cooperative sticky
● Is customizable
Selected consumer performance metrics
Metric Meaning MBean
fetch-latency-avg Avg time taken for a fetch request kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
fetch-size-avg Avg number of bytes fetched per request kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
commit-latency-avg Avg time commit request kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.w
rebalance-latency-total Total time taken for group rebalances kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.
fetch-throttle-time-avg Avg throttle time (ms) kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-
Overview Java metrics and librdkafka statistics
(1) Start with most simple test: Without any
tuning, we get extremely good results
● 10M messages in less than 30 seconds
● 1Gb data retrieved
● 325 Mb/s
● Tuning producer is key, if it is correctly
tuned, there (can be) almost no tuning
04. Brokers, Zookeepers
and Topics
Brokers and Zookeeper
Request lifecycle in broker
● How are produce & fetch requests
● How can inefficient batching impact
● How to identify where time is spent during
request handling?
Controller, leaders, and Zookeeper
● How is the Controller elected?
● How are broker failures detected?
● Why does the partition count matter for
the recovery time after a controller failure?
04. Optimizing and Tuning
Client Applications
04. Recommendations &
● Benchmark all applications with a significant & representative load
● Consider a test cluster with
the applications requirements configured (either it is durability, availability or any other)
real data (size, schema, serialization format, ...)
● Test the different parameters to see the impact in the test data (throughput, latency, ...) considering
different configurations (batch size, compression, linger, ...)
● Evaluate the traffic and leave space for growth when determining the number of partitions
● Low volume applications may need care too
● Re-evaluate after major changes in application or message content (JSON size, ...) and volume
● Should be used to identify bottlenecks in running clusters
● Client monitoring is as important as broker monitoring
● Optimizing Your Apache Kafka®
● Optimizing and Tuning
● White paper
Optimization approach
● Determine service goals
● Understand Kafka’s internals
● Configure clients & cluster
● Benchmark, monitor & tune
Continue the conversation
● How to monitor the cluster & clients?
● Integration with external systems
Tokyo AK Meetup Speedtest - Share.pdf

Tokyo AK Meetup Speedtest - Share.pdf

  • 3. Understand and tune • Producers • Consumers • Brokers Producer tuning is key • Efficient batching is essential for overall performance Focus on fundamentals • Large impact & gains • Advanced topics e.g. in • Tail Latency at Scale with Apache Kafka Where to begin? 3
  • 4. Service goals and tradeoffs 4 Non-performance objectives • Business requirements take priority • Durability, availability and ordering? Performance objectives • Trade off between throughput and latency Example approach • Set configuration to ensure data durability • Optimize for throughput Throughput Latency Availability Durability payments logging Next Best Offer Centralized Kafka
  • 5. Agenda 5 01. Introduction Setting the scene & review of relevant terminology 02. Producers Deep dive into producer internals. Why is producer behavior key for cluster performance? 03. Consumers Understand fetching and consumer group behavior. 04. Brokers, Zookeepers and Topics How are requests handled? Why does Zookeeper matter? 05. Optimising and Tuning Client Applications Key parameters to consider for different service goals. 06. Summary Summary and outlook.
  • 6. Identify your service goal Throughput, latency, durability, or availability Understand Kafka internals Producer, Consumer and Broker behavior Configure cluster and clients Ensure service goals are met Benchmark, monitor, and tune Iterative procedure to drive performance It is a journey...
  • 8. Producer 8 acks=1 enable.idempotence=false max.request.size=1MB retries=MAX_INT per.connection=5 Serializer ● Retrieves and caches schemas from Schema Registry Partitioner ● Java client uses murmur2 for hashing ● If key not provided performs round robin ● If keys unbalanced it will overload one leader Sender thread ● Batches grouped by destination broker into requests ● Multiple batches to different partitions potentially in the same producer request Record accumulator ● Buffer per partition, seldom used partitions may not achieve high batching ● If many producers are in the same JVM, memory and GC could become important ● Sticky partitioner could be used to increase batches in the case of round robin (KIP-408/KIP-794) Compression ● At batch level ● Allows faster transfer to the broker ● Reduces the inter broker replication load ● Reduces page cache & disk space utilization on brokers ● Gzip is more CPU intensive, Snappy is lighter, LZ4/ZStd are a good balance* compress.type=none batch.size=16KB buffer.memory=32MB record batch request batch.size=16KB buffer.memory=32MB compress.type=none
  • 9. Batching is key to overall performance 9 Benefits to batching ● Reduced network bandwidth ○ producer to broker ○ broker to broker (replication) ○ broker to consumer ● Less storage requirements on broker disks ● Reduced CPU requirement due to fewer requests From Tail Latency at Scale with Apache Kafka “Batching reduces the cost of each record by amortizing costs on both the clients and brokers. Generally, bigger batches reduce processing overhead and reduce network and disk IO, which improves network and disk utilization.”
  • 10. Start the demo environment 10 in docker-compose (on my mac) 1 * zookeeper 5 * brokers 1 * Squid proxy (sends JMX metrics to Health+) Not starting: schema registry connect ksqlDB REST Proxy Confluent Control Center
  • 12. Kafka performance test tools 12 kafka-producer-perf-test --num-records 1000000 --record-size 1000 --topic demo-perf-topic --throughput 10000 --print-metrics --producer-props bootstrap.servers=kafka:9092 acks=all batch.size=300000 compression.type=lz4 Overview ● CLI tools to write & read sample data to/from topics ● Helpful to enhance understanding of parameters & impact Disclaimer ● Performance numbers are not representative for specific customer use cases! ○ Random test data is reused ● Use case specific performance testing is required kafka-consumer-perf-test kafka-producer-perf-test
  • 13. Most significant producer performance metrics Metric Meaning MBean record-size-avg Avg record size kafka.producer:type=producer-metrics,client-id=([-.w]+) batch-size-avg Avg number of bytes sent per partition per-request kafka.producer:type=producer-metrics,client-id=([-.w]+) bufferpool-wait-ratio Faction of time an appender waits for space allocation kafka.producer:type=producer-metrics,client-id=([-.w]+) compression-rate-avg Avg compression rate for a topic. Compressed / uncompressed batch size kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),to pic=([-.w]+) record-queue-time-avg Avg time (ms) record batches spent in the send buffer kafka.producer:type=producer-metrics,client-id=([-.w]+) request-latency-avg Avg request latency (ms) kafka.producer:type=producer-metrics,client-id=([-.w]+) produce-throttle-time-avg Avg time (ms) a request was throttled by a broker kafka.producer:type=producer-metrics,client-id=([-.w]+) record-retry-rate Avg per-second number of retried record sends for a topic kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),to pic=([-.w]+) Overview Java metrics & librdkafka statistics
  • 15. Consumer application Kafka consumers fetch batches of events! Embrace at-least-once semantics!
  • 16. Consumers Partitions ● Basis for scalability ● No partition will be assigned to more than one consumer in the same group Key parameters # of partitions fetch.min.bytes=1 max.partition.fetch.bytes=10MB fetch.max.bytes=50MB max.poll.records=500 (if being used)
  • 17. Key positions in each partition 17 Log end offset • Latest data added to the partition • Position of the producer • Not accessible to consumers High watermark • Offsets up to the watermark can be consumed • Data has been replicated to all insync replicas Current position • Specific to consumer instances • Current message being processed in poll-loop Last committed offset • Last position persisted in the __consumer_offsets topic 0 1 2 3 4 5 6 7 8 9 10 11 12 Last committed offset Current position of consumer High watermark Log end offset
  • 18. Consumer groups Consumer Any Broker (bootstrap) Coordinator Broker Find coordinator Coordinator details Join consum er group Leader details Sync group Partition assignm ent Rebalances ● Every time a new consumer joins or leaves (fails) the group ● Until Kafka 2.4 “stop the world” event (solved in KIP-429) ● Consider setting to minimize rebalances (KIP-345) Partition assignment ● Based on partition.assignment.strategy ● Options: Range (default), round robin, sticky, cooperative sticky ● Is customizable Heartbeat group.initial.
  • 19. Selected consumer performance metrics Metric Meaning MBean fetch-latency-avg Avg time taken for a fetch request kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([- .w]+) fetch-size-avg Avg number of bytes fetched per request kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([- .w]+) commit-latency-avg Avg time commit request kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.w ]+) rebalance-latency-total Total time taken for group rebalances kafka.consumer:type=consumer-coordinator-metrics,client-id=([-. w]+) fetch-throttle-time-avg Avg throttle time (ms) kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([- .w]+) Overview Java metrics and librdkafka statistics
  • 20. Consumer Benchmarking 20 (1) Start with most simple test: Without any tuning, we get extremely good results Highlights: ● 10M messages in less than 30 seconds ● 1Gb data retrieved ● 325 Mb/s Conclusion: ● Tuning producer is key, if it is correctly tuned, there (can be) almost no tuning required on consumer side
  • 24. Overview Brokers and Zookeeper 24 Request lifecycle in broker ● How are produce & fetch requests handled? ● How can inefficient batching impact performance? ● How to identify where time is spent during request handling? Controller, leaders, and Zookeeper ● How is the Controller elected? ● How are broker failures detected? ● Why does the partition count matter for the recovery time after a controller failure? (Next 8 slides skipped)
  • 25. 04. Optimizing and Tuning Client Applications
  • 27. Recommendations 27 Benchmarking ● Benchmark all applications with a significant & representative load ● Consider a test cluster with the applications requirements configured (either it is durability, availability or any other) real data (size, schema, serialization format, ...) ● Test the different parameters to see the impact in the test data (throughput, latency, ...) considering different configurations (batch size, compression, linger, ...) ● Evaluate the traffic and leave space for growth when determining the number of partitions ● Low volume applications may need care too ● Re-evaluate after major changes in application or message content (JSON size, ...) and volume Monitoring ● Should be used to identify bottlenecks in running clusters ● Client monitoring is as important as broker monitoring
  • 28. Conclusion 28 Resources ● Optimizing Your Apache Kafka® Deployment ● Optimizing and Tuning ● White paper Optimization approach ● Determine service goals ● Understand Kafka’s internals ● Configure clients & cluster ● Benchmark, monitor & tune Continue the conversation ● How to monitor the cluster & clients? ● Integration with external systems ● Tuning of Kafka Streams & ksqlDB applications?