SlideShare a Scribd company logo
©2015 LinkedIn Corporation. All Rights Reserved.
Aditya Auradkar & Dong Lin
©2015 LinkedIn Corporation. All Rights Reserved.
Motivation: Why is this important?
● Shared resources in a multi-tenant environment
● Bad clients can hurt others
– Bootstrapping consumers
– Buggy clients
● Better QOS for well-behaved clients
● Preserve throughout and latency for everyone else
● API Limits/Billing
©2015 LinkedIn Corporation. All Rights Reserved.
Clients and Client-Ids
● Quotas are enforced per client-id
● Why client-id?
● No quotas per topic
● No quotas per topic * client-id combination
● Blanket produce and fetch quota for all clients
©2015 LinkedIn Corporation. All Rights Reserved.
Quota Overrides
● Certain clients justify higher quotas
● Rolling bounces take too long and require too much effort
● Store overrides in ZooKeeper
● Brokers parse config change notifications
● Apply new quota immediately
©2015 LinkedIn Corporation. All Rights Reserved.
Quota Overrides
{ "version":1,
"config": {
"producer_byte_rate":"1048576",
"consumer_byte_rate":"1048576”
}
}
©2015 LinkedIn Corporation. All Rights Reserved.
Broker Metrics
● Metrics created for each client
● Clients can come and go
● Don’t need to retain client metrics forever
● GC metrics if inactive for longer than 1 hr
● Recreate if client reconnects
©2015 LinkedIn Corporation. All Rights Reserved.
Enforcement
● Reduce client throughput to desired rate
● Compute delay based on current throughput
● Small violations result in small delays
● Use smaller measurement windows to avoid long pauses
● Client side metrics available to detect throttling
©2015 LinkedIn Corporation. All Rights Reserved.
Delay Calculation
● Delay = W * (μ - Q) / μ
● W = window size, μ = observed rate, Q = desired rate
©2015 LinkedIn Corporation. All Rights Reserved.
replica
manager log
quota
manager
Enforcement
producer
r
e
q
u
e
s
t
c
h
a
n
n
e
l
1. request
7. response
3. append
4. record metric
5. delay
delay queue
6. dequeue
delay queue
2. process
©2015 LinkedIn Corporation. All Rights Reserved.
replica
manager log
quota
manager
Enforcement
r
e
q
u
e
s
t
c
h
a
n
n
e
l
1. request
7. Response
(zero copy)
3. fetch offsets
4. record metric
delay queue
6. dequeue
delay queue
2. process
5. delay
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Slowdown vs Error
● Error handling is hard
● Tricky to implement backoff and retries
● All client implementations need to handle quota errors
● Need something easier
©2015 LinkedIn Corporation. All Rights Reserved.
Getting Started
● Important Broker configs
– quota.producer.default (in bytes/sec)
– quota.consumer.default (in bytes/sec)
● Apply overrides
./bin/kafka-configs.sh --alter
--add-config 'producer_byte_rate=1048576,consumer_byte_rate=1048576’
--entity-type clients
--entity-name TestTopic
--zookeeper localhost:2181
● Read overrides
./bin/kafka-configs.sh --describe
--entity-type clients
--entity-name TestTopic
--zookeeper localhost:2181
©2015 LinkedIn Corporation. All Rights Reserved.
Monitoring
● Producer metrics
– throttle-time avg and max
● Consumer metrics
– throttle-time avg and max
● Broker metrics
– byte-rate and avg throttle-time per client-id
– byte-rate is used for enforcement
● ZookeeperConsumerConnector and SimpleConsumer metrics also
available
©2015 LinkedIn Corporation. All Rights Reserved.
Rollout Strategy
● Deploy without enforcement
● Monitor metrics to track throughput for all clients
● Identify candidates for overrides
● Start with high thresholds
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation
● Validate quota functionality
- broker-throughput <= sum(quota_of_clientid)
- sum(client-throughput) <= quota_of_clientId
● Evaluate performance improvement for clients.
- Throughput and latency
- Clients with different throughput demand
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● Unlimited quota
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● quota.producer.default = quota.consumer.default = 50 MBps
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● quota.producer.default = quota.consumer.default = 10 MBps
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
clients join
in presence of quota
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
clients join
in presence of quota
comparison
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Latency (ms)
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Alone
Latency (ms) 1.5
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
● Producer runs with other producers without quota (together)
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Alone Together
Latency (ms) 1.5 23.6
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
● Producer runs with other producers without quota (together)
● Producer runs with other producers with 10 MBps quota (quota)
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Alone Together Quota
Latency (ms) 1.5 23.6 2.5
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
Throughput
(MBps)
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
alone
alone-
quota
Throughput
(MBps)
87 45
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
● Consumer runs with other consumers without quota (together)
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
alone
alone-
quota
together
Throughput
(MBps)
87 45 31
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
● Consumer runs with other consumers without quota (together)
● Consumer runs with other consumers with 50 MBps quota (quota)
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
alone
alone-
quota
together quota
Throughput
(MBps)
87 45 31 40
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation - Summary
● Quota functionality is enforced
● Performance improvement for clients from quota in the event that large
clients join
©2015 LinkedIn Corporation. All Rights Reserved.
Future Work
● Throttle replica traffic (e.g. during bootstrap)
● Throttle more request types (OffsetCommitRequest etc.)
● Client-id authentication for use in multi-tenancy environment
©2015 LinkedIn Corporation. All Rights Reserved.
Acknowledgements
● LinkedIn Kafka Engineering team
● Confluent Inc
● John McClean (formerly at LI)

More Related Content

Kafka Quotas Talk at LinkedIn

  • 1. ©2015 LinkedIn Corporation. All Rights Reserved. Aditya Auradkar & Dong Lin
  • 2. ©2015 LinkedIn Corporation. All Rights Reserved. Motivation: Why is this important? ● Shared resources in a multi-tenant environment ● Bad clients can hurt others – Bootstrapping consumers – Buggy clients ● Better QOS for well-behaved clients ● Preserve throughout and latency for everyone else ● API Limits/Billing
  • 3. ©2015 LinkedIn Corporation. All Rights Reserved. Clients and Client-Ids ● Quotas are enforced per client-id ● Why client-id? ● No quotas per topic ● No quotas per topic * client-id combination ● Blanket produce and fetch quota for all clients
  • 4. ©2015 LinkedIn Corporation. All Rights Reserved. Quota Overrides ● Certain clients justify higher quotas ● Rolling bounces take too long and require too much effort ● Store overrides in ZooKeeper ● Brokers parse config change notifications ● Apply new quota immediately
  • 5. ©2015 LinkedIn Corporation. All Rights Reserved. Quota Overrides { "version":1, "config": { "producer_byte_rate":"1048576", "consumer_byte_rate":"1048576” } }
  • 6. ©2015 LinkedIn Corporation. All Rights Reserved. Broker Metrics ● Metrics created for each client ● Clients can come and go ● Don’t need to retain client metrics forever ● GC metrics if inactive for longer than 1 hr ● Recreate if client reconnects
  • 7. ©2015 LinkedIn Corporation. All Rights Reserved. Enforcement ● Reduce client throughput to desired rate ● Compute delay based on current throughput ● Small violations result in small delays ● Use smaller measurement windows to avoid long pauses ● Client side metrics available to detect throttling
  • 8. ©2015 LinkedIn Corporation. All Rights Reserved. Delay Calculation ● Delay = W * (μ - Q) / μ ● W = window size, μ = observed rate, Q = desired rate
  • 9. ©2015 LinkedIn Corporation. All Rights Reserved. replica manager log quota manager Enforcement producer r e q u e s t c h a n n e l 1. request 7. response 3. append 4. record metric 5. delay delay queue 6. dequeue delay queue 2. process
  • 10. ©2015 LinkedIn Corporation. All Rights Reserved. replica manager log quota manager Enforcement r e q u e s t c h a n n e l 1. request 7. Response (zero copy) 3. fetch offsets 4. record metric delay queue 6. dequeue delay queue 2. process 5. delay consumer
  • 11. ©2015 LinkedIn Corporation. All Rights Reserved. Slowdown vs Error ● Error handling is hard ● Tricky to implement backoff and retries ● All client implementations need to handle quota errors ● Need something easier
  • 12. ©2015 LinkedIn Corporation. All Rights Reserved. Getting Started ● Important Broker configs – quota.producer.default (in bytes/sec) – quota.consumer.default (in bytes/sec) ● Apply overrides ./bin/kafka-configs.sh --alter --add-config 'producer_byte_rate=1048576,consumer_byte_rate=1048576’ --entity-type clients --entity-name TestTopic --zookeeper localhost:2181 ● Read overrides ./bin/kafka-configs.sh --describe --entity-type clients --entity-name TestTopic --zookeeper localhost:2181
  • 13. ©2015 LinkedIn Corporation. All Rights Reserved. Monitoring ● Producer metrics – throttle-time avg and max ● Consumer metrics – throttle-time avg and max ● Broker metrics – byte-rate and avg throttle-time per client-id – byte-rate is used for enforcement ● ZookeeperConsumerConnector and SimpleConsumer metrics also available
  • 14. ©2015 LinkedIn Corporation. All Rights Reserved. Rollout Strategy ● Deploy without enforcement ● Monitor metrics to track throughput for all clients ● Identify candidates for overrides ● Start with high thresholds
  • 15. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation ● Validate quota functionality - broker-throughput <= sum(quota_of_clientid) - sum(client-throughput) <= quota_of_clientId ● Evaluate performance improvement for clients. - Throughput and latency - Clients with different throughput demand
  • 16. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Validate Quota Functionality ● Unlimited quota producer consumer
  • 17. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Validate Quota Functionality ● quota.producer.default = quota.consumer.default = 50 MBps producer consumer
  • 18. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Validate Quota Functionality ● quota.producer.default = quota.consumer.default = 10 MBps producer consumer
  • 19. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone
  • 20. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone clients join together
  • 21. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone clients join together clients join in presence of quota
  • 22. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone clients join together clients join in presence of quota comparison
  • 23. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Latency (ms)
  • 24. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement ● Producer runs at 2 MBps alone (alone) 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Alone Latency (ms) 1.5
  • 25. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement ● Producer runs at 2 MBps alone (alone) ● Producer runs with other producers without quota (together) 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Alone Together Latency (ms) 1.5 23.6
  • 26. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement ● Producer runs at 2 MBps alone (alone) ● Producer runs with other producers without quota (together) ● Producer runs with other producers with 10 MBps quota (quota) 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Alone Together Quota Latency (ms) 1.5 23.6 2.5
  • 27. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota Throughput (MBps)
  • 28. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement ● Consumer runs alone (alone) ● Consumer runs alone with 50 MBps quota (alone-quota) 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota alone alone- quota Throughput (MBps) 87 45
  • 29. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement ● Consumer runs alone (alone) ● Consumer runs alone with 50 MBps quota (alone-quota) ● Consumer runs with other consumers without quota (together) 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota alone alone- quota together Throughput (MBps) 87 45 31
  • 30. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement ● Consumer runs alone (alone) ● Consumer runs alone with 50 MBps quota (alone-quota) ● Consumer runs with other consumers without quota (together) ● Consumer runs with other consumers with 50 MBps quota (quota) 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota alone alone- quota together quota Throughput (MBps) 87 45 31 40
  • 31. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation - Summary ● Quota functionality is enforced ● Performance improvement for clients from quota in the event that large clients join
  • 32. ©2015 LinkedIn Corporation. All Rights Reserved. Future Work ● Throttle replica traffic (e.g. during bootstrap) ● Throttle more request types (OffsetCommitRequest etc.) ● Client-id authentication for use in multi-tenancy environment
  • 33. ©2015 LinkedIn Corporation. All Rights Reserved. Acknowledgements ● LinkedIn Kafka Engineering team ● Confluent Inc ● John McClean (formerly at LI)

Editor's Notes

  1. Good eve. Welcome to LI Introduce. Work on kafka engineering team at LI Here to talk about a brand new feature in 0.9. Quotas Ability to define throughput thresholds for a client
  2. When run as a service, all resources are shared. CPU, disk, network etc.. Single bad client can degrade the experience for others (buggy clients) In some cases, the client isn’t even bad i.e. bootstrapping consumers. Need a way to offer better QOS for well-behaved clients
  3. What is the quantity we throttle? Client-id Client-id logically identifies an application. Hence we choose Topics inherently don’t have a notion of ownership. Significant number of topics are public data. It’s hard to add quotas per topic, because everyone using that topic will get throttled. Not desirable. For e.g. one well behaved consumer should be throttled because of a different bootstrapping consumer. A well behaved producer instance shouldn’t get throttled because of a buggy client Quotas per topic * client-id combination are also tricky to get right. For example: a wildcard consumer will receive infinite quota. A producer producing to 1000 topics also can bypass the quota system Have a reasonable threshold for everyone.
  4. Many clients can justify larger quotas. Default doesn’t work for everybody Quota changes can happen frequently. SRE would hate having to bounce clusters to change quotas for custom clients Similar to topic configs, we store overrides in ZK
  5. In order to track quota, we have metrics for each client that has connected. This number can be significant Shortlived clients: console consumers, console producers etc.
  6. As mentioned, we have metrics to track per-client byte-rate. The goal is to reduce client throughput to the desired rate. Delay is computed based on the current throughput. Basically, if throughput violation is low, small delays are added to the responses. We use small measurement windows to detect violations early. For e.g., if we had a 5 minute window, we would have a long pause towards the end. This is configurable Metrics available client side. No error returned
  7. * After the delay, the measuring window should have throughput equal to Q.
  8. Request sent to request channel Sent to replica manager. Appended to log Number of bytes appended, metric updated Compute delay, insert into a queue Reaper thread, will send response async
  9. Number of nuances to client side error. Cannot trust client implementations to do the right thing Why not just send errors to clients? Dozens of client implementations need to bulld backoffs and retries on error Something that just works
  10. Lets talk specifics Tooling available to change quotas per client id
  11. * Consistent with other metrics available on each of these clients
  12. Observe traffic patterns Monitor metrics to track throughput for all clients. This lets you pick a reasonable threshold. Start high. Don’t want to configure too low a quota and most people end up getting throttled on a stable cluster