Multi-Tier, Multi-Tenant, Multi-Problem Kafka
Todd Palino
Who Am I?
What Will We Talk About?
 Multi-Tenant Pipelines
 Multi-Tier Architecture
 Why I Drink Interesting Problems
 Conclusion

Multi-Tenant Pipelines
Tracking and Data Deployment
 Tracking – Data going to HDFS
 Data Deployment – Hadoop job results going to online applications
 Many shared topics
 Schemas require a common header
 All message counts are audited
 Special Problems
– Hard to tell what application is dropping messages
– Some of these messages are copied 42 times!
 Application and OS metrics
 Deployment and build system events
 Service calls – sampling of timing information for individual application calls
 Some application logs
 Special Problems
– Every server in the datacenter produces to this cluster at least twice
– Graphing/Alerting system consumes the metrics 20 times
 Application logging messages destined for ELK clusters
 Lower retention than other clusters
 Loosest restrictions on message schema and encoding
 Special Problems
– Not many – it’s still overprovisioned
– Customers starting to ask about aggregation

 Everything else
 Primarily messages internal to applications
 Also emails and user messaging
 Messages are Avro encoded, but do not require headers
 Special Problems:
– Many messages which use unregistered schemas
– Clusters can have very high message rates (but not large data)
Special Case Clusters
 Not all use cases fit multi-tenancy
– Custom configurations that are needed
– Tighter performance guarantees
– Use of topic deletion
 Espresso (KV store) internal replication
 Brooklin – Change capture
 Replication from Hadoop to Voldemort
Tiered Cluster Architecture
One Kafka Cluster

Kafka basics
Kafka basicsKafka basics
Kafka basics

Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.

Multiple Clusters – Message Aggregation
Why Not Direct?
 Network Concerns
– Bandwidth
– Network partitioning
– Latency
 Security Concerns
– Firewalls and ACLs
– Encrypting data in transit
 Resource Concerns
– A misbehaving application can swamp production resources
What Do We Lose?
 You may lose message ordering
– Mirror maker breaks apart message batches and redistributes them
 You may lose key to partition affinity
– Mirror maker will partition based on the key
– Differing partition counts in source and target will result in differing distribution
– Mirror maker does not (without work) honor custom partitioning
Aggregation Rules
 Aggregate clusters are only for consuming messages
– Producing to an aggregate cluster is not allowed
– This assures all aggregate clusters have the same content
 Not every topic appears in PROD aggregate-tracking clusters
– Trying to discourage aggregate cluster usage in PROD
– All topics are available in CORP
 Aggregate-queuing is whitelist only and very restricted
– Please discuss your use case with us before developing

Interesting Problems
Buy The Book!
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
Also discusses stream
processing and other use
Monitoring Using Kafka
 Monitoring and alerting are self-service
– No gatekeeper on what metrics are collected and stored
 Applications use a common container
– EventBus Kafka producer
– Simple annotation of metrics to collect
– Sampled service calls
– Application logs
 Everything is produced to Kafka and consumed by the monitoring
Monitoring Kafka
 Kafka is great for monitoring your applications

KMon and EnlightIN
 Developed a separate monitoring and notification system
– Metrics are only retained long enough to alert on them
– One rule: we can’t use Kafka
 Alerting is simplified from our self-service system
– Nothing complex like regular expressions or RPNs
– Only used for critical Kafka and Zookeeper alerts
– Faster and more reliable
 Notifications are cleaner
– Alerts are grouped into incidents for fewer notifications when things break
– Notification system is generic and subscribable so we can use it for other things
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
Is Kafka Working?
 Knowing that the cluster is up isn’t always enough
– Network problems
– Metrics can lie
 Customers still ask us first if something breaks
– Part of the solution is educating them as to what to monitor
– Need to be absolutely sure of the answer “There’s nothing wrong with Kafka”
Kafka Monitoring Framework
 Producer to consumer testing of a Kafka cluster
– Assures that producers and consumers actually work
– Measures how long messages take to get through
 We have a SLO of 99.99% availability for all clusters
 Working on multi-tier support
– Answers the question of how long messages take to get to Hadoop
 LinkedIn Kafka Open Source

Is Mirroring Working?
 Most critical data flows through Kafka
– Most of that depends on mirror makers
– How do we make sure it all gets where it’s going?
 Mirror maker pipelines can have over a thousand topics
– Different message rates
– Some are more important than others
 Lag threshold monitoring doesn’t work
– Traffic spikes cause false alerts
– What should the threshold be?
– No easy way to monitor 1000 topics and over 10k partitions
Kafka Audit
 Audit tracks topic completeness across all clusters in the pipeline
– Primarily tracking messages
– Schema must have a valid header
– Alerts for DWH topics are set for 0.1% message loss
 Provided as an integrated part of the internal Kafka libraries
 Used for data completeness checks before Hadoop jobs run
Auditing Message Flows
 Burrow is an advanced Kafka consumer monitoring system
– Provides an objective view of consumer status
– Much more powerful than threshold-based lag monitoring
 Burrow is Open Source!
– Used by many other companies, including Wikimedia and Blizzard
– Used internally to assure all Mirror Makers and Audit are running correctly
 Exports metrics for all consumers to self-service monitoring

MTTF Is Not Your Friend
 We have over 1800 Kafka brokers
– All have at least 12 drives, most have 16
– Dual CPUs, at least 64 GB of memory
– Really lousy Megaraid controllers
 This means hardware fails daily
– We don’t always know when it happens, if it doesn’t take the system down
– It can’t always be fixed immediately
– We can take one broker down, but not two
Moving Partitions
 Prior to Kafka 0.8, moving partitions was basically impossible
– It’s still not easy – you have to be explicit about what you are moving
– There’s no good way to balance partitions in a cluster
 We developed kafka-assigner to solve the problem
– A single command to remove a broker and distribute it’s partitions
– Chainable modules for balancing partitions
– Open source!
 Also working on “Cruise Control” for Kafka
– An add-on service that will handle redistributing partitions automatically
Pushing Data from Hadoop
 To help Hadoop jobs, we maintain a KafkaPushJob
– A mapper that produces messages to Kafka
– Pushes to data-deployment, which then gets mirrored to production
 Hadoop jobs tend to push a lot of data all at once
– Some jobs spin up hundreds of mappers
– Pushing many gigabytes of data in a very short period of time
 This overwhelms a Kafka cluster
– Spurious alerts for under replicated partitions
– Problems with mirroring the messages out
Kafka Quotas
 Quotas limit traffic based on client ID
– Specified in bytes/sec on a per-broker basis
– Not per-topic or per-partition
 Should be transparent to clients
– Accomplished by delaying the response to requests
– Newer clients have metrics specific to quotas for clarity
 We use it to protect the replication of the cluster
– Set it as high as possible while protecting against a single bad client

Delete Topic
 Feature has been under development for almost 3 years
– Only recently has it even worked a little bit
– We’re still not sure about it (from SRE’s point of view)
 Recently performed additional testing so we can use it
– Found that even when disabled for a cluster, something was happening
– Some brokers claimed the topic was gone, some didn’t
– Mirror makers broke for the topic
 One of the code paths in the controller was not blocked
– Metadata change went out, but it was hard to diagnose
Brokers are Independent
 When there’s a problem in the cluster, brokers might have bad information
– The controller should tell them what the topic metadata is
– Brokers get out of sync due to connection issues or bugs
 There’s no good tool for just sending a request to a broker and reading the
– We had to write a Java application just to send a metadata request
 Coming soon – kafka-protocol
– Simple CLI tool for sending individual requests to Kafka brokers
– Will be part of the repository
Broker Improvement - JBOD
 We use RAID-10 on all brokers
– Trade off a lot of performance for a little resiliency
– Lose half of our disk space
 Current JBOD implementation isn’t great
– No admin tools for moving partitions
– Assignment is round-robin
– Broker shuts down if a single disk fails
 Looking at options
– Might try to fix the JBOD implementation in Kafka
– Testing running multiple brokers on a single server

Mirror Maker Improvements
 Mirror Maker has performance issues
– Has to decompress and recompress every message
– Loses information about partition affinity and strict ordering
 Developed an Identity message handler
– Messages in source partition 0 get produced directly to partition 0
– Requires mirror maker to maintain downstream partition counts
 Working on the next steps
– No decompression of message batches
– Looking at other options on how to run mirror makers
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
Getting Involved With Kafka
 Join the mailing lists
 - #apache-kafka
 Meetups
– Bay Area –
 Contribute code
Multi tier, multi-tenant, multi-problem kafka

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tier, Multi-Tenant, Multi-Problem Kafka
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Multi-Tenant Pipelines  Multi-Tier Architecture  Why I Drink Interesting Problems  Conclusion 4
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tenant Pipelines 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tracking and Data Deployment  Tracking – Data going to HDFS  Data Deployment – Hadoop job results going to online applications  Many shared topics  Schemas require a common header  All message counts are audited  Special Problems – Hard to tell what application is dropping messages – Some of these messages are copied 42 times! 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Metrics  Application and OS metrics  Deployment and build system events  Service calls – sampling of timing information for individual application calls  Some application logs  Special Problems – Every server in the datacenter produces to this cluster at least twice – Graphing/Alerting system consumes the metrics 20 times 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Logging  Application logging messages destined for ELK clusters  Lower retention than other clusters  Loosest restrictions on message schema and encoding  Special Problems – Not many – it’s still overprovisioned – Customers starting to ask about aggregation 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Queuing  Everything else  Primarily messages internal to applications  Also emails and user messaging  Messages are Avro encoded, but do not require headers  Special Problems: – Many messages which use unregistered schemas – Clusters can have very high message rates (but not large data) 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Special Case Clusters  Not all use cases fit multi-tenancy – Custom configurations that are needed – Tighter performance guarantees – Use of topic deletion  Espresso (KV store) internal replication  Brooklin – Change capture  Replication from Hadoop to Voldemort 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tiered Cluster Architecture 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Kafka Cluster 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multiple Clusters – Message Aggregation 13
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Why Not Direct?  Network Concerns – Bandwidth – Network partitioning – Latency  Security Concerns – Firewalls and ACLs – Encrypting data in transit  Resource Concerns – A misbehaving application can swamp production resources 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Do We Lose?  You may lose message ordering – Mirror maker breaks apart message batches and redistributes them  You may lose key to partition affinity – Mirror maker will partition based on the key – Differing partition counts in source and target will result in differing distribution – Mirror maker does not (without work) honor custom partitioning 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Aggregation Rules  Aggregate clusters are only for consuming messages – Producing to an aggregate cluster is not allowed – This assures all aggregate clusters have the same content  Not every topic appears in PROD aggregate-tracking clusters – Trying to discourage aggregate cluster usage in PROD – All topics are available in CORP  Aggregate-queuing is whitelist only and very restricted – Please discuss your use case with us before developing 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Interesting Problems 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 18 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Using Kafka  Monitoring and alerting are self-service – No gatekeeper on what metrics are collected and stored  Applications use a common container – EventBus Kafka producer – Simple annotation of metrics to collect – Sampled service calls – Application logs  Everything is produced to Kafka and consumed by the monitoring infrastructure 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Kafka  Kafka is great for monitoring your applications 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. KMon and EnlightIN  Developed a separate monitoring and notification system – Metrics are only retained long enough to alert on them – One rule: we can’t use Kafka  Alerting is simplified from our self-service system – Nothing complex like regular expressions or RPNs – Only used for critical Kafka and Zookeeper alerts – Faster and more reliable  Notifications are cleaner – Alerts are grouped into incidents for fewer notifications when things break – Notification system is generic and subscribable so we can use it for other things 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Kafka Working?  Knowing that the cluster is up isn’t always enough – Network problems – Metrics can lie  Customers still ask us first if something breaks – Part of the solution is educating them as to what to monitor – Need to be absolutely sure of the answer “There’s nothing wrong with Kafka” 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Monitoring Framework  Producer to consumer testing of a Kafka cluster – Assures that producers and consumers actually work – Measures how long messages take to get through  We have a SLO of 99.99% availability for all clusters  Working on multi-tier support – Answers the question of how long messages take to get to Hadoop  LinkedIn Kafka Open Source – 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Mirroring Working?  Most critical data flows through Kafka – Most of that depends on mirror makers – How do we make sure it all gets where it’s going?  Mirror maker pipelines can have over a thousand topics – Different message rates – Some are more important than others  Lag threshold monitoring doesn’t work – Traffic spikes cause false alerts – What should the threshold be? – No easy way to monitor 1000 topics and over 10k partitions 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Audit  Audit tracks topic completeness across all clusters in the pipeline – Primarily tracking messages – Schema must have a valid header – Alerts for DWH topics are set for 0.1% message loss  Provided as an integrated part of the internal Kafka libraries  Used for data completeness checks before Hadoop jobs run 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Auditing Message Flows 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Burrow  Burrow is an advanced Kafka consumer monitoring system – Provides an objective view of consumer status – Much more powerful than threshold-based lag monitoring  Burrow is Open Source! – Used by many other companies, including Wikimedia and Blizzard – Used internally to assure all Mirror Makers and Audit are running correctly  Exports metrics for all consumers to self-service monitoring  28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. MTTF Is Not Your Friend  We have over 1800 Kafka brokers – All have at least 12 drives, most have 16 – Dual CPUs, at least 64 GB of memory – Really lousy Megaraid controllers  This means hardware fails daily – We don’t always know when it happens, if it doesn’t take the system down – It can’t always be fixed immediately – We can take one broker down, but not two 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Moving Partitions  Prior to Kafka 0.8, moving partitions was basically impossible – It’s still not easy – you have to be explicit about what you are moving – There’s no good way to balance partitions in a cluster  We developed kafka-assigner to solve the problem – A single command to remove a broker and distribute it’s partitions – Chainable modules for balancing partitions – Open source!  Also working on “Cruise Control” for Kafka – An add-on service that will handle redistributing partitions automatically 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Pushing Data from Hadoop  To help Hadoop jobs, we maintain a KafkaPushJob – A mapper that produces messages to Kafka – Pushes to data-deployment, which then gets mirrored to production  Hadoop jobs tend to push a lot of data all at once – Some jobs spin up hundreds of mappers – Pushing many gigabytes of data in a very short period of time  This overwhelms a Kafka cluster – Spurious alerts for under replicated partitions – Problems with mirroring the messages out 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Quotas  Quotas limit traffic based on client ID – Specified in bytes/sec on a per-broker basis – Not per-topic or per-partition  Should be transparent to clients – Accomplished by delaying the response to requests – Newer clients have metrics specific to quotas for clarity  We use it to protect the replication of the cluster – Set it as high as possible while protecting against a single bad client 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Delete Topic  Feature has been under development for almost 3 years – Only recently has it even worked a little bit – We’re still not sure about it (from SRE’s point of view)  Recently performed additional testing so we can use it – Found that even when disabled for a cluster, something was happening – Some brokers claimed the topic was gone, some didn’t – Mirror makers broke for the topic  One of the code paths in the controller was not blocked – Metadata change went out, but it was hard to diagnose 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Brokers are Independent  When there’s a problem in the cluster, brokers might have bad information – The controller should tell them what the topic metadata is – Brokers get out of sync due to connection issues or bugs  There’s no good tool for just sending a request to a broker and reading the response – We had to write a Java application just to send a metadata request  Coming soon – kafka-protocol – Simple CLI tool for sending individual requests to Kafka brokers – Will be part of the repository 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 35
  • 36. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Improvement - JBOD  We use RAID-10 on all brokers – Trade off a lot of performance for a little resiliency – Lose half of our disk space  Current JBOD implementation isn’t great – No admin tools for moving partitions – Assignment is round-robin – Broker shuts down if a single disk fails  Looking at options – Might try to fix the JBOD implementation in Kafka – Testing running multiple brokers on a single server 36
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Mirror Maker Improvements  Mirror Maker has performance issues – Has to decompress and recompress every message – Loses information about partition affinity and strict ordering  Developed an Identity message handler – Messages in source partition 0 get produced directly to partition 0 – Requires mirror maker to maintain downstream partition counts  Working on the next steps – No decompression of message batches – Looking at other options on how to run mirror makers 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka   Join the mailing lists – –  - #apache-kafka  Meetups – Bay Area –  Contribute code 39

  13. So how can you get more involved in the Kafka community? The most obvious answer is to go From there you can Join the mailing lists, either on the development or the user side You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local You can also dive into the source repository, and work on and contribute your own tools back. Kafka may be young, but it’s a critical piece of data infrastructure for many of us.