SlideShare a Scribd company logo
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Multi-Tier, Multi-Tenant, Multi-Problem Kafka
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Who Am I?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Multi-Tenant Pipelines
 Multi-Tier Architecture
 Why I Drink Interesting Problems
 Conclusion
4

Recommended for you

Apache kafka
Apache kafkaApache kafka
Apache kafka

Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.

Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control

An introduction to Kafka cruise control which performs dynamic workload balancing for Kafka clusters.

kafkaworkload modelingworkload balancing
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka

Apache Kafka's rise in popularity as a streaming platform has demanded a revisit of its traditional at-least-once message delivery semantics. In this talk, we present the recent additions to Kafka to achieve exactly-once semantics (EoS) including support for idempotence and transactions in the Kafka clients. The main focus will be the specific semantics that Kafka distributed transactions enable and the underlying mechanics which allow them to scale efficiently.

apache kafkasemanticsexactly-once
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Multi-Tenant Pipelines
5
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Tracking and Data Deployment
 Tracking – Data going to HDFS
 Data Deployment – Hadoop job results going to online applications
 Many shared topics
 Schemas require a common header
 All message counts are audited
 Special Problems
– Hard to tell what application is dropping messages
– Some of these messages are copied 42 times!
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Metrics
 Application and OS metrics
 Deployment and build system events
 Service calls – sampling of timing information for individual application calls
 Some application logs
 Special Problems
– Every server in the datacenter produces to this cluster at least twice
– Graphing/Alerting system consumes the metrics 20 times
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Logging
 Application logging messages destined for ELK clusters
 Lower retention than other clusters
 Loosest restrictions on message schema and encoding
 Special Problems
– Not many – it’s still overprovisioned
– Customers starting to ask about aggregation
8

Recommended for you

Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka

Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.

producerperformancekafka
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka

A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.

streamingstreamsknolx
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect

It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.

apache kafkakafka connectapache kafka connect
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Queuing
 Everything else
 Primarily messages internal to applications
 Also emails and user messaging
 Messages are Avro encoded, but do not require headers
 Special Problems:
– Many messages which use unregistered schemas
– Clusters can have very high message rates (but not large data)
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Special Case Clusters
 Not all use cases fit multi-tenancy
– Custom configurations that are needed
– Tighter performance guarantees
– Use of topic deletion
 Espresso (KV store) internal replication
 Brooklin – Change capture
 Replication from Hadoop to Voldemort
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Tiered Cluster Architecture
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
One Kafka Cluster
12

Recommended for you

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained

Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput. This talk provides a comprehensive overview of Kafka architecture and internal functions, including: -Topics, partitions and segments -The commit log and streams -Brokers and broker replication -Producer basics -Consumers, consumer groups and offsets This session is part 2 of 4 in our Fundamentals for Apache Kafka series.

apache kafkaconfluentconfluent platform
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams

Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.

stream processingkafka
Kafka basics
Kafka basicsKafka basics
Kafka basics

Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.

kafkastreamingtutorial
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Multiple Clusters – Message Aggregation
13
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Why Not Direct?
 Network Concerns
– Bandwidth
– Network partitioning
– Latency
 Security Concerns
– Firewalls and ACLs
– Encrypting data in transit
 Resource Concerns
– A misbehaving application can swamp production resources
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Do We Lose?
 You may lose message ordering
– Mirror maker breaks apart message batches and redistributes them
 You may lose key to partition affinity
– Mirror maker will partition based on the key
– Differing partition counts in source and target will result in differing distribution
– Mirror maker does not (without work) honor custom partitioning
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Aggregation Rules
 Aggregate clusters are only for consuming messages
– Producing to an aggregate cluster is not allowed
– This assures all aggregate clusters have the same content
 Not every topic appears in PROD aggregate-tracking clusters
– Trying to discourage aggregate cluster usage in PROD
– All topics are available in CORP
 Aggregate-queuing is whitelist only and very restricted
– Please discuss your use case with us before developing
16

Recommended for you

Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform

The document discusses Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka and describes how it is used by many large companies to process streaming data in real-time. Key aspects of Kafka explained include topics, partitions, producers, consumers, consumer groups, and how Kafka is able to achieve high performance through its architecture and design.

kafka tutorialkafka exampleskafka architecture
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.

big datamessaging queuestreaming
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Interesting Problems
17
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
18
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring Using Kafka
 Monitoring and alerting are self-service
– No gatekeeper on what metrics are collected and stored
 Applications use a common container
– EventBus Kafka producer
– Simple annotation of metrics to collect
– Sampled service calls
– Application logs
 Everything is produced to Kafka and consumed by the monitoring
infrastructure
19
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring Kafka
 Kafka is great for monitoring your applications
20

Recommended for you

Kafka 101
Kafka 101Kafka 101
Kafka 101

What is Kafka What problem does Kafka solve How does Kafka work What are the benefits of Kafka Conclusion

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®

This document discusses using Apache Kafka as a data hub to capture changes from various data sources using change data capture (CDC). It outlines several common CDC patterns like using modification dates, database triggers, or log files to identify changes. It then discusses using Kafka Connect to integrate various data sources like MongoDB, PostgreSQL and replicate changes. The document provides examples of open source CDC connectors and concludes with suggestions for getting involved in the Apache Kafka community.

apache kafkacdcconfluent cloud
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams

PPT Focus on Core Kafka concept: Why Kafka, Kafka Eco System, Topics and Partitions, Broker, Replication Factor, Segments, Leaders, Producer, Consumer, Kafka Connect: Connector Architecture, Demo of File and JDBC Connector Kafka Streams: Stream Processing, Kstream, and KTable, Demo

kafkamessingkafka connect
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
KMon and EnlightIN
 Developed a separate monitoring and notification system
– Metrics are only retained long enough to alert on them
– One rule: we can’t use Kafka
 Alerting is simplified from our self-service system
– Nothing complex like regular expressions or RPNs
– Only used for critical Kafka and Zookeeper alerts
– Faster and more reliable
 Notifications are cleaner
– Alerts are grouped into incidents for fewer notifications when things break
– Notification system is generic and subscribable so we can use it for other things
21
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
22
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Is Kafka Working?
 Knowing that the cluster is up isn’t always enough
– Network problems
– Metrics can lie
 Customers still ask us first if something breaks
– Part of the solution is educating them as to what to monitor
– Need to be absolutely sure of the answer “There’s nothing wrong with Kafka”
23
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Monitoring Framework
 Producer to consumer testing of a Kafka cluster
– Assures that producers and consumers actually work
– Measures how long messages take to get through
 We have a SLO of 99.99% availability for all clusters
 Working on multi-tier support
– Answers the question of how long messages take to get to Hadoop
 LinkedIn Kafka Open Source
– https://github.com/linkedin/streaming
24

Recommended for you

Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)

Hello ApacheKafka An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers. We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.

apache kafkaapache nifischema registry
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

kafkasparkdata streaming
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems

Presented at Kafka Summit 2016 Operating out of multiple datacenters is a large part of most disaster recovery plans, but it brings extra complications to our data pipelines. Instead of having a straight path from front to back, it now has forks and dead ends and odd little use cases that don’t match up with a perfect view of the world. This talk will focus on how to best utilize Apache Kafka in this world, including basic architectures for multi-datacenter and multi-tier clusters. We will also touch on how to assure messages make it from producer to consumer, and how to monitor the entire ecosystem.

apachekafkabig data
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Is Mirroring Working?
 Most critical data flows through Kafka
– Most of that depends on mirror makers
– How do we make sure it all gets where it’s going?
 Mirror maker pipelines can have over a thousand topics
– Different message rates
– Some are more important than others
 Lag threshold monitoring doesn’t work
– Traffic spikes cause false alerts
– What should the threshold be?
– No easy way to monitor 1000 topics and over 10k partitions
25
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Audit
 Audit tracks topic completeness across all clusters in the pipeline
– Primarily tracking messages
– Schema must have a valid header
– Alerts for DWH topics are set for 0.1% message loss
 Provided as an integrated part of the internal Kafka libraries
 Used for data completeness checks before Hadoop jobs run
26
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Auditing Message Flows
27
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Burrow
 Burrow is an advanced Kafka consumer monitoring system
– Provides an objective view of consumer status
– Much more powerful than threshold-based lag monitoring
 Burrow is Open Source!
– Used by many other companies, including Wikimedia and Blizzard
– Used internally to assure all Mirror Makers and Audit are running correctly
 Exports metrics for all consumers to self-service monitoring
 https://github.com/linkedin/Burrow
28

Recommended for you

Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control

The document discusses Cruise Control, a tool for managing Apache Kafka clusters. It was created by LinkedIn to handle their large Kafka deployment consisting of over 2,000 brokers and 4 trillion messages per day. Cruise Control monitors broker loads, detects anomalies, and generates proposals to optimize resource usage and replica distributions. It supports operations like adding or removing brokers and performing automatic rebalances without downtime or data loss. The architecture includes components for load monitoring, analysis, execution and an API for administration.

linkedinaimachine learning
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT

First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture

spark streamingbig dataapache spark
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software Development

The document discusses modern software development tools and practices, including: - Using Git for version control and GitHub for collaboration between developers. - Tools like Jenkins, Trello, and Slack to enable continuous integration, project management, and team communication. - Following architectural approaches like microservices and implementing infrastructure as code using tools from the HashiCorp stack like Vagrant, Consul, and Terraform. - Achieving continuous delivery by integrating development and operations to reliably release software through an automated deployment process.

ci/cdmicroservicestesting
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
MTTF Is Not Your Friend
 We have over 1800 Kafka brokers
– All have at least 12 drives, most have 16
– Dual CPUs, at least 64 GB of memory
– Really lousy Megaraid controllers
 This means hardware fails daily
– We don’t always know when it happens, if it doesn’t take the system down
– It can’t always be fixed immediately
– We can take one broker down, but not two
29
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Moving Partitions
 Prior to Kafka 0.8, moving partitions was basically impossible
– It’s still not easy – you have to be explicit about what you are moving
– There’s no good way to balance partitions in a cluster
 We developed kafka-assigner to solve the problem
– A single command to remove a broker and distribute it’s partitions
– Chainable modules for balancing partitions
– Open source! https://github.com/linkedin/kafka-tools
 Also working on “Cruise Control” for Kafka
– An add-on service that will handle redistributing partitions automatically
30
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Pushing Data from Hadoop
 To help Hadoop jobs, we maintain a KafkaPushJob
– A mapper that produces messages to Kafka
– Pushes to data-deployment, which then gets mirrored to production
 Hadoop jobs tend to push a lot of data all at once
– Some jobs spin up hundreds of mappers
– Pushing many gigabytes of data in a very short period of time
 This overwhelms a Kafka cluster
– Spurious alerts for under replicated partitions
– Problems with mirroring the messages out
31
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Quotas
 Quotas limit traffic based on client ID
– Specified in bytes/sec on a per-broker basis
– Not per-topic or per-partition
 Should be transparent to clients
– Accomplished by delaying the response to requests
– Newer clients have metrics specific to quotas for clarity
 We use it to protect the replication of the cluster
– Set it as high as possible while protecting against a single bad client
32

Recommended for you

Symantec SDN Deployment
Symantec SDN DeploymentSymantec SDN Deployment
Symantec SDN Deployment

Symantec deployed an SDN using OpenStack with the following key aspects: 1. They created different "Classes of Service" including a development environment and a production environment to onboard teams and manage workloads. 2. They provided self-service user onboarding through Horizon with automatic network creation to hide complexities. 3. They offered load balancing as a service using HA Proxy with various optimizations to achieve high performance. 4. They attached baremetal servers to the overlay network by launching them in network namespaces. 5. They aimed for over 99.95% control plane availability using a distributed controller and Cassandra setup with automation and monitoring.

Security and Virtualization in the Data Center
Security and Virtualization in the Data CenterSecurity and Virtualization in the Data Center
Security and Virtualization in the Data Center

The evolving complexity of the data center is placing increased demand on the network and security teams to come up with inventive methods for enforcing security policies in these ever-changing environments. The goal of this session is to provide participants with an understanding of features and design recommendations for integrating security into the data center environment. This session will focus on recommendations for securing next-generation data center architectures. Areas of focus include security services integration, leveraging device virtualization, and considerations and recommendations for server virtualization. The target audience are security and data center administrators.

cisco connect toronto 2016
Working with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data ArchitecturesWorking with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data Architectures

Discussion on the needs and impact of data movement on hybrid clouds (where public and private clouds, along with legacy datacenters) collide.

hybrid cloudmessagingarchitecture
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Delete Topic
 Feature has been under development for almost 3 years
– Only recently has it even worked a little bit
– We’re still not sure about it (from SRE’s point of view)
 Recently performed additional testing so we can use it
– Found that even when disabled for a cluster, something was happening
– Some brokers claimed the topic was gone, some didn’t
– Mirror makers broke for the topic
 One of the code paths in the controller was not blocked
– Metadata change went out, but it was hard to diagnose
33
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Brokers are Independent
 When there’s a problem in the cluster, brokers might have bad information
– The controller should tell them what the topic metadata is
– Brokers get out of sync due to connection issues or bugs
 There’s no good tool for just sending a request to a broker and reading the
response
– We had to write a Java application just to send a metadata request
 Coming soon – kafka-protocol
– Simple CLI tool for sending individual requests to Kafka brokers
– Will be part of the https://github.com/linkedin/kafka-tools repository
34
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Conclusion
35
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Improvement - JBOD
 We use RAID-10 on all brokers
– Trade off a lot of performance for a little resiliency
– Lose half of our disk space
 Current JBOD implementation isn’t great
– No admin tools for moving partitions
– Assignment is round-robin
– Broker shuts down if a single disk fails
 Looking at options
– Might try to fix the JBOD implementation in Kafka
– Testing running multiple brokers on a single server
36

Recommended for you

Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1

This document discusses Oracle Ravello Cloud and provides an overview, live demonstration, and summary. Oracle Ravello Cloud allows users to migrate VMware workloads to public clouds without modification by lifting and shifting the virtual machines. The live demonstration shows importing an existing Primavera environment into Oracle Ravello Cloud and publishing the virtual machines to the cloud with one click. The summary notes that Oracle Ravello Cloud solves issues like compatibility, lock-in, and labor costs by allowing lift and shift of workloads to the cloud in an agnostic manner.

SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)

Many organizations anticipate significant growth in WAN bandwidth and Public Cloud usage. Leveraging the Internet to provide extra WAN bandwidth and to offload Public Cloud traffic is compelling, however network reliability, application performance and security are the primary roadblocks. Cisco IWAN transport solution is the most full featured architecture to support the Software Defined Wide Area Network (SD-WAN) requirements that are emerging in standards bodies like the Open Networking User Group (ONUG) to address these issues. Many enterprises are looking for the benefits these technologies deliver, but without the costs associated with owning and operating those technologies. Here is where VMS for IWAN meets market need. Cisco VMS is a full featured management platform for both virtual and physical devices. This session will cover a full description of the VMS platform and how it can be used to deliver exceptional customer experience when supporting a managed offering of IWAN. The roles of Customer and Resource Facing Services will be covered, along with integration between the IWAN service and SP operations. This session will also cover the topic of how Virtual Network Functions (VNFs) can be placed optimally in the network from the CPE to SP datacenter, along with a demo of the end user and operator experience.

cisco connect toronto 2016
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016

Two large enterprise AEM implementations were presented and compared. Anshul Chhabra from Symantec presented their implementation handling 3.3 billion requests per month. Anil Kalbag from Cisco presented their implementation handling 375 million monthly page views. Both implementations utilized multiple data centers for high availability and disaster recovery. Key architecture decisions around virtual/physical infrastructure, storage, caching, and multi-tenancy were discussed and compared between the two organizations.

adobe experience manageruse caseaem
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Mirror Maker Improvements
 Mirror Maker has performance issues
– Has to decompress and recompress every message
– Loses information about partition affinity and strict ordering
 Developed an Identity message handler
– Messages in source partition 0 get produced directly to partition 0
– Requires mirror maker to maintain downstream partition counts
 Working on the next steps
– No decompression of message batches
– Looking at other options on how to run mirror makers
37
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
38
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Bay Area – https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/
 Contribute code
39
Multi tier, multi-tenant, multi-problem kafka

Recommended for you

An Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An OverviewAn Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An Overview

An integrated approach to network traffic management provides benefits over traditional point solutions. A single-point solution can monitor all types of network traffic, support multiple protocols, provide insightful dashboards and reports, and analyze network behavior. This saves troubleshooting time, improves service levels, and maximizes return on investment by enabling better resource allocation and optimization.

I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn

The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to. At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone. Description: Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany. Organized by EIT Digital and Huawei GRC, Germany. Twitter: @CloudRR2016

sreoperationslinkedin
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slidesOracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides

Want to move your Oracle Commerce infrastructure to the cloud? If you are running your commerce business on Oracle Commerce (formerly ATG) stack, and want to take it to the cloud, check out this webinar slide deck. Two industry innovators are joined by the leading retail analyst to discuss why now is the time for the large retailers to transition to the cloud and how they can help.

move to the cloudoraclecloud enablement

More Related Content

What's hot

Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
Araf Karsh Hamid
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
confluent
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Kafka basics
Kafka basicsKafka basics
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Aparna Pillai
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 

What's hot (20)

Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 

Similar to Multi tier, multi-tenant, multi-problem kafka

More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
Todd Palino
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
Bill Liu
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software Development
Angel Conde Manjon
 
Symantec SDN Deployment
Symantec SDN DeploymentSymantec SDN Deployment
Symantec SDN Deployment
Rudrajit Tapadar
 
Security and Virtualization in the Data Center
Security and Virtualization in the Data CenterSecurity and Virtualization in the Data Center
Security and Virtualization in the Data Center
Cisco Canada
 
Working with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data ArchitecturesWorking with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data Architectures
Dave McAllister
 
Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1
Kurt Liu
 
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
Cisco Canada
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
An Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An OverviewAn Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An Overview
ManageEngine
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slidesOracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Grid Dynamics
 
OSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkOSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd Erk
NETWAYS
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDS
ShapeBlue
 
Serverless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment OpportunitiesServerless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment Opportunities
Underscore VC
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
confluent
 
Simplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing HadoopSimplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing Hadoop
Precisely
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
Jens Pillgram-Larsen
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
Andrew Morgan
 

Similar to Multi tier, multi-tenant, multi-problem kafka (20)

More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software Development
 
Symantec SDN Deployment
Symantec SDN DeploymentSymantec SDN Deployment
Symantec SDN Deployment
 
Security and Virtualization in the Data Center
Security and Virtualization in the Data CenterSecurity and Virtualization in the Data Center
Security and Virtualization in the Data Center
 
Working with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data ArchitecturesWorking with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data Architectures
 
Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1
 
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
 
An Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An OverviewAn Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An Overview
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slidesOracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
 
OSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkOSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd Erk
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDS
 
Serverless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment OpportunitiesServerless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment Opportunities
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Simplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing HadoopSimplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing Hadoop
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
Todd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
Todd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Todd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
Todd Palino
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
Todd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
Todd Palino
 

More from Todd Palino (10)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Recently uploaded

Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
Evento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recapEvento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recap
Rafael Santos
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
Mani Krishna Sarkar
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Sinan KOZAK
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
sharvaridhokte
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
VICTOR MAESTRE RAMIREZ
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
GOWSIKRAJA PALANISAMY
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
Iwiss Tools Co.,Ltd
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
Muanisa Waras
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
Mani Krishna Sarkar
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
peacekipu
 
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Miss Khusi #V08
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
itssurajthakur06
 
Social media management system project report.pdf
Social media management system project report.pdfSocial media management system project report.pdf
Social media management system project report.pdf
Kamal Acharya
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
Tool and Die Tech
 

Recently uploaded (20)

Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
Evento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recapEvento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recap
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
 
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
 
Social media management system project report.pdf
Social media management system project report.pdfSocial media management system project report.pdf
Social media management system project report.pdf
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
 

Multi tier, multi-tenant, multi-problem kafka

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tier, Multi-Tenant, Multi-Problem Kafka
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Multi-Tenant Pipelines  Multi-Tier Architecture  Why I Drink Interesting Problems  Conclusion 4
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tenant Pipelines 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tracking and Data Deployment  Tracking – Data going to HDFS  Data Deployment – Hadoop job results going to online applications  Many shared topics  Schemas require a common header  All message counts are audited  Special Problems – Hard to tell what application is dropping messages – Some of these messages are copied 42 times! 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Metrics  Application and OS metrics  Deployment and build system events  Service calls – sampling of timing information for individual application calls  Some application logs  Special Problems – Every server in the datacenter produces to this cluster at least twice – Graphing/Alerting system consumes the metrics 20 times 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Logging  Application logging messages destined for ELK clusters  Lower retention than other clusters  Loosest restrictions on message schema and encoding  Special Problems – Not many – it’s still overprovisioned – Customers starting to ask about aggregation 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Queuing  Everything else  Primarily messages internal to applications  Also emails and user messaging  Messages are Avro encoded, but do not require headers  Special Problems: – Many messages which use unregistered schemas – Clusters can have very high message rates (but not large data) 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Special Case Clusters  Not all use cases fit multi-tenancy – Custom configurations that are needed – Tighter performance guarantees – Use of topic deletion  Espresso (KV store) internal replication  Brooklin – Change capture  Replication from Hadoop to Voldemort 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tiered Cluster Architecture 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Kafka Cluster 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multiple Clusters – Message Aggregation 13
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Why Not Direct?  Network Concerns – Bandwidth – Network partitioning – Latency  Security Concerns – Firewalls and ACLs – Encrypting data in transit  Resource Concerns – A misbehaving application can swamp production resources 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Do We Lose?  You may lose message ordering – Mirror maker breaks apart message batches and redistributes them  You may lose key to partition affinity – Mirror maker will partition based on the key – Differing partition counts in source and target will result in differing distribution – Mirror maker does not (without work) honor custom partitioning 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Aggregation Rules  Aggregate clusters are only for consuming messages – Producing to an aggregate cluster is not allowed – This assures all aggregate clusters have the same content  Not every topic appears in PROD aggregate-tracking clusters – Trying to discourage aggregate cluster usage in PROD – All topics are available in CORP  Aggregate-queuing is whitelist only and very restricted – Please discuss your use case with us before developing 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Interesting Problems 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 18 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Using Kafka  Monitoring and alerting are self-service – No gatekeeper on what metrics are collected and stored  Applications use a common container – EventBus Kafka producer – Simple annotation of metrics to collect – Sampled service calls – Application logs  Everything is produced to Kafka and consumed by the monitoring infrastructure 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Kafka  Kafka is great for monitoring your applications 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. KMon and EnlightIN  Developed a separate monitoring and notification system – Metrics are only retained long enough to alert on them – One rule: we can’t use Kafka  Alerting is simplified from our self-service system – Nothing complex like regular expressions or RPNs – Only used for critical Kafka and Zookeeper alerts – Faster and more reliable  Notifications are cleaner – Alerts are grouped into incidents for fewer notifications when things break – Notification system is generic and subscribable so we can use it for other things 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Kafka Working?  Knowing that the cluster is up isn’t always enough – Network problems – Metrics can lie  Customers still ask us first if something breaks – Part of the solution is educating them as to what to monitor – Need to be absolutely sure of the answer “There’s nothing wrong with Kafka” 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Monitoring Framework  Producer to consumer testing of a Kafka cluster – Assures that producers and consumers actually work – Measures how long messages take to get through  We have a SLO of 99.99% availability for all clusters  Working on multi-tier support – Answers the question of how long messages take to get to Hadoop  LinkedIn Kafka Open Source – https://github.com/linkedin/streaming 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Mirroring Working?  Most critical data flows through Kafka – Most of that depends on mirror makers – How do we make sure it all gets where it’s going?  Mirror maker pipelines can have over a thousand topics – Different message rates – Some are more important than others  Lag threshold monitoring doesn’t work – Traffic spikes cause false alerts – What should the threshold be? – No easy way to monitor 1000 topics and over 10k partitions 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Audit  Audit tracks topic completeness across all clusters in the pipeline – Primarily tracking messages – Schema must have a valid header – Alerts for DWH topics are set for 0.1% message loss  Provided as an integrated part of the internal Kafka libraries  Used for data completeness checks before Hadoop jobs run 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Auditing Message Flows 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Burrow  Burrow is an advanced Kafka consumer monitoring system – Provides an objective view of consumer status – Much more powerful than threshold-based lag monitoring  Burrow is Open Source! – Used by many other companies, including Wikimedia and Blizzard – Used internally to assure all Mirror Makers and Audit are running correctly  Exports metrics for all consumers to self-service monitoring  https://github.com/linkedin/Burrow 28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. MTTF Is Not Your Friend  We have over 1800 Kafka brokers – All have at least 12 drives, most have 16 – Dual CPUs, at least 64 GB of memory – Really lousy Megaraid controllers  This means hardware fails daily – We don’t always know when it happens, if it doesn’t take the system down – It can’t always be fixed immediately – We can take one broker down, but not two 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Moving Partitions  Prior to Kafka 0.8, moving partitions was basically impossible – It’s still not easy – you have to be explicit about what you are moving – There’s no good way to balance partitions in a cluster  We developed kafka-assigner to solve the problem – A single command to remove a broker and distribute it’s partitions – Chainable modules for balancing partitions – Open source! https://github.com/linkedin/kafka-tools  Also working on “Cruise Control” for Kafka – An add-on service that will handle redistributing partitions automatically 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Pushing Data from Hadoop  To help Hadoop jobs, we maintain a KafkaPushJob – A mapper that produces messages to Kafka – Pushes to data-deployment, which then gets mirrored to production  Hadoop jobs tend to push a lot of data all at once – Some jobs spin up hundreds of mappers – Pushing many gigabytes of data in a very short period of time  This overwhelms a Kafka cluster – Spurious alerts for under replicated partitions – Problems with mirroring the messages out 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Quotas  Quotas limit traffic based on client ID – Specified in bytes/sec on a per-broker basis – Not per-topic or per-partition  Should be transparent to clients – Accomplished by delaying the response to requests – Newer clients have metrics specific to quotas for clarity  We use it to protect the replication of the cluster – Set it as high as possible while protecting against a single bad client 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Delete Topic  Feature has been under development for almost 3 years – Only recently has it even worked a little bit – We’re still not sure about it (from SRE’s point of view)  Recently performed additional testing so we can use it – Found that even when disabled for a cluster, something was happening – Some brokers claimed the topic was gone, some didn’t – Mirror makers broke for the topic  One of the code paths in the controller was not blocked – Metadata change went out, but it was hard to diagnose 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Brokers are Independent  When there’s a problem in the cluster, brokers might have bad information – The controller should tell them what the topic metadata is – Brokers get out of sync due to connection issues or bugs  There’s no good tool for just sending a request to a broker and reading the response – We had to write a Java application just to send a metadata request  Coming soon – kafka-protocol – Simple CLI tool for sending individual requests to Kafka brokers – Will be part of the https://github.com/linkedin/kafka-tools repository 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 35
  • 36. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Improvement - JBOD  We use RAID-10 on all brokers – Trade off a lot of performance for a little resiliency – Lose half of our disk space  Current JBOD implementation isn’t great – No admin tools for moving partitions – Assignment is round-robin – Broker shuts down if a single disk fails  Looking at options – Might try to fix the JBOD implementation in Kafka – Testing running multiple brokers on a single server 36
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Mirror Maker Improvements  Mirror Maker has performance issues – Has to decompress and recompress every message – Loses information about partition affinity and strict ordering  Developed an Identity message handler – Messages in source partition 0 get produced directly to partition 0 – Requires mirror maker to maintain downstream partition counts  Working on the next steps – No decompression of message batches – Looking at other options on how to run mirror makers 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Bay Area – https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/  Contribute code 39

Editor's Notes

  1. So who am I, and why am I qualified to stand up here? I am a member of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka and Zookeeper operations, as well as Samza and a couple iterations of our change capture systems. SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. At the end of the day, our job is to keep the site running, always.
  2. What are the things we are going to cover in this talk? I’m going to assume some basic knowledge of what Kafka is and how it works, so I won’t be covering the basics. I’ll start by describing the Kafka pipelines we have set up at LinkedIn in our multi-tenant environment. This will transition into the tier architecture that many of those pipelines use. But I’ll spend most of our time on the interesting problems that we’ve run into in running Kafka at such a large scale. We’ll wrap up talking about a couple of the things that we’re working on now, and hopefully have some time for Q&A
  3. I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck. Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems. Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network. What happens when you have two sites to deal with?
  4. Now we iterate on the architecture. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves. We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
  5. We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns. The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring. There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now. The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
  6. More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
  7. This is as good a time as any for a little self-promotion. Many of the questions around how to set up and lay out Kafka clusters, including specific performance concerns and tuning, are covered in this fine book that I am co-authoring. You’ll also find a trove of information about client development, stream processing, and a variety of use cases for Kafka. We currently have 4 chapters complete, and it’s available from O’Reilly under their early access program. We expect to have the book completed late this year, or early next, with chapters being released as soon as we can write them.
  8. Many of us use Kafka for monitoring applications. At LinkedIn, every application and server writes metrics and logs to Kafka. We have central applications that read out these metrics and provide pretty graphs, thresholding, and other tools to make sure that everything is running properly within LinkedIn. Kafka itself is no exception, which leads to this… As soon as I say “monitoring Kafka with Kafka”, we know this is not a good thing
  9. For the broker, what are the critical metrics that I’m keeping an eye on every day? Bytes in, bytes out, and messages in are all critical metrics for us from a growth point of view. While we don’t alert on these, we do keep an eye on them because they help us to understand how the usage of the cluster is growing over time, and they let us plan for the next expansion. You may ask why I don’t have messages out on this list. It’s because there is no messages out metric. Kafka consumers read batches of messages, not single messages, and it’s not easy for Kafka to count messages on the outbound side. There’s a metric on the number of fetches, but it’s less interesting to me. For partitions, we start with the number of partitions per broker, and the number of leader partitions per broker. As we know, there is a single broker responsible for leadership for a given partition. In a healthy cluster, I want to make sure that each broker has approximately the same number of partitions, and that each broker is leading about 50% of those because we have a replication factor of 2 for most things. We can also see this reflected in the bytes rates, because if the partitions are imbalanced, the bytes rates will be as well. This gives us uneven load and that can cause a lot of problems. More importantly though, we monitor the number of under replicated partitions that each broker is reporting. I’m going to get into this in much more detail in a few slides, but this indicates the number of partitions that the broker is leader for where at least one of the replicas has fallen behind. This is the single most important metric to monitor and alert on. It indicates a number of problems and a single alert here will provide coverage of most Kafka issues. Lastly, there are metrics on the thread pool usage, both network and request pools, as well as rate and time metrics on the different types of requests. These are all examples of metrics that are good to have, but they’re difficult to alert on. If you are able to establish a good baseline on some of the request time metrics, I do recommend doing it, however, as rising request times can indicate a problem that is building up, and you may be able to see it before it becomes under replicated partitions. Buried in the middle there is the “max dirty percent” metric. This is a measurement of how many log segments are able to be compacted that are not currently compacted. Right now, this is the only way to monitor the health of log compaction within Kafka, which is critical for the consumer offsets topic at the very least. If the thread doing log compaction dies (which it can do frequently), the only way you will know is by this metric increasing and staying high. Normal behavior is for the metric to spike up and immediately drop back down again.
  10. There are a number of things that can be improved upon, both in the brokers and in the mirror maker, to make it easier to set up and manage multiple datacenters. Another big problem is that we are using RAID and providing a single mount point to the Kafka brokers for a log dir. This is because there are some issues with the way JBOD is handled in the broker. Specifically, the brokers assign partitions to log dirs by round robin, not taking into account current size. In addition, there are no administrative functions to move partitions from one directory to another. And if a single disk fails, the entire broker fails. If JBOD was more robust, we could have replication factors of 3 or 4 without an increase in hardware cost, which would allow us to have “no data loss” configurations.
  11. The big improvement to mirror maker is the creation of an identity mirror maker, which would keep message batches together in the exact same partition from source to target cluster. This would completely eliminate the compression overhead from the mirror maker, making it much faster and more efficient. Of course, this requires maintaining the partitions counts in the clusters properly, and allowing the mirror maker to increase partition counts in a target cluster if needed.
  12. That leads into the idea of multi-cluster management. While there are a couple people making some headway on this in the open source world, we still lack a solid interface for managing Kafka clusters as part of an overall infrastructure. This would include maintaining topic configurations across multiple clusters and easily configuring and visualizing the mirror maker links between them. Another piece needed is better client monitoring overall. Burrow provides us with a good view of what the consumers are doing, but there’s nothing available yet for producer client monitoring. We, of course, have our internal audit system for this. And other companies have their own versions as well. It would be nice to have an open source solution that anyone can use for assuring that the producers are working properly. We could also use better end-to-end monitoring of our Kafka clusters, so we can know that they are available. We have a lot of metrics that can track information about the individual components, but without a client view of the cluster, we don’t know if the cluster is actually available. We also have a hard time making sure that the entire pipeline is working properly. There’s not a lot available for this right now, but watch this space…
  13. So how can you get more involved in the Kafka community? The most obvious answer is to go apache.kafka.org. From there you can Join the mailing lists, either on the development or the user side You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local You can also dive into the source repository, and work on and contribute your own tools back. Kafka may be young, but it’s a critical piece of data infrastructure for many of us.