SlideShare a Scribd company logo
Netflix Data Pipeline
with Kafka
Allen Wang & Steven Wu
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What is Netflix?
Netflix is a logging company

Recommended for you

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...

BY Jun Rao From the Bay Area Apache Kafka September 2016 Meetup. Abstract: To manage the ever-increasing volume and velocity of data within your company you have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center powered by Apache Kafka. But what needs to be done if one data center is not enough? In this session we describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence. We provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication and mirroring as well as disaster scenarios and failure handling.

data centerdata infrastructureskafka
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer

While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Many organizations understand the use cases around their data – fraud detection, quality of service and technical operations, user behavior analysis, for example – but are not necessarily data infrastructure experts. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes an hour of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and architects planning, building, or maintaining similar systems.

Kafka aws
Kafka awsKafka aws
Kafka aws

This document discusses building a fault-tolerant Kafka cluster on AWS to handle 2.5 billion requests per day. It covers choosing AWS instance types and broker counts, spreading brokers across availability zones, configuring replication and partitioning, automating fault tolerance, adding metrics and alerts, and testing the cluster's resilience. Key decisions include broker placement, topic partitioning, Zookeeper ensemble sizing, and automation to dynamically reassign partitions and change configurations in response to failures or added capacity.

that occasionally streams video
Numbers
● 400 billion events per day
● 8 million events & 17 GB per second during
peak
● hundreds of event types
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
Mission of Data Pipeline
Publish, Collect, Aggregate, Move Data @
Cloud Scale

Recommended for you

Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

samzakafkaetl
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

Many enterprises have a large technical debt in legacy applications hosted in on-premises data centers. There is a strong desire to modernize and move to a cloud-based infrastructure, but the world won’t stop for you to transition. Existing applications need to be supported and enhanced; data from legacy platforms is required to make decisions that drive the business. On the other hand, data from cloud-based applications does not exist in a vacuum. Legacy applications need access to these cloud data sources and vice versa. Can an enterprise have it both ways? Can new applications be built in the cloud while existing applications are maintained in a private data center? Monsanto has adopted a cloud-first mentality—today most new development is focused on the cloud. However, this transition did not happen overnight. Chrix Finne and Bob Lehmann share their experience building and implementing a Kafka-based cross-data-center streaming platform to facilitate the move to the cloud—in the process, kick-starting Monsanto’s transition from batch to stream processing. Details include an overview of the challenges involved in transitioning to the cloud and a deep dive into the cross-data-center stream platform architecture, including best practices for running this architecture in production and a summary of the benefits seen after deploying this architecture.

In the old days ...
S3
EMR
Event
Producer
Nowadays ...
S3
Router
Druid
EMR
Existing Data Pipeline
Event
Producer
Stream
Consumers

Recommended for you

Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

Chris Curtin gave a presentation on Apache Kafka at the Atlanta Java Users Group. He discussed his background in technology and current role at Silverpop. He then provided an overview of Apache Kafka, describing its core functionality as a distributed publish-subscribe messaging system. Finally, he demonstrated how producers and consumers interact with Kafka and highlighted some use cases and performance figures from LinkedIn's deployment of Kafka.

stormkafkaajug
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud

Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.

kafkanetflix
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...

Deploying Kafka to support multiple teams or even an entire company has many benefits. It reduces operational costs, simplifies onboarding of new applications as your adoption grows, and consolidates all your data in one place. However, this makes applications sharing the cluster vulnerable to any one or few of them taking all cluster resources. The combined cluster load also becomes less predictable, increasing the risk of overloading the cluster and data unavailability. In this talk, we will describe how to use quota framework in Apache Kafka to ensure that a misconfigured client or unexpected increase in client load does not monopolize broker resources. You will get a deeper understanding of bandwidth and request quotas, how they get enforced, and gain intuition for setting the limits for your use-cases. While quotas limit individual applications, there must be enough cluster capacity to support the combined application load. Onboarding new applications or scaling the usage of existing applications may require manual quota adjustments and upfront capacity planning to ensure high availability. We will describe the steps we took toward solving this problem in Confluent Cloud, where we must immediately support unpredictable load with high availability. We implemented a custom broker quota plugin (KIP-257) to replace static per broker quota allocation with dynamic and self-tuning quotas based on the available capacity (which we also detect dynamically). By learning our journey, you will have more insights into the relevant problems and techniques to address them.

cloudcore kafkaintermediate
In to the Future ...
New Data Pipeline
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Serving Consumers off Diff Clusters
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Split Fronting Kafka Clusters
● Low-priority (error log, request trace, etc.)
o 2 copies, 1-2 hour retention
● Medium-priority (majority)
o 2 copies, 4 hour retention
● High-priority (streaming activities etc.)
o 3 copies, 12-24 hour retention

Recommended for you

Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka

This document discusses using microservices with Kafka. It describes how Kafka can be used to connect microservices for asynchronous communication. It outlines various features of Kafka like high throughput, replication, partitioning, and how it can provide reliability. Examples are given of how microservices could use Kafka for logging, filtering messages, and dispatching to different topics. Performance benefits of Kafka are highlighted like scalability and ability to handle high volumes of messages.

microservicesevent-driven architectureapache kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka

To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.

kafkadata centerstream
Kafka internals
Kafka internalsKafka internals
Kafka internals

This presentation contains our understanding of Kafka internals, its performance. We also present Nest Logic Active Data Profiling system.

no sqlbig datakafka
Producer Resilience
● Kafka outage should never disrupt existing
instances from serving business purpose
● Kafka outage should never prevent new
instances from starting up
● After kafka cluster restored, event producing
should resume automatically
Fail but Never Block
● block.on.buffer.full=false
● handle potential blocking of first meta data
request
● Periodical check whether KafkaProducer
was opened successfully
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What Does It Take to Run In Cloud
● Support elasticity
● Respond to scaling events
● Resilience to failures
o Favors architecture without single point of failure
o Retries, smart routing, fallback ...

Recommended for you

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...

Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.

sf2019confluent
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka

Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional. In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause. Visit www.confluent.io for more information.

apache kafkadisaster recoverygwen shapira
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale

At Hootsuite, we've been transitioning from a single monolithic PHP application to a set of scalable Scala-based microservices. To avoid excessive coupling between services, we've implemented an event system using Apache Kafka that allows events to be reliably produced + consumed asynchronously from services as well as data stores. In this presentation, I talk about: - Why we chose Kafka - How we set up our Kafka clusters to be scalable, highly available, and multi-data-center aware. - How we produce + consume events - How we ensure that events can be understood by all parts of our system (Some that are implemented in other programming languages like PHP and Python) and how we handle evolving event payload data.

kafkascalamicroservices
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Netflix Kafka Container
Kafka
Metric reporting Health check
service Bootstrap
Kafka JVM
Bootstrap
● Broker ID assignment
o Instances obtain sequential numeric IDs using Curator’s locks recipe
persisted in ZK
o Cleans up entry for terminated instances and reuse its ID
o Same ID upon restart
● Bootstrap Kafka properties from Archaius
o Files
o System properties/Environment variables
o Persisted properties service
● Service registration
o Register with Eureka for internal service discovery
o Register with AWS Route53 DNS service
Metric Reporting
● We use Servo and Atlas from NetflixOSS
Kafka
MetricReporter
(Yammer → Servo adaptor)
JMX
Atlas Service

Recommended for you

Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...

Confluent Cloud runs a modified version of Apache Kafka - redesigned to be cloud-native and deliver a serverless user experience. In this talk, we will discuss key improvements we've made to Kafka and how they contribute to Confluent Cloud availability, elasticity, and multi-tenancy. You'll learn about innovations that you can use on-prem, and everything you need to make the most of Confluent Cloud.

kafka summitapache kafkaconfluent cloud
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres

Introduction to Apache Kafka And Real-Time ETL for DBAs and others who are interested in new ways of working with relational databases

big dataapachestream data
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka

This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.

awskafkadata pipeline
Kafka Atlas Dashboard
Health check service
● Use Curator to periodically read ZooKeeper
data to find signs of unhealthiness
● Export metrics to Servo/Atlas
● Expose the service via embedded Jetty
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
ZooKeeper
● Dedicated 5 node cluster for our data
pipeline services
● EIP based
● SSD instance

Recommended for you

Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem

In this presentation, Ben Stopford, Engineer, Confluent discusses the world of Microservices in the streaming world.

microservicesbig datakafka
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering

This document summarizes Netflix's global cloud edge architecture. Key points include: - Netflix uses edge services and a global cloud infrastructure to deliver content to over 1000 device types in over 40 countries. - Zuul is an open source framework that Netflix uses for dynamic routing, authentication, testing, and security across its edge services. - Netflix's edge scripting tier allows device teams to rapidly deploy scripts that control endpoints, content formatting, and APIs for different devices. - RxJava and Hystrix help make the edge service API asynchronous, fault tolerant, and able to handle high concurrency. - Netflix's delivery pipeline uses techniques like canary analysis, debugging, and load testing to continuously and automatically deploy changes

apigroovyhystrix
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second

In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

aws-reinventcloud computingaws cloud
Auditor
● Highly configurable producers and
consumers with their own set of topics and
metadata in messages
● Built as a service deployable on single or
multiple instances
● Runs as producer, consumer or both
● Supports replay of preconfigured set of
messages
Auditor
● Broker monitoring (Heartbeating)
Auditor
● Broker performance testing
o Produce tens of thousands messages per second on
single instance
o As consumers to test consumer impact
Kafka admin UI
● Still searching …
● Currently trying out KafkaManager

Recommended for you

Apache kafka
Apache kafkaApache kafka
Apache kafka

This document discusses Apache Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka's design and capabilities including: 1) Kafka is a distributed publish-subscribe messaging system that can handle high throughput workloads with low latency. 2) It is designed for real-time data pipelines and activity streaming and can be used for transporting logs, metrics collection, and building real-time applications. 3) Kafka supports distributed, scalable, fault-tolerant storage and processing of streaming data across multiple producers and consumers.

kafka overview
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper

A short presentation on Overview of Kafka and Zookeeper for beginners to understand the basic concepts of these two in a lucid manner.

kafka overviewbeginnersapache zookeeper
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka

This document provides an overview of Apache Kafka including its main components, architecture, and ecosystem. It describes how LinkedIn used Kafka to solve their data pipeline problem by decoupling systems and allowing for horizontal scaling. The key elements of Kafka are producers that publish data to topics, the Kafka cluster that stores streams of records in a distributed, replicated commit log, and consumers that subscribe to topics. Kafka Connect and the Schema Registry are also introduced as part of the Kafka ecosystem.

apache kafkakafkaschema registry
Netflix Data Pipeline With Kafka
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Challenges
● ZooKeeper client issues
● Cluster scaling
● Producer/consumer/broker tuning
ZooKeeper Client
● Challenges
o Broker/consumer cannot survive ZooKeeper cluster
rolling push due to caching of private IP
o Temporary DNS lookup failure at new session
initialization kills future communication

Recommended for you

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015

Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.

samzapipelinebig data
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis

Mark Harrison presented on using Amazon Kinesis for event-driven microservices architectures. He discussed the limitations of traditional monolithic and microservice architectures, and how Kinesis can help address issues like tight coupling, high latency, and lack of event broadcasting. Key concepts around Kinesis included streams, shards, producers, consumers, and constraints like throughput limits. Harrison demonstrated a custom Scala/Akka client for Kinesis that provides asynchronous, non-blocking producers and consumers with features like throttling and checkpointing. Performance tests showed throughput scales linearly with additional shards. In closing, Harrison invited the audience to learn more about open source and job opportunities with Weight Watchers.

akkakinesiskafka
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker

We will talk about how we are migrating our Presto clusters from AWS EMR to Docker using production-grade orchestrators considering cluster management, configuration and monitoring. We will discuss between Hashicorp Nomad and Kubernetes as a base solution

big dataprestodocker
ZooKeeper Client
● Solutions
o Created our internal fork of Apache ZooKeeper
client
o Periodically refresh private IP resolution
o Fallback to last good private IP resolution upon DNS
lookup failure
Scaling
● Provisioned for peak traffic
o … and we have regional fail-over
Strategy #1 Add Partitions to New
Brokers
● Caveat
o Most of our topics do not use keyed messages
o Number of partitions is still small
o Require high level consumer
Strategy #1 Add Partitions to new
brokers
● Challenges: existing admin tools does not
support atomic adding partitions and
assigning to new brokers

Recommended for you

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016. Topics covered: - Architectural design and principles for Keystone - Technologies that Keystone is leveraging - Best practices http://www.meetup.com/SF-Data-Engineering/events/228293610/

keystonenetflixreal time data infrastructure
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp

Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments. These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters. Slides used at OpenShift Meetup Spain: - https://www.meetup.com/es-ES/openshift_spain/events/261284764/

apache kafkaopenshiftstrimzi
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline

http://www.oreilly.com/pub/e/3764 Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.

kafkastream processingoreilly
Strategy #1 Add Partitions to new
brokers
● Solutions: created our own tool to do it in
one ZooKeeper change and repeat for all or
selected topics
● Reduced the time to scale up from a few
hours to a few minutes
Strategy #2 Move Partitions
● Should work without precondition, but ...
● Huge increase of network I/O affecting
incoming traffic
● A much longer process than adding
partitions
● Sometimes confusing error messages
● Would work if pace of replication can be
controlled
Scale down strategy
● There is none
● Look for more support to automatically move
all partitions from a set of brokers to a
different set
Client tuning
● Producer
o Batching is important to reduce CPU and network
I/O on brokers
o Stick to one partition for a while when producing for
non-keyed messages
o “linger.ms” works well with sticky partitioner
● Consumer
o With huge number of consumers, set proper
fetch.wait.max.ms to reduce polling traffic on broker

Recommended for you

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

Disenchantment is a Netflix show following the medieval misadventures of a hard-drinking princess, her feisty elf, and her personal demon. In this talk, we will follow the story of Netflix’s container management platform, Titus, which powers critical aspects of the Netflix business (video encoding & streaming, big data, recommendations & machine learning, and other workloads). We’ll cover the challenges growing Titus from 10’s to 1000’s of workloads. We’ll talk about our feisty team’s work across container runtimes, scheduling & control plane, and cloud infrastructure integration. We’ll talk about the demons we’ve found on this journey covering operability, security, reliability and performance.

netflixtituscontainers
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2

In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.

dynomitenetflixossevcache
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016

Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.

samzacloudmonal daxini
Effect of batching
partitioner batched records
per request
broker cpu util
[1]
random without
lingering
1.25 75%
sticky without
lingering
2.0 50%
sticky with 100ms
lingering
15 33%
[1] 10 MB & 10K msgs / second per broker, 1KB per message
Broker tuning
● Use G1 collector
● Use large page cache and memory
● Increase max file descriptor if you have
thousands of producers or consumers
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Road map
● Work with Kafka community on rack/zone
aware replica assignment
● Failure resilience testing
o Chaos Monkey
o Chaos Gorilla
● Contribute to open source
o Kafka
o Schlep -- our messaging library including SQS and
Kafka support
o Auditor

Recommended for you

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...

QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas. Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.

analyticsdevopsquestdb
kafka
kafkakafka
kafka

This document discusses the evolution of Kafka clusters at AppsFlyer over time. The initial cluster had 4 brokers and handled hundreds of millions of messages with low partitioning and replication. A new cluster was designed with more brokers, replication across availability zones, and higher partitioning to support billions of messages. However, this led to issues like uneven leader distribution and failures. Various solutions were implemented like increasing brokers, splitting topics, and hardware upgrades. Ongoing testing and monitoring helped identify more problems and improvements around replication, partitioning, and automation. Key lessons learned included balancing replication and leaders, supporting dynamic changes, and thorough testing of failure scenarios.

Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming

This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.

Thank you!
http://netflix.github.io/
http://techblog.netflix.com/
@NetflixOSS
@allenxwang
@stevenzwu

More Related Content

What's hot

Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
Joel Koshy
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
confluent
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
Kafka aws
Kafka awsKafka aws
Kafka aws
Ariel Moskovich
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Discover Pinterest
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
confluent
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
confluent
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
László-Róbert Albert
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
David Groozman
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
confluent
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
jimriecken
 
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
HostedbyConfluent
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
PivotalOpenSourceHub
 

What's hot (20)

Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
 
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 

Viewers also liked

Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
confluent
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
Mikey Cohen - Hiring Amazing Engineers
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
Amazon Web Services
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Rahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
Rahul Jain
 

Viewers also liked (6)

Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 

Similar to Netflix Data Pipeline With Kafka

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Ricardo Bravo
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
José Román Martín Gil
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
kafka
kafkakafka
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
HostedbyConfluent
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
OpenStack
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
Monal Daxini
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
Ankur Bansal
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
HostedbyConfluent
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
kloia
 

Similar to Netflix Data Pipeline With Kafka (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
kafka
kafkakafka
kafka
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 

Recently uploaded

Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 

Recently uploaded (20)

Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 

Netflix Data Pipeline With Kafka

  • 1. Netflix Data Pipeline with Kafka Allen Wang & Steven Wu
  • 2. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 4. Netflix is a logging company
  • 6. Numbers ● 400 billion events per day ● 8 million events & 17 GB per second during peak ● hundreds of event types
  • 7. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 8. Mission of Data Pipeline Publish, Collect, Aggregate, Move Data @ Cloud Scale
  • 9. In the old days ...
  • 13. In to the Future ...
  • 15. Serving Consumers off Diff Clusters S3 Router Druid EMR Event Producer Stream Consumers Fronting Kafka Consumer Kafka
  • 16. Split Fronting Kafka Clusters ● Low-priority (error log, request trace, etc.) o 2 copies, 1-2 hour retention ● Medium-priority (majority) o 2 copies, 4 hour retention ● High-priority (streaming activities etc.) o 3 copies, 12-24 hour retention
  • 17. Producer Resilience ● Kafka outage should never disrupt existing instances from serving business purpose ● Kafka outage should never prevent new instances from starting up ● After kafka cluster restored, event producing should resume automatically
  • 18. Fail but Never Block ● block.on.buffer.full=false ● handle potential blocking of first meta data request ● Periodical check whether KafkaProducer was opened successfully
  • 19. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 20. What Does It Take to Run In Cloud ● Support elasticity ● Respond to scaling events ● Resilience to failures o Favors architecture without single point of failure o Retries, smart routing, fallback ...
  • 21. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 22. Netflix Kafka Container Kafka Metric reporting Health check service Bootstrap Kafka JVM
  • 23. Bootstrap ● Broker ID assignment o Instances obtain sequential numeric IDs using Curator’s locks recipe persisted in ZK o Cleans up entry for terminated instances and reuse its ID o Same ID upon restart ● Bootstrap Kafka properties from Archaius o Files o System properties/Environment variables o Persisted properties service ● Service registration o Register with Eureka for internal service discovery o Register with AWS Route53 DNS service
  • 24. Metric Reporting ● We use Servo and Atlas from NetflixOSS Kafka MetricReporter (Yammer → Servo adaptor) JMX Atlas Service
  • 26. Health check service ● Use Curator to periodically read ZooKeeper data to find signs of unhealthiness ● Export metrics to Servo/Atlas ● Expose the service via embedded Jetty
  • 27. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 28. ZooKeeper ● Dedicated 5 node cluster for our data pipeline services ● EIP based ● SSD instance
  • 29. Auditor ● Highly configurable producers and consumers with their own set of topics and metadata in messages ● Built as a service deployable on single or multiple instances ● Runs as producer, consumer or both ● Supports replay of preconfigured set of messages
  • 31. Auditor ● Broker performance testing o Produce tens of thousands messages per second on single instance o As consumers to test consumer impact
  • 32. Kafka admin UI ● Still searching … ● Currently trying out KafkaManager
  • 34. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 35. Challenges ● ZooKeeper client issues ● Cluster scaling ● Producer/consumer/broker tuning
  • 36. ZooKeeper Client ● Challenges o Broker/consumer cannot survive ZooKeeper cluster rolling push due to caching of private IP o Temporary DNS lookup failure at new session initialization kills future communication
  • 37. ZooKeeper Client ● Solutions o Created our internal fork of Apache ZooKeeper client o Periodically refresh private IP resolution o Fallback to last good private IP resolution upon DNS lookup failure
  • 38. Scaling ● Provisioned for peak traffic o … and we have regional fail-over
  • 39. Strategy #1 Add Partitions to New Brokers ● Caveat o Most of our topics do not use keyed messages o Number of partitions is still small o Require high level consumer
  • 40. Strategy #1 Add Partitions to new brokers ● Challenges: existing admin tools does not support atomic adding partitions and assigning to new brokers
  • 41. Strategy #1 Add Partitions to new brokers ● Solutions: created our own tool to do it in one ZooKeeper change and repeat for all or selected topics ● Reduced the time to scale up from a few hours to a few minutes
  • 42. Strategy #2 Move Partitions ● Should work without precondition, but ... ● Huge increase of network I/O affecting incoming traffic ● A much longer process than adding partitions ● Sometimes confusing error messages ● Would work if pace of replication can be controlled
  • 43. Scale down strategy ● There is none ● Look for more support to automatically move all partitions from a set of brokers to a different set
  • 44. Client tuning ● Producer o Batching is important to reduce CPU and network I/O on brokers o Stick to one partition for a while when producing for non-keyed messages o “linger.ms” works well with sticky partitioner ● Consumer o With huge number of consumers, set proper fetch.wait.max.ms to reduce polling traffic on broker
  • 45. Effect of batching partitioner batched records per request broker cpu util [1] random without lingering 1.25 75% sticky without lingering 2.0 50% sticky with 100ms lingering 15 33% [1] 10 MB & 10K msgs / second per broker, 1KB per message
  • 46. Broker tuning ● Use G1 collector ● Use large page cache and memory ● Increase max file descriptor if you have thousands of producers or consumers
  • 47. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 48. Road map ● Work with Kafka community on rack/zone aware replica assignment ● Failure resilience testing o Chaos Monkey o Chaos Gorilla ● Contribute to open source o Kafka o Schlep -- our messaging library including SQS and Kafka support o Auditor

Editor's Notes

  1. Netflix is a data-driven company. We just love data. We log a lot of data. We have a broad range of events: error log, request tracing, viewing activities, UI events. e.g. data scientist might generate recommendation of tv shows and movies based on view activities data Metrics are not flowing through our data pipeline. We have a separate metric system open sourced. Client is called servo. Backend is called atlas
  2. Transfer data from producer to consumer reliably
  3. in the old good days, we only have one or two guys supporting this pipeline. there is apache project for chukwa. our internal chukwa code base has deviated from the open-source chukwa version since very early days. AFAIK, there is very little shared code.
  4. Nowadays, there is a growing demand for real-time analytics. e.g. some people want to feed data to ElasticSearch for real-time index, some people want to feed data to Druid for real-time analytics, some people want to do stream processing using storm, spark, or samza. So we introduced the real-time branch of pipeline highlighted in the red box. In addition to upload data to s3, chukwa tee the traffic to kafka. right now, there are about 20% of traffic flowing to real-time kafka branch. One important piece of the real-time branch is the routing service, which routes traffic from kafka to ElasticSearch and Druid. now our architecture looks a lot more complicated. and more importantly, there are overlaps btw what chukwa does and what kafka can do. so we want to remove the redundance.
  5. We removed chukwa and placed kafka in the front gate of the pipeline. there are a few reasons for this move kafka has a very vibrant community and ecosystem with a lot of momentum kafka provides better durability with replication. chukwa doesn’t support replication simplifies architecture. removed one major component (chukwa) to maintain In addition to kafka, the other center piece of our infrastructure is routing service. it can route traffic from kafka to sinks like S3, ES, Druid, Kafka. We are still in the middle of this transition. right now, we are doing dual writes to both old and new pipeline. we plan to finish the migration in 2-3 months
  6. fronting kafka clusters are the front gate to our data pipeline. there are hundreds of micro services and tens of thousands instances producing events to our pipeline. it is critical to keep them up so that we don’t lose data. Because kafka supports pub-sub model, anyone can attach a consumer to an existing topic. It will be hard to control the fan-out factors of consumer traffic. if a high-traffic topic has many consumers, we can easily saturate the network link of fronting clusters. we copy data from fronting cluster to dedicated clusters for serving consumers. This way we can avoid any disruption from misbehaving consumers on the fronting kafka. It is true that data copy will cost extra resources. here we are trading resources for stability.
  7. We maintain a small numbers of fronting clusters, just like Confluent team recommended in a recent blog post. We split incoming traffic into three clusters based on importance of events. first cluster is for low priority events, e.g. error log, request tracing, diagnostics data. where losing data occasionally won’t be a big deal. for this cluster, we only keep 2 copies of data and very low retention like 1 or 2 hours medium-priority are the majority of the use cases. for high-priority business events, we set replication factor to 3 and retain data much longer like 12 to 24 hours
  8. pretty much every applications in netflix send events to our data pipeline. No matter we like it or not, we all know that outage is going to happen. What will happen to producers if there is a kafka outage. Also as you may recall, our team mission is to transfer data from producer to consumer reliably. In case of outage, it is ok to drop messages. but it is NOT ok to affect the producer applications. yes, data is important. but what is more important is that our users can still stream videos, because that’s the service users pay us for. Since kafka is the front gate of our pipeline, we tested pipeline outage by shutting down kafka cluster. We want to see that our producer can achieve these goals.
  9. In new 0.8.2 java producer, all requests are sent asynchronously except for one situation.
  10. Now I will hand over to Allen who will tell you more about kafka
  11. Instances come and go. Software should not assume that it will always run on the same set of physical machines. Instead, it should make the configuration, deployment and start up easy on any new instance. We constantly adjust server group size. We scale up to respond to more traffic from user or our upstream dependency (middle tier services), and scale down to save the cost. Both of them are vital to keep your software running in cloud. Software should automatically respond to such events without much human intervention. Hardware failures are expected and can happen any time. Temporary network glitches are the norm. A whole zone/rack outage can happen. Software should be resilient to such failures. It is observed that a software without a single point of failure tend to run better in cloud.
  12. We have three components running along side of Kafka in the same JVM
  13. Need to map the string AWS instance ID to a numeric broker ID. Archaius is our open source property management library. We need to register Kafka service with service discovery so that other services know how to connect to Kafka.
  14. Servo is our application metric library. Atlas is our metric backend that provides aggregation and query functionality. Kafka’s Yammer metric is converted to Servo metric and stored in JMX. An atlas client later retrieves the metric and send to Atlas serivce.
  15. Is not quite the same LinkedIn’s auditing service. It does not track flow of messages in the data pipeline. It is synthetic traffic.
  16. Topic view From UI point of view, we are still in early exploring stage. KafkaManager appears to be usable but have some issues. We definitely look forward to the Admin UI from the Confluent.
  17. ZooKeeper cluster’s public host name is resolved to its private IP, which is gone if the instance is terminated. Apache ZooKeeper client caches the private IP for communicating with ZooKeeper and never fresh. Private IP are all changed when we do ZooKeeper rolling push. After rolling push, communications are basically stalled for both broker and consumer.
  18. Provisioned for two region’s peak traffic. But not surprisingly, with over-provision, we still have to do scale up periodically to deal with growing traffic.
  19. Using existing tool from Kafka, you need to first add new partitions, then create a reassignment plan, and then execute the new assignment. Repeat for all topics. That’s a lot of manual steps and eyeballing.
  20. Works with keyed topic and when there is large number of partitions.
  21. We are still shy on our scale down strategy
  22. Setup message size: 1 KB traffic per broker: 10 MB / 10K msgs no compression topic partitions: 36
  23. For broker tuning, we followed the LinkedIn recommendation.
  24. Chaos Monkey randomly terminate an instance, which we are confident that Kafka can survive. Chaos Gorilla simulate AWS zone outage by removing all instances in a zone. Must have Zone aware replica assignment before we can try.
  25. Before we are opening up to questions, I want to mention that we still have a lot to learn about operating kafka at large scale in cloud. we are looking forward to working with the community.