SlideShare a Scribd company logo
Netflix Data Pipeline
with Kafka
Allen Wang & Steven Wu
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What is Netflix?
Netflix is a logging company

Recommended for you

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020

Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.

azure cloud analytics synapse data warehousing
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

high throughput and low latencyp99p99 conf
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

awsemrhadoop
that occasionally streams video
Numbers
● 400 billion events per day
● 8 million events & 17 GB per second during
peak
● hundreds of event types
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
Mission of Data Pipeline
Publish, Collect, Aggregate, Move Data @
Cloud Scale

Recommended for you

Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix

apache avroapache hiveapache parquet
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!

ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly. ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.

ksqldbkafka
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013

The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.

In the old days ...
S3
EMR
Event
Producer
Nowadays ...
S3
Router
Druid
EMR
Existing Data Pipeline
Event
Producer
Stream
Consumers

Recommended for you

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud

Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set. Speakers: Junjie Chen, Junping Du

Kafka 101
Kafka 101Kafka 101
Kafka 101

Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.

kafkajobteaserbeginner
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka

A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.

streamingstreamsknolx
In to the Future ...
New Data Pipeline
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Serving Consumers off Diff Clusters
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Split Fronting Kafka Clusters
● Low-priority (error log, request trace, etc.)
o 2 copies, 1-2 hour retention
● Medium-priority (majority)
o 2 copies, 4 hour retention
● High-priority (streaming activities etc.)
o 3 copies, 12-24 hour retention

Recommended for you

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.

spark + ai summit

 *
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect

It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.

apache kafkakafka connectapache kafka connect
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

spark + ai summit

 *
Producer Resilience
● Kafka outage should never disrupt existing
instances from serving business purpose
● Kafka outage should never prevent new
instances from starting up
● After kafka cluster restored, event producing
should resume automatically
Fail but Never Block
● block.on.buffer.full=false
● handle potential blocking of first meta data
request
● Periodical check whether KafkaProducer
was opened successfully
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What Does It Take to Run In Cloud
● Support elasticity
● Respond to scaling events
● Resilience to failures
o Favors architecture without single point of failure
o Retries, smart routing, fallback ...

Recommended for you

Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh

This document discusses data mesh, a distributed data management approach for microservices. It outlines the challenges of implementing microservice architecture including data decoupling, sharing data across domains, and data consistency. It then introduces data mesh as a solution, describing how to build the necessary infrastructure using technologies like Kubernetes and YAML to quickly deploy data pipelines and provision data across services and applications in a distributed manner. The document provides examples of how data mesh can be used to improve legacy system integration, batch processing efficiency, multi-source data aggregation, and cross-cloud/environment integration.

Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Flink Forward San Francisco 2022. At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing. by Xiang Zhang & Pratyush Sharma & Xiaoman Dong

apache flinkstream processingbig data
Kafka basics
Kafka basicsKafka basics
Kafka basics

Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.

kafkastreamingtutorial
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Netflix Kafka Container
Kafka
Metric reporting Health check
service Bootstrap
Kafka JVM
Bootstrap
● Broker ID assignment
o Instances obtain sequential numeric IDs using Curator’s locks recipe
persisted in ZK
o Cleans up entry for terminated instances and reuse its ID
o Same ID upon restart
● Bootstrap Kafka properties from Archaius
o Files
o System properties/Environment variables
o Persisted properties service
● Service registration
o Register with Eureka for internal service discovery
o Register with AWS Route53 DNS service
Metric Reporting
● We use Servo and Atlas from NetflixOSS
Kafka
MetricReporter
(Yammer → Servo adaptor)
JMX
Atlas Service

Recommended for you

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake

Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

kafkasparkdata streaming
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka

This document provides an overview of Apache Kafka including its main components, architecture, and ecosystem. It describes how LinkedIn used Kafka to solve their data pipeline problem by decoupling systems and allowing for horizontal scaling. The key elements of Kafka are producers that publish data to topics, the Kafka cluster that stores streams of records in a distributed, replicated commit log, and consumers that subscribe to topics. Kafka Connect and the Schema Registry are also introduced as part of the Kafka ecosystem.

apache kafkakafkaschema registry
Kafka Atlas Dashboard
Health check service
● Use Curator to periodically read ZooKeeper
data to find signs of unhealthiness
● Export metrics to Servo/Atlas
● Expose the service via embedded Jetty
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
ZooKeeper
● Dedicated 5 node cluster for our data
pipeline services
● EIP based
● SSD instance

Recommended for you

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015

Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.

samzapipelinebig data
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis

Mark Harrison presented on using Amazon Kinesis for event-driven microservices architectures. He discussed the limitations of traditional monolithic and microservice architectures, and how Kinesis can help address issues like tight coupling, high latency, and lack of event broadcasting. Key concepts around Kinesis included streams, shards, producers, consumers, and constraints like throughput limits. Harrison demonstrated a custom Scala/Akka client for Kinesis that provides asynchronous, non-blocking producers and consumers with features like throttling and checkpointing. Performance tests showed throughput scales linearly with additional shards. In closing, Harrison invited the audience to learn more about open source and job opportunities with Weight Watchers.

akkakinesiskafka
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker

We will talk about how we are migrating our Presto clusters from AWS EMR to Docker using production-grade orchestrators considering cluster management, configuration and monitoring. We will discuss between Hashicorp Nomad and Kubernetes as a base solution

big dataprestodocker
Auditor
● Highly configurable producers and
consumers with their own set of topics and
metadata in messages
● Built as a service deployable on single or
multiple instances
● Runs as producer, consumer or both
● Supports replay of preconfigured set of
messages
Auditor
● Broker monitoring (Heartbeating)
Auditor
● Broker performance testing
o Produce tens of thousands messages per second on
single instance
o As consumers to test consumer impact
Kafka admin UI
● Still searching …
● Currently trying out KafkaManager

Recommended for you

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016. Topics covered: - Architectural design and principles for Keystone - Technologies that Keystone is leveraging - Best practices http://www.meetup.com/SF-Data-Engineering/events/228293610/

keystonenetflixreal time data infrastructure
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp

Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments. These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters. Slides used at OpenShift Meetup Spain: - https://www.meetup.com/es-ES/openshift_spain/events/261284764/

apache kafkaopenshiftstrimzi
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline

http://www.oreilly.com/pub/e/3764 Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.

kafkastream processingoreilly
Netflix Data Pipeline With Kafka
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Challenges
● ZooKeeper client issues
● Cluster scaling
● Producer/consumer/broker tuning
ZooKeeper Client
● Challenges
o Broker/consumer cannot survive ZooKeeper cluster
rolling push due to caching of private IP
o Temporary DNS lookup failure at new session
initialization kills future communication

Recommended for you

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

Disenchantment is a Netflix show following the medieval misadventures of a hard-drinking princess, her feisty elf, and her personal demon. In this talk, we will follow the story of Netflix’s container management platform, Titus, which powers critical aspects of the Netflix business (video encoding & streaming, big data, recommendations & machine learning, and other workloads). We’ll cover the challenges growing Titus from 10’s to 1000’s of workloads. We’ll talk about our feisty team’s work across container runtimes, scheduling & control plane, and cloud infrastructure integration. We’ll talk about the demons we’ve found on this journey covering operability, security, reliability and performance.

netflixtituscontainers
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2

In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.

dynomitenetflixossevcache
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016

Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.

samzacloudmonal daxini
ZooKeeper Client
● Solutions
o Created our internal fork of Apache ZooKeeper
client
o Periodically refresh private IP resolution
o Fallback to last good private IP resolution upon DNS
lookup failure
Scaling
● Provisioned for peak traffic
o … and we have regional fail-over
Strategy #1 Add Partitions to New
Brokers
● Caveat
o Most of our topics do not use keyed messages
o Number of partitions is still small
o Require high level consumer
Strategy #1 Add Partitions to new
brokers
● Challenges: existing admin tools does not
support atomic adding partitions and
assigning to new brokers

Recommended for you

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...

QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas. Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.

analyticsdevopsquestdb
kafka
kafkakafka
kafka

This document discusses the evolution of Kafka clusters at AppsFlyer over time. The initial cluster had 4 brokers and handled hundreds of millions of messages with low partitioning and replication. A new cluster was designed with more brokers, replication across availability zones, and higher partitioning to support billions of messages. However, this led to issues like uneven leader distribution and failures. Various solutions were implemented like increasing brokers, splitting topics, and hardware upgrades. Ongoing testing and monitoring helped identify more problems and improvements around replication, partitioning, and automation. Key lessons learned included balancing replication and leaders, supporting dynamic changes, and thorough testing of failure scenarios.

Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming

This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.

Strategy #1 Add Partitions to new
brokers
● Solutions: created our own tool to do it in
one ZooKeeper change and repeat for all or
selected topics
● Reduced the time to scale up from a few
hours to a few minutes
Strategy #2 Move Partitions
● Should work without precondition, but ...
● Huge increase of network I/O affecting
incoming traffic
● A much longer process than adding
partitions
● Sometimes confusing error messages
● Would work if pace of replication can be
controlled
Scale down strategy
● There is none
● Look for more support to automatically move
all partitions from a set of brokers to a
different set
Client tuning
● Producer
o Batching is important to reduce CPU and network
I/O on brokers
o Stick to one partition for a while when producing for
non-keyed messages
o “linger.ms” works well with sticky partitioner
● Consumer
o With huge number of consumers, set proper
fetch.wait.max.ms to reduce polling traffic on broker

Recommended for you

Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications

This document discusses using Apache Kafka at scale to deliver real-time notifications. It describes using Kafka to listen to database events, process them into transactions, translate transactions into notifications, and push notifications to mobile devices in real-time. It outlines the system architecture, challenges faced around partitioning, scaling consumers, and upgrading Kafka clients. Monitoring metrics and consumer lag is also discussed using tools like Burrow and Datadog.

dublinkafkameetup
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter

Until recently, the Messaging team at Twitter had been running an in-house build Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 200M messages per second. In this talk, we will share the learning and challenges in our journey moving to Apache Kafka.

kafka summitapache kafkatwitter
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT

Audience: Advanced About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand. This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production. Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more! Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies. OpenStack Australia Day - Sydney 2016 https://events.aptira.com/openstack-australia-day-sydney-2016/

openstackdeploymentopenstack australia day
Effect of batching
partitioner batched records
per request
broker cpu util
[1]
random without
lingering
1.25 75%
sticky without
lingering
2.0 50%
sticky with 100ms
lingering
15 33%
[1] 10 MB & 10K msgs / second per broker, 1KB per message
Broker tuning
● Use G1 collector
● Use large page cache and memory
● Increase max file descriptor if you have
thousands of producers or consumers
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Road map
● Work with Kafka community on rack/zone
aware replica assignment
● Failure resilience testing
o Chaos Monkey
o Chaos Gorilla
● Contribute to open source
o Kafka
o Schlep -- our messaging library including SQS and
Kafka support
o Auditor

Recommended for you

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems

The document discusses Netflix's approach to handling big data problems. It summarizes Netflix's data pipeline system called Keystone that was built in a year to replace a legacy system. Keystone ingests over 1 trillion events per day and processes them using technologies like Kafka, Samza and Spark Streaming. The document emphasizes Netflix's culture of freedom and responsibility and how it helped the small team replace the legacy system without disruption while achieving massive scale.

simian armysparkdocker
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

ureplicatorkafkaat-least once kafka
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud

"Confluent Cloud is a cloud-native service based on Apache Kafka. We run tens of thousands of clusters across all major cloud service providers (AWS, GCP and Azure). In this talk, we will go over our journey to make Confluent Cloud 10x faster than Apache Kafka. We will talk about how we designed our various workloads, the complexities involved in our cloud-native service, the challenges we faced, and the various pitfalls we ran into. We will also cover the interesting learnings, which in hindsight, are first principles from this multi-year journey. By attending this talk, attendees will be able to take our learnings from making Confluent Cloud latencies 10x better and possibly apply similar principles to their cloud native data streaming systems."

current 23confluent clouddata streaming
Thank you!
http://netflix.github.io/
http://techblog.netflix.com/
@NetflixOSS
@allenxwang
@stevenzwu

More Related Content

What's hot

Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clement Demonchy
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Kafka basics
Kafka basicsKafka basics
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 

What's hot (20)

Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 

Similar to Netflix Data Pipeline With Kafka

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Ricardo Bravo
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
José Román Martín Gil
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
kafka
kafkakafka
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
HostedbyConfluent
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
OpenStack
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
Monal Daxini
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
Ankur Bansal
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
HostedbyConfluent
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
kloia
 

Similar to Netflix Data Pipeline With Kafka (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
kafka
kafkakafka
kafka
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 

Recently uploaded

What is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for FreeWhat is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for Free
TwisterTools
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
bhatinidhi2001
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
karim wahed
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)
FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)
FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)
Roshan Dwivedi
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
sheqnetworkmarketing
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
908dutch
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
Task Tracker
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
SSTech System
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
Ortus Solutions, Corp
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
MaisnamLuwangPibarel
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
onemonitarsoftware
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
sudsdeep
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
sachin chaurasia
 
Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation
Mindfire Solution
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
ssuser2b426d1
 
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial CompanyNBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Softwares
 

Recently uploaded (20)

What is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for FreeWhat is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for Free
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)
FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)
FAST Channels: Explosive Growth Forecast 2024-2027 (Buckle Up!)
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
 
Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
 
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial CompanyNBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial Company
 

Netflix Data Pipeline With Kafka

  • 1. Netflix Data Pipeline with Kafka Allen Wang & Steven Wu
  • 2. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 4. Netflix is a logging company
  • 6. Numbers ● 400 billion events per day ● 8 million events & 17 GB per second during peak ● hundreds of event types
  • 7. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 8. Mission of Data Pipeline Publish, Collect, Aggregate, Move Data @ Cloud Scale
  • 9. In the old days ...
  • 13. In to the Future ...
  • 15. Serving Consumers off Diff Clusters S3 Router Druid EMR Event Producer Stream Consumers Fronting Kafka Consumer Kafka
  • 16. Split Fronting Kafka Clusters ● Low-priority (error log, request trace, etc.) o 2 copies, 1-2 hour retention ● Medium-priority (majority) o 2 copies, 4 hour retention ● High-priority (streaming activities etc.) o 3 copies, 12-24 hour retention
  • 17. Producer Resilience ● Kafka outage should never disrupt existing instances from serving business purpose ● Kafka outage should never prevent new instances from starting up ● After kafka cluster restored, event producing should resume automatically
  • 18. Fail but Never Block ● block.on.buffer.full=false ● handle potential blocking of first meta data request ● Periodical check whether KafkaProducer was opened successfully
  • 19. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 20. What Does It Take to Run In Cloud ● Support elasticity ● Respond to scaling events ● Resilience to failures o Favors architecture without single point of failure o Retries, smart routing, fallback ...
  • 21. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 22. Netflix Kafka Container Kafka Metric reporting Health check service Bootstrap Kafka JVM
  • 23. Bootstrap ● Broker ID assignment o Instances obtain sequential numeric IDs using Curator’s locks recipe persisted in ZK o Cleans up entry for terminated instances and reuse its ID o Same ID upon restart ● Bootstrap Kafka properties from Archaius o Files o System properties/Environment variables o Persisted properties service ● Service registration o Register with Eureka for internal service discovery o Register with AWS Route53 DNS service
  • 24. Metric Reporting ● We use Servo and Atlas from NetflixOSS Kafka MetricReporter (Yammer → Servo adaptor) JMX Atlas Service
  • 26. Health check service ● Use Curator to periodically read ZooKeeper data to find signs of unhealthiness ● Export metrics to Servo/Atlas ● Expose the service via embedded Jetty
  • 27. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 28. ZooKeeper ● Dedicated 5 node cluster for our data pipeline services ● EIP based ● SSD instance
  • 29. Auditor ● Highly configurable producers and consumers with their own set of topics and metadata in messages ● Built as a service deployable on single or multiple instances ● Runs as producer, consumer or both ● Supports replay of preconfigured set of messages
  • 31. Auditor ● Broker performance testing o Produce tens of thousands messages per second on single instance o As consumers to test consumer impact
  • 32. Kafka admin UI ● Still searching … ● Currently trying out KafkaManager
  • 34. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 35. Challenges ● ZooKeeper client issues ● Cluster scaling ● Producer/consumer/broker tuning
  • 36. ZooKeeper Client ● Challenges o Broker/consumer cannot survive ZooKeeper cluster rolling push due to caching of private IP o Temporary DNS lookup failure at new session initialization kills future communication
  • 37. ZooKeeper Client ● Solutions o Created our internal fork of Apache ZooKeeper client o Periodically refresh private IP resolution o Fallback to last good private IP resolution upon DNS lookup failure
  • 38. Scaling ● Provisioned for peak traffic o … and we have regional fail-over
  • 39. Strategy #1 Add Partitions to New Brokers ● Caveat o Most of our topics do not use keyed messages o Number of partitions is still small o Require high level consumer
  • 40. Strategy #1 Add Partitions to new brokers ● Challenges: existing admin tools does not support atomic adding partitions and assigning to new brokers
  • 41. Strategy #1 Add Partitions to new brokers ● Solutions: created our own tool to do it in one ZooKeeper change and repeat for all or selected topics ● Reduced the time to scale up from a few hours to a few minutes
  • 42. Strategy #2 Move Partitions ● Should work without precondition, but ... ● Huge increase of network I/O affecting incoming traffic ● A much longer process than adding partitions ● Sometimes confusing error messages ● Would work if pace of replication can be controlled
  • 43. Scale down strategy ● There is none ● Look for more support to automatically move all partitions from a set of brokers to a different set
  • 44. Client tuning ● Producer o Batching is important to reduce CPU and network I/O on brokers o Stick to one partition for a while when producing for non-keyed messages o “linger.ms” works well with sticky partitioner ● Consumer o With huge number of consumers, set proper fetch.wait.max.ms to reduce polling traffic on broker
  • 45. Effect of batching partitioner batched records per request broker cpu util [1] random without lingering 1.25 75% sticky without lingering 2.0 50% sticky with 100ms lingering 15 33% [1] 10 MB & 10K msgs / second per broker, 1KB per message
  • 46. Broker tuning ● Use G1 collector ● Use large page cache and memory ● Increase max file descriptor if you have thousands of producers or consumers
  • 47. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 48. Road map ● Work with Kafka community on rack/zone aware replica assignment ● Failure resilience testing o Chaos Monkey o Chaos Gorilla ● Contribute to open source o Kafka o Schlep -- our messaging library including SQS and Kafka support o Auditor