This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly. ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set. Speakers: Junjie Chen, Junping Du
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
This document discusses data mesh, a distributed data management approach for microservices. It outlines the challenges of implementing microservice architecture including data decoupling, sharing data across domains, and data consistency. It then introduces data mesh as a solution, describing how to build the necessary infrastructure using technologies like Kubernetes and YAML to quickly deploy data pipelines and provision data across services and applications in a distributed manner. The document provides examples of how data mesh can be used to improve legacy system integration, batch processing efficiency, multi-source data aggregation, and cross-cloud/environment integration.
Flink Forward San Francisco 2022. At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing. by Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
This document provides an overview of Apache Kafka including its main components, architecture, and ecosystem. It describes how LinkedIn used Kafka to solve their data pipeline problem by decoupling systems and allowing for horizontal scaling. The key elements of Kafka are producers that publish data to topics, the Kafka cluster that stores streams of records in a distributed, replicated commit log, and consumers that subscribe to topics. Kafka Connect and the Schema Registry are also introduced as part of the Kafka ecosystem.
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Mark Harrison presented on using Amazon Kinesis for event-driven microservices architectures. He discussed the limitations of traditional monolithic and microservice architectures, and how Kinesis can help address issues like tight coupling, high latency, and lack of event broadcasting. Key concepts around Kinesis included streams, shards, producers, consumers, and constraints like throughput limits. Harrison demonstrated a custom Scala/Akka client for Kinesis that provides asynchronous, non-blocking producers and consumers with features like throttling and checkpointing. Performance tests showed throughput scales linearly with additional shards. In closing, Harrison invited the audience to learn more about open source and job opportunities with Weight Watchers.
We will talk about how we are migrating our Presto clusters from AWS EMR to Docker using production-grade orchestrators considering cluster management, configuration and monitoring. We will discuss between Hashicorp Nomad and Kubernetes as a base solution
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016. Topics covered: - Architectural design and principles for Keystone - Technologies that Keystone is leveraging - Best practices http://www.meetup.com/SF-Data-Engineering/events/228293610/
Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments. These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters. Slides used at OpenShift Meetup Spain: - https://www.meetup.com/es-ES/openshift_spain/events/261284764/
http://www.oreilly.com/pub/e/3764 Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
Disenchantment is a Netflix show following the medieval misadventures of a hard-drinking princess, her feisty elf, and her personal demon. In this talk, we will follow the story of Netflix’s container management platform, Titus, which powers critical aspects of the Netflix business (video encoding & streaming, big data, recommendations & machine learning, and other workloads). We’ll cover the challenges growing Titus from 10’s to 1000’s of workloads. We’ll talk about our feisty team’s work across container runtimes, scheduling & control plane, and cloud infrastructure integration. We’ll talk about the demons we’ve found on this journey covering operability, security, reliability and performance.
In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas. Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
This document discusses the evolution of Kafka clusters at AppsFlyer over time. The initial cluster had 4 brokers and handled hundreds of millions of messages with low partitioning and replication. A new cluster was designed with more brokers, replication across availability zones, and higher partitioning to support billions of messages. However, this led to issues like uneven leader distribution and failures. Various solutions were implemented like increasing brokers, splitting topics, and hardware upgrades. Ongoing testing and monitoring helped identify more problems and improvements around replication, partitioning, and automation. Key lessons learned included balancing replication and leaders, supporting dynamic changes, and thorough testing of failure scenarios.
This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
This document discusses using Apache Kafka at scale to deliver real-time notifications. It describes using Kafka to listen to database events, process them into transactions, translate transactions into notifications, and push notifications to mobile devices in real-time. It outlines the system architecture, challenges faced around partitioning, scaling consumers, and upgrading Kafka clients. Monitoring metrics and consumer lag is also discussed using tools like Burrow and Datadog.
Until recently, the Messaging team at Twitter had been running an in-house build Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 200M messages per second. In this talk, we will share the learning and challenges in our journey moving to Apache Kafka.
Audience: Advanced About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand. This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production. Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more! Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies. OpenStack Australia Day - Sydney 2016 https://events.aptira.com/openstack-australia-day-sydney-2016/
The document discusses Netflix's approach to handling big data problems. It summarizes Netflix's data pipeline system called Keystone that was built in a year to replace a legacy system. Keystone ingests over 1 trillion events per day and processes them using technologies like Kafka, Samza and Spark Streaming. The document emphasizes Netflix's culture of freedom and responsibility and how it helped the small team replace the legacy system without disruption while achieving massive scale.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
"Confluent Cloud is a cloud-native service based on Apache Kafka. We run tens of thousands of clusters across all major cloud service providers (AWS, GCP and Azure). In this talk, we will go over our journey to make Confluent Cloud 10x faster than Apache Kafka. We will talk about how we designed our various workloads, the complexities involved in our cloud-native service, the challenges we faced, and the various pitfalls we ran into. We will also cover the interesting learnings, which in hindsight, are first principles from this multi-year journey. By attending this talk, attendees will be able to take our learnings from making Confluent Cloud latencies 10x better and possibly apply similar principles to their cloud native data streaming systems."