This document provides an overview of Apache Kafka including its main components, architecture, and ecosystem. It describes how LinkedIn used Kafka to solve their data pipeline problem by decoupling systems and allowing for horizontal scaling. The key elements of Kafka are producers that publish data to topics, the Kafka cluster that stores streams of records in a distributed, replicated commit log, and consumers that subscribe to topics. Kafka Connect and the Schema Registry are also introduced as part of the Kafka ecosystem.
Kafka Streams: Revisiting the decisions of the past (How I could have made it better), Jason Bell, Kafka DevOps Engineer @ Digitalis.io https://www.meetup.com/Cleveland-Kafka/events/272339276/
While Kafka has guarantees around the number of server failures a cluster can tolerate, to avoid service interruptions, or even data loss, it is prudent to have infrastructure in place for when an environment becomes unavailable during a planned or unplanned outage. This talk describes the architectures available to you when planning for an outage. We will examine configurations including active/passive and active/active as well as availability zones and debate the benefits and limitations of each. We will also cover how to set up each configuration using the tools in Kafka. Whether downtime while you fail over clients to a backup is acceptable or you require your Kafka clusters to be highly available, this talk will give you an understanding of the options available to mitigate the impact of the loss of an environment.
In this talk, we'll discuss how VillageMD is able to use Kafka topic compaction for rapidly scaling our reprocessing pipelines to encompass hundreds of feeds. Within healthcare data ecosystems, privacy and data minimalism are key design priorities. Being able to handle data deletion in a reliable, timely manner within event-driven architectures is becoming more and more necessary with key governance frameworks like the GDPR and HIPAA. We'll be giving an overview of the building and governance of dead-letter queues for streaming data processing. We'll discuss: 1. How to architect a data sink for failed records. 2. How topic compaction can reduce duplicate data and enable idempotency. 3. Building a tombstoning system for removing successfully reprocessed records from the queues. 4. Considerations for monitoring a reprocessing system in production -- what metrics, dataops, and SLAs are useful?
Kafka is a vital part of data infrastructure in many organizations. When the Kafka cluster grows and more data is stored in Kafka for a longer duration, several issues related to scalability, efficiency, and operations become important to address. Kafka cluster storage is typically scaled by adding more broker nodes to the cluster. But this also adds needless memory and CPUs to the cluster making overall storage cost less efficient compared to storing the older data in external storage. Tiered storage is introduced to extend Kafka's storage beyond the local storage available on the Kafka cluster by retaining the older data in cheaper stores, such as HDFS, S3, Azure or GCS with minimal impact on the internals of Kafka. We will talk about - How tiered storage addresses the above problems and also brings several other advantages. - High level architecture of tiered storage - Future work planned as part of tiered storage.
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Confluent Cloud runs a modified version of Apache Kafka - redesigned to be cloud-native and deliver a serverless user experience. In this talk, we will discuss key improvements we've made to Kafka and how they contribute to Confluent Cloud availability, elasticity, and multi-tenancy. You'll learn about innovations that you can use on-prem, and everything you need to make the most of Confluent Cloud.
Apache Pulsar was developed to address several shortcomings of existing messaging systems including geo-replication, message durability, and lower message latency. We will implement a multi-currency quoting application that feeds pricing information to a crypto-currency trading platform that is deployed around the globe. Given the volatility of the crypto-currency prices, sub-second message latency is critical to traders. Equally important is ensuring consistent quotes are available to all geographical locations, i.e the price of Bitcoin shown to a user in the USA should be the same as it to a trader in Hong Kong. We will highlight the advantages of Apache Pulsar over traditional messaging systems and show how its low latency and replication across multiple geographies make it ideally suited for globally distributed, real-time applications.
This document provides an overview and summary of Apache Pulsar, a distributed streaming and messaging platform. It discusses Pulsar's benefits like data durability, scalability, geo-replication and multi-tenancy. It outlines key use cases like message queuing and data streaming. The document also summarizes Pulsar's architecture, subscriptions modes, connectors, and integration with other technologies like Apache Flink, Apache NiFi and MQTT. It highlights real-world customer implementations and provides demos of ingesting IoT data via Pulsar.
1. The document discusses implementing a new data structure, Cuckoo filters, as a Redis module. It describes the presenter's experience level in C programming and goals for the project. 2. Probabilistic data structures like Bloom filters and Cuckoo filters are described as space-efficient alternatives to sets for membership testing with some error. The presenter explains why Cuckoo filters were chosen over Bloom filters for the ability to delete items. 3. Advice is given on starting a new Redis module, including using existing code as examples and writing tests at different speeds. Good API design, leveraging existing work, and thorough testing are emphasized.
Data Pipeline with Kafka, This slide include Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores. In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
Interactive Queries in Apache Kafka's Streams API allows users to query the local state of a Kafka Streams application without accessing an external database. It treats the Kafka Streams application as an embedded, lightweight database. The local state is fault-tolerant and can be sharded across tasks to scale horizontally. Users can discover other application instances and their state to perform queries on remote state if needed. Interactive Queries simplifies stateful stream processing by reducing moving parts compared to using an external database.
1. Zillow transitioned from using multiple messaging systems and data pipelines to using Kafka as their single streaming platform to unify their data infrastructure. 2. They took a bottom-up approach to gain trust from teams by publishing service level objectives, onboarding non-critical streams quickly, and meeting developers where they were with tools like Terraform. 3. An important lesson was to treat the platform as a product by providing documentation, libraries, and blog posts to make it easy for developers to use.
This document summarizes a presentation about Kafka multi-tenancy at LINE Corporation. The presentation discusses how LINE runs a single shared Kafka cluster to handle over 100 billion messages per day from many independent services. It describes the hardware used, requirements for multi-tenancy like protecting against abusive workloads and providing isolation. It then discusses specific issues identified like slow response times caused by disk reads, and the solutions implemented like request quotas, metrics, and pre-loading data into memory to avoid blocking. The presentation concludes that after addressing these issues, their shared Kafka hosting model works efficiently while maintaining a single data hub.
Presented by Michael Noll, Product Manager, Confluent. Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all. Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.
(Mike Graham + Dan Carroll, Comcast) Kafka Summit SF 2018 Comcast manages over 2 million miles of fiber and coax, and over 40 million in home devices. This “outside plant” is subject to adverse conditions from severe weather to power grid outages to construction-related disruptions. Maintaining the health of this large and important infrastructure requires a distributed, scalable, reliable and fast information system capable of real-time processing and rapid analysis and response. Using Apache Kafka and the Kafka Streams Processor API, Comcast built an innovative new system for monitoring, problem analysis, metrics reporting and action response for the outside plant. In this talk, you’ll learn how topic partitions, state stores, key mapping, source and sink topics and processors from the Kafka Streams Processor API work together to build a powerful dynamic system. We will dive into the details about the inner workings of the state store—how it is backed by a Kafka “changelog” topic, how it is scaled horizontally by partition and how the instances are rebuilt on startup or on processor failure. We will discuss how these state stores essentially become like materialized views in a SQL database but are updated incrementally as data flows through the system, and how this allows the developers to maintain the data in the optimal structures for performing the processing. The best part is that the data is readily available when needed by the processors. You will see how a REST API using Kafka Streams “interactive queries” can be used to retrieve the data in the state stores. We will explore the deployment and monitoring mechanisms used to deliver this system as a set of independently deployed components.
Matías Cascallares, Confluent, Customer Success Architect Streams es uno de los keywords de moda! En esta presentación, veremos cómo implementar stream processing con Kafka Streams, que consideraciones tenemos que tener en cuenta, y un pequeño tour por ksqlDB como herramienta. https://www.meetup.com/Mexico-Kafka/events/276717476/
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses the evolution of Netflix's data pipeline to incorporate Kafka to handle 400 billion events per day. It describes how Netflix uses Kafka clusters with different priorities and configurations. It also outlines some of the challenges of using Kafka at Netflix's scale, such as Zookeeper client issues and cluster scaling, and the solutions Netflix developed to address these challenges.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
Kafka Connect allows data ingestion into Kafka from external systems by using connectors. It provides scalability, fault tolerance, and exactly-once semantics. Connectors are run as tasks within workers that can run in either standalone or distributed mode. The Schema Registry works with Kafka Connect to handle schema validation and evolution.
Confluent Platform is supporting London Metal Exchange’s Kafka Centre of Excellence across a number of projects with the main objective to provide a reliable, resilient, scalable and overall efficient Kafka as a Service model to the teams across the entire London Metal Exchange estate.
Timothy will introduce Apache Pulsar, an open-source distributed messaging and streaming platform. He will discuss how to build real-time applications using Pulsar with various libraries, schemas, languages, frameworks and tools. The presentation will cover what Pulsar is, its functions and components, how it compares to other technologies like Apache Kafka, its advantages, and how to integrate it with tools like Apache Flink, Apache Spark, Apache NiFi and more. A demo and Q&A will follow.
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-ramp 2022 As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed. I will walk through how to get started, some use cases and demos and answer questions. https://www.devfest-uki.com/schedule https://linktr.ee/tspannhw
Timothy Spann: Apache Pulsar for ML Data Science Online Camp 2023 Winter Website: https://dscamp.org Youtube: https://www.youtube.com/channel/UCeHtPZ_ZLZ-nHFMUCXY81RQ FB: https://www.facebook.com/people/Data-Science-Camp/100064240830422/
Speakers: Ravi Dubey, Senior Manager, Software Engineering, Capital One + Jeff Sharpe, Software Engineer, Capital One Capital One supports interactions with real-time streaming transactional data using Apache Kafka®. Kafka helps deliver information to internal operation teams and bank tellers to assist with assessing risk and protect customers in a myriad of ways. Inside the bank, Kafka allows Capital One to build a real-time system that takes advantage of modern data and cloud technologies without exposing customers to unnecessary data breaches, or violating privacy regulations. These examples demonstrate how a streaming platform enables Capital One to act on their visions faster and in a more scalable way through the Kafka solution, helping establish Capital One as an innovator in the banking space. Join us for this online talk on lessons learned, best practices and technical patterns of Capital One’s deployment of Apache Kafka. -Find out how Kafka delivers on a 5-second service-level agreement (SLA) for inside branch tellers. -Learn how to combine and host data in-memory and prevent personally identifiable information (PII) violations of in-flight transactions. -Understand how Capital One manages Kafka Docker containers using Kubernetes. Watch the recording: https://videos.confluent.io/watch/6e6ukQNnmASwkf9Gkdhh69?.
This document provides an overview of stream data processing and common stream processing tools. It discusses streaming basics like stateful and stateless operations. It also covers microbatch vs realtime streaming and compositional vs declarative stream processing engines. Typical stream processing architectures and use cases are presented. Main considerations for projects using stream processing are outlined. An overview of popular stream processing tools like Apache Spark, Storm, Flink, and cloud services from AWS, GCP and Azure is provided. The document concludes with a case study example and questions for discussion.
Watch the webcast here: https://videos.confluent.io/watch/bfDAKeLTiq751YyiZaCgHG Speaker: Brett Randall
JConf.dev 2022 - Apache Pulsar Development 101 with Java https://2022.jconf.dev/ In this session I will get you started with real-time cloud native streaming programming with Java. We will start off with a gentle introduction to Apache Pulsar and setting up your first easy standalone cluster. We will then l show you how to produce and consume message to Pulsar using several different Java libraries including native Java client, AMQP/RabbitMQ, MQTT and even Kafka. After this session you will building real-time streaming and messaging applications with Java. We will also touch on Apache Spark and Apache Flink. Timothy Spann Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/speaker/185963
Music city data Hail Hydrate! from stream to lake with apache pulsar, apache flink, apache nifi, datalake, cloud, streaming data, IoT.
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target. This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.