Performance Tuning RocksDB for Kafka Streams’ State Stores, Bruno Cadonna, Contributor to Apache Kafka & Software Developer at Confluent and Dhruba Borthakur, CTO & Co-founder Rockset Meetup link: https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/273823025/
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop. It's also enabling many real-time system frameworks and use cases. Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API. Also talk about the best practices involved in running a producer/consumer. In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects. We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing Kafka ACLs and monitoring Consumer offsets.
3 Things to Learn About: -How Kudu is able to fill the analytic gap between HDFS and Apache HBase -The trade-offs between real-time transactional access and fast analytic performance -How Kudu provides an option to achieve fast scans and random access from a single API
Presentation at Strata Data Conference 2018, New York The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure. Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker. Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
What is Kafka What problem does Kafka solve How does Kafka work What are the benefits of Kafka Conclusion
Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale. In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Speaker: Matthias J. Sax, Software Engineer, Confluent ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries. Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems. https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Flink Forward San Francisco 2022. Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink. by Nico Kruber
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application. This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance. Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
This document provides an introduction to Apache Kafka, an open-source distributed event streaming platform. It discusses Kafka's history as a project originally developed by LinkedIn, its use cases like messaging, activity tracking and stream processing. It describes key Kafka concepts like topics, partitions, offsets, replicas, brokers and producers/consumers. It also gives examples of how companies like Netflix, Uber and LinkedIn use Kafka in their applications and provides a comparison to Apache Spark.
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going examine a number common streaming design patterns in the context of the following questions. WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? HOW are going to architect the solution? And how much are you willing to pay for it? Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
A presentation in ApacheCon Asia 2022 from Dan Wang and Yingchun Lai. Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store. Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
Tungsten Connector / Proxy for MySQL is truly the secret sauce for the Tungsten Clustering solution. Join us for a basic 30min introduction and tour of Tungsten Connector / Proxy, and gain an understanding of the various SQL routing methods available in Tungsten Connector / Proxy. AGENDA - Review the cluster architecture - Understand the role of the Connector - Explore Connector routing methods - Discuss user authentication - Review configuration files and their locations - Explore the command line interface
Using Kafka streaming platform to build a scalable, highly available, decouple realtime data pipeline at scale
Deep dive deck given at SQL Saturday Manchester 2015 covering Checkpoint File Pairs, Hash Indexign etc.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
This is the version of my Apache Performance Tuning deck I last presented at ApacheCon, in 2008 in Amsterdam.
This document provides an overview and agenda for a presentation on Apache ActiveMQ 5.9.x and Apache Apollo. The presentation will cover new features in ActiveMQ 5.9.x including AMQP 1.0 support, REST management, a new default file-based store using LevelDB, and high availability replication of the store. It will also introduce Apache Apollo and allow for a question and discussion period.
Microservices can provide terabytes of data in microseconds by mapping data from SQL databases into in-memory key-value stores and column key stores within JVMs. This is done through periodic synchronization of changed data from databases into memory and mapping the in-memory data into fast access structures. The in-memory data is then exposed through Java Stream and REST APIs to microservices for high performance querying and analysis of large datasets. This architecture allows microservices to quickly share access to large datasets and restart rapidly by reloading from the synchronized persistent stores.
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
Spark Streaming provides fault-tolerance through checkpointing and write ahead logs (WAL). Checkpointing saves metadata and generated RDDs to reliable storage to recover from driver failures. WAL saves all received data to log files to enable zero data loss recovery from executor failures. Structured Streaming uses checkpointing for fault-tolerance. Kafka achieves fault-tolerance through replication of partitions across brokers. Flume uses durable file channels and redundant topologies. HDFS replicates blocks across multiple machines. The Lambda architecture handles batch and real-time data through separate batch and speed layers that are merged in the serving layer.
Getting started with Riak in the Cloud involves provisioning a Riak cluster on Engine Yard and optimizing it for performance. Key steps include choosing instance types like m1.large or m1.xlarge that are EBS-optimized, having at least 5 nodes, setting the ring size to 256, disabling swap, using the Bitcask backend, enabling kernel optimizations, and monitoring and backing up the cluster. Benchmarks show best performance from high I/O instance types like hi1.4xlarge that use SSDs rather than EBS storage.
I explain the key points of Kafka that makes it a very fast and reliable distributed message bus even using mechanical disks.
Oracle has evolved from its first release in 1979 to become a leading database with various editions that can be used by individuals, workgroups or enterprises, and it provides developer tools and supports different database structures, security mechanisms, SQL for data access and transactions. Key components of an Oracle database include control files, data files, redo log files, tablespaces that logically organize storage, and various memory and file structures.
The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Tim Vaillancourt is a senior technical operations architect specializing in MongoDB. He has over 10 years of experience tuning Linux for database workloads and monitoring technologies like Nagios, MRTG, Munin, Zabbix, Cacti, and Graphite. He discussed the various MongoDB storage engines including MMAPv1, WiredTiger, RocksDB, and TokuMX. Key metrics for monitoring the different engines include lock ratio, page faults, background flushing times, checkpoints/compactions, replication lag, and scanned/moved documents. High-level operating system metrics like CPU, memory, disk, and network utilization are also important for ensuring MongoDB has sufficient resources.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
This document provides an overview and agenda for a presentation on virtualizing SQL Server workloads on VMware vSphere. The presentation will cover designing SQL Server virtual machines for performance in production environments, consolidating multiple SQL Server workloads, and ensuring SQL Server availability using vSphere features. It emphasizes understanding the workload, optimizing for storage and network performance, avoiding swapping, using large memory pages, and accounting for NUMA when configuring SQL Server virtual machines.
The different challenges of working with state in a distributed streaming data pipeline and how we solve it with the 3S architecture and Kafka streams stores based on rocksDB
Per costruire applicazioni di AI affidabili, sicure e governate occorre una base dati in tempo reale altrettanto solida. Ancor più quando ci troviamo a gestire ingenti flussi di dati in continuo movimento. Come arrivarci? Affidati a una vera piattaforma di data streaming che ti permetta di scalare e creare rapidamente applicazioni di AI in tempo reale partendo da dati affidabili. Scopri di più! Non perdere il nostro prossimo webinar durante il quale avremo l’occasione di: • Esplorare il paradigma della GenAI e di come questa nuova tecnnologia sta rimodellando il panorama aziendale, rispondendo alla necessità di offrire un contesto e soluzioni in tempo reale che soddisfino le esigenze della tua azienda. • Approfondire le incertezze del panorama dell'AI in evoluzione e l'importanza cruciale del data streaming e dell'elaborazione dati. • Vedere in dettaglio l'architettura in continua evoluzione e il ruolo chiave di Kafka e Confluent nelle applicazioni di AI. • Analizzare i vantaggi di una piattaforma di streaming dei dati come Confluent nel collegare l'eredità legacy e la GenAI, facilitando lo sviluppo e l’utilizzo di AI predittive e generative.
As businesses strive to remain at the cutting edge of innovation, the demand for scalable and up-to-date conversational AI solutions has become paramount. Generative AI (GenAI) chatbots that seamlessly integrate into our daily lives and adapt to the ever-evolving nuances of human interaction are crucial. Real-time data plays a pivotal role in ensuring the responsiveness and relevance of these chatbots, empowering them to stay abreast of the latest trends, user preferences, and contextual information.