In this talk I will present the architecture that allows runners to execute a Beam pipeline. I will explain what needs to happen in order for a compatible runner to know which transforms to run, how to pass data from one step to the next, and how beam allows runners to be SDK agnostic when running pipelines.
Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu
"There's little talk about capacity planning Kafka clusters, it's very much learn as you go, every cluster is different. In this talk Kafka DevOps Engineer Jason Bell takes you through the things that will help you, from broker capacity, thinking about topics and how the other Confluent components can affect throughput and performance. With a number of production deployments under his watchful gaze for over six years Jason has plenty of experience, stories and useful information that will help you. By the end of the talk you'll have a good understanding of designing the cluster for various scenarios, where the points of latency are to watch and monitor. And also how to prevent teams breaking the cluster behind your back. This talk is designed for everyone, anyone who is just starting to those who are operating Kafka on a daily basis."
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing. In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
This document discusses using ray tracing to perform visibility testing of voxel objects. It describes how ray tracing can be used to efficiently determine which voxel objects are visible without needing to render everything like with traditional occlusion culling. The key steps are: 1. Create a ray tracing buffer in GPU memory 2. Trace rays from pixels into a KD-tree of voxel objects to find the closest visible object 3. Record the object IDs in the buffer This approach is shown to perform faster than CPU occlusion culling by implementing it using CUDA on the GPU. Testing finds it can render frames at over 60 FPS even with large voxel worlds containing 50,000+ objects.
Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
2018년 6월 24일 "백수들의 Conference"에서 발표한 개발자를 위한 (블로그) 글쓰기 intro입니다 좋은 글을 많이 보는 노하우 + 꾸준히 글을 작성하는 노하우에 대�� 주로 이야기했습니다! (어떻게 글을 작성하는가는 없어요!) 피드백은 언제나 환영합니다 :)
XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses: 1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date. 2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs. 3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.
MongoDB has adapted transaction feature (ACID Properties) in MongoDB 4.0. This talk focuses on the internals of how MongoDB adapted the ACID properties with Weird Tiger Engine. Weird tiger offers more future possibilities for MongoDB. This tech talk was presented at Mydbops Database Meetup on 27-04-2019 by Manosh Malai Senior Devops/NoSQL Consultant with Mydbops and Ranjith Database Administrator with Mydbops.
In this talk we present Zalando's microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing for near-real time business intelligence. Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization. We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach - Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams. We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch. With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL. Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements. On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability. Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.
The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5 It provides a brief introduction to the motivation for building Kafka and how it works from a high level. Please download the presentation if you wish to see the animated slides.
This document summarizes Chen Yang's presentation on vectorized query execution in Apache Spark at Facebook. The key points are: 1) Spark is the largest SQL query engine at Facebook and uses columnar formats like ORC to improve storage efficiency. 2) Vectorized processing can improve performance over row-at-a-time processing by reducing per-row overhead and improving cache locality. 3) Facebook has implemented a vectorized ORC reader and writer in Spark that shows up to 8x speedup on microbenchmarks compared to the row-at-a-time approach.
Badai Aqrandista, Confluent, Senior Technical Support Engineer This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way! https://www.meetup.com/apache-kafka-sydney/events/279651982/
This document provides an introduction to Apache Flink and its streaming capabilities. It discusses stream processing as an abstraction and how Flink uses streams as a core abstraction. It compares Flink streaming to Spark streaming and outlines some of Flink's streaming features like windows, triggers, and how it handles event time.