SlideShare a Scribd company logo
How a BEAM runner executes a
pipeline
Javier Ramirez (@supercoco9)
Head of Engineering @teamdatatonic
2018-10-02
▪ Why do I care?
▪ Pipeline basics
▪ Runner Overview
▪ Exploring the graph
▪ Implementing PCollections
▪ Watermark Propagation
▪ Implementing Ptransforms: Read, Pardo, GroupByKey,
Window, and Flatten
▪ Optimising the execution Plan
▪ Persisting state (coders and snapshots)
▪ FnApi and RunnerApi for SDK independence
Why
▪ I started using beam dataflow private alpha when it was “just” a serverless runner
▪ Then Beam was born as a common layer on top of multiple runners
▪ Wanted to understand what is part of Beam and what’s part of the runner
▪ Might help choosing the right runner for the job
Pipeline overview
■ Write pipeline code in Java, Python, or Go
■ The abstraction is a Directed Acyclic Graph (DAG) where nodes are transforms and edges are data flowing as
PCollections.
■ Both PTransforms and PCollections can be distributed and parallelised, and the model is fault-tolerant, so
they need to be serializable to be sent across workers
■ Read data from one or more inputs, bounded or unbounded
■ Apply transforms, stateless or stateful
■ Write data to one or more outputs
■ Optionally, keep track of metrics

Recommended for you

Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...

Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu

apache flinkbig datastream processing
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis

"There's little talk about capacity planning Kafka clusters, it's very much learn as you go, every cluster is different. In this talk Kafka DevOps Engineer Jason Bell takes you through the things that will help you, from broker capacity, thinking about topics and how the other Confluent components can affect throughput and performance. With a number of production deployments under his watchful gaze for over six years Jason has plenty of experience, stories and useful information that will help you. By the end of the talk you'll have a good understanding of designing the cluster for various scenarios, where the points of latency are to watch and monitor. And also how to prevent teams breaking the cluster behind your back. This talk is designed for everyone, anyone who is just starting to those who are operating Kafka on a daily basis."

kafka summitapache kafkakafka cluser
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework

These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing. In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?

apache sparkarchitecturemachine learning
Runner overview
BEAM-compatible is a very flexible claim
▪ Can choose to support only some languages (The portability API will change this)
▪ Can choose to support only batch or streaming processing
▪ Can choose to what extent to support early triggers and late data, refinements, state…
▪ Needs to translate from BEAM code to runner-native code
▪ Is responsible for submitting and monitoring the pipeline
▪ Must serialize/deserialize data and functions across workers and stages
▪ Is responsible for performance, scalability, optimisations, and enforcing the BEAM
model guarantees (some methods will be called exactly once, a transform will not be
executed by more than a thread at once within a worker, if a bundle of data is
processed by a transform more than once, it will not generate duplicates…)
Runner entrypoint: exploring the DAG
■ Beam provides a method to traverse (visit) the graph. Runners need to walk the graph to:
■ Validate the pipeline
■ Get insights to choose the best execution strategy
❏ Example: Spark Runner
❏ Chooses if using the batch or streaming engine by visiting the graph and checking if any PCollection
is unbounded
❏ Detects PCollections that are used in more than one transform, and creates internal caches to store
those collections
■ Translate the BEAM transforms into native transforms
■ Optimise the graph execution (to minimize serialization and shuffling)
Implementing PCollections
■ Unordered bags of elements
■ Might be bounded or unbounded
■ All the elements are of the same type and the PCollection has a coder to serialize/deserialize
elements
■ Every element will always have
■ A Timestamp (might be negative infinity if not important)
■ A Window, which is initially the global window, but can be changed via transforms
■ Every PCollection has a watermark estimating how complete it is
Watermark Propagation
Watermark propagation taken from the Flink documentation https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

Recommended for you

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

spark; internal; shuffle;
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.

CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트

This document discusses using ray tracing to perform visibility testing of voxel objects. It describes how ray tracing can be used to efficiently determine which voxel objects are visible without needing to render everything like with traditional occlusion culling. The key steps are: 1. Create a ray tracing buffer in GPU memory 2. Trace rays from pixels into a KD-tree of voxel objects to find the closest visible object 3. Record the object IDs in the buffer This approach is shown to perform faster than CPU occlusion culling by implementing it using CUDA on the GPU. Testing finds it can render frames at over 60 FPS even with large voxel worlds containing 50,000+ objects.

cudagame engine developmentgame programming
Implementing PTransforms
■ Beam can do pretty complex things with just a few primitives
■ Read
■ Flatten
■ Window
■ GroupByKey
■ ParDo
Implementing Read
■ Read can be bounded or unbounded.
■ The runner calls split to the source into
bundles
■ The runner gets a reader object and
then it can execute =====>
■ If supported, the runner can call
splitAtFraction to enable dynamic
rebalancing
Implementing unbounded Read
■ The general pattern is the same for both bounded and unbounded, but unbounded sources
■ Report a watermark that the runner needs to associate to the elements and propagate downstream
■ Provide Record IDs in case we need to use deduplication to enforce exactly-once processing
■ Support Checkpointing. The runner can get information about the current checkpoint on the stream,
and can call “FinalizeCheckpoint” to tell the unbounded source the elements are safe on the pipeline
and can be acknowledged from the stream if needed
Implementing Flatten
■ The runner only needs to verify the window strategies of all the PCollections to flatten are
compatible
■ The result is a single PCollection containing all the elements and windows of the input
PCollections without any changes

Recommended for you

Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink

Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski

apache flinkstream processingbig data
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber

Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.

flinktuningmemory allocation
개발자를 위한 (블로그) 글쓰기 intro
개발자를 위한 (블로그) 글쓰기 intro개발자를 위한 (블로그) 글쓰기 intro
개발자를 위한 (블로그) 글쓰기 intro

2018년 6월 24일 "백수들의 Conference"에서 발표한 개발자를 위한 (블로그) 글쓰기 intro입니다 좋은 글을 많이 보는 노하우 + 꾸준히 글을 작성하는 노하우에 대�� 주로 이야기했습니다! (어떻게 글을 작성하는가는 없어요!) 피드백은 언제나 환영합니다 :)

blogpostingwriting
Implementing Window
■ Window is just a grouping key with a maximum timestamp
■ One element can be conceptually in one window only. If you need to assign an element to
multiple windows, it counts as multiple elements from Beam’s point of view.
■ The runner may choose to use a physical representation where one element appears to be
assigned to multiple windows for storage efficiency, but it maps conceptually to multiple
elements
Implementing GroupByKey
■ GroupByKey groups a PCollection of key-value pairs by Key and Window
■ GroupByKey will emit results only when window triggers allow it, and should automatically drop
expired elements
■ Since GroupByKey is closely related to Windows, it needs to be able to merge element by
window when requested, for example to keep session windows
■ GroupByKey needs to choose the timestamp to emit with the results
Implementing Pardo
■ Conceptually simple:
■ Setup is called once per instance of the ParDo
■ The runner decides on the bundle size (some runners allow user control)
■ It calls startBundle once per bundle
■ It calls processElement once per element
■ If we are using timely processing, it calls onTimer for each timer activation
■ It finishes by calling finishBundle
■ If an element fails, the whole bundle is retried
■ Teardown is called to release ParDo resources
■ Under the hood, the runner needs to take into account ParDos can be stateful and can have side
inputs. In those cases the runner is responsible for keeping and propagating state, and for
materialising the side inputs
Optimising the DAG execution
■ Two levels of optimisation
■ Execution plan (Supported by BEAM)
■ Intermediate data materialisation (Depends on the Runner)

Recommended for you

XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook

XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.

Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming

This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses: 1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date. 2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs. 3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.

databricksstrata san joseapache spark
Optimising the Execution Plan
■ Fusion and combine
aggregation are core concepts
behind the dataflow/BEAM
model
■ The JAVA SDK provides core
helpers to deal with this
Intermediate data materialisation
■ What to do if one transform fails downstream and we need to reprocess the data?
■ With some sources (like a static file) we could potentially replay the whole data and retry. Not
deterministic or fast, but it would work
■ With unbounded sources (or with bounded sources with changing data), we might not be
able to replay the whole data, so we need to have some way of
“materialising/checkpointing/snapshotting” the data
❏ Flink and IBM Streams distributed snapshots
❏ Samza incremental checkpoints
Transforms called more than once
SDK independent runners
■ Recap: Executing a user's pipeline can be broken up into the following categories:
■ Perform any grouping and triggering logic including keeping track of a watermark, window merging,
propagating status...
■ Pipeline execution management such as polling for counter information, job status, …
■ Execute user definable functions (UDFs) including custom sources/sinks, regular DoFns...
■ Executing UDFs is the only category which requires a specific language-specific SDK context to execute in.
Moving the execution of UDFs to language-specific SDK harnesses and using an RPC service between the
two allows for a cross language/cross Runner portability story.

Recommended for you

MongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To Transactions

MongoDB has adapted transaction feature (ACID Properties) in MongoDB 4.0. This talk focuses on the internals of how MongoDB adapted the ACID properties with Weird Tiger Engine. Weird tiger offers more future possibilities for MongoDB. This tech talk was presented at Mydbops Database Meetup on 27-04-2019 by Manosh Malai Senior Devops/NoSQL Consultant with Mydbops and Ranjith Database Administrator with Mydbops.

mongodbweird tigermongodb 4.0
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...

In this talk we present Zalando's microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing for near-real time business intelligence. Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization. We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach - Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams. We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch. With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL. Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements. On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability. Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.

microservicesflinkzalando tech
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka

The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5 It provides a brief introduction to the motivation for building Kafka and how it works from a high level. Please download the presentation if you wish to see the animated slides.

kafkameetuplinkedin
SDK independent runners: RunnerAPI & FnAPI
■ The harness is a docker container able to run the language-specific parts of the pipeline. The
Runner is responsible for launching and managing the container. Communication between
Runner and Harness is via the FnApi, implemented via gRPC
SDK independent runners: FnAPI
▪ Why do I care?
▪ Pipeline basics
▪ Runner Overview
▪ Exploring the graph
▪ Implementing PCollections
▪ Watermark Propagation
▪ Implementing Ptransforms: Read, Pardo, GroupByKey,
Window, and Flatten
▪ Optimising the execution Plan
▪ Persisting state (coders and snapshots)
▪ FnApi and RunnerApi for SDK independence
Cheers
Javier Ramirez (@supercoco9)
Head of Engineering
@teamdatatonic

Recommended for you

Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook

This document summarizes Chen Yang's presentation on vectorized query execution in Apache Spark at Facebook. The key points are: 1) Spark is the largest SQL query engine at Facebook and uses columnar formats like ORC to improve storage efficiency. 2) Vectorized processing can improve performance over row-at-a-time processing by reducing per-row overhead and improving cache locality. 3) Facebook has implemented a vectorized ORC reader and writer in Spark that shows up to 8x speedup on microbenchmarks compared to the row-at-a-time approach.

Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer

Badai Aqrandista, Confluent, Senior Technical Support Engineer This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way! https://www.meetup.com/apache-kafka-sydney/events/279651982/

apache kafka
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming

This document provides an introduction to Apache Flink and its streaming capabilities. It discusses stream processing as an abstraction and how Flink uses streams as a core abstraction. It compares Flink streaming to Spark streaming and outlines some of Flink's streaming features like windows, triggers, and how it handles event time.

flink streamingspark meetupbangalore apache spark meetup

More Related Content

What's hot

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
HostedbyConfluent
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
YEONG-CHEON YOU
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Flink Forward
 
개발자를 위한 (블로그) 글쓰기 intro
개발자를 위한 (블로그) 글쓰기 intro개발자를 위한 (블로그) 글쓰기 intro
개발자를 위한 (블로그) 글쓰기 intro
Seongyun Byeon
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Aniket Mokashi
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
MongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To Transactions
Mydbops
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
Akash Vacher
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
Databricks
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 

What's hot (20)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
개발자를 위한 (블로그) 글쓰기 intro
개발자를 위한 (블로그) 글쓰기 intro개발자를 위한 (블로그) 글쓰기 intro
개발자를 위한 (블로그) 글쓰기 intro
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
MongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To Transactions
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 

Similar to How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
Anant Corporation
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N
 
P99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io SystemP99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io System
ScyllaDB
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
basisspace
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
Samuel Chow
 

Similar to How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018 (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
P99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io SystemP99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io System
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
 

More from javier ramirez

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
javier ramirez
 

More from javier ramirez (20)

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 

Recently uploaded

AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
jiya khan$A17
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
shoeb2926
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
depikasharma
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
shruti singh$A17
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
nehadubay1
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
khansayyad1256
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 

Recently uploaded (20)

AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 

How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018

  • 1. How a BEAM runner executes a pipeline Javier Ramirez (@supercoco9) Head of Engineering @teamdatatonic 2018-10-02
  • 2. ▪ Why do I care? ▪ Pipeline basics ▪ Runner Overview ▪ Exploring the graph ▪ Implementing PCollections ▪ Watermark Propagation ▪ Implementing Ptransforms: Read, Pardo, GroupByKey, Window, and Flatten ▪ Optimising the execution Plan ▪ Persisting state (coders and snapshots) ▪ FnApi and RunnerApi for SDK independence
  • 3. Why ▪ I started using beam dataflow private alpha when it was “just” a serverless runner ▪ Then Beam was born as a common layer on top of multiple runners ▪ Wanted to understand what is part of Beam and what’s part of the runner ▪ Might help choosing the right runner for the job
  • 4. Pipeline overview ■ Write pipeline code in Java, Python, or Go ■ The abstraction is a Directed Acyclic Graph (DAG) where nodes are transforms and edges are data flowing as PCollections. ■ Both PTransforms and PCollections can be distributed and parallelised, and the model is fault-tolerant, so they need to be serializable to be sent across workers ■ Read data from one or more inputs, bounded or unbounded ■ Apply transforms, stateless or stateful ■ Write data to one or more outputs ■ Optionally, keep track of metrics
  • 5. Runner overview BEAM-compatible is a very flexible claim ▪ Can choose to support only some languages (The portability API will change this) ▪ Can choose to support only batch or streaming processing ▪ Can choose to what extent to support early triggers and late data, refinements, state… ▪ Needs to translate from BEAM code to runner-native code ▪ Is responsible for submitting and monitoring the pipeline ▪ Must serialize/deserialize data and functions across workers and stages ▪ Is responsible for performance, scalability, optimisations, and enforcing the BEAM model guarantees (some methods will be called exactly once, a transform will not be executed by more than a thread at once within a worker, if a bundle of data is processed by a transform more than once, it will not generate duplicates…)
  • 6. Runner entrypoint: exploring the DAG ■ Beam provides a method to traverse (visit) the graph. Runners need to walk the graph to: ■ Validate the pipeline ■ Get insights to choose the best execution strategy ❏ Example: Spark Runner ❏ Chooses if using the batch or streaming engine by visiting the graph and checking if any PCollection is unbounded ❏ Detects PCollections that are used in more than one transform, and creates internal caches to store those collections ■ Translate the BEAM transforms into native transforms ■ Optimise the graph execution (to minimize serialization and shuffling)
  • 7. Implementing PCollections ■ Unordered bags of elements ■ Might be bounded or unbounded ■ All the elements are of the same type and the PCollection has a coder to serialize/deserialize elements ■ Every element will always have ■ A Timestamp (might be negative infinity if not important) ■ A Window, which is initially the global window, but can be changed via transforms ■ Every PCollection has a watermark estimating how complete it is
  • 8. Watermark Propagation Watermark propagation taken from the Flink documentation https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html
  • 9. Implementing PTransforms ■ Beam can do pretty complex things with just a few primitives ■ Read ■ Flatten ■ Window ■ GroupByKey ■ ParDo
  • 10. Implementing Read ■ Read can be bounded or unbounded. ■ The runner calls split to the source into bundles ■ The runner gets a reader object and then it can execute =====> ■ If supported, the runner can call splitAtFraction to enable dynamic rebalancing
  • 11. Implementing unbounded Read ■ The general pattern is the same for both bounded and unbounded, but unbounded sources ■ Report a watermark that the runner needs to associate to the elements and propagate downstream ■ Provide Record IDs in case we need to use deduplication to enforce exactly-once processing ■ Support Checkpointing. The runner can get information about the current checkpoint on the stream, and can call “FinalizeCheckpoint” to tell the unbounded source the elements are safe on the pipeline and can be acknowledged from the stream if needed
  • 12. Implementing Flatten ■ The runner only needs to verify the window strategies of all the PCollections to flatten are compatible ■ The result is a single PCollection containing all the elements and windows of the input PCollections without any changes
  • 13. Implementing Window ■ Window is just a grouping key with a maximum timestamp ■ One element can be conceptually in one window only. If you need to assign an element to multiple windows, it counts as multiple elements from Beam’s point of view. ■ The runner may choose to use a physical representation where one element appears to be assigned to multiple windows for storage efficiency, but it maps conceptually to multiple elements
  • 14. Implementing GroupByKey ■ GroupByKey groups a PCollection of key-value pairs by Key and Window ■ GroupByKey will emit results only when window triggers allow it, and should automatically drop expired elements ■ Since GroupByKey is closely related to Windows, it needs to be able to merge element by window when requested, for example to keep session windows ■ GroupByKey needs to choose the timestamp to emit with the results
  • 15. Implementing Pardo ■ Conceptually simple: ■ Setup is called once per instance of the ParDo ■ The runner decides on the bundle size (some runners allow user control) ■ It calls startBundle once per bundle ■ It calls processElement once per element ■ If we are using timely processing, it calls onTimer for each timer activation ■ It finishes by calling finishBundle ■ If an element fails, the whole bundle is retried ■ Teardown is called to release ParDo resources ■ Under the hood, the runner needs to take into account ParDos can be stateful and can have side inputs. In those cases the runner is responsible for keeping and propagating state, and for materialising the side inputs
  • 16. Optimising the DAG execution ■ Two levels of optimisation ■ Execution plan (Supported by BEAM) ■ Intermediate data materialisation (Depends on the Runner)
  • 17. Optimising the Execution Plan ■ Fusion and combine aggregation are core concepts behind the dataflow/BEAM model ■ The JAVA SDK provides core helpers to deal with this
  • 18. Intermediate data materialisation ■ What to do if one transform fails downstream and we need to reprocess the data? ■ With some sources (like a static file) we could potentially replay the whole data and retry. Not deterministic or fast, but it would work ■ With unbounded sources (or with bounded sources with changing data), we might not be able to replay the whole data, so we need to have some way of “materialising/checkpointing/snapshotting” the data ❏ Flink and IBM Streams distributed snapshots ❏ Samza incremental checkpoints
  • 20. SDK independent runners ■ Recap: Executing a user's pipeline can be broken up into the following categories: ■ Perform any grouping and triggering logic including keeping track of a watermark, window merging, propagating status... ■ Pipeline execution management such as polling for counter information, job status, … ■ Execute user definable functions (UDFs) including custom sources/sinks, regular DoFns... ■ Executing UDFs is the only category which requires a specific language-specific SDK context to execute in. Moving the execution of UDFs to language-specific SDK harnesses and using an RPC service between the two allows for a cross language/cross Runner portability story.
  • 21. SDK independent runners: RunnerAPI & FnAPI ■ The harness is a docker container able to run the language-specific parts of the pipeline. The Runner is responsible for launching and managing the container. Communication between Runner and Harness is via the FnApi, implemented via gRPC
  • 23. ▪ Why do I care? ▪ Pipeline basics ▪ Runner Overview ▪ Exploring the graph ▪ Implementing PCollections ▪ Watermark Propagation ▪ Implementing Ptransforms: Read, Pardo, GroupByKey, Window, and Flatten ▪ Optimising the execution Plan ▪ Persisting state (coders and snapshots) ▪ FnApi and RunnerApi for SDK independence
  • 24. Cheers Javier Ramirez (@supercoco9) Head of Engineering @teamdatatonic