SlideShare a Scribd company logo
Unbounded, unordered, global scale datasets are increasingly common in day-today business,
and consumers of these datasets have detailed requirements for latency, cost, and
completeness.
Apache Beam (incubating) defines a new data processing programming model that evolved
from more than a decade of experience building Big Data infrastructure within Google,
including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch
and streaming use cases and neatly separates properties of the data from runtime
characteristics, allowing pipelines to be portable across multiple runtime environments, both
open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud
Dataflow).
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main
concepts in the programming model. During the talk, we’ll argue why Beam is unified, efficient
and portable.
Abstract
Davor Bonaci
Apache Beam PPMC
Software Engineer, Google Inc.
Apache Beam:
A Unified Model for Batch and
Streaming Data Processing
Hadoop Summit, June 28-30, 2016, San Jose, CA
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
1. The Beam Model:
What / Where / When / How
2. SDKs for writing Beam pipelines:
• Java
• Python
3. Runners for Existing Distributed
Processing Backends
• Apache Flink
• Apache Spark
• Google Cloud Dataflow
• Local runner for testing
What is Apache Beam?
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution

Recommended for you

Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)

Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.

apache apexapache beam
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect

It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.

apache kafkakafka connectapache kafka connect
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

spark + ai summit

 *
The Evolution of Beam
MapReduce
Google Cloud
Dataflow
Apache
Beam
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Data...
...can be big...

Recommended for you

kafka
kafkakafka
kafka

Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.

kafka
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going examine a number common streaming design patterns in the context of the following questions. WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? HOW are going to architect the solution? And how much are you willing to pay for it? Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.

* 
apache spark

 *big data

 *ai

 *
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow

Google Cloud Dataflow is a fully managed cloud service for batch and streaming data processing. It provides a unified programming model and DSL that can process both batch and streaming data on platforms like Spark, Flink, and Google's managed Dataflow service. Dataflow uses concepts like pipelines, PCollections, transforms, and windowing to process data according to event time while controlling latency through triggers. The Dataflow service on Google Cloud provides integration with Google Cloud services and monitoring at a cost based on the number and type of workers used.

google cloud dataflowdataflowgcp
...really, really big...
Tuesday
Wednesday
Thursday
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
Reality
Formalizing Event-Time Skew
ProcessingTime
Event Time
Ideal
Skew

Recommended for you

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?

Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.

apache kafkaevent streamingkafka streams
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers

Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer.

kafkakafka trainingkafka consutling
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka

A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.

streamingstreamsknolx
Formalizing Event-Time Skew
Watermarks describe event
time progress.
"No timestamp earlier than the
watermark will be seen"
ProcessingTime
Event Time
~Watermark
Ideal
Skew
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
Element-Wise Aggregating Composite
What: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());

Recommended for you

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.

spark + ai summit

 *
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...

The document discusses the history and development of Apache Arrow, an open-source cross-language development platform for in-memory data. Some key points: - Arrow started in 2016 to optimize data transfer between systems using a standardized columnar memory format. - It has since expanded to include libraries in many languages, file formats like Arrow and Parquet, and distributed computing capabilities like Arrow Flight for RPC. - Over time, more projects have adopted Arrow as an internal data structure for improved performance, including Spark, DuckDB, and Streamlit. - Today Arrow is an ecosystem of interoperable components, with continued work on higher-level tools around databases, machine learning, and geospatial data.

osa con 2022apache arrowaltinity
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook

XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.

What: Computing Integer Sums
What: Computing Integer Sums
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
2 3 4
Where: Fixed 2-minute Windows
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());

Recommended for you

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake

Delta Lake is an open source storage layer that sits on top of data lakes and brings ACID transactions and reliability to Apache Spark. It addresses challenges with data lakes like lack of schema enforcement and transactions. Delta Lake provides features like ACID transactions, scalable metadata handling, schema enforcement and evolution, time travel/data versioning, and unified batch and streaming processing. Delta Lake stores data in Apache Parquet format and uses a transaction log to track changes and ensure consistency even for large datasets. It allows for updates, deletes, and merges while enforcing schemas during writes.

spark with delta lakeapache sparkspark
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison

Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers. This video explores why a single real-time pipeline, called Kappa architecture, is the better fit for many enterprise architectures. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for a Lambda architecture. The main focus of the discussion is on Apache Kafka (and its ecosystem) as the de facto standard for event streaming to process data in motion (the key concept of Kappa), but the video also compares various technologies and vendors such as Confluent, Cloudera, IBM Red Hat, Apache Flink, Apache Pulsar, AWS Kinesis, Amazon MSK, Azure Event Hubs, Google Pub Sub, and more. Video recording of this presentation: https://youtu.be/j7D29eyysDw Further reading: https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/ https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/ https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/

kafkapulsarapache
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices

Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop. It's also enabling many real-time system frameworks and use cases. Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API. Also talk about the best practices involved in running a producer/consumer. In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects. We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing Kafka ACLs and monitoring Consumer offsets.

hadoop summitapache kafkadataworks summit
Where: Fixed 2-minute Windows
When in processing time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
~Watermark
Ideal
Skew
When: Triggering at the Watermark
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
When: Triggering at the Watermark

Recommended for you

Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez

Hive on Tez with LLAP (Late Loading Application) can achieve query processing speeds of over 100,000 queries per hour. Tuning various Hive and YARN parameters such as increasing the number of executor and I/O threads, memory allocation, and disabling consistent splits between LLAP daemons and data nodes was needed to reach this performance level on a test cluster of 45 nodes. Future work includes adding a web UI for monitoring LLAP clusters and implementing column-level access controls while allowing other frameworks like Spark to still access data through HiveServer2 and prevent direct access to HDFS for security reasons.

hs16melbhadoop summit
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud

Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set. Speakers: Junjie Chen, Junping Du

Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill

This document summarizes a presentation about Apache Twill, which provides abstractions for building large-scale applications on Apache Hadoop YARN. It discusses why Twill was created to simplify developing on YARN, Twill's architecture and components, key features like real-time logging and elastic scaling, real-world uses at CDAP, and the Twill roadmap.

distributed_systemshadoopapachetwill
When: Early and Late Firings
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
When: Early and Late Firings
How do refinements relate?
• How should multiple outputs per window accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative [3]
Watermark [5, 1]
Late [2]
Last Observed
Total Observed
Discarding
3
6
2
2
11
Accumulating
3
9
11
11
23
Acc. & Retracting
3
9, -3
11, -9
11
11
(Accumulating & Retracting not yet implemented.)
How: Add Newest, Remove Previous
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());

Recommended for you

Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill

This document discusses Apache Twill, which aims to simplify developing distributed applications on YARN. Twill provides a Java thread-like programming model for YARN applications, avoiding the complexity of directly using YARN APIs. Key features of Twill include real-time logging, resource reporting, state recovery, elastic scaling, command messaging between tasks, service discovery, and support for executing bundled JAR applications on YARN. Twill handles communication with YARN and the Application Master while providing an easy-to-use API for application developers.

apacheeasysimple
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry

Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large. Come learn about the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline: What results are being calculated? Where in event time are they calculated? When in processing time are they materialized? How do refinements of results relate? Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).

apachebeam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.

open sourceflinkconference
How: Add Newest, Remove Previous
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Distributed Systems are Distributed

Recommended for you

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective. In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.

hadoopspark sqlspark mllib
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop

Presentation that I gave at the '2014 Open-BDA Hadoop Summit' on November 18th, 2014 on Modern Data Architecture with Enterprise Hadoop

hadoopmodern data architecturehortonworks
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit

This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics. These are a few takeaways from this talk: - Adopt Apache Beam for easier development and portability between Big Data Execution Engines. - Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency. - Accelerate your Big Data applications with In-Memory open source tools. - Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices… - Have Machine Learning part of your strategy or passively watch your industry completely transformed! - How to advance your strategy for hybrid integration between cloud and on-premise deployments?

stream analyticsstream processingapache nifi
Event Time Results are Stable
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Sessions
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
Identifying Bursts of User Activity

Recommended for you

Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts

Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications. This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output. You will learn more about the following: 1. Apache Kafka: a Streaming Data Platform 2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples? 3. Writing, deploying and running your first Kafka Streams application 4. Code and Demo of an end-to-end Kafka-based Streaming Data Application 5. Where to go from here?

real-time stream processingkafkakafka streams
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source. With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing. In this talk, you will learn about: 1. What is Apache Flink stack and how it fits into the Big Data ecosystem? 2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment? 3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark. 4. Who is using Apache Flink? 5. Where to learn more about Apache Flink?

batch processingflink architecturereal-time streaming
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics

This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications. In this talk, you learn more about: 1. What is Apache Flink Stack? 2. Batch vs. Streaming Analytics 3. Key Differentiators of Apache Flink for Streaming Analytics 4. Real-World Use Cases with Flink for Streaming Analytics 5. Who is using Flink? 6. Where do you go from here?

hadoopapache flinkapache storm
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
6. Sessions
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(Sessions.withGapDuration(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
6. Sessions

Recommended for you

Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka

Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing. In this talk you will learn more about: 1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why? 2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka

kafka connectkafkakafka streams
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark

Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.

sparkflinkreal-time stream processing
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus

This document discusses streaming and parallel decision trees in Flink. It motivates the need for a classifier system that can learn from streaming data and classify both the streaming training data and new streaming data. It describes the architecture of keeping the classifier model fresh as new data streams in, allowing classification during the learning process in real-time. It also outlines decision tree algorithms and their implementation using Flink streaming.

conferenceflink forwardapache flink
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Workload vary in pipelines over timeWorkload
Time
Batch pipelines go through stagesStreaming pipeline’s input varies
Perils of fixed decisionsWorkload
Time
Under-provisioned / average caseOver-provisioned / worst case
Workload
Time

Recommended for you

Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM

Okkam is an Italian SME specializing in large-scale data integration using semantic technologies. It provides services for public administration and restaurants by building and managing very large entity-centric knowledge bases. Okkam uses Apache Flink as its data processing framework for tasks like domain reasoning, managing the RDF data lifecycle, detecting duplicate records, entity record linkage, and telemetry analysis by combining Flink with technologies like Parquet, Jena, Sesame, ELKiBi, HBase, Solr, MongoDB, and Weka. The presenters work at Okkam and will discuss their use of Flink in more detail in their session.

apache flinkflink forwardconference
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One

Capital One is a large consumer and commercial bank that wanted to improve its real-time monitoring of customer activity data to detect and resolve issues quickly. Its legacy solution was expensive, proprietary, and lacked real-time and advanced analytics capabilities. Capital One implemented a new solution using Apache Flink for its real-time stream processing abilities. Flink provided cost-effective, real-time event processing and advanced analytics on data streams to help meet Capital One's goals. It also aligned with the company's technology strategy of using open source solutions.

flink forwardapache flinkconference
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017

Everything is designed, yet some interactions are much better than others. What does it take to make a great experience? What are the areas that UX specialists focus on? How do skills in cognitive psycology, computer science and design come together? Carol introduces basic concepts in user experience design that you can use to improve the user's expeirence and/or clearly communicate with designers.

uxuser experiencescrum
Ideal case
Workload
Time
Solution: bundles
class MyDoFn extends
DoFn<String, String> {
void startBundle(...) { }
void processElement(...) { }
void finishBundle(...) { }
}
• User code operates on
bundles of elements.
• Easy parallelization.
• Dynamic sizing.
• Parallelism decisions in the
runner’s hands.
The Straggler Problem
• Work is unevenly
distributed across
tasks.
• Reasons:
• Underlying data.
• Processing.
• Effects multiplied per
stage.
Worker
Time
“Standard” workarounds for stragglers
• Split files into equal sizes?
• Pre-emptively over-split?
• Detect slow workers and re-
execute?
• Sample extensively and then
split?
Worker
Time

Recommended for you

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One

Google Cloud Dataflow is a fully managed service that allows users to build batch or streaming parallel data processing pipelines. It provides a unified programming model and SDKs in Java and Python to process data across Google Cloud Platform services like Pub/Sub, BigQuery, and Cloud Storage. The Cloud Dataflow service automatically optimizes and runs data pipelines at scale in a reliable, cost-effective manner without requiring operational management by the user.

hadoop summitapache hadoop
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam

http://flink-forward.org/kb_sessions/no-shard-left-behind-dynamic-work-rebalancing-in-apache-beam/ The Apache Beam (incubating) programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in Google Cloud Dataflow and which path other Apache Beam runners link Apache Flink can follow to benefit from it.

big dataopen sourcestreaming technologies
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source

The document discusses Apache Beam, a solution for next generation data processing. It provides a unified programming model for both batch and streaming data processing. Beam allows data pipelines to be written once and run on multiple execution engines. The presentation covers common challenges with historical data processing approaches, how Beam addresses these issues, a demo of running a Beam pipeline on different engines, and how to get involved with the Apache Beam community.

hadoop summit
No amount of upfront heuristic tuning (be it manual or
automatic) is enough to guarantee good performance:
the system will always hit unpredictable situations at run-time.
A system that's able to dynamically adapt and
get out of a bad situation is much more powerful
than one that heuristically hopes to avoid getting into it.
Solution: Dynamic Work Rebalancing
Done work Active work Predicted completion Split
Now Average
completion
Time Now Average
completion
Time
Solution: Dynamic work rebalancing
class MyReader
extends BoundedReader<T> {
[...]
getFractionConsumed() { }
splitAtFraction(...) { }
}
class MySource
extends BoundedSource<T> {
[...]
splitIntoBundles() { }
getEstimatedSizeBytes() { }
}
Real world example

Recommended for you

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture

At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop. We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover: • The architectural need for an approach to data collection and distribution as a first-class capability • The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data. • The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time. • The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing. • What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation. This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.

hadoop samza kafka analytics data
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...

Data gravity is a reality when dealing with massive amounts and globally distributed systems. Processing this data requires distributed analytics processing across InterCloud. In this presentation we will share our real world experience with storing, routing, and processing big data workloads on Cisco Cloud Services and Amazon Web Services clouds.

Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...

Benjamin Wright-Jones, Simon Lidberg. Are you interested in near real-time data processing but confused about Azure capabilities and product positioning? Spark, StreamInsight, Storm (HDInsight) and Stream Analytics offer ways to ingest data but there is uncertainty about when and how we should use these capabilities. For example, what are the differences and key solution design decision points? Come to this session to learn about current and new near real-time data processing engines. Go to https://channel9.msdn.com/ to find the recording of this session.

cortana analyticssparkstorm
Dynamic bundles + work re-balancing + autoscaling
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
1. Write: Choose an SDK to write your
pipeline in.
2. Execute: Choose any runner at
execution time.
Apache Beam Architecture
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
Categorizing Runner Capabilities
http://beam.incubator.apache.org/capability-matrix/

Recommended for you

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...

4.6.16 AI&BigData Lab Upcoming events: goo.gl/I2gJ4H Как устроить анализ данных 40 млн. человек за 5 лет так, чтобы это выглядело почти в реальном времени.

bigdataaigeekslab
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way

Onyx is a data processing framework for Clojure that allows users to define workflows, functions, and windows to process streaming and batch data across distributed clusters. It uses concepts like peers, virtual peers, and Zookeeper for scheduling and Aeron for messaging. Users can write Onyx jobs in Clojure to perform ETL, analytics, and other data processing tasks in a declarative way.

realtimebig dataonyx
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.

googlebig data day labig data
1. End users: who want to write
pipelines or transform libraries in a
language that’s familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing
environment and want to support
Beam pipelines
Multiple categories of users
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
• If you have Big Data APIs, write a Beam
SDK or DSL or library of transformations.
• If you have a distributed processing
backend, write a Beam runner!
• If you have a data storage or messaging
system, write a Beam IO connector!
Growing the Open Source Community
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Visions are a Journey
02/01/2016
Enter Apache
Incubator
End 2016
Beam pipelines
run on many
runners in
production uses
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Additional refactoring,
non-production uses
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
06/14/2016
1st incubating
release
June 2016
Python SDK
moves to
Beam

Recommended for you

Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam

Unbounded, unordered, global­ scale datasets are increasingly common in day-­to-­day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open ­source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution. This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.

hadoop summitdataworks summitapache beam
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's

This document summarizes a talk on cloud computing and bioinformatics given by Angel Pizarro. The talk discusses 3 pillars of systems design for cloud computing, 3 storage implementations (S3, distributed file systems on EC2, memory grids), and 3 areas of bioinformatics affected by clouds (computational biology, software engineering, applications to life sciences). It also summarizes 3 internal projects at UPenn related to resource management, workflow management, and an RNA-Seq analysis pipeline called RUM that combines Bowtie and BLAT.

bioit10bioinformaticscloud
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series

This document discusses building a machine learning model for real-time time series analysis on big data. It describes using Spark and Kafka to ingest streaming sensor data and train a model to identify patterns and predict failures. The training phase identifies concepts in historical data to build a knowledge base. In real-time, incoming data is processed in microbatches to identify patterns and sequences matching the concepts, triggering alerts. Challenges addressed include handling large volumes of small files and sharing data between batches for signals spanning multiple batches.

Learn More!
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Thank you!
Extra material
Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time

Recommended for you

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?

With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.

stream-processingmicroservices-communicationkafka
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message

This document contains an agenda and summaries of sections from a presentation on messaging architectures and using messaging data with Splunk. The agenda includes an overview of common messaging systems like JMS, AMQP, and Kafka. It also covers customizing message handling and scaling modular inputs. Additionally, it discusses using ZeroMQ to access messages and using underutilized computers with JMS tasks.

mqqueueamqp
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere". This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future. Speaker Davor Bonaci, Senior Software Engineer, Google

googleapache beamdataworks summit 2017
Aggregating via Processing-Time Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Aggregating via Event-Time Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Processing Time Results Differ
Identifying Bursts of User Activity

Recommended for you

DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017

The document summarizes a meetup on data streaming and machine learning with Google Cloud Platform. The meetup consisted of two presentations: 1. The first presentation discussed using Apache Beam (Dataflow) on Google Cloud Platform to parallelize machine learning training for improved performance. It showed how Dataflow was used to reduce training time from 12 hours to under 30 minutes. 2. The second presentation demonstrated building a streaming pipeline for sentiment analysis on Twitter data using Dataflow. It covered streaming patterns, batch vs streaming processing, and a demo that ingested tweets from PubSub and analyzed them using Cloud NLP API and BigQuery.

gcpgooglemachine learning
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21

The document summarizes a meetup on data streaming and machine learning with Google Cloud Platform. The meetup consisted of two presentations: 1. The first presentation discussed using Apache Beam and Google Cloud Dataflow to parallelize machine learning training for hyperparameter optimization. It showed how Dataflow reduced training time from 12 hours to under 30 minutes. 2. The second presentation demonstrated building a streaming Twitter sentiment analysis pipeline with Dataflow. It covered streaming patterns, batch vs streaming considerations, and a demo that ingested tweets from PubSub, analyzed sentiment with NLP, and loaded results to BigQuery.

#google#meetup#labsvibe
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações

A plataforma Windows Azure abre espaço a desenvimento de aplicações utilizando o novo paradigma: "A Nuvem". Aplicações escaláveis, redundantes, e mais próximas do utilizador final. Isto tudo utilizando como base os conhecimentos que já tem e o novo Visual Studio 2010.

computingsqlazuresql
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Calculating Session Lengths
input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark())
.discardingFiredPanes())
.apply(CalculateWindowLength()));
Apache Beam: A unified model for batch and stream processing data
Calculating the Average Session Length
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())
.withEarlyFirings(AtPeriod(Minutes(1)))
.accumulatingFiredPanes())
.apply(Mean.globally());
input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark())
.discardingFiredPanes())
.apply(CalculateWindowLength()));

Recommended for you

Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence

The document discusses data partitioning and distribution across multiple machines in a cluster. It explains that data replication does not scale well, but data partitioning, where each record exists on only one machine, allows write latency to scale with the number of machines in the cluster. Coherence provides a distributed cache that partitions data and offers functions for server-side processing near the data through tools like entry processors.

Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...

This document summarizes a presentation about building a structured streaming connector for continuous applications using Azure Event Hubs as the streaming data source. It discusses key design considerations like representing offsets, implementing the getOffset and getBatch methods required by structured streaming sources, and challenges with testing asynchronous behavior. It also outlines issues contributed back to the Apache Spark community around streaming checkpoints and recovery.

apache sparkspark summit
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production

This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.

apahce sparkhadoopdataworks summit
Apache Beam: A unified model for batch and stream processing data

More Related Content

What's hot

Kafka 101
Kafka 101Kafka 101
Kafka 101
Aparna Pillai
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams
Johan Andrén
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
Apache Apex
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
kafka
kafkakafka
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
Igor Roiter
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
confluent
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Aniket Mokashi
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 

What's hot (20)

Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
kafka
kafkakafka
kafka
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 

Viewers also liked

Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
Henry Saputra
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
Terence Yim
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
Slim Baltagi
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
Flink Forward
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
Flink Forward
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
Flink Forward
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Carol Smith
 

Viewers also liked (16)

Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
 

Similar to Apache Beam: A unified model for batch and stream processing data

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Flink Forward
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
MSAdvAnalytics
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
Bahadir Cambel
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
delagoya
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
Damien Dallimore
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Comunidade NetPonto
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
Ben Stopford
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 

Similar to Apache Beam: A unified model for batch and stream processing data (20)

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 

Recently uploaded (20)

Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 

Apache Beam: A unified model for batch and stream processing data

  • 1. Unbounded, unordered, global scale datasets are increasingly common in day-today business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtime environments, both open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in the programming model. During the talk, we’ll argue why Beam is unified, efficient and portable. Abstract
  • 2. Davor Bonaci Apache Beam PPMC Software Engineer, Google Inc. Apache Beam: A Unified Model for Batch and Streaming Data Processing Hadoop Summit, June 28-30, 2016, San Jose, CA
  • 3. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 4. 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines: • Java • Python 3. Runners for Existing Distributed Processing Backends • Apache Flink • Apache Spark • Google Cloud Dataflow • Local runner for testing What is Apache Beam? Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 5. The Evolution of Beam MapReduce Google Cloud Dataflow Apache Beam BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel
  • 6. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 10. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:00
  • 11. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  • 13. Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" ProcessingTime Event Time ~Watermark Ideal Skew Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  • 14. What are you computing? Where in event time? When in processing time? How do refinements relate?
  • 15. What are you computing? Element-Wise Aggregating Composite
  • 16. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());
  • 19. Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data. Where in event time? Fixed Sliding 1 2 3 54 Sessions 2 431 Key 2 Key 1 Key 3 Time 2 3 4
  • 20. Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());
  • 22. When in processing time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. ProcessingTime Event Time ~Watermark Ideal Skew
  • 23. When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  • 24. When: Triggering at the Watermark
  • 25. When: Early and Late Firings PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  • 26. When: Early and Late Firings
  • 27. How do refinements relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative [3] Watermark [5, 1] Late [2] Last Observed Total Observed Discarding 3 6 2 2 11 Accumulating 3 9 11 11 23 Acc. & Retracting 3 9, -3 11, -9 11 11 (Accumulating & Retracting not yet implemented.)
  • 28. How: Add Newest, Remove Previous PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  • 29. How: Add Newest, Remove Previous
  • 32. Distributed Systems are Distributed
  • 33. Event Time Results are Stable
  • 35. Sessions PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  • 36. Identifying Bursts of User Activity
  • 38. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  • 40. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  • 42. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 43. Workload vary in pipelines over timeWorkload Time Batch pipelines go through stagesStreaming pipeline’s input varies
  • 44. Perils of fixed decisionsWorkload Time Under-provisioned / average caseOver-provisioned / worst case Workload Time
  • 46. Solution: bundles class MyDoFn extends DoFn<String, String> { void startBundle(...) { } void processElement(...) { } void finishBundle(...) { } } • User code operates on bundles of elements. • Easy parallelization. • Dynamic sizing. • Parallelism decisions in the runner’s hands.
  • 47. The Straggler Problem • Work is unevenly distributed across tasks. • Reasons: • Underlying data. • Processing. • Effects multiplied per stage. Worker Time
  • 48. “Standard” workarounds for stragglers • Split files into equal sizes? • Pre-emptively over-split? • Detect slow workers and re- execute? • Sample extensively and then split? Worker Time
  • 49. No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the system will always hit unpredictable situations at run-time. A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that heuristically hopes to avoid getting into it.
  • 50. Solution: Dynamic Work Rebalancing Done work Active work Predicted completion Split Now Average completion Time Now Average completion Time
  • 51. Solution: Dynamic work rebalancing class MyReader extends BoundedReader<T> { [...] getFractionConsumed() { } splitAtFraction(...) { } } class MySource extends BoundedSource<T> { [...] splitIntoBundles() { } getEstimatedSizeBytes() { } }
  • 53. Dynamic bundles + work re-balancing + autoscaling
  • 54. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 55. 1. Write: Choose an SDK to write your pipeline in. 2. Execute: Choose any runner at execution time. Apache Beam Architecture Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 57. 1. End users: who want to write pipelines or transform libraries in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Multiple categories of users Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 58. • If you have Big Data APIs, write a Beam SDK or DSL or library of transformations. • If you have a distributed processing backend, write a Beam runner! • If you have a data storage or messaging system, write a Beam IO connector! Growing the Open Source Community
  • 59. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 60. Visions are a Journey 02/01/2016 Enter Apache Incubator End 2016 Beam pipelines run on many runners in production uses Early 2016 Design for use cases, begin refactoring Mid 2016 Additional refactoring, non-production uses Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository 06/14/2016 1st incubating release June 2016 Python SDK moves to Beam
  • 61. Learn More! Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Follow @ApacheBeam on Twitter
  • 64. Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 65. Aggregating via Processing-Time Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 66. Aggregating via Event-Time Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  • 68. Identifying Bursts of User Activity
  • 72. Calculating the Average Session Length .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally()); input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

Editor's Notes

  1. Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. <animate> Inside Google, kept innovating, but just published papers <animate> Externally the open source community created Hadoop. Entire ecosystem flourished, partially influenced by those Google papers. <animate> In 2014, Google Cloud Dataflow -- included both a new programming model and fully managed service share this model more broadly -- both because it is awesome and because users benefit from a larger ecosystem and portability across multiple runtimes. So Google, along with a handful of partners donated this programming model to the Apache Software Foundation, as the incubating project Apache Beam...
  2. here’s gaming logs each square represents an event where a user scored some points for their team
  3. game gets popular
  4. start organizing it into a repeated structure
  5. repetitive structure just a cheap way of representing an infinite data source. game logs are continuous distributed systems can cause ambiguity...
  6. Lets look at some points that were scored at 8am <animate> red score 8am, received quickly <animate> yellow score also happened at 8am, received at 8:30 due to network congestion <animate> green element was hours late. this was someone playing in airplane mode on the plane. had to wait for it to land. so now we’ve got an unordered, infinite data set, how do we process it...
  7. Blue axis is event, Green is processing. Ideally no delay -- elements processed when they occurred <animate> Reality looks more like that red squiggly line, where processing time is slightly delayed off event time. <animate> The variable distance between reality and ideal is called skew. need to track in order to reason about correctness.
  8. red line the watermark -- no event times earlier than this point are expected to appear in the future. often heuristic based too slow → unnecessary latency. too fast → some data comes in late, after we thought we were done for a given time period. how do we reason about these types of infinite, out-of-order datasets...
  9. not too hard if you know what kinds of questions to ask! What results are calculated? sums, joins, histograms, machine learning models? Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions? When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights? And finally, how do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another? Let’s dive into how each question contributes when we build a pipeline...
  10. The first thing to figure out is what you are actually computing. transform each element independently, similar to the Map function in MapReduce, easy to parallelize Other transformations, like Grouping and Combining, require inspecting multiple elements at a time Some operations are really just subgraphs of other more primitive operations. Now let’s see a code snippet for our gaming example...
  11. Psuedo Java for compactness/clarity! start by reading a collection of raw events transform it into a more structured collection containing key value pairs with a team name and the number of points scored during the event. use a composite operation to sum up all the points per team. Let’s see how this code excutes...
  12. Looking at points scored for a given team blue axis, green axis <animate> ideal <animate> This score of 3 from just before 12:07 arrives almost immediately. <animate> 7 minutes delayed. elevator or subway. graph not big enough to show offline mode from transatlantic flight
  13. time is thick white line. accumulate sum into the intermediate state produces output represented by the blue rectangle. all the data available, rectangle covers all events, no matter when in time they occurred. single final result emitted when it’s all complete pretty standard batch processing -> let’s see what happens if we tweak the other questions
  14. windowing lets us create individual results for different slices of event time. divides data into finite chunks based on the event time of each element. common patterns include fixed time (like hourly, daily, monthly), sliding windows (like the last 24 hours worth of data, every hour) -- single element may be in multiple overlapping windows session based windows that capture bursts of user activity -- unaligned per key very common when trying to aggregations on infinite data also actually common pattern in batch, though historically done using composite keys.
  15. fixed windows that are 2 minutes long
  16. independent answer for every two minute period of event time. still waiting until the entire computation completes to emit any results. won’t work for infinite data! want to reduce latency...
  17. trigger define when in processing time to emit results often relative to the watermark, which is that heuristic about event time progress.
  18. request that results are emitted when we think we’ve roughly seen all the elements for a given window. actually default -- just written it for clarity.
  19. left graph shows a perfect watermark -- tracks when all the data for a given event time has arrived emit the result from each window as soon as the watermark passes. <animate> watermark is usually just a heuristic, so look more like graph on the right. now 9 is missed and if the watermark is delayed, like in the first graph, need to wait a long time for anything. would like speculative. lets use a more advanced trigger...
  20. ask for early, speculative firings every minute get updates every time a late element comes in.
  21. in all cases, able to get speculative results before the watermark. now get results when watermark passes, but still handle late value 9 even with heuristic watermark in this case, we accumulate across the multiple results per window In the final window, we see and emit 3 but then still include that 3 in the next update of 12. but this behavior around multiple firings is configurable...
  22. fire three times for a window -- a speculative firing with 3, watermark with two more values 5 and 1, and finally a late value 2. one option is emit new elements that have come in since the last result. requires consumer to be able to do final sum could produce the running sum every time. consumer may overcount produce both the new running sum and retract the old one.
  23. use accumulating and retracting.
  24. speculative results, on time results, and retractions. now the final window emits 3, then retracts the 3 when emitting 12. So those are the four questions...
  25. those are the four key questions are they the right questions? here are 5 reasons...
  26. the results we get are correct this is not something we’ve historically gotten with streaming systems.
  27. distributed systems are … distributed. if the winds had been blowing from the east instead of the west, elements might have arrived in a slightly different order.
  28. aggregating based on event time may have different intermediate results but the final results are identical across the two arrival scenarios.
  29. next, the abstractions can represent powerful and complex algorithms.
  30. earlier mentioned session windows -- burst of user activity simple code change...
  31. want to identify two groupings of points in other words, Tyler was playing the game, got distracted by a squirrel, and then resumed his play.
  32. ok… flexibility for covering all sorts of uses cases
  33. By tuning our what/where/when/how knobs, we’ve covered everything from classic batch… to sessions
  34. And not only that, we do so with lovely modular code
  35. all these uses cases -- and we never changed our core algorithm just integer summing here, but the same would apply with much more complex algorithms too
  36. so there you go -- 5 reasons that these 4 questions are awesome
  37. Data 1 file per task & files of different sizes Bigtable key range partitioned lexicographically, assuming uniform Processing Hot shuffle key ranges Data-dependent computation
  38. Pre-job stage: chunk files into equal sizes Choice of constant? Does not handle runtime asymmetry Pre-emptively over-split How much is enough? How much is too much? Per-task overheads can dominate Detect slow workers and re-execute Does not handle processing asymmetry Sample (maybe extensively) and then split Overhead Still does not handle runtime asymmetry
  39. 400 workers Read GCS → Parse → GroupByKey → Write
  40. The Beam model is attempting to generalize semantics -- will not align perfectly with all possible runtimes. Started categorizing the features in the model and the various levels of runner support. This will help users understand mismatches like using event time processing in Spark or exactly once processing with Samza.
  41. fully support three different categories of users End users who want to write data processing pipelines Includes adding value like additional connectors -- we’ve got Kafka! Additionally, support community-sourced SDKs and runners Each community has very different sets of goals and needs. having a vision and reaching it are two different things...
  42. And one of the things we’re most excited about is the collaboration opportunities that Beam enables. Been doing this stuff for a while at Google -- very hermetic environment. Looking forward to incorporating new perspectives -- to build a truly generalizable solution. Growing the Beam development community over the next few months, whether they are looking to write transform libraries for end users, new SDKs, or provide new runners.
  43. Beam entered incubation in early February. Quickly did the code donations and began bootstrapping the infrastructure. initial focus is on stabilizing internal APIs and integrating the additional runners. Part of that is understanding what different runners can do...
  44. Credits?
  45. Credits?
  46. Element wise transformations work on individual elements parsing, translating or filtering applied as elements flow past but other transforms like counting or joining require combining multiple elements together ...
  47. when doing aggregations, need to divide the infinite stream of elements into finite sized chunks that can be processed independently. simplest way using arrival time in fixed time periods can mean elements are being processed out of order, late elements may be aggregated with unrelated elements that arrived about the same time...
  48. reorganize data base on when they occurred, not when they arrived red element arrived relatively on time and stays in the noon window. green that arrived at 12:30, was actually created about 11:30, so it moves up to the 11am window. requires formalizing the difference between processing time and event time
  49. if we were aggregating based on processing time, this would result in different results for the two orderings.
  50. now you can see the sessions being built over time at first we see multiple components in the first session not until late element 9 comes in that we realize it’s one big session
  51. next -- we’ve seen what the four questions can do. what if we ask the questions twice?
  52. code to calculate the length of a user session
  53. Remember that these graphs are always shown per key here’s the graph calculating session legths for Frances and the ones for Tyler
  54. now lets take those session lengths per user ask the questions again this time using fixed windows to take the mean across the entire collection...
  55. Now calculating the average length of all sessions that ended in a given time period if we rolled out an update to our game, this would let us quickly understand if that resulted a change in user behavior if the change made the game less fun, we could see a sudden drop in how long users play