SlideShare a Scribd company logo
Will it Scale?
The Secrets behind Scaling Stream Processing
Applications
Navina Ramesh
Software Engineer, LinkedIn
Apache Samza, Committer & PMC
navina@apache.org
What is this talk about ?
● Understand the architectural choices in stream processing systems that may
impact performance/scalability of stream processing applications
● Have a high level comparison of two streaming engines (Flink/Samza) with a
focus on scalability of the stream-processing application
What this talk is not about ?
● Not a feature-by-feature comparison of existing stream processing systems
(such as Flink, Storm, Samza etc)
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
● Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
● Conclusion

Recommended for you

Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman

Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends. We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.

apache flinkstream processingververica
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes

Zhiyong Bai As a high performance and scalable key value database, Zhihu use HBase to provide online data store system along with Mysql and Redis. Zhihu’s platform team had accumulated some experience in technology of container, and this time, based on Kubernetes, we build flexible platform of online HBase system, create multiple logic isolated HBase clusters on the shared physical cluster with fast rapid,and provide customized service for different business needs. Combined with Consul and DNS server, we implement high available access of HBase using client mainly written with Python. This presentation is mainly shared the architecture of online HBase platform in Zhihu and some practical experience in production environment. hbaseconasia2017 hbasecon hbase

hbaseconasia2017 hbase hbasecon apache kubernetes
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce

We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.

salesforceargushbase
0 ms
RPC
Stream Processing
Synchronous
Milliseconds to minutes
Later. Typically, hours
Response Latency
Batch Processing
Spectrum of Processing
Newsfeed
Cyber Security
Selective Push Notifications

Recommended for you

Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa

Partha Saha and CW Chung (Visa) Visa has embarked on an ambitious multi-year redesign of its entire data platform that powers its business. As part of this plan, the Apache Hadoop ecosystem, including HBase, will now become a staple in many of its solutions. Here, we will describe our journey in rolling out a high-availability NoSQL solution based on HBase behind some of our prominent mobile offerings.

visahadoophbasecon
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad

Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.

hadoopscalahdfs
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka

Kafka Streams is a lightweight stream processing library included in Apache Kafka since version 0.10. It provides a simple yet powerful API for building stream processing applications. The API uses a domain-specific language that allows developers to define stream processing topologies where data from Kafka topics acts as input streams and can be transformed before writing the results to output topics. The library handles common stream processing tasks like state management, windowing, and fault tolerance using Kafka's distributed and fault-tolerant architecture.

hadoop summit
Agenda
● Use cases in Stream Processing
●Typical Data Pipelines
● Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
● Conclusion
Typical Data Pipeline - Batch
Ingestion
Service
HDFS
Mappers Reducers
HDFS/
HBase
Query
Typical Data Pipeline
Ingestion
Service
HDFS
Mappers Reducers
HDFS/
HBase
Data
Ingestion
Query
Typical Data Pipeline - Batch
Ingestion
Service
HDFS
Mappers Reducers
HDFS/
HBase
Data
Ingestion
Data
Processing
Query
Typical Data Pipeline - Batch

Recommended for you

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.

kafka streamsmicroservicesintermediate
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...

Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator. With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster. Speaker bio Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.

los angeles big data users groupapache cassandralabdug
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...

This document discusses stream processing at scale. It begins with an introduction and agenda. It then discusses scenarios for stream processing like newsfeeds, cybersecurity, and IoT. It presents the canonical stream processing architecture with data buses, real-time and batch processing, and ingestion/serving tiers. The document dives into the essential ingredients for stream processing: scale, reprocessing, accuracy of results, and easy programmability. It provides examples and strategies for each of these essential ingredients to achieve efficient and accurate stream processing at large scales.

real time processingbig data spainbig data
Ingestion
Service
HDFS
Mappers Reducers
HDFS/
HBase
Data
Ingestion
Data
Processing
Result Storage /
Serving
Query
Typical Data Pipeline - Batch
Parallels in Streaming
Ingestion
Service
HDFS
Mappers Reducers
HDFS/
HBase
Processors Processors
HDFS
KV
Store
Partition 0
Partition 1
Partition N
...
Data
Ingestion
Data
Processing
Result Storage /
Serving
Query
Query
Ingestion
Service
HDFS
Mappers Reducers
HDFS/
HBase
Processors Processors
HDFS
KV
Store
Partition 0
Partition 1
Partition N
...
Data
Ingestion
Data
Processing
Result Storage /
Serving
Query
Query
Parallels in Streaming
Batch Streaming
● Data Processing on bounded data
● Acceptable Latency - order of hours
● Processing occurs at regular intervals
● Throughput trumps latency
● Horizontal scaling to improve processing
time
● Data processing on unbounded data
● Low latency - order of sub-seconds
● Processing is continuous
● Horizontal scaling is not straightforward
(stateful applications)
● Need tools to reason about time (esp.
when re-processing stream)

Recommended for you

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

dataworks summit washington dcdws19dataworks summit 2019
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming

Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.

kafkaarchitecturespark-streaming
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.

hadoopsparkbig data
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
●Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
● Conclusion
Typical Data Ingestion
Producers
Partition 0
Partition 1
Partition 3
key=0
key=3
key=23
Stream A
Consumer
(host A)
Consumer
(host B)
Partition 2
- Typically, streams are
partitioned
- Messages sent to partitions
based on “Partition Key”
- Time-based message
retentionkey=10
Kafka Kinesis
Scaling Data Ingestion
Producers
Partition 0
Partition 1
Partition 3
Stream A
Consumer
(host A)
Consumer
(host B)
Partition 2
- Scaling “up” -> Increasing
partitions
- Changing partitioning logic
re-distributes* the keys
across the partitions
Partition 4
key=0
key=10
key=23
key=3
Kafka Kinesis
Scaling Data Ingestion
Producers
Partition 0
Partition 1
Partition 3
Stream A
Consumer
(host A)
Consumer
(host B)
Partition 2
- Scaling “up” -> Increasing
partitions
- Changing partitioning logic
re-distributes* the keys
across the partitions
- Consuming clients (includes
stream processors) should be
able to re-adjust!
- Impact -> Over-provisioning
of partitions in order to handle
changes in load
Partition 4
key=0
key=10
key=23
key=3
Kafka Kinesis

Recommended for you

High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016

Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data. We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases. This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.

zookeepertime seriesrocana
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc. Bio: Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.

ingestiÓnbig dataapache apex
Connecting kafka message systems with scylla
Connecting kafka message systems with scylla   Connecting kafka message systems with scylla
Connecting kafka message systems with scylla

Maheedhar Gunturu presented on connecting Kafka message systems with Scylla. He discussed the benefits of message queues like Kafka including centralized infrastructure, buffering capabilities, and streaming data transformations. He then explained Kafka Connect which provides a standardized framework for building connectors with distributed and scalable connectors. Scylla and Cassandra connectors are available today with a Scylla shard aware connector being developed.

kafkascylladbscylla
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
● Scaling Data Ingestion
●Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
● Conclusion
Scaling Data Processing
● Increase number of processing units → Horizontal Scaling
Scaling Data Processing
● Increase number of processing units → Horizontal Scaling
But more machines means more $$$
● Impact NOT only CPU cores, but “large” (order of TBs) stateful applications
impact network and disk!!
Key Bottleneck in Scaling Data Processing
● Accessing State
○ Operator state
■ Read/Write state that is maintained during stream processing
■ Eg: windowed aggregation, windowed join
○ Adjunct state
■ To process events, applications might need to lookup related or ‘adjunct’ data.

Recommended for you

From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017

This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.

apache apexhadoopstreaming analytics
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka

Data Pipeline with Kafka, This slide include Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus

datapipelinescalakafka
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn

Lambda-less Stream Processing @Scale in LinkedIn The document discusses challenges with stream processing including data accuracy and reprocessing. It proposes a "lambda-less" approach using windowed computations and handling late and out-of-order events to produce eventually correct results. Samza is used in LinkedIn's implementation to store streaming data locally using RocksDB for processing within configurable windows. The approach avoids code duplication compared to traditional lambda architectures while still supporting reprocessing through resetting offsets. Challenges remain in merging online and reprocessed results at large scale.

hadoop summit
Repartitioner Assembler
homepage_service_call
feed_service_call
profile_service_call
pymk_service_call
...
Homepage_service_call (tree id: 10)
| |
| Pymk_service_call (tree id: 10)
| |
| Profile_service_call (tree id: 10)
|
Feed_service_call (tree id: 10)
Stateful Process!
Service
Calls
Accessing Operator State: Assemble Call Graph
(Partition events by
“tree id”)
(Aggregate events
by “tree id”)
Repartitioner Assembler
homepage_service_call
feed_service_call
profile_service_call
pymk_service_call
...
Homepage_service_call (tree id: 10)
| |
| Pymk_service_call (tree id: 10)
| |
| Profile_service_call (tree id: 10)
|
Feed_service_call (tree id: 10)
In-Memory Mapping
Service
Calls
Accessing Operator State: Assemble Call Graph
- In-memory structure to aggregate events until ready to
output
- Concerns:
- Large windows can cause overflow!
- Restarting job after a long downtime can
increase memory pressure!
(Partition events by
“tree id”)
(Aggregate events
by “tree id”)
Repartitioner Assembler
homepage_service_call
feed_service_call
profile_service_call
pymk_service_call
...
Homepage_service_call (tree id: 10)
| |
| Pymk_service_call (tree id: 10)
| |
| Profile_service_call (tree id: 10)
|
Feed_service_call (tree id: 10)
Service
Calls
Accessing Operator State: Assemble Call Graph
Remote
KV Store
(operator state)
(Partition events by
“tree id”)
(Aggregate events
by “tree id”)
Concerns:
- Remote RPC is Slow!! (Stream: ~1 million
records/sec ; DB: ~3-4K writes/sec)
- Mutations can’t rollback!
- Task may fail & recover
- Change in logic!
Accessing Operator State: Push Notifications
B2
Online
Apps
Relevance
Score
User
Action Data
Task
(Generate active notifications -
filtering, windowed-aggregation,
external calls etc)
Notification System
(Scheduler)

Recommended for you

Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data

Presentation from Big Data London 2016 about how to use open source search engines like Lucene, Solr and Elasticsearch for Big Data applications.

media monitoringlucenesolr
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing

Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)

google dataflowbatch processingstreaming processing
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node

This document summarizes benchmarking tests of Apache Samza's performance processing streaming data. The tests measured Samza's performance on different processing tasks: message passing achieved 1.2 million messages per second per node; key counting with an in-memory store achieved 1 million messages per second; key counting with RocksDB storage was 443k messages per second; and key counting with RocksDB storage and changelog was 300k messages per second. The benchmarks provide a foundation for developing a capacity model for Samza's performance on high-volume streaming data applications.

Accessing Operator State: Push Notifications
B2
Online
Apps
Relevance
Score
User
Action Data
Task
(Generate active notifications -
filtering, windowed-aggregation,
external calls etc)
Notification System
(Scheduler)
- Stream processing tasks
consume from multiple sources
- offline/online
- Performs multiple operations
- Filters information and
buffers data for window of
time
- Aggregates / Joins
buffered data
- Total operator state per
instance can easily grow to
multiple GBs per Task
Accessing Adjunct Data: AdQuality Updates
Task
AdClicks AdQuality Update
Read Member Data
Member Info
Stream-to-Table Join
(Look-up memberId & generate
AdQuality improvements for the
User)
Accessing Adjunct Data: AdQuality Updates
Task
AdClicks AdQuality Update
Read Member Data
Member Info
Stream-to-Table Join
(Look-up memberId & generate
AdQuality improvements for the
User)
Concerns:
- Remote look-up Latency is
high!
- DDoS on shared store -
MemberInfo
Accessing Adjunct Data using Cache: AdQuality Updates
Task
AdClicks AdQuality Update
Read Member Data
Member Info
Stream-to-Table Join
(Maintain a cache of
member Info & do local
lookup)

Recommended for you

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...

Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.

#mlconf #machinelearning #ehtshamelahi
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark

This document discusses high performance spatial-temporal trajectory analysis using Spark. It covers the background of analyzing mobile signaling data to enable smarter urban planning. The solution architecture includes data sources, distributed file system, computation engine, and visualization. Technical designs address the big data platform, data governance, algorithm models, and Spark spatial computing. Example scenarios are presented for population heatmaps, commute routes, and office-residence imbalance analysis.

hs16melbhadoop summit
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka

Developing Realtime Data Pipelines With Apache Kafka. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

scalaapache kafkakafka
Accessing Adjunct Data using Cache: AdQuality Updates
Task
AdClicks AdQuality Update
Read Member Data
Member Info
Stream-to-Table Join
(Maintain a cache of
member Info & do local
lookup)
Concerns:
- Overhead of maintaining cache
consistency based on the source of
truth (MemberInfo)
- Warming up the cache after the job’s
downtime can cause temporary spike
in QPS on the shared store
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
● Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
● Conclusion
Apache Flink
Apache Flink: Processing
● Dataflows with streams and
transformation operators
● Starts with one or more source and
ends in one or more sinks

Recommended for you

codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra

CQRS (Command Query Responsibility Segregation) is a pattern, which separates the process of querying and updating data. As a query only returns data without any side effects, a command is designed to change data. CQRS is often combined with Event Sourcing. This is an architecture in which all changes to an application state are stored as a sequence of events. Because of its great capability to store time series data Cassandra is the perfect fit for implementing the event store. But there a still a lot of open questions: What about the data modeling? What techniques will be used to process and store data in the Cassandra database? How to access the current state of the application, without replaying every event? And what about failure handling? In this talk, I will give a brief introduction to CQRS and the Event Sourcing pattern and will then answer the questions above using a real life example of a data store for customer data.

cassandra summitcassandra nosql2015
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

http://flink-forward.org/kb_sessions/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads/ Modern stream processing engines not only have to process millions of events per second at sub-second latency but also have to cope with constantly changing workloads. Due to the dynamic nature of stream applications where the number of incoming events can strongly vary with time, systems cannot reliably predetermine the amount of required resources. In order to meet guaranteed SLAs as well as utilizing system resources as efficiently as possible, frameworks like Apache Flink have to adapt their resource consumption dynamically. In this talk, we will take a look under the hood and explain how Flink scales stateful application in and out. Starting with the concept of key groups and partionable state, we will cover ways to detect bottlenecks in streaming jobs and discuss efficient strategies how to scale out operators with minimal down-time.

conference#ff16apache software foundation
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems

Elastic scaling allows a data stream processing system to react to a dynamically changing query or event workload by automatically scaling in or out. Thereby, both unpredictable load peaks as well as underload situations can be handled. However, each scaling decision comes with a latency penalty due to the required operator movements. Therefore, in practice an elastic system might be able to improve the system utilization, however it is not able to provide latency guarantees defined by a service level agreement (SLA). In this paper we introduce an elastic scaling system, which optimizes the utilization under certain latency constraints defined by a SLA. Specifically, we present a model, which estimates the latency spike created by a set of operator movements. We use this model to build a latency-aware elastic operator placement algorithm, which minimizes the number of latency violations. We show that our solution is able to reduce the 90th percentile of the end to end latency by up to 30% and reduce the number of latency violations by 50%. The achieved system utilization for our approach is comparable to a scaling strategy, which does not use latency as optimization target.

latencyscalabilitycomplex event processing
Actor System
Scheduler
Checkpoint
Coordinator
Job Manager
Task
Slot
Task
Slot
Task Manager
Task
Slot
Actor System
Network Manager
Memory & I/O Manager
Task
Slot
Task
Slot
Task Manager
Task
Slot
Actor System
Network Manager
Memory & I/O Manager
Stream
Task
Slot
● JobManager (Master) coordinates
distributed execution such as,
checkpoint, recovery management,
schedule tasks etc.
● TaskManager (JVM Process)
execute the subtasks of the dataflow,
and buffer and exchange data
streams
● Each Task Slot may execute multiple
subtasks and runs on a separate
thread.
Apache Flink: Processing
Apache Flink: State Management
● Lightweight Asynchronous Barrier Snapshots
● Master triggers checkpoint and source inserts barrier
● On receiving barrier from all input sources, each operator stores the entire state, acks the
checkpoint to the master and emits snapshot barrier in the output
Apache Flink: State Management
Job
Manager
Task
Manager
HDFS
Snapshot Store
Task
Manager
Task
Manager
● Lightweight Asynchronous Barrier
Snapshots
● Periodically snapshot the entire state
to snapshot store
● Checkpoint mapping is stored in Job
Manager
● Snapshot Store (typically, HDFS)
○ operator state
(windows/aggregation)
○ user-defined state
(checkpointed)
Apache Flink: State Management
● Operator state is primarily
stored In-Memory or local File
System
● Recently added RocksDB
● Allows user-defined operators
to define state that should be
checkpointed
Job
Manager
Task
Manager
HDFS
Snapshot Store
Task
Manager
Task
Manager

Recommended for you

Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing

An elastic data stream processing system is able to handle changes in workload by dynamically scaling out and scaling in. This allows for handling of unexpected load spikes without the need for constant overprovisioning. One of the major challenges for an elastic system is to find the right point in time to scale in or to scale out. Finding such a point is difficult as it depends on constantly changing workload and system characteristics. In this paper we investigate the application of different auto-scaling techniques for solving this problem. Specifically: (1) we formulate basic requirements for an autoscaling technique used in an elastic data stream processing system, (2) we use the formulated requirements to select the best auto scaling techniques, and (3) we perform evaluation of the selected auto scaling techniques using the real world data. Our experiments show that the auto scaling techniques used in existing elastic data stream processing systems are performing worse than the strategies used in our work.

elasticitycomplex event processingdistributed systems
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing

A major challenge for cloud-based systems is to be fault tolerant so as to cope with an increasing probability of faults in cloud environments. This is especially true for in-memory computing solutions like data stream processing systems, where a single host failure might result in an unrecoverable information loss. In state of the art data streaming systems either active replication or upstream backup are applied to ensure fault tolerance, which have a high resource overhead or a high recovery time respectively. This paper combines these two fault tolerance mechanisms in one system to minimize the number of violations of a user-defined recovery time threshold and to reduce the overall resource consumption compared to active replication. The system switches for individual operators between both replication techniques dynamically based on the current workload characteristics. Our approach is implemented as an extension of an elastic data stream processing engine, which is able to reduce the number of used hosts due to the smaller replication overhead. Based on a real-world evaluation we show that our system is able to reduce the resource usage by up to 19% compared to an active replication scheme.

streamingdata stream processingreplication
Cluster Schedulers
Cluster SchedulersCluster Schedulers
Cluster Schedulers

This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega

schedulingclustersmesos
Apache Flink: Fault Tolerance of State
Job
Manager
Task
Manager
Snapshot Store
Task
Manager
Task
Manager
Task Failure
Apache Flink: Fault Tolerance of State
Job
Manager
Task
Manager
HDFS
Task
Manager
Task
Manager
● Full restore of snapshot from last
completed checkpointed state
● Continues processing after restoring
from the latest snapshot from the
store
Full Restore
Apache Flink: Summary
● State Management Primitives:
○ Within task, local state info is stored primarily in-memory (recently, rocksdb)
○ Periodic snapshot (checkpoints + user-defined state + operator state) written to Snapshot
Store
● Fault-Tolerance of State
○ Full state restored from Snapshot Store
Apache Flink: Observations
● Full snapshots are expensive for large states
● Frequent snapshots that can quickly saturate network
● Applications must trade-off between snapshot frequency and how large a
state can be built within a task

Recommended for you

AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices

We believe that security *IS* a shared responsibility, - when we give developers the power to create infrastructure, security became their responsibility, too. During this meetup, we'd like to share our experience with implementing security best practices, to be implemented directly by development teams to build more robust and secure cloud environments. Make cloud security your team's sport!

securitydoit internationalinformation security
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink

This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.

hadoopapache flinkhadoop summit
Application Scalability in Server Farms - NCache
Application Scalability in Server Farms - NCacheApplication Scalability in Server Farms - NCache
Application Scalability in Server Farms - NCache

NCache is an in-memory caching solution by Alachisoft that improves application scalability and performance by reducing database trips and storing frequently accessed data in memory to provide better performance. It is also used to cache session data in web farms.

distributed cachencacheasp.net app not working
Apache Samza
Apache Samza: Processing
● Samza Master handles container
life-cycle and failure handling
Samza
Master
Task Task
Container
Task Task
Container
Apache Samza: Processing
● Samza Master handles container life-
cycle and failure handling
● Each container (JVM process)
contains more than one task to
process the input stream partitions
Samza
Master
Task Task
Container
Task Task
Container
Apache Samza: State Management
● Tasks checkpoint periodically to a
checkpoint stream
● Checkpoint indicates which position
in the input from which processing
has to continue in case of a container
restart
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream

Recommended for you

Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics

Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.

Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza

The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources. Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines. Speaker Navina Ramesh, Sr. Software Engineer, LinkedIn

linkedinapache samzadataworks summit 2017
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka

This document provides an overview of Apache Kafka including its main components, architecture, and ecosystem. It describes how LinkedIn used Kafka to solve their data pipeline problem by decoupling systems and allowing for horizontal scaling. The key elements of Kafka are producers that publish data to topics, the Kafka cluster that stores streams of records in a distributed, replicated commit log, and consumers that subscribe to topics. Kafka Connect and the Schema Registry are also introduced as part of the Kafka ecosystem.

apache kafkakafkaschema registry
Apache Samza: State Management
● State store is local to the task -
typically RocksDB (off-heap) and In-
Memory (backed by a map)
● State store contains any operator
state or adjunct state
● Allows application to define state
through a Key Value interface
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream
Apache Samza: State Management
● State store is continuously replicated
to a changelog stream
● Each store partition is mapped to a
specific changelog partition
Samza
Master
Task Task
Container
Task Task
Container
Changelog Stream
Checkpoint Stream
Apache Samza: Fault Tolerance of State
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Container Failure
Machine A Machine B
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Re-allocated on
different host!
Machine A Machine X
● When container is recovered in a
different host, there is no state
available locally
Apache Samza: Fault Tolerance of State

Recommended for you

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.

scylladbnosqlnosql database
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...

Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.

dataworks summit barcelonadws19credit agricole group infrastructure platform
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTXHA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX

This document discusses high availability (HA) and disaster recovery (DR) architectures for SAP HANA on IBM Power Systems. It provides an overview of typical HA/DR configurations including host auto-failover, SAP HANA system replication in performance-optimized and cost-optimized modes, and the roles of cluster managers like Pacemaker in automating failover. Key aspects covered are recovery point objectives (RPOs), recovery time objectives (RTOs), synchronous vs. asynchronous replication modes, and multi-tier DR landscapes.

Samza
Master
Task Task
Container
Task
Checkpoint Stream
Re-allocated on
different host!
Machine A Machine X
● When container comes up in a
different host, there is no state
available locally
● Restores from the beginning of the
changelog stream -> Full restore!
Task Task
Container
Apache Samza: Fault Tolerance of State
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Container Failure
● State store is persisted to local disk
on the machine, along with info on
which offset to begin restoring the
state from changelogMachine A Machine B
Apache Samza: Fault Tolerance of State
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Re-allocated on same host!
Machine A Machine B
● Samza Master tries to re-allocate the
container on the same host
● The feature where the Samza Master
attempts to co-locate the task with
their built-up state stores (where they
were previously running) is called
Host-affinity.
Apache Samza: Fault Tolerance of State
Samza
Master
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Machine A Machine B
Re-allocated on same host!
● Samza Master tries to re-allocate the
container on the same host
● The feature where the Samza Master
attempts to co-locate the task with
their built-up state stores (where they
were previously running) is called
Host-affinity.
● If container is re-allocated on the
same host, state store is partially
restored from changelog stream
(delta restore)
Apache Samza: Fault Tolerance of State

Recommended for you

An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...

This document discusses an adaptive and self-healing framework for real-time data ingestion across geographically distributed data centers. It describes the problem domain of ingesting 15 billion events per day across multiple schemas and data types from various sources. The proposed architecture includes an ingestion layer using technologies like Storm, Kafka and HDFS to ingest, transform and replicate streaming and batch data. It also includes a serving layer using Aerospike to provide low-latency aggregated user views. Issues encountered with technologies like Storm and Kafka are discussed, as well as features still under development.

frameworkhadoopstorm
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise

https://www.bigdataspain.org/2016/program/thu-introduction-apache-apex.html https://www.youtube.com/watch?v=93mWU2k8AaU&index=41&list=PL6O3g23-p8Tr5eqnIIPdBD_8eE5JBDBik&t=199s

big data spain
Four Ways to Improve ASP .NET Performance and Scalability
 Four Ways to Improve ASP .NET Performance and Scalability Four Ways to Improve ASP .NET Performance and Scalability
Four Ways to Improve ASP .NET Performance and Scalability

Learn how to improve ASP.net performance and scalability with object caching, session state caching, view state caching and output caching.

object cachingdistributed cachingsession state caching
Samza
AppMaster
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
● Once state is restored, checkpoint
stream contains the correct offset for
each task to begin processing
Machine A Machine B
Re-allocated on same host!
Apache Samza: Fault Tolerance of State
● Persisting state on local disk + host-
affinity effectively reduces the time-
to-recover state from failure (or)
upgrades and continue with
processing
Samza
AppMaster
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Apache Samza: Fault Tolerance of State
● Persisting state on local disk + host-
affinity effectively reduces the time-
to-recover state from failure (or)
upgrades and continue with
processing
● Only a subset of tasks may require
full restore, thereby, reducing the
time to recover from failure or time to
restart processing upon upgrades!
Samza
AppMaster
Task Task
Container
Task Task
Container
Checkpoint Stream
Changelog Stream
Apache Samza: Fault Tolerance of State
Apache Samza: Summary
● State Management Primitives
○ Within task, data is stored in-memory or on-disk using RocksDB
○ Checkpoint state stored in checkpoint-stream
○ User-defined and operator state continuously replicated in a changelog stream
● Fault-Tolerance of State
○ Full state restored by consuming changelog stream, if user-defined state not persisted on
task’s machine
○ If locally persisted, only partial restore

Recommended for you

Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final

Slides from my talk at Velocity 2018, NYC https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/71375

schedulersorchestratorsdistributed systems
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex

Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with: * Architecture for high throughput, low latency and exactly-once processing semantics. * Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more * Java based with unobtrusive API to build real-time and batch applications and implement custom business logic. * Advanced engine features for auto-scaling, dynamic changes, compute locality. Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.

big dataapache apexbig data ingestion
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

stream processingstream processing applicationsevent processing
Apache Samza: Observations
● State recovery from changelog can be time-consuming. It could potentially
saturate Kafka clusters. Hence, partial restore is necessary.
● Host-affinity allows for faster failure recovery of task states, and faster job
upgrades, even for large stateful jobs
● Since checkpoints are written to a stream and state is continuously replicated
in changelog, frequent checkpoints are possible.
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
● Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
● Conclusion
Comparison of State & Fault-tolerance
Apache Samza Apache Flink
Durable State
RocksDB FileSystem (Recently added,
RocksDB)
State Fault Tolerance Kafka based Changelog Stream HDFS
State Update Unit Delta Changes Full Snapshot
State Recovery Unit
Full Restore + Improved recovery
with host-affinity
Full Restore
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
● Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
●Challenges in Scaling Result Storage
● Conclusion

Recommended for you

Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose

This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses the rise of stream processing and how Flink enables low-latency applications through features like pipelining, operator state, fault tolerance using distributed snapshots, and integration with batch processing. The document also outlines Flink's roadmap, which includes graduating its DataStream API, fully managing windowing and state, and unifying batch and stream processing.

Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData

This document provides an overview of Apache Flink, an open-source platform for distributed stream and batch data processing. Flink allows for unified batch and stream processing with a simple yet powerful programming model. It features native stream processing, exactly-once fault tolerance based on consistent snapshots, and high performance optimized for streaming workloads. The document outlines Flink's APIs, state management, fault tolerance approach, and roadmap for continued improvements in 2015.

flinkbig dataapache
SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management

John Naguib session in the SPSUnity event presented SharePoint 2013 Performance and Capacity Management

sharepoint; capacity
Challenges in Scaling Result Storage / Serving
● Any fast KV store can handle very small (order of thousands) QPS compared
to the rate of stream processing output rate (order of millions)
● Output store can DoS due to high-throughput
Online +Async Processor
Challenges in Scaling Result Storage / Serving
Processor Downtime ~30 min
Processor Restarts
(Derived Data)
Online Apps
QueryServing
Platform
Stream
Processing
Scaling Result Storage / Serving
Offline
Processing
Distributed
Queue
(Kafka)
Bulk Load
Change Stream
Agenda
● Use cases in Stream Processing
● Typical Data Pipelines
● Scaling Data Ingestion
● Scaling Data Processing
○ Challenges in Scaling Data Processing
○ Walk-through of Apache Flink & Apache Samza
○ Observations on state & fault-tolerance
● Challenges in Scaling Result Storage
●Conclusion

Recommended for you

Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server

The document discusses troubleshooting performance issues for SQL Server. It begins with an introduction and case study on the MS Society of Canada's website. It then discusses optimizing the environment, using Performance Monitor (PerfMon) to monitor performance, and concludes with recommendations to address issues like high CPU usage, slow disk speeds, and insufficient memory.

firestartereventfirestarter
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time. Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications. In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.

apache flinkstreamingbig data
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex

This document discusses Apache Apex, an open source stream processing framework. It provides an overview of stream data processing and common use cases. It then describes key Apache Apex capabilities like in-memory distributed processing, scalability, fault tolerance, and state management. The document also highlights several customer use cases from companies like PubMatic, GE, and Silver Spring Networks that use Apache Apex for real-time analytics on data from sources like IoT sensors, ad networks, and smart grids.

Conclusion
● Ingest/Process/Serve should be wholistically scalable to successfully scale
stream processing applications
● The notion of a “locally” accessible state is great to scale stream processing
applications for performance. It brings in the additional cost of making the
state fault-tolerant
References
● Apache Samza - http://samza.apache.org
● Apache Flink - http://flink.apache.org
● https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html
● https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html
● http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed
● https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
● Apache Kafka - http://kafka.apache.org
● http://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-resharding.html
● http://aws.amazon.com/streaming-data/
Contribute!
Exciting features coming up in Apache Samza:
● SAMZA-516 - Standalone deployment of Samza (independent of Yarn)
● SAMZA-390 - High-level Language for Samza
● SAMZA-863 - Multithreading in Samza
Join us!
● Apache Samza - samza.apache.org
● Samza Hello World! - https://samza.apache.org/startup/hello-samza/latest/
● Mailing List - dev@samza.apache.org
● JIRA - http://issues.apache.org/jira/browse/SAMZA
Thanks!

Recommended for you

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training 2024 July 09

1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT

1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT

1239_2.pdf for procurement
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management

Introduction to Project Management: Introduction, Project and Importance of Project Management, Contract Management, Activities Covered by Software Project Management, Plans, Methods and Methodologies, some ways of categorizing Software Projects, Stakeholders, Setting Objectives, Business Case, Project Success and Failure, Management and Management Control, Project Management life cycle, Traditional versus Modern Project Management Practices.

project managementcontract managementmanagement

More Related Content

What's hot

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
HBaseCon
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa
HBaseCon
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
Toby Matejovsky
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Big Data Spain
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Connecting kafka message systems with scylla
Connecting kafka message systems with scylla   Connecting kafka message systems with scylla
Connecting kafka message systems with scylla
Maheedhar Gunturu
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 

What's hot (20)

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Connecting kafka message systems with scylla
Connecting kafka message systems with scylla   Connecting kafka message systems with scylla
Connecting kafka message systems with scylla
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 

Viewers also liked

Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
Charlie Hull
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
Tao Feng
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Zbigniew Jerzak
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
Zbigniew Jerzak
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
Zbigniew Jerzak
 
Cluster Schedulers
Cluster SchedulersCluster Schedulers
Cluster Schedulers
Pietro Michiardi
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
DoiT International
 

Viewers also liked (14)

Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
Cluster Schedulers
Cluster SchedulersCluster Schedulers
Cluster Schedulers
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
 

Similar to Will it Scale? The Secrets behind Scaling Stream Processing Applications

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Application Scalability in Server Farms - NCache
Application Scalability in Server Farms - NCacheApplication Scalability in Server Farms - NCache
Application Scalability in Server Farms - NCache
Alachisoft
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Ricardo Bravo
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTXHA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
ThinL389917
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...
Angad Singh
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Four Ways to Improve ASP .NET Performance and Scalability
 Four Ways to Improve ASP .NET Performance and Scalability Four Ways to Improve ASP .NET Performance and Scalability
Four Ways to Improve ASP .NET Performance and Scalability
Alachisoft
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
preethaappan
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
Gyula Fóra
 
SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management
jems7
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
Stephen Rose
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 

Similar to Will it Scale? The Secrets behind Scaling Stream Processing Applications (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Application Scalability in Server Farms - NCache
Application Scalability in Server Farms - NCacheApplication Scalability in Server Farms - NCache
Application Scalability in Server Farms - NCache
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
 
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTXHA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
HA and DR Architecture for HANA on Power Deck - 2022-Nov-21.PPTX
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Four Ways to Improve ASP .NET Performance and Scalability
 Four Ways to Improve ASP .NET Performance and Scalability Four Ways to Improve ASP .NET Performance and Scalability
Four Ways to Improve ASP .NET Performance and Scalability
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 

Recently uploaded

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
Mani Krishna Sarkar
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
Prakhyath Rai
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
Muanisa Waras
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
santoshpatilrao33
 
Net Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK EmpireNet Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK Empire
Global Network for Zero
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Bert Blevins
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
sharvaridhokte
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Sinan KOZAK
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
bookhotbebes1
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
GOWSIKRAJA PALANISAMY
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
Blesson Easo Varghese
 

Recently uploaded (20)

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
 
Net Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK EmpireNet Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK Empire
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
 

Will it Scale? The Secrets behind Scaling Stream Processing Applications

  • 1. Will it Scale? The Secrets behind Scaling Stream Processing Applications Navina Ramesh Software Engineer, LinkedIn Apache Samza, Committer & PMC navina@apache.org
  • 2. What is this talk about ? ● Understand the architectural choices in stream processing systems that may impact performance/scalability of stream processing applications ● Have a high level comparison of two streaming engines (Flink/Samza) with a focus on scalability of the stream-processing application
  • 3. What this talk is not about ? ● Not a feature-by-feature comparison of existing stream processing systems (such as Flink, Storm, Samza etc)
  • 4. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ● Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ● Conclusion
  • 5. 0 ms RPC Stream Processing Synchronous Milliseconds to minutes Later. Typically, hours Response Latency Batch Processing Spectrum of Processing
  • 9. Agenda ● Use cases in Stream Processing ●Typical Data Pipelines ● Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ● Conclusion
  • 10. Typical Data Pipeline - Batch Ingestion Service HDFS Mappers Reducers HDFS/ HBase Query
  • 11. Typical Data Pipeline Ingestion Service HDFS Mappers Reducers HDFS/ HBase Data Ingestion Query Typical Data Pipeline - Batch
  • 14. Parallels in Streaming Ingestion Service HDFS Mappers Reducers HDFS/ HBase Processors Processors HDFS KV Store Partition 0 Partition 1 Partition N ... Data Ingestion Data Processing Result Storage / Serving Query Query
  • 15. Ingestion Service HDFS Mappers Reducers HDFS/ HBase Processors Processors HDFS KV Store Partition 0 Partition 1 Partition N ... Data Ingestion Data Processing Result Storage / Serving Query Query Parallels in Streaming
  • 16. Batch Streaming ● Data Processing on bounded data ● Acceptable Latency - order of hours ● Processing occurs at regular intervals ● Throughput trumps latency ● Horizontal scaling to improve processing time ● Data processing on unbounded data ● Low latency - order of sub-seconds ● Processing is continuous ● Horizontal scaling is not straightforward (stateful applications) ● Need tools to reason about time (esp. when re-processing stream)
  • 17. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ●Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ● Conclusion
  • 18. Typical Data Ingestion Producers Partition 0 Partition 1 Partition 3 key=0 key=3 key=23 Stream A Consumer (host A) Consumer (host B) Partition 2 - Typically, streams are partitioned - Messages sent to partitions based on “Partition Key” - Time-based message retentionkey=10 Kafka Kinesis
  • 19. Scaling Data Ingestion Producers Partition 0 Partition 1 Partition 3 Stream A Consumer (host A) Consumer (host B) Partition 2 - Scaling “up” -> Increasing partitions - Changing partitioning logic re-distributes* the keys across the partitions Partition 4 key=0 key=10 key=23 key=3 Kafka Kinesis
  • 20. Scaling Data Ingestion Producers Partition 0 Partition 1 Partition 3 Stream A Consumer (host A) Consumer (host B) Partition 2 - Scaling “up” -> Increasing partitions - Changing partitioning logic re-distributes* the keys across the partitions - Consuming clients (includes stream processors) should be able to re-adjust! - Impact -> Over-provisioning of partitions in order to handle changes in load Partition 4 key=0 key=10 key=23 key=3 Kafka Kinesis
  • 21. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ● Scaling Data Ingestion ●Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ● Conclusion
  • 22. Scaling Data Processing ● Increase number of processing units → Horizontal Scaling
  • 23. Scaling Data Processing ● Increase number of processing units → Horizontal Scaling But more machines means more $$$ ● Impact NOT only CPU cores, but “large” (order of TBs) stateful applications impact network and disk!!
  • 24. Key Bottleneck in Scaling Data Processing ● Accessing State ○ Operator state ■ Read/Write state that is maintained during stream processing ■ Eg: windowed aggregation, windowed join ○ Adjunct state ■ To process events, applications might need to lookup related or ‘adjunct’ data.
  • 25. Repartitioner Assembler homepage_service_call feed_service_call profile_service_call pymk_service_call ... Homepage_service_call (tree id: 10) | | | Pymk_service_call (tree id: 10) | | | Profile_service_call (tree id: 10) | Feed_service_call (tree id: 10) Stateful Process! Service Calls Accessing Operator State: Assemble Call Graph (Partition events by “tree id”) (Aggregate events by “tree id”)
  • 26. Repartitioner Assembler homepage_service_call feed_service_call profile_service_call pymk_service_call ... Homepage_service_call (tree id: 10) | | | Pymk_service_call (tree id: 10) | | | Profile_service_call (tree id: 10) | Feed_service_call (tree id: 10) In-Memory Mapping Service Calls Accessing Operator State: Assemble Call Graph - In-memory structure to aggregate events until ready to output - Concerns: - Large windows can cause overflow! - Restarting job after a long downtime can increase memory pressure! (Partition events by “tree id”) (Aggregate events by “tree id”)
  • 27. Repartitioner Assembler homepage_service_call feed_service_call profile_service_call pymk_service_call ... Homepage_service_call (tree id: 10) | | | Pymk_service_call (tree id: 10) | | | Profile_service_call (tree id: 10) | Feed_service_call (tree id: 10) Service Calls Accessing Operator State: Assemble Call Graph Remote KV Store (operator state) (Partition events by “tree id”) (Aggregate events by “tree id”) Concerns: - Remote RPC is Slow!! (Stream: ~1 million records/sec ; DB: ~3-4K writes/sec) - Mutations can’t rollback! - Task may fail & recover - Change in logic!
  • 28. Accessing Operator State: Push Notifications B2 Online Apps Relevance Score User Action Data Task (Generate active notifications - filtering, windowed-aggregation, external calls etc) Notification System (Scheduler)
  • 29. Accessing Operator State: Push Notifications B2 Online Apps Relevance Score User Action Data Task (Generate active notifications - filtering, windowed-aggregation, external calls etc) Notification System (Scheduler) - Stream processing tasks consume from multiple sources - offline/online - Performs multiple operations - Filters information and buffers data for window of time - Aggregates / Joins buffered data - Total operator state per instance can easily grow to multiple GBs per Task
  • 30. Accessing Adjunct Data: AdQuality Updates Task AdClicks AdQuality Update Read Member Data Member Info Stream-to-Table Join (Look-up memberId & generate AdQuality improvements for the User)
  • 31. Accessing Adjunct Data: AdQuality Updates Task AdClicks AdQuality Update Read Member Data Member Info Stream-to-Table Join (Look-up memberId & generate AdQuality improvements for the User) Concerns: - Remote look-up Latency is high! - DDoS on shared store - MemberInfo
  • 32. Accessing Adjunct Data using Cache: AdQuality Updates Task AdClicks AdQuality Update Read Member Data Member Info Stream-to-Table Join (Maintain a cache of member Info & do local lookup)
  • 33. Accessing Adjunct Data using Cache: AdQuality Updates Task AdClicks AdQuality Update Read Member Data Member Info Stream-to-Table Join (Maintain a cache of member Info & do local lookup) Concerns: - Overhead of maintaining cache consistency based on the source of truth (MemberInfo) - Warming up the cache after the job’s downtime can cause temporary spike in QPS on the shared store
  • 34. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ● Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ● Conclusion
  • 36. Apache Flink: Processing ● Dataflows with streams and transformation operators ● Starts with one or more source and ends in one or more sinks
  • 37. Actor System Scheduler Checkpoint Coordinator Job Manager Task Slot Task Slot Task Manager Task Slot Actor System Network Manager Memory & I/O Manager Task Slot Task Slot Task Manager Task Slot Actor System Network Manager Memory & I/O Manager Stream Task Slot ● JobManager (Master) coordinates distributed execution such as, checkpoint, recovery management, schedule tasks etc. ● TaskManager (JVM Process) execute the subtasks of the dataflow, and buffer and exchange data streams ● Each Task Slot may execute multiple subtasks and runs on a separate thread. Apache Flink: Processing
  • 38. Apache Flink: State Management ● Lightweight Asynchronous Barrier Snapshots ● Master triggers checkpoint and source inserts barrier ● On receiving barrier from all input sources, each operator stores the entire state, acks the checkpoint to the master and emits snapshot barrier in the output
  • 39. Apache Flink: State Management Job Manager Task Manager HDFS Snapshot Store Task Manager Task Manager ● Lightweight Asynchronous Barrier Snapshots ● Periodically snapshot the entire state to snapshot store ● Checkpoint mapping is stored in Job Manager ● Snapshot Store (typically, HDFS) ○ operator state (windows/aggregation) ○ user-defined state (checkpointed)
  • 40. Apache Flink: State Management ● Operator state is primarily stored In-Memory or local File System ● Recently added RocksDB ● Allows user-defined operators to define state that should be checkpointed Job Manager Task Manager HDFS Snapshot Store Task Manager Task Manager
  • 41. Apache Flink: Fault Tolerance of State Job Manager Task Manager Snapshot Store Task Manager Task Manager Task Failure
  • 42. Apache Flink: Fault Tolerance of State Job Manager Task Manager HDFS Task Manager Task Manager ● Full restore of snapshot from last completed checkpointed state ● Continues processing after restoring from the latest snapshot from the store Full Restore
  • 43. Apache Flink: Summary ● State Management Primitives: ○ Within task, local state info is stored primarily in-memory (recently, rocksdb) ○ Periodic snapshot (checkpoints + user-defined state + operator state) written to Snapshot Store ● Fault-Tolerance of State ○ Full state restored from Snapshot Store
  • 44. Apache Flink: Observations ● Full snapshots are expensive for large states ● Frequent snapshots that can quickly saturate network ● Applications must trade-off between snapshot frequency and how large a state can be built within a task
  • 46. Apache Samza: Processing ● Samza Master handles container life-cycle and failure handling Samza Master Task Task Container Task Task Container
  • 47. Apache Samza: Processing ● Samza Master handles container life- cycle and failure handling ● Each container (JVM process) contains more than one task to process the input stream partitions Samza Master Task Task Container Task Task Container
  • 48. Apache Samza: State Management ● Tasks checkpoint periodically to a checkpoint stream ● Checkpoint indicates which position in the input from which processing has to continue in case of a container restart Samza Master Task Task Container Task Task Container Checkpoint Stream
  • 49. Apache Samza: State Management ● State store is local to the task - typically RocksDB (off-heap) and In- Memory (backed by a map) ● State store contains any operator state or adjunct state ● Allows application to define state through a Key Value interface Samza Master Task Task Container Task Task Container Checkpoint Stream
  • 50. Apache Samza: State Management ● State store is continuously replicated to a changelog stream ● Each store partition is mapped to a specific changelog partition Samza Master Task Task Container Task Task Container Changelog Stream Checkpoint Stream
  • 51. Apache Samza: Fault Tolerance of State Samza Master Task Task Container Task Task Container Checkpoint Stream Changelog Stream Container Failure Machine A Machine B
  • 52. Samza Master Task Task Container Task Task Container Checkpoint Stream Changelog Stream Re-allocated on different host! Machine A Machine X ● When container is recovered in a different host, there is no state available locally Apache Samza: Fault Tolerance of State
  • 53. Samza Master Task Task Container Task Checkpoint Stream Re-allocated on different host! Machine A Machine X ● When container comes up in a different host, there is no state available locally ● Restores from the beginning of the changelog stream -> Full restore! Task Task Container Apache Samza: Fault Tolerance of State
  • 54. Samza Master Task Task Container Task Task Container Checkpoint Stream Changelog Stream Container Failure ● State store is persisted to local disk on the machine, along with info on which offset to begin restoring the state from changelogMachine A Machine B Apache Samza: Fault Tolerance of State
  • 55. Samza Master Task Task Container Task Task Container Checkpoint Stream Changelog Stream Re-allocated on same host! Machine A Machine B ● Samza Master tries to re-allocate the container on the same host ● The feature where the Samza Master attempts to co-locate the task with their built-up state stores (where they were previously running) is called Host-affinity. Apache Samza: Fault Tolerance of State
  • 56. Samza Master Task Task Container Task Task Container Checkpoint Stream Changelog Stream Machine A Machine B Re-allocated on same host! ● Samza Master tries to re-allocate the container on the same host ● The feature where the Samza Master attempts to co-locate the task with their built-up state stores (where they were previously running) is called Host-affinity. ● If container is re-allocated on the same host, state store is partially restored from changelog stream (delta restore) Apache Samza: Fault Tolerance of State
  • 57. Samza AppMaster Task Task Container Task Task Container Checkpoint Stream Changelog Stream ● Once state is restored, checkpoint stream contains the correct offset for each task to begin processing Machine A Machine B Re-allocated on same host! Apache Samza: Fault Tolerance of State
  • 58. ● Persisting state on local disk + host- affinity effectively reduces the time- to-recover state from failure (or) upgrades and continue with processing Samza AppMaster Task Task Container Task Task Container Checkpoint Stream Changelog Stream Apache Samza: Fault Tolerance of State
  • 59. ● Persisting state on local disk + host- affinity effectively reduces the time- to-recover state from failure (or) upgrades and continue with processing ● Only a subset of tasks may require full restore, thereby, reducing the time to recover from failure or time to restart processing upon upgrades! Samza AppMaster Task Task Container Task Task Container Checkpoint Stream Changelog Stream Apache Samza: Fault Tolerance of State
  • 60. Apache Samza: Summary ● State Management Primitives ○ Within task, data is stored in-memory or on-disk using RocksDB ○ Checkpoint state stored in checkpoint-stream ○ User-defined and operator state continuously replicated in a changelog stream ● Fault-Tolerance of State ○ Full state restored by consuming changelog stream, if user-defined state not persisted on task’s machine ○ If locally persisted, only partial restore
  • 61. Apache Samza: Observations ● State recovery from changelog can be time-consuming. It could potentially saturate Kafka clusters. Hence, partial restore is necessary. ● Host-affinity allows for faster failure recovery of task states, and faster job upgrades, even for large stateful jobs ● Since checkpoints are written to a stream and state is continuously replicated in changelog, frequent checkpoints are possible.
  • 62. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ● Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ● Conclusion
  • 63. Comparison of State & Fault-tolerance Apache Samza Apache Flink Durable State RocksDB FileSystem (Recently added, RocksDB) State Fault Tolerance Kafka based Changelog Stream HDFS State Update Unit Delta Changes Full Snapshot State Recovery Unit Full Restore + Improved recovery with host-affinity Full Restore
  • 64. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ● Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ●Challenges in Scaling Result Storage ● Conclusion
  • 65. Challenges in Scaling Result Storage / Serving ● Any fast KV store can handle very small (order of thousands) QPS compared to the rate of stream processing output rate (order of millions) ● Output store can DoS due to high-throughput
  • 66. Online +Async Processor Challenges in Scaling Result Storage / Serving Processor Downtime ~30 min Processor Restarts
  • 67. (Derived Data) Online Apps QueryServing Platform Stream Processing Scaling Result Storage / Serving Offline Processing Distributed Queue (Kafka) Bulk Load Change Stream
  • 68. Agenda ● Use cases in Stream Processing ● Typical Data Pipelines ● Scaling Data Ingestion ● Scaling Data Processing ○ Challenges in Scaling Data Processing ○ Walk-through of Apache Flink & Apache Samza ○ Observations on state & fault-tolerance ● Challenges in Scaling Result Storage ●Conclusion
  • 69. Conclusion ● Ingest/Process/Serve should be wholistically scalable to successfully scale stream processing applications ● The notion of a “locally” accessible state is great to scale stream processing applications for performance. It brings in the additional cost of making the state fault-tolerant
  • 70. References ● Apache Samza - http://samza.apache.org ● Apache Flink - http://flink.apache.org ● https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html ● https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html ● http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed ● https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/ ● Apache Kafka - http://kafka.apache.org ● http://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-resharding.html ● http://aws.amazon.com/streaming-data/
  • 71. Contribute! Exciting features coming up in Apache Samza: ● SAMZA-516 - Standalone deployment of Samza (independent of Yarn) ● SAMZA-390 - High-level Language for Samza ● SAMZA-863 - Multithreading in Samza Join us! ● Apache Samza - samza.apache.org ● Samza Hello World! - https://samza.apache.org/startup/hello-samza/latest/ ● Mailing List - dev@samza.apache.org ● JIRA - http://issues.apache.org/jira/browse/SAMZA