SlideShare a Scribd company logo
Towards the True Elasticity
of Spark
Michael Le and Min Li
IBM,	
  T.J.	
  Watson	
  Research	
  Center	
  
Auto-Scaling
•  Detects changes of resource usage in current workloads →
Dynamically allocate/de-allocate resources
•  Meets SLA requirements at reduced cost
•  Existing auto-scaling approaches react slowly and often miss
optimization opportunities
•  YARN and Mesos have initial auto-scaling support, yet how
workloads can benefit from the capability?
Infrastructure as a Service"
Resource Manager (YARN/Mesos)"
Main Focus
•  Analyze how scaling affects Spark workloads
–  Is simply adding new resources sufficient for performance
improvement?
•  Analyze pros/cons of current Spark auto-scaling mechanism
–  Are there rooms for performance improvement?
Agenda
•  Introduction
•  Evaluation setup
•  Impact of scaling
•  Auto-scaling in Spark
•  Lessons and future work

Recommended for you

Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil

This document discusses analyzing large genomic datasets with ADAM and Toil. It summarizes the sequencing and analysis process, and how ADAM implemented on Spark can provide horizontal scalability and speedups of 30-50x over traditional tools. Toil is introduced as a pipeline system for massive genomic workflows that can run on thousands of nodes and is resilient to failures. Results show ADAM produces equivalent variants to GATK while being 3.5x faster and 4x cheaper.

#apachespark #sparksummit
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark

This document describes Drizzle, a low latency execution engine for Apache Spark. It addresses the high overheads of Spark's centralized scheduling model by decoupling execution from scheduling through batch scheduling and pre-scheduling of shuffles. Microbenchmarks show Drizzle achieves milliseconds latency for iterative workloads compared to hundreds of milliseconds for Spark. End-to-end experiments show Drizzle improves latency for streaming and machine learning workloads like logistic regression. The authors are working on automatic batch tuning and an open source release of Drizzle.

#apachespark#sparksummit
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2

apache sparkstructuredstreamingsparkr
Experimental Setup
•  Baseline setup (6 nodes):
•  Node: 4 CPUs, 8GB, 100 GB HDDs
•  YARN executor: 3GB, 1CPU
•  Mesos executor: 6GB and 4CPUs
•  4 benchmarks (SparkBench*):
–  Kmeans (input 37GB)
–  Page Rank (input 1.1GB)
–  Spark SQL: SQL queries (input 39GB)
–  Logistic Regression (input 47GB)
•  Scaling up: add 4 new nodes “instantaneously”
•  Wait ~45 seconds after benchmark run
•  Scale down – wait 3min after benchmark run
*	
  h6ps://bitbucket.org/lm0926/sparkbench	
  
Mechanisms for Scaling
•  Scale up – add new node “instantaneously” (VM already
provisioned)
–  Run YARN node manager or Mesos slave daemon
–  New nodes are Task nodes (no HDFS component)
–  Scale down – kill resource manager processes
•  Spark on YARN – set total executors for app higher than
available in cluster to ensure executors get launched
•  Spark on Mesos – make use of new resource offers from
Mesos
Agenda
•  Introduction
•  Evaluation setup
•  Impact of scaling
•  Auto-scaling in Spark
•  Lessons and future work
Runtime – Spark on YARN
Benchmark! Runtime
(baseline)!
Runtime !
(scale out)!
Runtime
Reduction!
Kmeans" 54min 44sec" 29min 26sec" 46%"
Page Rank" 9min 28sec" 8min 18sec" 12.3%"
Spark SQL queries" 15min 45sec" 13min 11sec" 16.3%"
LogisticRegression" 13min 10sec" 12min 55sec" 2%"
Similar behavior seen on Mesos

Recommended for you

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability

The document describes a new architecture called "monotasks" for Apache Spark that aims to make reasoning about Spark job performance easier. The monotasks architecture decomposes Spark tasks so that each task uses only one resource (e.g. CPU, disk, network). This avoids issues where Spark tasks bottleneck on different resources over time or experience resource contention. With monotasks, dedicated schedulers control resource contention and monotask timing data can be used to model ideal performance. Results show monotasks match Spark's performance and provide clearer insight into bottlenecks.

#sparksummit#apachespark
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali

Luca Canali presented on using flame graphs to investigate performance improvements in Spark 2.0 over Spark 1.6 for a CPU-intensive workload. Flame graphs of the Spark 1.6 and 2.0 executions showed Spark 2.0 spending less time in core Spark functions and more time in whole stage code generation functions, indicating improved optimizations. Additional tools like Linux perf confirmed Spark 2.0 utilized CPU and memory throughput better. The presentation demonstrated how flame graphs and other profiling tools can help pinpoint performance bottlenecks and understand the impact of changes like Spark 2.0's code generation optimizations.

apache spark
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark. The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage. See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.

spark summitapache spark
Why the variation?
•  Delay scheduling preventing tasks to be scheduled on
new node
Benchmark! Total
tasks!
% tasks locality
preference
RACK or ANY!
% tasks wait >3s
to be scheduled!
% tasks on new
nodes!
KMeans" 7800" 0.4%" 3.5%" 24.5%"
Page Rank" 21600" 0%" 0.01%" 13.5%"
Spark SQL" 9805" 3%" 0.2%" 11.8%"
LogisticRegression" 6637" 0%" 0.1%" 1.5%"
Tuning Spark for Scaling
•  locality wait time
–  How soon to change locality preference of tasks in
stage
•  resource revive interval
–  How soon to inform scheduler a resource that has not
been used is still available
Runtime – Tuning Spark
Benchmark! Runtime
(baseline)!
Runtime
(scale out)!
Runtime (scale
out– revive
interval 100ms)!
Runtime (scale
out– locality wait
time 100ms)!
Runtime (scale
out–locality wait
time 0ms)!
KMeans" 54min 44sec" 29min 26sec" 30min 39sec" 11min 32sec" 14min 1sec"
Page Rank" 9min 28sec" 8min 18sec" 8min 35sec" 7min 35sec" 5min 57sec"
Spark SQL queries" 15min 45sec" 13min 11sec" 12min 55sec" 11min 40sec" 19min 57sec"
LogisticRegression" 13min 10sec" 12min 55sec" 12min 43sec" 6min 54sec" 9min 12sec"
Benchmark! % tasks on new
node (locality wait
time 3s)!
% tasks on new node
(locality wait time
100ms)!
% tasks on new node
(locality wait time
0ms)!
KMeans" 24.5%" 39.1%" 38%"
Page Rank" 13.5%" 16.1%" 39%"
Spark SQL queries" 11.8%" 22.6%" 39%"
LogisticRegression" 1.5%" 38%" 39%"
KMeans - CPU Utilization Per Node
Scale out (locality wait - 100ms)
1 of the 4 new nodes
Base line (1 of the 6 base nodes) Scale out (locality wait - 100ms)
1 of the 6 base nodes
Scale out (locality wait - 0ms)
1 of the 6 base nodes
Scale out (locality wait - 0ms)
1 of the 4 new nodes

Recommended for you

Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.

http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/ http://blog.ashansa.org/2016/02/stream-processing-is-becoming-crucial.html Batch Processing. https://github.com/karamel-lab/batch-processing-comparison Stream Processing. https://github.com/karamel-lab/stream-processing-comparison

sparkcloudbatch.streaming
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal

This document discusses Project Tungsten, which aims to substantially improve the memory and CPU efficiency of Spark. It describes how Spark has optimized IO but the CPU has become the bottleneck. Project Tungsten focuses on improving execution performance through techniques like explicit memory management, code generation, cache-aware algorithms, whole-stage code generation, and columnar in-memory data formats. It shows how these techniques provide significant performance improvements, such as 5-30x speedups on operators and 10-100x speedups on radix sort. Future work includes cost-based optimization and improving performance on many-core machines.

apache spark
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service

700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!

real-timesparkanalytics
KMeans - Network Utilization
10 MB shuffle read
12 MB shuffle write
Baseline
10 MB shuffle read
11 MB shuffle write
Scale out (locality wait - 100ms)
Greater bandwidth consumption due to transferring of RDD partitions
Scale out (locality wait - 0ms)
10 MB shuffle read
12 MB shuffle write
KMeans - Memory Utilization
Baseline KMeans – scale out (locality wait - 100ms)
KMeans – scale out (locality wait - 0ms)
PageRank - CPU Utilization
Scale out (locality wait - 100ms)
1 of the 4 new nodes
Base line (1 of the 6 nodes) Scale out (locality wait - 100ms)
1 of the 6 nodes
Scale out (locality wait - 0ms)
1 of the 4 new nodes
Scale out (locality wait - 0ms)
1 of the 6 nodes
PageRank - Network Utilization
15.50 GB shuffle read
16.71 GB shuffle write
Baseline
16.07 GB shuffle read
16.71 GB shuffle write
Scale out (locality wait - 100ms)
Greater bandwidth consumption due to transferring of RDD partitions
16.10 GB shuffle read
16.73 GB shuffle write
Scale out (locality wait - 0ms)

Recommended for you

A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...

This document presents a Spark framework for personalized DNA analysis at large scale for under $100 and less than 1 hour. The framework segments input DNA data and runs it through three stages on a Spark cluster: 1) mapping and static load balancing, 2) sorting and dynamic load balancing, and 3) Picard deduplication and GATK variant calling. It achieves high CPU utilization, scales linearly from 1 to 20 nodes, analyzes 400GB of data in under an hour on a 35-node cluster for under $100, and has a 99.1% concordance with serial GATK. Future work involves accelerating it using FPGAs.

#apachespark #sparksummit
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue

This talk presents a continuous application example that relies on Spark FAIR scheduler as the conductor to orchestrate the entire “lambda architecture” in a single spark context. As a typical time series event stream analysis might involved, there are four key components: – an ETL step to store the raw data – a series of real time aggregation on the joint of streaming input and historical data to power a model – model execution – ad-hoc query for human inspection. The key benefits of this setup compared to a typical design that has a bunch of Spark application running individually are 1. Decouple streaming batches process from triggering model calculation, model calculations are triggered at a different pace from the stream processing. 2. Model is always processing the latest data, using pure rdd APIs. 3. Launch various operations in different threads on the driver node, ensuring them got submitted to the appropriate fair scheduler pool. Let FAIR scheduler to do the resource distribution. 4. Share code and time by sharing the actual data transformation (like the rdds in the intermediate steps). 5. Support adhoc queries on intermediate state without a dedicated serving layer or output protocol. 6. Only one app to monitor and tune.

apache sparkspark summit
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

This document provides an overview of Spark: Data Science as a Service by Sridhar Alla and Kiran Muglurmath of Comcast. It discusses Comcast's data science challenges due to massive data size and lack of scalable architecture. It introduces Roadrunner, Comcast's solution built on Spark, which provides a centralized processing system with SQL and machine learning capabilities to enable data ingestion, quality checks, feature engineering, modeling and workflow management. Roadrunner is accessed through REST APIs and helps multiple teams work with the same large datasets. Examples of transformations, joins, aggregations and anomaly detection algorithms demonstrated in Roadrunner are also included.

spark summitapache spark
PageRank - Memory Utilization
Baseline Scale out (locality wait - 100ms)
Scale out (locality wait - 0ms)
SQL – Network Utilization
Scale out (locality wait - 0ms)Scale out (locality wait - 100ms)
LogisticRegression – Network Utilization
Scale out (locality wait - 0ms)Scale out (locality wait - 100ms)
Take-aways
•  Locality wait time is key to improvement
– Tune during runtime?
– Adjust during scaling to force use of new
nodes
•  Need to consider gains from running task
on new nodes vs. network bandwidth used

Recommended for you

Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

deep learningapache spark
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...

There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem. In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.

apache sparkspark summit
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette

This document provides an overview of spark-timeseries, an open source time series library for Apache Spark. It discusses the library's design choices around representing multivariate time series data, partitioning time series data for distributed processing, and handling operations like lagging and differencing on irregular time series data. It also presents examples of using the library to test for stationarity, generate lagged features, and perform Holt-Winters forecasting on seasonal passenger data.

spark summit eastapache spark
Scaling Down
Benchmark" Runtime – Baseline
(10 nodes)"
Runtime – "
Scale in"
KMeans" 11min 54sec" 49min 13sec"
PageRank" 11min 41sec" 13min 12sec"
Spark SQL" 12min 50sec" 15min 52sec"
LogisticRegression" 10min 20sec" 14min 7sec"
Re-execution overhead worst for some workloads
Scaling Down with Mesos/YARN
•  Prevent Spark from scheduling more tasks on
nodes that are selected to removed
–  Mesos fine-grained, simply not offer resource
from node selected for removal
–  Mesos coarse-grained and YARN, requires
cooperation from Spark to not schedule new
tasks on nodes selected for removal
•  Shutdown node once tasks are drained
–  What to do about stored shuffle data?
Agenda
•  Introduction
•  Background
•  Evaluation setup
•  Impact of scaling
•  Auto-scaling in Spark
•  Lessons and future work
Existing Auto-scaling Mechanism
•  Dynamic Executor Allocation (DEA)
–  Works only with YARN
–  Spark request new executors after a fixed time interval
when there are still pending tasks
–  Number of requested executors grow exponentially
•  Works but does have some potential inefficiency

Recommended for you

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.

spark summit eastapache spark
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks

Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators. In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we: Manage a typical query lifetime through the Databricks software stack Dynamically allocate resources to satisfy the elastic demands of a single cluster Isolate the data and the generated state within a large organization with multiple clusters

* 
apache spark

 *big data

 *ai

 *
Цветочные легенды
Цветочные легендыЦветочные легенды
Цветочные легенды

.

Improving Auto-scaling
•  Main reason tasks are not scheduled on new node
is data locality preferences
–  Delay scheduling preventing tasks from being
run on new node
•  Approach:
–  Change locality wait time dynamically during
application runtime
–  Ideally, reduce locality wait time at point of scale
out, then after stabilizes, revert to previous
locality wait time value
Improving Auto-scaling Details
•  Ideally: tasks spread evenly among all executors
–  average # of tasks per executor:
•  T = Total tasks / # executors
•  ti = # of tasks per executori
•  If ti < alpha*T, for s seconds, then change locality
wait time
•  If ti is still below threshold after r seconds from first
change of locality time, then remove executor
•  If no executors below threshold, reset locality wait
time to initial value
Dynamically Adjusting Locality Wait Time at
Runtime with Dynamic Allocator Execution
Benchmark! Runtime 

(scale out w/
DEA)!
!
% tasks on new
node (locality
wait time 3s)!
Runtime!
(scale out w/
DEA and runtime
adjustment of
locality wait time)!
% tasks on new
node (dynamic
locality wait time)!
KMeans" 32min 15sec" 28.6%" 20min 35sec" 35%"
PageRank" 12min 2sec" 12.9%" 11min 23sec" 14%"
Spark SQL" 12min 48sec" 11.5%" 12min 26sec" 16%"
Logistic

Regression"
11min 58sec" 0.7%" 12min 16sec" 3.3%"
Parameters: alpha = 30%, s = 5sec, new locality wait time = 100ms
Mechanism helps at increasing % tasks on new nodes
Simple Auto-scaling Improvements
Existing
Mechanism
Description Drawbacks Proposal
When to scale? Backlogged tasks
exist for n secs
Request CPU
resources when
Tasks might be I/O
bound
CPU/memory of more than 50%
nodes are greater than a threshold t
(e.g., 98%) over n seconds
How many more
VMs?
Increase
exponentially number
of executors until
configured max or #
of pending tasks
Initially under request n% of the task queue length
Whether to scale
(still beneficial)?
Always scale if above
condition met
Scaling unnecessary
if near end or short
runtime
Option1: According to the ratio of
unprocessed data: e.g., < 80%.
Rough estimation of job execution
time: proportional to the data
process rate
Option2: Model driven – predict
runtime based on previous runs

Recommended for you

Римский корсаков снегурочка
Римский корсаков снегурочкаРимский корсаков снегурочка
Римский корсаков снегурочка

.

High Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRSHigh Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRS

This document discusses the architectural pattern of Command Query Responsibility Segregation (CQRS). It summarizes that CQRS separates read (query) and write (command) operations into different models to allow for more scalability and performance. Queries use a read-only data store optimized for reading, while commands express user intentions and are validated before being asynchronously processed to update data. The pattern allows for eventual consistency by keeping query data slightly stale, and improves scalability by allowing separate optimization of queries and commands.

cqrs
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application

This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.

apache apexbig data ingestionbig data
Agenda
•  Introduction
•  Evaluation setup
•  Impact of scaling
•  Auto-scaling in Spark
•  Lessons and future work
Lessons
•  Naïve scaling can help but effectiveness varies greatly
across different workloads
•  Why some workloads do not respond well to scaling?
–  Delay scheduling preventing new nodes to be utilized
•  Dynamic executor allocation works but can be improved
–  Dynamically changing locality wait time can be
effective
•  Overhead of transferring RDD partitions can reduce
benefit of scaling
Future Work
•  Study scaling effects given multiple
simultaneous workloads
•  Implement better support for scaling down
•  Enhance DEA to make use of resource
monitors and job runtime prediction
Backup

Recommended for you

Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)

Presenter: Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent. In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.

yarnhadoopbig data
правописание приставок урок№4
правописание приставок урок№4правописание приставок урок№4
правописание приставок урок№4

выка

бсп (обоб. урок)
бсп (обоб. урок)бсп (обоб. урок)
бсп (обоб. урок)

віка

Delay Scheduling
•  Intended for fixed-size clusters running
multiple workloads with short tasks
•  Emphasis on scheduling task on nodes
containing data
– Wait short time for resources to free up
on nodes containing data rather than run
task on node available now but further
away from data
Resource Managers + Auto-scaling
•  Resource management
→ different frameworks can coexist
→ high resource utilization
•  Facilitates on-demand resource allocation – support
elastic services
Infrastructure as a Service"
Resource Manager (YARN/
Mesos)"
Brief Intro: YARN
•  Resource Manager (RM) controls resource allocation
•  Application Master (AM) negotiates with RM for
resources and launch executors to run jobs
Brief Intro: Mesos
•  Framework schedulers accept or
reject offered resource
•  Resource preferences are
communicated to Mesos thru common
APIs
•  Coarse-grained mode
–  Mesos launches one long-running Spark
executor on each node to execute all Spark
tasks
•  Fine-grained mode
–  Mesos launches executor for each Spark
task

Recommended for you

Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial

how to troubleshooting mysql.

troubleshooting mysql;tools
Windowing in Apache Apex
Windowing in Apache ApexWindowing in Apache Apex
Windowing in Apache Apex

Windowing in Apache Apex divides unbounded streaming data into finite time slices called windows to allow for computation. It uses time as a reference to break streams into windows, addressing issues like failure recovery and providing frames of reference. Operators can perform window-level processing by implementing callbacks for window start and end. Windows provide rolling statistics by accumulating results over multiple windows and emitting periodically. Windowing has lower latency than micro-batch systems as records are processed immediately rather than waiting for batch boundaries.

open sourcestreaming analyticsreal time
The 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy CodeThe 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy Code

Have you ever looked at a random piece of code and wanted to rewrite it so badly? It’s natural to have legacy code in your application at some point. It’s something that you need to accept and learn to live with. So is this a lost cause? Should we just throw in the towel and give up? Hell no! Over the years, I learned to identify 5 main creators/enablers of legacy code on the engineering side, which I’m sharing here with you using real development stories (with a little humour in the mix). Learn to keep them in line and your code will live longer!

javajavaonelegacy code
Runtime – Spark on Mesos
Benchmark! Runtime (baseline)! Runtime (scale out)!
Fine-grained" KMeans" 92min 54sec" 35min 37sec"
Page Rank" 19min 29sec" 16min 55 sec"
Spark SQL queries" 24min 45sec" 19min 7sec"
LogisticRegression" 14min 43sec" 14min 39sec"
Coarse-grained" KMeans" 89min 24sec" 11min 49sec"
Page Rank" 8min 29sec" 7min 41sec"
Spark SQL queries" 10min 32sec" 8min 30sec"
LogisticRegression" 10min 57sec" 7min 56sec"

More Related Content

What's hot

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
Databricks
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
Databricks
 

What's hot (20)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 

Viewers also liked

Цветочные легенды
Цветочные легендыЦветочные легенды
Цветочные легенды
Ninel Kek
 
Римский корсаков снегурочка
Римский корсаков снегурочкаРимский корсаков снегурочка
Римский корсаков снегурочка
Ninel Kek
 
High Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRSHigh Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRS
Jonathan Oliver
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
правописание приставок урок№4
правописание приставок урок№4правописание приставок урок№4
правописание приставок урок№4
HomichAlla
 
бсп (обоб. урок)
бсп (обоб. урок)бсп (обоб. урок)
бсп (обоб. урок)
HomichAlla
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial
james tong
 
Windowing in Apache Apex
Windowing in Apache ApexWindowing in Apache Apex
Windowing in Apache Apex
Apache Apex
 
The 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy CodeThe 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy Code
Roberto Cortez
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
Hadoop online training
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
bispsolutions
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Build your shiny new pc, with Pangoly
Build your shiny new pc, with PangolyBuild your shiny new pc, with Pangoly
Build your shiny new pc, with Pangoly
Pangoly
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examplesIntroduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examples
Noé Fernández-Pozo
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 

Viewers also liked (20)

Цветочные легенды
Цветочные легендыЦветочные легенды
Цветочные легенды
 
Римский корсаков снегурочка
Римский корсаков снегурочкаРимский корсаков снегурочка
Римский корсаков снегурочка
 
High Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRSHigh Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRS
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
правописание приставок урок№4
правописание приставок урок№4правописание приставок урок№4
правописание приставок урок№4
 
бсп (обоб. урок)
бсп (обоб. урок)бсп (обоб. урок)
бсп (обоб. урок)
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial
 
Windowing in Apache Apex
Windowing in Apache ApexWindowing in Apache Apex
Windowing in Apache Apex
 
The 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy CodeThe 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy Code
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Build your shiny new pc, with Pangoly
Build your shiny new pc, with PangolyBuild your shiny new pc, with Pangoly
Build your shiny new pc, with Pangoly
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Introduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examplesIntroduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examples
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 

Similar to Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
Databricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInTackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedIn
Databricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017
Alpine Data
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Breaking data
Breaking dataBreaking data
Breaking data
Terry Bunio
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jie Li
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
Rodolfo Kohn
��
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
DataWorks Summit
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 

Similar to Towards True Elasticity of Spark-(Michael Le and Min Li, IBM) (20)

Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInTackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedIn
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
depikasharma
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
shoeb2926
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
nikita dubey$A17
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 

Recently uploaded (20)

Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

  • 1. Towards the True Elasticity of Spark Michael Le and Min Li IBM,  T.J.  Watson  Research  Center  
  • 2. Auto-Scaling •  Detects changes of resource usage in current workloads → Dynamically allocate/de-allocate resources •  Meets SLA requirements at reduced cost •  Existing auto-scaling approaches react slowly and often miss optimization opportunities •  YARN and Mesos have initial auto-scaling support, yet how workloads can benefit from the capability? Infrastructure as a Service" Resource Manager (YARN/Mesos)"
  • 3. Main Focus •  Analyze how scaling affects Spark workloads –  Is simply adding new resources sufficient for performance improvement? •  Analyze pros/cons of current Spark auto-scaling mechanism –  Are there rooms for performance improvement?
  • 4. Agenda •  Introduction •  Evaluation setup •  Impact of scaling •  Auto-scaling in Spark •  Lessons and future work
  • 5. Experimental Setup •  Baseline setup (6 nodes): •  Node: 4 CPUs, 8GB, 100 GB HDDs •  YARN executor: 3GB, 1CPU •  Mesos executor: 6GB and 4CPUs •  4 benchmarks (SparkBench*): –  Kmeans (input 37GB) –  Page Rank (input 1.1GB) –  Spark SQL: SQL queries (input 39GB) –  Logistic Regression (input 47GB) •  Scaling up: add 4 new nodes “instantaneously” •  Wait ~45 seconds after benchmark run •  Scale down – wait 3min after benchmark run *  h6ps://bitbucket.org/lm0926/sparkbench  
  • 6. Mechanisms for Scaling •  Scale up – add new node “instantaneously” (VM already provisioned) –  Run YARN node manager or Mesos slave daemon –  New nodes are Task nodes (no HDFS component) –  Scale down – kill resource manager processes •  Spark on YARN – set total executors for app higher than available in cluster to ensure executors get launched •  Spark on Mesos – make use of new resource offers from Mesos
  • 7. Agenda •  Introduction •  Evaluation setup •  Impact of scaling •  Auto-scaling in Spark •  Lessons and future work
  • 8. Runtime – Spark on YARN Benchmark! Runtime (baseline)! Runtime ! (scale out)! Runtime Reduction! Kmeans" 54min 44sec" 29min 26sec" 46%" Page Rank" 9min 28sec" 8min 18sec" 12.3%" Spark SQL queries" 15min 45sec" 13min 11sec" 16.3%" LogisticRegression" 13min 10sec" 12min 55sec" 2%" Similar behavior seen on Mesos
  • 9. Why the variation? •  Delay scheduling preventing tasks to be scheduled on new node Benchmark! Total tasks! % tasks locality preference RACK or ANY! % tasks wait >3s to be scheduled! % tasks on new nodes! KMeans" 7800" 0.4%" 3.5%" 24.5%" Page Rank" 21600" 0%" 0.01%" 13.5%" Spark SQL" 9805" 3%" 0.2%" 11.8%" LogisticRegression" 6637" 0%" 0.1%" 1.5%"
  • 10. Tuning Spark for Scaling •  locality wait time –  How soon to change locality preference of tasks in stage •  resource revive interval –  How soon to inform scheduler a resource that has not been used is still available
  • 11. Runtime – Tuning Spark Benchmark! Runtime (baseline)! Runtime (scale out)! Runtime (scale out– revive interval 100ms)! Runtime (scale out– locality wait time 100ms)! Runtime (scale out–locality wait time 0ms)! KMeans" 54min 44sec" 29min 26sec" 30min 39sec" 11min 32sec" 14min 1sec" Page Rank" 9min 28sec" 8min 18sec" 8min 35sec" 7min 35sec" 5min 57sec" Spark SQL queries" 15min 45sec" 13min 11sec" 12min 55sec" 11min 40sec" 19min 57sec" LogisticRegression" 13min 10sec" 12min 55sec" 12min 43sec" 6min 54sec" 9min 12sec" Benchmark! % tasks on new node (locality wait time 3s)! % tasks on new node (locality wait time 100ms)! % tasks on new node (locality wait time 0ms)! KMeans" 24.5%" 39.1%" 38%" Page Rank" 13.5%" 16.1%" 39%" Spark SQL queries" 11.8%" 22.6%" 39%" LogisticRegression" 1.5%" 38%" 39%"
  • 12. KMeans - CPU Utilization Per Node Scale out (locality wait - 100ms) 1 of the 4 new nodes Base line (1 of the 6 base nodes) Scale out (locality wait - 100ms) 1 of the 6 base nodes Scale out (locality wait - 0ms) 1 of the 6 base nodes Scale out (locality wait - 0ms) 1 of the 4 new nodes
  • 13. KMeans - Network Utilization 10 MB shuffle read 12 MB shuffle write Baseline 10 MB shuffle read 11 MB shuffle write Scale out (locality wait - 100ms) Greater bandwidth consumption due to transferring of RDD partitions Scale out (locality wait - 0ms) 10 MB shuffle read 12 MB shuffle write
  • 14. KMeans - Memory Utilization Baseline KMeans – scale out (locality wait - 100ms) KMeans – scale out (locality wait - 0ms)
  • 15. PageRank - CPU Utilization Scale out (locality wait - 100ms) 1 of the 4 new nodes Base line (1 of the 6 nodes) Scale out (locality wait - 100ms) 1 of the 6 nodes Scale out (locality wait - 0ms) 1 of the 4 new nodes Scale out (locality wait - 0ms) 1 of the 6 nodes
  • 16. PageRank - Network Utilization 15.50 GB shuffle read 16.71 GB shuffle write Baseline 16.07 GB shuffle read 16.71 GB shuffle write Scale out (locality wait - 100ms) Greater bandwidth consumption due to transferring of RDD partitions 16.10 GB shuffle read 16.73 GB shuffle write Scale out (locality wait - 0ms)
  • 17. PageRank - Memory Utilization Baseline Scale out (locality wait - 100ms) Scale out (locality wait - 0ms)
  • 18. SQL – Network Utilization Scale out (locality wait - 0ms)Scale out (locality wait - 100ms)
  • 19. LogisticRegression – Network Utilization Scale out (locality wait - 0ms)Scale out (locality wait - 100ms)
  • 20. Take-aways •  Locality wait time is key to improvement – Tune during runtime? – Adjust during scaling to force use of new nodes •  Need to consider gains from running task on new nodes vs. network bandwidth used
  • 21. Scaling Down Benchmark" Runtime – Baseline (10 nodes)" Runtime – " Scale in" KMeans" 11min 54sec" 49min 13sec" PageRank" 11min 41sec" 13min 12sec" Spark SQL" 12min 50sec" 15min 52sec" LogisticRegression" 10min 20sec" 14min 7sec" Re-execution overhead worst for some workloads
  • 22. Scaling Down with Mesos/YARN •  Prevent Spark from scheduling more tasks on nodes that are selected to removed –  Mesos fine-grained, simply not offer resource from node selected for removal –  Mesos coarse-grained and YARN, requires cooperation from Spark to not schedule new tasks on nodes selected for removal •  Shutdown node once tasks are drained –  What to do about stored shuffle data?
  • 23. Agenda •  Introduction •  Background •  Evaluation setup •  Impact of scaling •  Auto-scaling in Spark •  Lessons and future work
  • 24. Existing Auto-scaling Mechanism •  Dynamic Executor Allocation (DEA) –  Works only with YARN –  Spark request new executors after a fixed time interval when there are still pending tasks –  Number of requested executors grow exponentially •  Works but does have some potential inefficiency
  • 25. Improving Auto-scaling •  Main reason tasks are not scheduled on new node is data locality preferences –  Delay scheduling preventing tasks from being run on new node •  Approach: –  Change locality wait time dynamically during application runtime –  Ideally, reduce locality wait time at point of scale out, then after stabilizes, revert to previous locality wait time value
  • 26. Improving Auto-scaling Details •  Ideally: tasks spread evenly among all executors –  average # of tasks per executor: •  T = Total tasks / # executors •  ti = # of tasks per executori •  If ti < alpha*T, for s seconds, then change locality wait time •  If ti is still below threshold after r seconds from first change of locality time, then remove executor •  If no executors below threshold, reset locality wait time to initial value
  • 27. Dynamically Adjusting Locality Wait Time at Runtime with Dynamic Allocator Execution Benchmark! Runtime 
 (scale out w/ DEA)! ! % tasks on new node (locality wait time 3s)! Runtime! (scale out w/ DEA and runtime adjustment of locality wait time)! % tasks on new node (dynamic locality wait time)! KMeans" 32min 15sec" 28.6%" 20min 35sec" 35%" PageRank" 12min 2sec" 12.9%" 11min 23sec" 14%" Spark SQL" 12min 48sec" 11.5%" 12min 26sec" 16%" Logistic
 Regression" 11min 58sec" 0.7%" 12min 16sec" 3.3%" Parameters: alpha = 30%, s = 5sec, new locality wait time = 100ms Mechanism helps at increasing % tasks on new nodes
  • 28. Simple Auto-scaling Improvements Existing Mechanism Description Drawbacks Proposal When to scale? Backlogged tasks exist for n secs Request CPU resources when Tasks might be I/O bound CPU/memory of more than 50% nodes are greater than a threshold t (e.g., 98%) over n seconds How many more VMs? Increase exponentially number of executors until configured max or # of pending tasks Initially under request n% of the task queue length Whether to scale (still beneficial)? Always scale if above condition met Scaling unnecessary if near end or short runtime Option1: According to the ratio of unprocessed data: e.g., < 80%. Rough estimation of job execution time: proportional to the data process rate Option2: Model driven – predict runtime based on previous runs
  • 29. Agenda •  Introduction •  Evaluation setup •  Impact of scaling •  Auto-scaling in Spark •  Lessons and future work
  • 30. Lessons •  Naïve scaling can help but effectiveness varies greatly across different workloads •  Why some workloads do not respond well to scaling? –  Delay scheduling preventing new nodes to be utilized •  Dynamic executor allocation works but can be improved –  Dynamically changing locality wait time can be effective •  Overhead of transferring RDD partitions can reduce benefit of scaling
  • 31. Future Work •  Study scaling effects given multiple simultaneous workloads •  Implement better support for scaling down •  Enhance DEA to make use of resource monitors and job runtime prediction
  • 33. Delay Scheduling •  Intended for fixed-size clusters running multiple workloads with short tasks •  Emphasis on scheduling task on nodes containing data – Wait short time for resources to free up on nodes containing data rather than run task on node available now but further away from data
  • 34. Resource Managers + Auto-scaling •  Resource management → different frameworks can coexist → high resource utilization •  Facilitates on-demand resource allocation – support elastic services Infrastructure as a Service" Resource Manager (YARN/ Mesos)"
  • 35. Brief Intro: YARN •  Resource Manager (RM) controls resource allocation •  Application Master (AM) negotiates with RM for resources and launch executors to run jobs
  • 36. Brief Intro: Mesos •  Framework schedulers accept or reject offered resource •  Resource preferences are communicated to Mesos thru common APIs •  Coarse-grained mode –  Mesos launches one long-running Spark executor on each node to execute all Spark tasks •  Fine-grained mode –  Mesos launches executor for each Spark task
  • 37. Runtime – Spark on Mesos Benchmark! Runtime (baseline)! Runtime (scale out)! Fine-grained" KMeans" 92min 54sec" 35min 37sec" Page Rank" 19min 29sec" 16min 55 sec" Spark SQL queries" 24min 45sec" 19min 7sec" LogisticRegression" 14min 43sec" 14min 39sec" Coarse-grained" KMeans" 89min 24sec" 11min 49sec" Page Rank" 8min 29sec" 7min 41sec" Spark SQL queries" 10min 32sec" 8min 30sec" LogisticRegression" 10min 57sec" 7min 56sec"