SlideShare a Scribd company logo
1
SPARK
INSTRUCTOR:
DR. SHIYONG LU
BY:
SRINATH REDDY KOTU
GRADUATE STUDENT
2
Data Processing Goals
Low latency (interactive) queries on
historical data: enable faster decisions
E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
E.g., detect & block worms in real-time (a worm may
infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable
“better” decisions
E.g., anomaly detection, trend analysis
3
The Need for Unification (1/2)
Today’s state-of-art analytics stack
Batch stack
(e.g., Hadoop)
Input
Splitter
Streaming stack
(e.g., Storm)
Real-Time
Analytics
Ad-Hoc queries
on historical data
Interactive queries
on historical data
Interactive queries (e.g.,
HBase, Impala, SQL)
Challenges:
Need to maintain three separate stacks
Expensive and complex
Hard to compute consistent metrics across stacks
Hard and slow to share data across stacks
4
Data Processing Stack
Data Processing Layer
Resource Management Layer
Storage Layer
5
Hadoop Stack
Data Processing Layer
Resource Management Layer
Storage Layer
…
Hadoop MR
Hive Pig
HBase Storm
Hadoop Yarn
HDFS, S3, …
6
BDAS Stack
Data Processing Layer
Resource Management Layer
Storage Layer
Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
7
How do BDAS & Hadoop fit together?
Mesos Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
Hadoop Yarn
Spark
Stramin
g
Shark
SQL
Graph
X ML
library
BlinkDB
MLbas
e
Spark Hadoop MR
Hive Pig HBas
e
Storm
8
Apache Mesos (cluster manager)
Enable multiple frameworks to share same
cluster resources (e.g., Hadoop, Storm, Spark)
Twitter’s large scale deployment
6,000+ servers,
500+ engineers running jobs on Mesos
Mesospehere: startup to commercialize Mesos
9
Apache Spark
Distributed Execution Engine
Fault-tolerant, efficient in-memory storage (RDDs)
Powerful programming model and APIs (Scala,
Python, Java)
Fast: up to 100x faster than Hadoop
Easy to use: 5-10x less code than Hadoop
General: support interactive & iterative apps
10
Spark Streaming
Large scale streaming computation
Implement streaming as a sequence of <1s jobs
Fault tolerant
Handle stragglers
Ensure exactly one semantics
Integrated with Spark: unifies batch, interactive,
and batch computations
11
Shark
Hive over Spark: full support for HQL and UDFs
Up to 100x when input is in memory
Up to 5-10x when input is on disk
Running on hundreds of nodes at Yahoo!
12
BlinkDB
Trade between query performance and accuracy
using sampling
Why?
In-memory processing doesn’t guarantee interactive
processing
E.g., ~10’s sec just to scan 512 GB RAM!
Gap between memory capacity and transfer rate
increasing
13
GraphX
Combine data-parallel and graph-parallel
computations
Provide powerful abstractions:
PowerGraph, Pregel implemented in less than 20
LOC!
Leverage Spark’s fault tolerance
14
MLlib and MLbase
MLlib: high quality library for ML algorithms
MLbase: make ML accessible to non-experts
Declarative interface: allow users to say what they
want
E.g., classify(data)
Automatically pick best algorithm for given data, time
Allow developers to easily add and test new
algorithms
15
Tachyon
In-memory, fault-tolerant storage system
Flexible API, including HDFS API
Allow multiple frameworks (including Hadoop) to
share in-memory data
16
Thank You

More Related Content

Spark

  • 2. 2 Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions E.g., anomaly detection, trend analysis
  • 3. 3 The Need for Unification (1/2) Today’s state-of-art analytics stack Batch stack (e.g., Hadoop) Input Splitter Streaming stack (e.g., Storm) Real-Time Analytics Ad-Hoc queries on historical data Interactive queries on historical data Interactive queries (e.g., HBase, Impala, SQL) Challenges: Need to maintain three separate stacks Expensive and complex Hard to compute consistent metrics across stacks Hard and slow to share data across stacks
  • 4. 4 Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer
  • 5. 5 Hadoop Stack Data Processing Layer Resource Management Layer Storage Layer … Hadoop MR Hive Pig HBase Storm Hadoop Yarn HDFS, S3, …
  • 6. 6 BDAS Stack Data Processing Layer Resource Management Layer Storage Layer Mesos Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon
  • 7. 7 How do BDAS & Hadoop fit together? Mesos Mesos Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon Hadoop Yarn Spark Stramin g Shark SQL Graph X ML library BlinkDB MLbas e Spark Hadoop MR Hive Pig HBas e Storm
  • 8. 8 Apache Mesos (cluster manager) Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) Twitter’s large scale deployment 6,000+ servers, 500+ engineers running jobs on Mesos Mesospehere: startup to commercialize Mesos
  • 9. 9 Apache Spark Distributed Execution Engine Fault-tolerant, efficient in-memory storage (RDDs) Powerful programming model and APIs (Scala, Python, Java) Fast: up to 100x faster than Hadoop Easy to use: 5-10x less code than Hadoop General: support interactive & iterative apps
  • 10. 10 Spark Streaming Large scale streaming computation Implement streaming as a sequence of <1s jobs Fault tolerant Handle stragglers Ensure exactly one semantics Integrated with Spark: unifies batch, interactive, and batch computations
  • 11. 11 Shark Hive over Spark: full support for HQL and UDFs Up to 100x when input is in memory Up to 5-10x when input is on disk Running on hundreds of nodes at Yahoo!
  • 12. 12 BlinkDB Trade between query performance and accuracy using sampling Why? In-memory processing doesn’t guarantee interactive processing E.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing
  • 13. 13 GraphX Combine data-parallel and graph-parallel computations Provide powerful abstractions: PowerGraph, Pregel implemented in less than 20 LOC! Leverage Spark’s fault tolerance
  • 14. 14 MLlib and MLbase MLlib: high quality library for ML algorithms MLbase: make ML accessible to non-experts Declarative interface: allow users to say what they want E.g., classify(data) Automatically pick best algorithm for given data, time Allow developers to easily add and test new algorithms
  • 15. 15 Tachyon In-memory, fault-tolerant storage system Flexible API, including HDFS API Allow multiple frameworks (including Hadoop) to share in-memory data

Editor's Notes

  1. So what does this mean?Well, this means that we want low response-time on historical data since the faster we can make a decision the better.We want the ability to perform queries on live data since decisions on real-time data are better than on stale data.Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions.