Spark
- 2. 2
Data Processing Goals
Low latency (interactive) queries on
historical data: enable faster decisions
E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
E.g., detect & block worms in real-time (a worm may
infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable
“better” decisions
E.g., anomaly detection, trend analysis
- 3. 3
The Need for Unification (1/2)
Today’s state-of-art analytics stack
Batch stack
(e.g., Hadoop)
Input
Splitter
Streaming stack
(e.g., Storm)
Real-Time
Analytics
Ad-Hoc queries
on historical data
Interactive queries
on historical data
Interactive queries (e.g.,
HBase, Impala, SQL)
Challenges:
Need to maintain three separate stacks
Expensive and complex
Hard to compute consistent metrics across stacks
Hard and slow to share data across stacks
- 6. 6
BDAS Stack
Data Processing Layer
Resource Management Layer
Storage Layer
Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
- 7. 7
How do BDAS & Hadoop fit together?
Mesos Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
Hadoop Yarn
Spark
Stramin
g
Shark
SQL
Graph
X ML
library
BlinkDB
MLbas
e
Spark Hadoop MR
Hive Pig HBas
e
Storm
- 8. 8
Apache Mesos (cluster manager)
Enable multiple frameworks to share same
cluster resources (e.g., Hadoop, Storm, Spark)
Twitter’s large scale deployment
6,000+ servers,
500+ engineers running jobs on Mesos
Mesospehere: startup to commercialize Mesos
- 9. 9
Apache Spark
Distributed Execution Engine
Fault-tolerant, efficient in-memory storage (RDDs)
Powerful programming model and APIs (Scala,
Python, Java)
Fast: up to 100x faster than Hadoop
Easy to use: 5-10x less code than Hadoop
General: support interactive & iterative apps
- 10. 10
Spark Streaming
Large scale streaming computation
Implement streaming as a sequence of <1s jobs
Fault tolerant
Handle stragglers
Ensure exactly one semantics
Integrated with Spark: unifies batch, interactive,
and batch computations
- 11. 11
Shark
Hive over Spark: full support for HQL and UDFs
Up to 100x when input is in memory
Up to 5-10x when input is on disk
Running on hundreds of nodes at Yahoo!
- 12. 12
BlinkDB
Trade between query performance and accuracy
using sampling
Why?
In-memory processing doesn’t guarantee interactive
processing
E.g., ~10’s sec just to scan 512 GB RAM!
Gap between memory capacity and transfer rate
increasing
- 13. 13
GraphX
Combine data-parallel and graph-parallel
computations
Provide powerful abstractions:
PowerGraph, Pregel implemented in less than 20
LOC!
Leverage Spark’s fault tolerance
- 14. 14
MLlib and MLbase
MLlib: high quality library for ML algorithms
MLbase: make ML accessible to non-experts
Declarative interface: allow users to say what they
want
E.g., classify(data)
Automatically pick best algorithm for given data, time
Allow developers to easily add and test new
algorithms
Editor's Notes
- So what does this mean?Well, this means that we want low response-time on historical data since the faster we can make a decision the better.We want the ability to perform queries on live data since decisions on real-time data are better than on stale data.Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions.