Spark

1
SPARK
INSTRUCTOR:
DR. SHIYONG LU
BY:
SRINATH REDDY KOTU
GRADUATE STUDENT

2
Data Processing Goals
Low latency (interactive) queries on
historical data: enable faster decisions
E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
E.g., detect & block worms in real-time (a worm may
infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable
“better” decisions
E.g., anomaly detection, trend analysis

3
The Need for Unification (1/2)
Today’s state-of-art analytics stack
Batch stack
(e.g., Hadoop)
Input
Splitter
Streaming stack
(e.g., Storm)
Real-Time
Analytics
Ad-Hoc queries
on historical data
Interactive queries
on historical data
Interactive queries (e.g.,
HBase, Impala, SQL)
Challenges:
Need to maintain three separate stacks
Expensive and complex
Hard to compute consistent metrics across stacks
Hard and slow to share data across stacks

4
Data Processing Stack
Data Processing Layer
Resource Management Layer
Storage Layer

5
Hadoop Stack
Storage Layer
…
Hadoop MR
Hive Pig
HBase Storm
Hadoop Yarn
HDFS, S3, …

6
BDAS Stack
Storage Layer
Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon

7
How do BDAS & Hadoop fit together?
Mesos Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
Hadoop Yarn
Spark
Stramin
g
Shark
SQL
Graph
X ML
library
BlinkDB
MLbas
e
Spark Hadoop MR
Hive Pig HBas
e
Storm

8
Apache Mesos (cluster manager)
Enable multiple frameworks to share same
cluster resources (e.g., Hadoop, Storm, Spark)
Twitter’s large scale deployment
6,000+ servers,
500+ engineers running jobs on Mesos
Mesospehere: startup to commercialize Mesos

9
Apache Spark
Distributed Execution Engine
Fault-tolerant, efficient in-memory storage (RDDs)
Powerful programming model and APIs (Scala,
Python, Java)
Fast: up to 100x faster than Hadoop
Easy to use: 5-10x less code than Hadoop
General: support interactive & iterative apps

10
Spark Streaming
Large scale streaming computation
Implement streaming as a sequence of <1s jobs
Fault tolerant
Handle stragglers
Ensure exactly one semantics
Integrated with Spark: unifies batch, interactive,
and batch computations

11
Shark
Hive over Spark: full support for HQL and UDFs
Up to 100x when input is in memory
Up to 5-10x when input is on disk
Running on hundreds of nodes at Yahoo!

12
BlinkDB
Trade between query performance and accuracy
using sampling
Why?
In-memory processing doesn’t guarantee interactive
processing
E.g., ~10’s sec just to scan 512 GB RAM!
Gap between memory capacity and transfer rate
increasing

13
GraphX
Combine data-parallel and graph-parallel
computations
Provide powerful abstractions:
PowerGraph, Pregel implemented in less than 20
LOC!
Leverage Spark’s fault tolerance

14
MLlib and MLbase
MLlib: high quality library for ML algorithms
MLbase: make ML accessible to non-experts
Declarative interface: allow users to say what they
want
E.g., classify(data)
Automatically pick best algorithm for given data, time
Allow developers to easily add and test new
algorithms

15
Tachyon
In-memory, fault-tolerant storage system
Flexible API, including HDFS API
Allow multiple frameworks (including Hadoop) to
share in-memory data

Spark

Related slideshows

More Related Content

Spark

Editor's Notes