Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent Real-time Decisions | AWS Public Sector Summit 2017

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data in the Cloud: How the RISELab Enables
Computers to Make Intelligent Real-time Decisions
Ion Stoica,
RISELab Director, UC Berkeley
Executive Chairman, Databricks

Who am I?
RISELab director (2017-), UC Berkeley
• Co-director, AMPLab (2011-2016)
Co-founder & Executive Chairman, Databricks
• Unified Analytics Platform based on Apache Spark
• Hosted service in AWS
Co-founder & CTO, Conviva

Follows AMPLab...
AMPLab (2011-2016)
• Mission: “Make sense of big data”
• 8+ faculty, 60+ students
Government and Industry support:
Algorithms
Machines People

AMPLab impact
Industry:
Three startups raising over $200M so far
Academic:
• Faculty at MIT, Stanford, CMU, Cornell, U Michigan, …
• Three ACM Dissertation Awards
1,000s customers
100K’s users
100s customers
…
100s customers

Leveraged AWS from day one
Systems developed, tested, and deployed on AWS
• AWS first cloud to host Apache Spark (Feb, 2013)
5

Leveraged AWS from day one
First Terasort Apache Spark record (2014)
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
Also sorted 1PB in 4 hours (bottlenecked by network)

From batch data to advanced analytics
AMPLab
From live data to real-time decisions
RISELab

From live data to real-time decisions
RISELab

Why?
Data only as valuable as the decisions (actions) it enables

Why?
Data only as valuable as the decisions (actions) it enables
What is a good decision?
• Faster decisions better than slower decisions
• Decisions on fresh data better than decisions on stale data
• Decisions on personalized data better than on generic data

What do we want?
Real-time decisions
on live data
with strong security
decide in ms
current state of environment
privacy, confidentiality, integrity

Typical decision system
Decision System
Query
Decision
Environment
+
sensors &
actuators
Observations, Feedback
Preprocess Intermediate
data
Decision
Engine

Typical decision system
Decision System
Query
Decision
Environment
+
sensors &
actuators
Preprocess Intermediate
data
Decision
Engine
Live
Update latency
(e.g., ~1 seconds)
Real-time
decision latency
(e.g., ~10 ms)

Example of decision systems
Decision System
Obs.
Action
Update
Policy
Policy
obs 
action
Query
Policy
Observations, Rewards
Reinforcement
Learning Systems
Decision System
Query
Action
Training
Models
(diff. tradeoffs
complexity/
accuracy)
Model
Serving
Feedback
ML Pipeline

What else do we want from decisions?
Intelligent: complex decisions in uncertain environments

Robust: handle complex noise

Robust: handle complex noise, failures, unforeseen inputs
Need ability to say “I don’t know!”

Robust: handle complex noise, failures, unforeseen inputs
Explainable: ability to explain non-obvious decisions

RISELab goal
Develop open source platforms, tools, and
algorithms for intelligent real-time decisions on live-
data

Secure Real-time Decision Stack (SRDS)
Open source platform to develop RISE like apps
Targets emerging AI applications
• Hyperparameter search
• Reinforcement Learning (RL)
Secure from ground up

Secure Real-time Decision Stack (SRDS)
scheduler object store
RISE μkernel
Ray PyWren …
Ground (data context service)
Clipper
optimizer
Spark/
Opaque
Time
Machine

SRDS: Microkernel
RISE μkernel
Ray PyWren …
Ground (metadata manager)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Minimalist execution engine:
• Support both data flow and task-parallel execution models
• High-throughput, low-latency scheduler

SRDS: Ground
RISE μkernel
Ray PyWren …
Clipper
optimizer
Spark/
Opaque
Time
Machine
Central repository for models, APIs to capture the
context in which data gets used and produced

SRDS: Time machine
RISE μkernel
Ray
PyWre
n
…
Clippe
r
optimizer
Spark/
Opaqu
e
Time
Machine
Replaying of apps at fine granularity
• Simplify development, debugging
• Robustness: replay against perturbed inputs
• Explainability: identify inputs causing decision
• Security: confirm vulnerabilities, test security
patches, compliance auditing

SRDS: Application frameworks
RISE μkernel
Ray PyWren …
Clipper
optimizer
Spark/
Opaque
Time
Machine
Computation frameworks to simplify development of RISE apps
• Ray: task-parallel framework to support RL workloads
• PyWren: serverless framework for AI applications
• Clipper: model serving supporting ensembles & cascading models
• Opaque: secure SparkSQL

PyWren
RISE μkernel
Ray PyWren …
PyWren
optimizer
Spark/
Opaque
Time
Machine

PyWren
Parallel computing for non-CS people!
Cloud computing still hard
• What type? What instance?
What base image?
• How many to spin up?
What price? Spot?
• Devops?
27

Requirements
Minimal setup overhead
• No need to set up clusters
Minimal learning curve
• Anyone who can write python should be able to invoke it
Target minute-long jobs
28

PyWren
Wren: small short-winged songbird found chiefly in the New World
• Most wrens are small and rather inconspicuous, except for their
loud and often complex songs.
29
+

Micro-instances
• 300 seconds single-core
• 1.5GB RAM
• Python, Java, Node
No need for provisioning or managing servers
Billing is metered in increments of 100 milliseconds
AWS Lambda: Serverless computation

How does it work?
32
pull job from S3
download anaconda runtime
python to run code
pickle result
stick in S3
your laptop the cloud
future = runner.map(fn, data)
Serialize func and data
Put on Amazon S3
Invoke AWS Lambda
func datadatadata
future.result()
poll S3
unpickle and return
result

Scalability
33
Compute Data
Near linear scalability
45 TFOPS

Ray + Microkernel
RISE μkernel
Ray
PyWre
n
…
Clippe
r
optimizer
Spark/
Opaqu
e
Time
Machine

Ray
Targets Reinforcement Learning (RL) applications
Currently includes μkernel functionality
Decision System
Obs.
Action
Update
Policy
Policy
obs 
action
Query
Policy
Observations, Rewards

Reinforcement Learning requirements
Process inputs from different sensors in parallel & real-time

Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)

Rollouts outcomes are used to update policy (e.g., SGD)
Update
policy
Update
policy
…
…
Update
policy
rollouts
Update
policy
…

…
Updat
e
policy
Update
policy
…
Update
policy
rollout
Update
policy
…
• Heterogeneous durations, dynamic execution graph
• 100s millions of rollouts, each rollout as little as a few msec

RL requirements
Often policies implemented by DNNs
actions
observation
s

Ray goals
Flexibility
• Combine neural networks, planning, search, simulation, etc
• Heterogeneous tasks: CPUs/GPUs, durations, computation
• Fine-grained data and task dependencies, dynamic execution
Performance
• Millions of tasks per second with msec level latencies
• Adapt to changing work in real-time
Easy of use
• Minimal changes to parallelize existing Python serial code

Worker
Node
Worker Worker
Node
Worker Worker
Node
Object store Object store Object store
Local scheduler Local scheduler Local scheduler
Object table
Task table
Function table
Global control store
Global scheduler
Web UI
Debugging tools
Error diagnosis
Driver
Ray architecture

Experiments
Tested and deployed on AWS
Latency
• Local task execution: ~300 usec
• Remote task execution: ~1ms
• Object store R/W throughput: 6GB/s for large objects
45

Ray scales to 1 million tasks/second

Robustness to node failures
Key feature to support spot instances

Ray Speeds up Rollouts for Policy Gradients
Simulate
Env
Simulate
Env
Evaluate
Policy
Simulate
Env
Simulate
Env
Evaluate
Policy
CPU CPU
Parallel Rollouts on
CPU
Speedup: 1.0x
Evaluate Policy
Simulate Env
Simulate Env
Evaluate Policy
CPU
GPU
GPU
CPU
Policy Evaluation
on GPU
1.3x
...
Evaluate Policy (Batch)
Simulate
Env
Simulate
Env
Simulate
Env
Evaluate Policy (Batch)GPU
GPU
CPU CPU
Fine grained
rollouts
4.1x
Leverage P2 (GPU) instances

Summary
Many challenges in ML/AI, systems, security, architectures
Both PyWren and Ray already released and working on AWS
PyWren: https://github.com/pywren/pywren
Ray: https://github.com/ray-project/ray
RISELab goal: Develop open source platforms, tools
and algorithms for real-time decisions on live data
with strong security

Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent Real-time Decisions | AWS Public Sector Summit 2017

More Related Content

Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent Real-time Decisions | AWS Public Sector Summit 2017

Editor's Notes