SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data in the Cloud: How the RISELab Enables
Computers to Make Intelligent Real-time Decisions
Ion Stoica,
RISELab Director, UC Berkeley
Executive Chairman, Databricks
Who am I?
RISELab director (2017-), UC Berkeley
• Co-director, AMPLab (2011-2016)
Co-founder & Executive Chairman, Databricks
• Unified Analytics Platform based on Apache Spark
• Hosted service in AWS
Co-founder & CTO, Conviva
Follows AMPLab...
AMPLab (2011-2016)
• Mission: “Make sense of big data”
• 8+ faculty, 60+ students
Government and Industry support:
Algorithms
Machines People
AMPLab impact
Industry:
Three startups raising over $200M so far
Academic:
• Faculty at MIT, Stanford, CMU, Cornell, U Michigan, …
• Three ACM Dissertation Awards
1,000s customers
100K’s users
100s customers
…
100s customers
Leveraged AWS from day one
Systems developed, tested, and deployed on AWS
• AWS first cloud to host Apache Spark (Feb, 2013)
5
Leveraged AWS from day one
First Terasort Apache Spark record (2014)
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
Also sorted 1PB in 4 hours (bottlenecked by network)
From batch data to advanced analytics
AMPLab
From live data to real-time decisions
RISELab
From live data to real-time decisions
RISELab
Why?
Data only as valuable as the decisions (actions) it enables
Why?
Data only as valuable as the decisions (actions) it enables
What is a good decision?
• Faster decisions better than slower decisions
• Decisions on fresh data better than decisions on stale data
• Decisions on personalized data better than on generic data
What do we want?
Real-time decisions
on live data
with strong security
decide in ms
current state of environment
privacy, confidentiality, integrity
Typical decision system
Decision System
Query
Decision
Environment
+
sensors &
actuators
Observations, Feedback
Preprocess Intermediate
data
Decision
Engine
Typical decision system
Decision System
Query
Decision
Environment
+
sensors &
actuators
Observations, Feedback
Preprocess Intermediate
data
Decision
Engine
Live
Update latency
(e.g., ~1 seconds)
Real-time
decision latency
(e.g., ~10 ms)
Example of decision systems
Decision System
Obs.
Action
Update
Policy
Policy
obs 
action
Query
Policy
Observations, Rewards
Reinforcement
Learning Systems
Decision System
Query
Action
Training
Models
(diff. tradeoffs
complexity/
accuracy)
Model
Serving
Feedback
Observations, Feedback
ML Pipeline
What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
Robust: handle complex noise
What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
Robust: handle complex noise, failures, unforeseen inputs
Need ability to say “I don’t know!”
What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
Robust: handle complex noise, failures, unforeseen inputs
Explainable: ability to explain non-obvious decisions
RISELab goal
Develop open source platforms, tools, and
algorithms for intelligent real-time decisions on live-
data
Secure Real-time Decision Stack (SRDS)
Open source platform to develop RISE like apps
Targets emerging AI applications
• Hyperparameter search
• Reinforcement Learning (RL)
Secure from ground up
Secure Real-time Decision Stack (SRDS)
scheduler object store
RISE μkernel
Ray PyWren …
Ground (data context service)
Clipper
optimizer
Spark/
Opaque
Time
Machine
SRDS: Microkernel
scheduler object store
RISE μkernel
Ray PyWren …
Ground (metadata manager)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Minimalist execution engine:
• Support both data flow and task-parallel execution models
• High-throughput, low-latency scheduler
SRDS: Ground
scheduler object store
RISE μkernel
Ray PyWren …
Ground (data context service)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Central repository for models, APIs to capture the
context in which data gets used and produced
SRDS: Time machine
scheduler object store
RISE μkernel
Ray
PyWre
n
…
Ground (data context service)
Clippe
r
optimizer
Spark/
Opaqu
e
Time
Machine
Replaying of apps at fine granularity
• Simplify development, debugging
• Robustness: replay against perturbed inputs
• Explainability: identify inputs causing decision
• Security: confirm vulnerabilities, test security
patches, compliance auditing
SRDS: Application frameworks
scheduler object store
RISE μkernel
Ray PyWren …
Ground (metadata manager)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Computation frameworks to simplify development of RISE apps
• Ray: task-parallel framework to support RL workloads
• PyWren: serverless framework for AI applications
• Clipper: model serving supporting ensembles & cascading models
• Opaque: secure SparkSQL
PyWren
scheduler object store
RISE μkernel
Ray PyWren …
Ground (metadata manager)
PyWren
optimizer
Spark/
Opaque
Time
Machine
PyWren
Parallel computing for non-CS people!
Cloud computing still hard
• What type? What instance?
What base image?
• How many to spin up?
What price? Spot?
• Devops?
27
Requirements
Minimal setup overhead
• No need to set up clusters
Minimal learning curve
• Anyone who can write python should be able to invoke it
Target minute-long jobs
28
PyWren
Wren: small short-winged songbird found chiefly in the New World
• Most wrens are small and rather inconspicuous, except for their
loud and often complex songs.
29
+
Micro-instances
• 300 seconds single-core
• 1.5GB RAM
• Python, Java, Node
No need for provisioning or managing servers
Billing is metered in increments of 100 milliseconds
AWS Lambda: Serverless computation
PyWren API
31
How does it work?
32
pull job from S3
download anaconda runtime
python to run code
pickle result
stick in S3
your laptop the cloud
future = runner.map(fn, data)
Serialize func and data
Put on Amazon S3
Invoke AWS Lambda
func datadatadata
future.result()
poll S3
unpickle and return
result
Scalability
33
Compute Data
Near linear scalability
45 TFOPS
34
Ray + Microkernel
scheduler object store
RISE μkernel
Ray
PyWre
n
…
Ground (metadata manager)
Clippe
r
optimizer
Spark/
Opaqu
e
Time
Machine
Ray
Targets Reinforcement Learning (RL) applications
Currently includes μkernel functionality
Decision System
Obs.
Action
Update
Policy
Policy
obs 
action
Query
Policy
Observations, Rewards
RL Example: Learning to run
Reinforcement Learning requirements
Process inputs from different sensors in parallel & real-time
Reinforcement Learning requirements
Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)
Reinforcement Learning requirements
Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)
Rollouts outcomes are used to update policy (e.g., SGD)
Update
policy
Update
policy
…
…
Update
policy
rollouts
Update
policy
…
…
Updat
e
policy
Update
policy
…
Update
policy
rollout
Update
policy
…
Reinforcement Learning requirements
Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)
Rollouts outcomes are used to update policy (e.g., SGD)
• Heterogeneous durations, dynamic execution graph
• 100s millions of rollouts, each rollout as little as a few msec
RL requirements
Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)
Rollouts outcomes are used to update policy (e.g., SGD)
Often policies implemented by DNNs
actions
observation
s
Ray goals
Flexibility
• Combine neural networks, planning, search, simulation, etc
• Heterogeneous tasks: CPUs/GPUs, durations, computation
• Fine-grained data and task dependencies, dynamic execution
Performance
• Millions of tasks per second with msec level latencies
• Adapt to changing work in real-time
Easy of use
• Minimal changes to parallelize existing Python serial code
Worker
Node
Worker Worker
Node
Worker Worker
Node
Object store Object store Object store
Local scheduler Local scheduler Local scheduler
Object table
Task table
Function table
Global control store
Global scheduler
Web UI
Debugging tools
Error diagnosis
Driver
Ray architecture
Experiments
Tested and deployed on AWS
Latency
• Local task execution: ~300 usec
• Remote task execution: ~1ms
• Object store R/W throughput: 6GB/s for large objects
45
Ray scales to 1 million tasks/second
Robustness to node failures
Key feature to support spot instances
Ray Speeds up Rollouts for Policy Gradients
Simulate
Env
Simulate
Env
Evaluate
Policy
Simulate
Env
Simulate
Env
Evaluate
Policy
CPU CPU
Parallel Rollouts on
CPU
Speedup: 1.0x
Evaluate Policy
Simulate Env
Simulate Env
Evaluate Policy
CPU
GPU
GPU
CPU
Policy Evaluation
on GPU
1.3x
...
Evaluate Policy (Batch)
Simulate
Env
Simulate
Env
Simulate
Env
Evaluate Policy (Batch)GPU
GPU
CPU CPU
Fine grained
rollouts
4.1x
Leverage P2 (GPU) instances
Summary
Many challenges in ML/AI, systems, security, architectures
Both PyWren and Ray already released and working on AWS
PyWren: https://github.com/pywren/pywren
Ray: https://github.com/ray-project/ray
RISELab goal: Develop open source platforms, tools
and algorithms for real-time decisions on live data
with strong security
Thank you!

More Related Content

Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent Real-time Decisions | AWS Public Sector Summit 2017

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data in the Cloud: How the RISELab Enables Computers to Make Intelligent Real-time Decisions Ion Stoica, RISELab Director, UC Berkeley Executive Chairman, Databricks
  • 2. Who am I? RISELab director (2017-), UC Berkeley • Co-director, AMPLab (2011-2016) Co-founder & Executive Chairman, Databricks • Unified Analytics Platform based on Apache Spark • Hosted service in AWS Co-founder & CTO, Conviva
  • 3. Follows AMPLab... AMPLab (2011-2016) • Mission: “Make sense of big data” • 8+ faculty, 60+ students Government and Industry support: Algorithms Machines People
  • 4. AMPLab impact Industry: Three startups raising over $200M so far Academic: • Faculty at MIT, Stanford, CMU, Cornell, U Michigan, … • Three ACM Dissertation Awards 1,000s customers 100K’s users 100s customers … 100s customers
  • 5. Leveraged AWS from day one Systems developed, tested, and deployed on AWS • AWS first cloud to host Apache Spark (Feb, 2013) 5
  • 6. Leveraged AWS from day one First Terasort Apache Spark record (2014) 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes Also sorted 1PB in 4 hours (bottlenecked by network)
  • 7. From batch data to advanced analytics AMPLab From live data to real-time decisions RISELab
  • 8. From live data to real-time decisions RISELab
  • 9. Why? Data only as valuable as the decisions (actions) it enables
  • 10. Why? Data only as valuable as the decisions (actions) it enables What is a good decision? • Faster decisions better than slower decisions • Decisions on fresh data better than decisions on stale data • Decisions on personalized data better than on generic data
  • 11. What do we want? Real-time decisions on live data with strong security decide in ms current state of environment privacy, confidentiality, integrity
  • 12. Typical decision system Decision System Query Decision Environment + sensors & actuators Observations, Feedback Preprocess Intermediate data Decision Engine
  • 13. Typical decision system Decision System Query Decision Environment + sensors & actuators Observations, Feedback Preprocess Intermediate data Decision Engine Live Update latency (e.g., ~1 seconds) Real-time decision latency (e.g., ~10 ms)
  • 14. Example of decision systems Decision System Obs. Action Update Policy Policy obs  action Query Policy Observations, Rewards Reinforcement Learning Systems Decision System Query Action Training Models (diff. tradeoffs complexity/ accuracy) Model Serving Feedback Observations, Feedback ML Pipeline
  • 15. What else do we want from decisions? Intelligent: complex decisions in uncertain environments
  • 16. What else do we want from decisions? Intelligent: complex decisions in uncertain environments Robust: handle complex noise
  • 17. What else do we want from decisions? Intelligent: complex decisions in uncertain environments Robust: handle complex noise, failures, unforeseen inputs Need ability to say “I don’t know!”
  • 18. What else do we want from decisions? Intelligent: complex decisions in uncertain environments Robust: handle complex noise, failures, unforeseen inputs Explainable: ability to explain non-obvious decisions
  • 19. RISELab goal Develop open source platforms, tools, and algorithms for intelligent real-time decisions on live- data
  • 20. Secure Real-time Decision Stack (SRDS) Open source platform to develop RISE like apps Targets emerging AI applications • Hyperparameter search • Reinforcement Learning (RL) Secure from ground up
  • 21. Secure Real-time Decision Stack (SRDS) scheduler object store RISE μkernel Ray PyWren … Ground (data context service) Clipper optimizer Spark/ Opaque Time Machine
  • 22. SRDS: Microkernel scheduler object store RISE μkernel Ray PyWren … Ground (metadata manager) Clipper optimizer Spark/ Opaque Time Machine Minimalist execution engine: • Support both data flow and task-parallel execution models • High-throughput, low-latency scheduler
  • 23. SRDS: Ground scheduler object store RISE μkernel Ray PyWren … Ground (data context service) Clipper optimizer Spark/ Opaque Time Machine Central repository for models, APIs to capture the context in which data gets used and produced
  • 24. SRDS: Time machine scheduler object store RISE μkernel Ray PyWre n … Ground (data context service) Clippe r optimizer Spark/ Opaqu e Time Machine Replaying of apps at fine granularity • Simplify development, debugging • Robustness: replay against perturbed inputs • Explainability: identify inputs causing decision • Security: confirm vulnerabilities, test security patches, compliance auditing
  • 25. SRDS: Application frameworks scheduler object store RISE μkernel Ray PyWren … Ground (metadata manager) Clipper optimizer Spark/ Opaque Time Machine Computation frameworks to simplify development of RISE apps • Ray: task-parallel framework to support RL workloads • PyWren: serverless framework for AI applications • Clipper: model serving supporting ensembles & cascading models • Opaque: secure SparkSQL
  • 26. PyWren scheduler object store RISE μkernel Ray PyWren … Ground (metadata manager) PyWren optimizer Spark/ Opaque Time Machine
  • 27. PyWren Parallel computing for non-CS people! Cloud computing still hard • What type? What instance? What base image? • How many to spin up? What price? Spot? • Devops? 27
  • 28. Requirements Minimal setup overhead • No need to set up clusters Minimal learning curve • Anyone who can write python should be able to invoke it Target minute-long jobs 28
  • 29. PyWren Wren: small short-winged songbird found chiefly in the New World • Most wrens are small and rather inconspicuous, except for their loud and often complex songs. 29 +
  • 30. Micro-instances • 300 seconds single-core • 1.5GB RAM • Python, Java, Node No need for provisioning or managing servers Billing is metered in increments of 100 milliseconds AWS Lambda: Serverless computation
  • 32. How does it work? 32 pull job from S3 download anaconda runtime python to run code pickle result stick in S3 your laptop the cloud future = runner.map(fn, data) Serialize func and data Put on Amazon S3 Invoke AWS Lambda func datadatadata future.result() poll S3 unpickle and return result
  • 34. 34
  • 35. Ray + Microkernel scheduler object store RISE μkernel Ray PyWre n … Ground (metadata manager) Clippe r optimizer Spark/ Opaqu e Time Machine
  • 36. Ray Targets Reinforcement Learning (RL) applications Currently includes μkernel functionality Decision System Obs. Action Update Policy Policy obs  action Query Policy Observations, Rewards
  • 38. Reinforcement Learning requirements Process inputs from different sensors in parallel & real-time
  • 39. Reinforcement Learning requirements Process inputs from different sensor in parallel & real-time Execute large number of rollouts (simulations)
  • 40. Reinforcement Learning requirements Process inputs from different sensor in parallel & real-time Execute large number of rollouts (simulations) Rollouts outcomes are used to update policy (e.g., SGD) Update policy Update policy … … Update policy rollouts Update policy …
  • 41. … Updat e policy Update policy … Update policy rollout Update policy … Reinforcement Learning requirements Process inputs from different sensor in parallel & real-time Execute large number of rollouts (simulations) Rollouts outcomes are used to update policy (e.g., SGD) • Heterogeneous durations, dynamic execution graph • 100s millions of rollouts, each rollout as little as a few msec
  • 42. RL requirements Process inputs from different sensor in parallel & real-time Execute large number of rollouts (simulations) Rollouts outcomes are used to update policy (e.g., SGD) Often policies implemented by DNNs actions observation s
  • 43. Ray goals Flexibility • Combine neural networks, planning, search, simulation, etc • Heterogeneous tasks: CPUs/GPUs, durations, computation • Fine-grained data and task dependencies, dynamic execution Performance • Millions of tasks per second with msec level latencies • Adapt to changing work in real-time Easy of use • Minimal changes to parallelize existing Python serial code
  • 44. Worker Node Worker Worker Node Worker Worker Node Object store Object store Object store Local scheduler Local scheduler Local scheduler Object table Task table Function table Global control store Global scheduler Web UI Debugging tools Error diagnosis Driver Ray architecture
  • 45. Experiments Tested and deployed on AWS Latency • Local task execution: ~300 usec • Remote task execution: ~1ms • Object store R/W throughput: 6GB/s for large objects 45
  • 46. Ray scales to 1 million tasks/second
  • 47. Robustness to node failures Key feature to support spot instances
  • 48. Ray Speeds up Rollouts for Policy Gradients Simulate Env Simulate Env Evaluate Policy Simulate Env Simulate Env Evaluate Policy CPU CPU Parallel Rollouts on CPU Speedup: 1.0x Evaluate Policy Simulate Env Simulate Env Evaluate Policy CPU GPU GPU CPU Policy Evaluation on GPU 1.3x ... Evaluate Policy (Batch) Simulate Env Simulate Env Simulate Env Evaluate Policy (Batch)GPU GPU CPU CPU Fine grained rollouts 4.1x Leverage P2 (GPU) instances
  • 49. Summary Many challenges in ML/AI, systems, security, architectures Both PyWren and Ray already released and working on AWS PyWren: https://github.com/pywren/pywren Ray: https://github.com/ray-project/ray RISELab goal: Develop open source platforms, tools and algorithms for real-time decisions on live data with strong security

Editor's Notes

  1. Switching from lambda to ec2
  2. RL Reinforce Learning