Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
Report
Share
Report
Share
1 of 50
More Related Content
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent Real-time Decisions | AWS Public Sector Summit 2017
2. Who am I?
RISELab director (2017-), UC Berkeley
• Co-director, AMPLab (2011-2016)
Co-founder & Executive Chairman, Databricks
• Unified Analytics Platform based on Apache Spark
• Hosted service in AWS
Co-founder & CTO, Conviva
3. Follows AMPLab...
AMPLab (2011-2016)
• Mission: “Make sense of big data”
• 8+ faculty, 60+ students
Government and Industry support:
Algorithms
Machines People
4. AMPLab impact
Industry:
Three startups raising over $200M so far
Academic:
• Faculty at MIT, Stanford, CMU, Cornell, U Michigan, …
• Three ACM Dissertation Awards
1,000s customers
100K’s users
100s customers
…
100s customers
5. Leveraged AWS from day one
Systems developed, tested, and deployed on AWS
• AWS first cloud to host Apache Spark (Feb, 2013)
5
6. Leveraged AWS from day one
First Terasort Apache Spark record (2014)
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
Also sorted 1PB in 4 hours (bottlenecked by network)
7. From batch data to advanced analytics
AMPLab
From live data to real-time decisions
RISELab
10. Why?
Data only as valuable as the decisions (actions) it enables
What is a good decision?
• Faster decisions better than slower decisions
• Decisions on fresh data better than decisions on stale data
• Decisions on personalized data better than on generic data
11. What do we want?
Real-time decisions
on live data
with strong security
decide in ms
current state of environment
privacy, confidentiality, integrity
12. Typical decision system
Decision System
Query
Decision
Environment
+
sensors &
actuators
Observations, Feedback
Preprocess Intermediate
data
Decision
Engine
13. Typical decision system
Decision System
Query
Decision
Environment
+
sensors &
actuators
Observations, Feedback
Preprocess Intermediate
data
Decision
Engine
Live
Update latency
(e.g., ~1 seconds)
Real-time
decision latency
(e.g., ~10 ms)
14. Example of decision systems
Decision System
Obs.
Action
Update
Policy
Policy
obs
action
Query
Policy
Observations, Rewards
Reinforcement
Learning Systems
Decision System
Query
Action
Training
Models
(diff. tradeoffs
complexity/
accuracy)
Model
Serving
Feedback
Observations, Feedback
ML Pipeline
15. What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
16. What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
Robust: handle complex noise
17. What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
Robust: handle complex noise, failures, unforeseen inputs
Need ability to say “I don’t know!”
18. What else do we want from decisions?
Intelligent: complex decisions in uncertain environments
Robust: handle complex noise, failures, unforeseen inputs
Explainable: ability to explain non-obvious decisions
19. RISELab goal
Develop open source platforms, tools, and
algorithms for intelligent real-time decisions on live-
data
20. Secure Real-time Decision Stack (SRDS)
Open source platform to develop RISE like apps
Targets emerging AI applications
• Hyperparameter search
• Reinforcement Learning (RL)
Secure from ground up
21. Secure Real-time Decision Stack (SRDS)
scheduler object store
RISE μkernel
Ray PyWren …
Ground (data context service)
Clipper
optimizer
Spark/
Opaque
Time
Machine
22. SRDS: Microkernel
scheduler object store
RISE μkernel
Ray PyWren …
Ground (metadata manager)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Minimalist execution engine:
• Support both data flow and task-parallel execution models
• High-throughput, low-latency scheduler
23. SRDS: Ground
scheduler object store
RISE μkernel
Ray PyWren …
Ground (data context service)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Central repository for models, APIs to capture the
context in which data gets used and produced
24. SRDS: Time machine
scheduler object store
RISE μkernel
Ray
PyWre
n
…
Ground (data context service)
Clippe
r
optimizer
Spark/
Opaqu
e
Time
Machine
Replaying of apps at fine granularity
• Simplify development, debugging
• Robustness: replay against perturbed inputs
• Explainability: identify inputs causing decision
• Security: confirm vulnerabilities, test security
patches, compliance auditing
25. SRDS: Application frameworks
scheduler object store
RISE μkernel
Ray PyWren …
Ground (metadata manager)
Clipper
optimizer
Spark/
Opaque
Time
Machine
Computation frameworks to simplify development of RISE apps
• Ray: task-parallel framework to support RL workloads
• PyWren: serverless framework for AI applications
• Clipper: model serving supporting ensembles & cascading models
• Opaque: secure SparkSQL
27. PyWren
Parallel computing for non-CS people!
Cloud computing still hard
• What type? What instance?
What base image?
• How many to spin up?
What price? Spot?
• Devops?
27
28. Requirements
Minimal setup overhead
• No need to set up clusters
Minimal learning curve
• Anyone who can write python should be able to invoke it
Target minute-long jobs
28
29. PyWren
Wren: small short-winged songbird found chiefly in the New World
• Most wrens are small and rather inconspicuous, except for their
loud and often complex songs.
29
+
30. Micro-instances
• 300 seconds single-core
• 1.5GB RAM
• Python, Java, Node
No need for provisioning or managing servers
Billing is metered in increments of 100 milliseconds
AWS Lambda: Serverless computation
32. How does it work?
32
pull job from S3
download anaconda runtime
python to run code
pickle result
stick in S3
your laptop the cloud
future = runner.map(fn, data)
Serialize func and data
Put on Amazon S3
Invoke AWS Lambda
func datadatadata
future.result()
poll S3
unpickle and return
result
40. Reinforcement Learning requirements
Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)
Rollouts outcomes are used to update policy (e.g., SGD)
Update
policy
Update
policy
…
…
Update
policy
rollouts
Update
policy
…
42. RL requirements
Process inputs from different sensor in parallel & real-time
Execute large number of rollouts (simulations)
Rollouts outcomes are used to update policy (e.g., SGD)
Often policies implemented by DNNs
actions
observation
s
43. Ray goals
Flexibility
• Combine neural networks, planning, search, simulation, etc
• Heterogeneous tasks: CPUs/GPUs, durations, computation
• Fine-grained data and task dependencies, dynamic execution
Performance
• Millions of tasks per second with msec level latencies
• Adapt to changing work in real-time
Easy of use
• Minimal changes to parallelize existing Python serial code
44. Worker
Node
Worker Worker
Node
Worker Worker
Node
Object store Object store Object store
Local scheduler Local scheduler Local scheduler
Object table
Task table
Function table
Global control store
Global scheduler
Web UI
Debugging tools
Error diagnosis
Driver
Ray architecture
45. Experiments
Tested and deployed on AWS
Latency
• Local task execution: ~300 usec
• Remote task execution: ~1ms
• Object store R/W throughput: 6GB/s for large objects
45
48. Ray Speeds up Rollouts for Policy Gradients
Simulate
Env
Simulate
Env
Evaluate
Policy
Simulate
Env
Simulate
Env
Evaluate
Policy
CPU CPU
Parallel Rollouts on
CPU
Speedup: 1.0x
Evaluate Policy
Simulate Env
Simulate Env
Evaluate Policy
CPU
GPU
GPU
CPU
Policy Evaluation
on GPU
1.3x
...
Evaluate Policy (Batch)
Simulate
Env
Simulate
Env
Simulate
Env
Evaluate Policy (Batch)GPU
GPU
CPU CPU
Fine grained
rollouts
4.1x
Leverage P2 (GPU) instances
49. Summary
Many challenges in ML/AI, systems, security, architectures
Both PyWren and Ray already released and working on AWS
PyWren: https://github.com/pywren/pywren
Ray: https://github.com/ray-project/ray
RISELab goal: Develop open source platforms, tools
and algorithms for real-time decisions on live data
with strong security