Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup

Recordings: https://youtube.datascienceonaws.com Book: https://www.amazon.com/Data-Science-AWS-End-End/dp/1492079391/ GitHub: https://github.com/data-science-on-aws/
Ray AI Runtime (AIR) on AWS:
Distributed ML with Amazon SageMaker,
EC2, EMR, and EKS!
GitHub repo:
https://github.com/data-science-on-aws
Recordings:
https://youtube.datascienceonaws.com
Book:
https://www.amazon.com/Data-Science-AWS-End-End/dp/1492079391/

Speakers
Chris Fregly
Principal Solution Architect, AI/ML
@ AWS
Antje Barth
Principal Developer Advocate, AI/ML
@ AWS
2
Apoorva Kulkarni
Senior Solution Architect, Containers
@ AWS

What is Ray?
3
Friction-less transition from research to production
Encourages iterative development and debugging
Env management: “conda as a service”, auto-syncs files across cluster
Makes TensorFlow/PyTorch/Scikit/Everything as easy to scale as Spark

Ray ecosystem
4

Scale from laptop to cluster
5

Scale from laptop to cluster
6

ray up cluster.yaml
7

Ray Clusters
8

ray cluster-dump
9

Recordings: https://youtube.datascienceonaws.com Book: https://www.amazon.com/Data-Science-AWS-End-End/dp/1492079391/ GitHub: https://github.com/data-science-on-aws/10
Ray Quick Start on AWS

Ray Autoscale
11

Ray Dashboard
12

Frictionless transition from research to production
13
Local
development
Remote cluster
production job
Remote cluster
development

Local development: local laptop and conda
14
pytorch-huggingface-clothing.py # train.py
--num_train_epochs 1 # hyper-parameter
--max_length 64 # hyper-parameter
--num_workers 4 # number of workers (ie. CPUs or GPUs)
--model_name_or_path roberta-base # base BERT model
--train_file ./data/train/part-algo-1-womens_clothing_ecommerce_reviews.csv
--validation_file ./data/validation/part-algo-1-womens_clothing_ecommerce_reviews.csv
https://github.com/data-science-on-aws/data-science-on-aws/tree/5b5ed1a/wip/ray/train

Remote development: cluster and cluster-scope conda
15
ray submit cluster.yaml # run the same python script on Ray cluster!
pytorch-huggingface-clothing.py # train.py
--num_workers 64 # number of workers

Remote cluster production jobs: specify conda yaml per job
16
ray job submit
--working-dir . # Copy everything from this directory and below
--runtime-env job-pytorch-huggingface-clothing-runtime.yaml # Conda env yaml
--address http://127.0.0.1:8265 -- # port forward to cluster
python pytorch-huggingface-clothing.py # train.py
--num_workers 4 # number of workers (ie. CPUs or GPUs)

Ray environment management (“conda as a service”)
17

Ray debugging
18
https://github.com/data-science-on-aws/data-science-on-aws/tree/5b5ed1a/wip/ray/debug

Ray AIR (AI Runtime)
19

Ray AIR (AI Runtime) - Quickstart
20

Ray Data - not (yet) a DataFrame abstraction (ie. no joins)
21
https://github.com/data-science-on-aws/data-science-on-aws/tree/5b5ed1a/wip/ray/datasets

Modin: Pandas on Ray
22

RayDP: Spark on Ray
23

Ray Tune
24
from ray import tune
# 1. Define an objective function.
def objective(config):
score = config["a"] ** 2 + config["b"]
return {"score": score}
# 2. Define a search space.
search_space = {
"a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
"b": tune.choice([1, 2, 3]),
}
# 3. Start a Tune run and print the best result.
analysis = tune.run(objective, config=search_space)
print(analysis.get_best_config(metric="score", mode="min"))

Ray RLlib: Initial beachhead for Ray
25
Ray Reinforcement Learning Ray Data & Ray Train/Tune

Ray Serve
26
Serving Framework on Ray
Python-native, supports any Python code, ML framework, etc
Compose multiple ML models into a deployment graph

Ray Serve: NLP inference pipeline with HuggingFace
27
https://github.com/data-science-on-aws/data-science-on-aws/tree/5b5ed1a/wip/ray/serve

Ray Serve: combine 2 NLP models, average the predictions
28

Ray Serve: DAGDriver (http server)
29

Ray Serve: build DAG with http inputs
30
InputNode() - http
inputs to the DAG
bind() - Graph
building API on
decorated body
serve.run() - Run
deployment graph

Ray Serve: submit long-running “serve job” to cluster
31
ray job submit
--working-dir .
--runtime-env job-serve-runtime.yaml
-- python serve-dag-huggingface.py

Ray Serve: run sample predictions with an http client
32
import requests
input_text_list = ["Ray Serve is great!", "Serving frameworks without DAG
support are not great."]
for input_text in input_text_list:
prediction = requests.get("http://<cluster_host>:8080/invocations",
data=input_text).text
print("Prediction for '{}' is {}".format(input_text, prediction))

Ray Workflows
33
High-performance, durable application
workflows
Large-scale workflows
(ie. ML and data pipelines)
Long-running business workflows
(when used with Ray Serve)
read_data() preprocessing() train() validate()

Ray Workflows - Define steps
34
https://github.com/data-science-on-aws/data-science-on-aws/tree/5b5ed1a/wip/ray/workflow

Ray Workflows - Initialize storage, setup and run workflow
35
Workflow.run() -
Start workflow DAG
Setup workflow DAG
Workflow execution is
durably logged to storage
https://github.com/data-science-on-aws/data-science-on-aws/tree/5b5ed1a/wip/ray/workflow

Ray + Kubernetes
36
KubeRay
https://shopify.engineering/merlin-shopify-machine-learning-platform

Demos: Ray on AWS
37
https://github.com/data-science-on-aws/data-science-on-aws/tree/2a968ba/wip/ray

Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup

More Related Content

Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup