SlideShare a Scribd company logo
Introduction to Polyaxon
Yu Ishikawa
Agenda
- Why do we need Polyaxon?
- What is Polyaxon?
- How does Polyaxon work?
- Demo
- Summary
Objectives to introduce Polyaxon
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
Experiment Phase Operating Phase
Problem
Setting
Collecting
Data
Experiment
s
Off-line
Evaluation
Serving
Models
On-line
Evaluation
Retrain
Model
Off-line
Evaluation
Productionize
ML system
Productionize PhaseML
Workflow
Polyaxon’s role
Why do we need Polyaxon?
- We are not able to manage experiments as team today.
- The cost of experiments is expensive in terms of the financial cost and time. There is room to
improve the efficiency and productivity.
- Setting experiment environments can be tough for ML engineers. Moreover, the environments tend
not to be reproducible. Taking over other member’s tasks can be expensive. As well as, we can not
manage the training process as team.
- It takes a long time for python ML libraries like sklearn to do hyperparameter search, since python is
basically not good at scalability.
Agenda
- Why do we need Polyaxon?
- What is Poyaxon?
- How does Polyaxon work?
- Demo
- Summary
What is Polyaxon?
- An open source platform for reproducible machine learning at scale.
- https://polyaxon.com/
- Features
- Notebook
- Hyperparameter search
- Powerful workspace
- User management
- Dynamic resources allocation
- Dashboard
- Versioning
Notebook environment with Jupyter
---
version: 1
kind: notebook
build:
image: tensorflow/tensorflow:1.4.1-py3
build_steps:
- pip3 install jupyter
$ Polyaxon notebook start -f polyaxon_notebook.yml
New version of CLI (0.3.5) is now available. To upgrade run:
pip install -U polyaxon-cli
Notebook is being deployed for project `quick-start`
It may take some time before you can access the notebook.
Your notebook will be available on:
http://35.184.217.84:80/notebook/root/quick-start/
- Polyaxon enables us to launch a jupyter environment with one command. As
well as we can define the environment with docker and some commands in a
YAML file.
- We can reproduce the notebook experiments easily.
Hyperparameter tuning with Polyaxon
- Polyaxon supports some hyperparameter tuning methods:
- Grid search
- Random search
- Bayesian optimization
- Early stopping
- Hyperband
- We can control the concurrency of hyperparameter tuning with YAML file.
- We can reproduce hyperparameter tuning jobs as well.
Hyperparameter tuning with high concurrency
---
version: 1
kind: group
hptuning:
concurrency: 5
matrix:
learning_rate:
linspace: 0.001:0.1:5
dropout:
values: [0.25, 0.3]
activation:
values: [relu, sigmoid]
declarations:
batch_size: 128
num_steps: 500
num_epochs: 1
build:
image: tensorflow/tensorflow:1.4.1-py3
build_steps:
- pip3 install --no-cache-dir -U polyaxon-helper
run:
cmd: python3 model.py --batch_size={{ batch_size }} 
--num_steps={{ num_steps }} 
--learning_rate={{ learning_rate }} 
--dropout={{ dropout }} 
--num_epochs={{ num_epochs }} 
--activation={{ activation }}
$ Polyaxon notebook start -f
polyaxon_notebook.yml
New version of CLI (0.3.5) is now available. To
upgrade run:
pip install -U polyaxon-cli
Creating an experiment group with the following
definition:
---------------- -----------------
Search algorithm grid
Concurrency 5 concurrent runs
Early stopping deactivated
---------------- -----------------
Experiment group 1 was created
polyaxon_gridsearch.yml
Dashboard ~ Experiments
Dashboard ~ Metrics Visualization
Agenda
- Why do we need Polyaxon?
- What is Polyaxon?
- How does Polyaxon work?
- Demo
- Summary
Hyperparameter tuning with a single machine
Many CPU cores machine
Memory
data
CPU
Core: Train with parameter set A
Core: Train with parameter set B
Core: Train with parameter set C
Core: Train with parameter set X
...
For instance, scikit-learn’s GridSearchCV enables us to run experiments in parallel. However, the number
of process is based on the number of CPU cores. For instance, a 64 CPU cores machine can have up to
64 concurrencies.
Hyperparameter tuning of Polyaxon
Polyaxon on k8s
Polyaxon core
Node A
Node B
Node C
Training Code
Upload & run
Build
Pod: parameter set A
Pod: parameter set B
Pod: parameter set C
Pod: parameter set D
Pod: parameter set E
Pod: parameter set F
Schedule
The more the number of nodes in k8s cluster, the more the number of process to
train is. There is no constraint of parallelism.
Even one experiments can be shorter
Experiments Evaluation
Experiments Evaluation
By leveraging the multiple nodes on k8s, we can shorten the experiment time of 1
experiment with high concurrency.
Single machine
Polyaxon
t
Reduce training time
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Training Code
of experiment X
Upload & run
Concurrency:
- 100
Requests:
- CPU: 1
- Memory: 2GB
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Preemptible node Preemptible node Preemptible node
Training Code
of experiment X
Upload & run
Concurrency:
- 100
Requests:
- CPU: 1
- Memory: 2GB
Automatically launch new preemptible instances
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Training Code
of experiment Y
Upload & run
Concurrency:
- 50
Requests:
- CPU: 1
- Memory: 1GB
Preemptible node Preemptible node Preemptible node
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Preemptible node Preemptible node
Preemptible node Preemptible node
Preemptible node
Training Code
of experiment Y
Upload & run
Concurrency:
- 50
Requests:
- CPU: 1
- Memory: 1GB
Preemptible instance/GPU/TPU pricing
Preemptible instance Preemptible GPU
Preemptible TPU
Regular instance cost vs Preemptible instance cost
Running cost
t
Regular instance
Preemptible instance
- We can reduce the cost of training models by
leveraging preemptible instances with
polyaxon.
- Polyaxon enables us to use preemptible node
pool for experiments.
- Since polyaxon automatically scale the node
pool with GKE, we don’t need to hold static
instances for experiments.
Reduced
cost
It takes a longer time to do experiments with a single machine sequentially,
because python ML library like sklearn is not basically scalable.
Multiple Experiments with a single machine
Experiments Evaluation
Experiments Evaluation
Experiments Evaluation
t
Polyaxon enables us to easily run multiple experiments in parallel on k8s. We
don’t need to wait for each experiments to move on to the next one.
Multiple Experiments with polyaxon
Experiments Evaluation
Experiments Evaluation
Experiments Evaluation
We can shorten the total experiments time by
the parallelism of Polyaxon.
t
Sequential experiments cost vs parallel experiments cost
Running cost
t
- Essentially speaking, the costs of instances
should be the same, since the cost of CPU
usage is linear with running time.
- However, we should not overlook labor costs
while experiments. Waiting for experiments is
time and money wasting. Time is money!!
- We can reduce the total cost by shortening the
total experiments time.
Sequential experiments
Experiments in parallel
Labor cost
t
Reduced cost
Power of multiple preemptible nodes
- The cost of preemptible n1-standard-64 x 10 nodes x 2 hours with 640
concurrencies. It should be cheap!
- $12.8 = $0.64 * 10 * 2
- Even If it takes about 20 minutes to run 1 parameter set of a training job, we
can run about 3840 parameter sets of a training job for just 2 hours with such
a cheap cost.
We can achieve the objectives:
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
Agenda
- Why do we need Polyaxon?
- What is Polyaxon?
- How does Polyaxon work?
- Demo
- Summary
Demo
- Notebook
- Job
- Experiment
- Hyperparameter tuning at scale
Summary
- We can definitely achieve the objectives with polyaxon on GKE.
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
- All we ML engineers have to do is:
- Making the training code in python as usual, and
- Defining the YAML files to do experiments.
- What’s next?
- Supporting preemptible GPUs / TPUs.
Appendix A: Links
- Polyaxon
- https://polyaxon.com/
- Documentation
- https://docs.polyaxon.com/
- Examples
- https://github.com/polyaxon/polyaxon-quick-start
- https://github.com/polyaxon/deep-learning-with-python-notebooks-on-polyaxon
- https://github.com/polyaxon/polyaxon-examples
Appendix B: Architecture of Polyaxon

More Related Content

Introduction to Polyaxon

  • 2. Agenda - Why do we need Polyaxon? - What is Polyaxon? - How does Polyaxon work? - Demo - Summary
  • 3. Objectives to introduce Polyaxon - Make the lead time of experiments as short as possible. - Make the financial cost to train models as cheap as possible. - Make the experiments reproducible. Experiment Phase Operating Phase Problem Setting Collecting Data Experiment s Off-line Evaluation Serving Models On-line Evaluation Retrain Model Off-line Evaluation Productionize ML system Productionize PhaseML Workflow Polyaxon’s role
  • 4. Why do we need Polyaxon? - We are not able to manage experiments as team today. - The cost of experiments is expensive in terms of the financial cost and time. There is room to improve the efficiency and productivity. - Setting experiment environments can be tough for ML engineers. Moreover, the environments tend not to be reproducible. Taking over other member’s tasks can be expensive. As well as, we can not manage the training process as team. - It takes a long time for python ML libraries like sklearn to do hyperparameter search, since python is basically not good at scalability.
  • 5. Agenda - Why do we need Polyaxon? - What is Poyaxon? - How does Polyaxon work? - Demo - Summary
  • 6. What is Polyaxon? - An open source platform for reproducible machine learning at scale. - https://polyaxon.com/ - Features - Notebook - Hyperparameter search - Powerful workspace - User management - Dynamic resources allocation - Dashboard - Versioning
  • 7. Notebook environment with Jupyter --- version: 1 kind: notebook build: image: tensorflow/tensorflow:1.4.1-py3 build_steps: - pip3 install jupyter $ Polyaxon notebook start -f polyaxon_notebook.yml New version of CLI (0.3.5) is now available. To upgrade run: pip install -U polyaxon-cli Notebook is being deployed for project `quick-start` It may take some time before you can access the notebook. Your notebook will be available on: http://35.184.217.84:80/notebook/root/quick-start/ - Polyaxon enables us to launch a jupyter environment with one command. As well as we can define the environment with docker and some commands in a YAML file. - We can reproduce the notebook experiments easily.
  • 8. Hyperparameter tuning with Polyaxon - Polyaxon supports some hyperparameter tuning methods: - Grid search - Random search - Bayesian optimization - Early stopping - Hyperband - We can control the concurrency of hyperparameter tuning with YAML file. - We can reproduce hyperparameter tuning jobs as well.
  • 9. Hyperparameter tuning with high concurrency --- version: 1 kind: group hptuning: concurrency: 5 matrix: learning_rate: linspace: 0.001:0.1:5 dropout: values: [0.25, 0.3] activation: values: [relu, sigmoid] declarations: batch_size: 128 num_steps: 500 num_epochs: 1 build: image: tensorflow/tensorflow:1.4.1-py3 build_steps: - pip3 install --no-cache-dir -U polyaxon-helper run: cmd: python3 model.py --batch_size={{ batch_size }} --num_steps={{ num_steps }} --learning_rate={{ learning_rate }} --dropout={{ dropout }} --num_epochs={{ num_epochs }} --activation={{ activation }} $ Polyaxon notebook start -f polyaxon_notebook.yml New version of CLI (0.3.5) is now available. To upgrade run: pip install -U polyaxon-cli Creating an experiment group with the following definition: ---------------- ----------------- Search algorithm grid Concurrency 5 concurrent runs Early stopping deactivated ---------------- ----------------- Experiment group 1 was created polyaxon_gridsearch.yml
  • 11. Dashboard ~ Metrics Visualization
  • 12. Agenda - Why do we need Polyaxon? - What is Polyaxon? - How does Polyaxon work? - Demo - Summary
  • 13. Hyperparameter tuning with a single machine Many CPU cores machine Memory data CPU Core: Train with parameter set A Core: Train with parameter set B Core: Train with parameter set C Core: Train with parameter set X ... For instance, scikit-learn’s GridSearchCV enables us to run experiments in parallel. However, the number of process is based on the number of CPU cores. For instance, a 64 CPU cores machine can have up to 64 concurrencies.
  • 14. Hyperparameter tuning of Polyaxon Polyaxon on k8s Polyaxon core Node A Node B Node C Training Code Upload & run Build Pod: parameter set A Pod: parameter set B Pod: parameter set C Pod: parameter set D Pod: parameter set E Pod: parameter set F Schedule The more the number of nodes in k8s cluster, the more the number of process to train is. There is no constraint of parallelism.
  • 15. Even one experiments can be shorter Experiments Evaluation Experiments Evaluation By leveraging the multiple nodes on k8s, we can shorten the experiment time of 1 experiment with high concurrency. Single machine Polyaxon t Reduce training time
  • 16. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments
  • 17. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Training Code of experiment X Upload & run Concurrency: - 100 Requests: - CPU: 1 - Memory: 2GB
  • 18. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Preemptible node Preemptible node Preemptible node Training Code of experiment X Upload & run Concurrency: - 100 Requests: - CPU: 1 - Memory: 2GB Automatically launch new preemptible instances
  • 19. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Training Code of experiment Y Upload & run Concurrency: - 50 Requests: - CPU: 1 - Memory: 1GB Preemptible node Preemptible node Preemptible node
  • 20. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Preemptible node Preemptible node Preemptible node Preemptible node Preemptible node Training Code of experiment Y Upload & run Concurrency: - 50 Requests: - CPU: 1 - Memory: 1GB
  • 21. Preemptible instance/GPU/TPU pricing Preemptible instance Preemptible GPU Preemptible TPU
  • 22. Regular instance cost vs Preemptible instance cost Running cost t Regular instance Preemptible instance - We can reduce the cost of training models by leveraging preemptible instances with polyaxon. - Polyaxon enables us to use preemptible node pool for experiments. - Since polyaxon automatically scale the node pool with GKE, we don’t need to hold static instances for experiments. Reduced cost
  • 23. It takes a longer time to do experiments with a single machine sequentially, because python ML library like sklearn is not basically scalable. Multiple Experiments with a single machine Experiments Evaluation Experiments Evaluation Experiments Evaluation t
  • 24. Polyaxon enables us to easily run multiple experiments in parallel on k8s. We don’t need to wait for each experiments to move on to the next one. Multiple Experiments with polyaxon Experiments Evaluation Experiments Evaluation Experiments Evaluation We can shorten the total experiments time by the parallelism of Polyaxon. t
  • 25. Sequential experiments cost vs parallel experiments cost Running cost t - Essentially speaking, the costs of instances should be the same, since the cost of CPU usage is linear with running time. - However, we should not overlook labor costs while experiments. Waiting for experiments is time and money wasting. Time is money!! - We can reduce the total cost by shortening the total experiments time. Sequential experiments Experiments in parallel Labor cost t Reduced cost
  • 26. Power of multiple preemptible nodes - The cost of preemptible n1-standard-64 x 10 nodes x 2 hours with 640 concurrencies. It should be cheap! - $12.8 = $0.64 * 10 * 2 - Even If it takes about 20 minutes to run 1 parameter set of a training job, we can run about 3840 parameter sets of a training job for just 2 hours with such a cheap cost. We can achieve the objectives: - Make the lead time of experiments as short as possible. - Make the financial cost to train models as cheap as possible.
  • 27. Agenda - Why do we need Polyaxon? - What is Polyaxon? - How does Polyaxon work? - Demo - Summary
  • 28. Demo - Notebook - Job - Experiment - Hyperparameter tuning at scale
  • 29. Summary - We can definitely achieve the objectives with polyaxon on GKE. - Make the lead time of experiments as short as possible. - Make the financial cost to train models as cheap as possible. - Make the experiments reproducible. - All we ML engineers have to do is: - Making the training code in python as usual, and - Defining the YAML files to do experiments. - What’s next? - Supporting preemptible GPUs / TPUs.
  • 30. Appendix A: Links - Polyaxon - https://polyaxon.com/ - Documentation - https://docs.polyaxon.com/ - Examples - https://github.com/polyaxon/polyaxon-quick-start - https://github.com/polyaxon/deep-learning-with-python-notebooks-on-polyaxon - https://github.com/polyaxon/polyaxon-examples