Apache Submarine: Unified Machine Learning Platform
- 1. Apache Submarine
Unified Machine Learning Platform
Keqiu Hu
Staff Software Engineer LinkedIn
Wangda Tan
Apache Hadoop PMC Member
Sr. Engineering Manager @ Cloudera
- 2. Agenda ML in Production & Requirements
What is Apache Submarine?
Demo
Current state and future
- 6. Data Pipeline For Machine Learning
ETL
Data Exploration
Join / Sampling /
Feature Extraction
Split train, test Data set, etc.
Model Training
Model Saving,
Versioning, etc.
Model Deployment
(Online Serving)
- 7. • Expert of ML algorithms, models, libraries,
feature engineering.
• Need tools and platforms to gain insight of
data, build models productively. Create ML
pipeline (such as: data labeling,
transformation, etc.)
• Mostly familiar with Python, Spark, Hive,
etc.
• Not familiar with platform stuffs.
Data Scientist
- 8. Model Exploration
• Pre-process using Spark/Hive,
(or some small scale
alternatives)
• Experiment using sampled
dataset with notebooks.
(Single node)
• Experiment with full dataset to
get best results. (distributed)
What Data Scientist Expect? (Cont)
Reproducible Experiment
• Record parameters, code,
metrics of experiment
• Dependency management,
coding once, run everywhere.
• Easy to fine-tune parameters,
AutoML.
Model Management
• Easy to manage model, and
push to production
• Model assurance, monitoring
- 9. What Data Scientist NOT expect to know?
• Deep understanding of resource
management system concepts in
YARN/Kubernetes (how capacity
scheduler, k8s operations, etc.)
• Compute engine tuning (memory
configuration, shuffling
performance)
• Nitty gritty details of underlying
infra, it should just work
- 12. Submarine “Hello World”
java -cp hadoop-submarine-<version>.jar submarine.cli job run
--framework tensorflow
--name tf-job-001
--input_path “hdfs://default/dataset/cifar-10-data”
--checkpoint_path “hdfs://default/tmp/cifar-10-jobdir”
--num_workers 2
--worker_resources memory=8G,vcores=2,gpu=2
--worker_launch_cmd "cmd for worker … "
--num_ps 1
--ps_resources memory=4G,vcores=2,gpu=0
--ps_launch_cmd "cmd for ps ..."
- 21. Current State
Submarine v0.1 released after Apache Hadoop v3.2.0
Submarine v0.2 released
PyTorch support
Thin and uber jar
Addded LinkedIn TonY execution runtime (Hadoop 2.7.3+compatability)
Zeppelin Submarine interpreter
Mini-submarine (All-In-One image)
- 22. Future
• We are working with Hadoop/Apache community to spin-off
Submarine to a new Apache project.
• Some more features we are working on
Task JIRA Target Version
Submarine Workbench (Web/Server) SUBMARINE-98/SUBMARINE-131 0.3.0
Submarine Kubernetes Support SUBMARINE-154 0.3.0
Metrics Support (Like MLflow) TBD TBD
- 24. Development Team
Community Members
Hadoop Community PMC & Committer
Zeppelin Community PMC & Committer
Cloudera Wangda Tan,Zhankun Tang,Sunil Govind …
NetEase Xun Liu, Quan Zhou
LinkedIn Keqiu Hu
Alibaba Jeff Zhang
Ke.com Guoxian Zhao,Feng Liu,Huiyang Jian
JD Wanqiang Ji
Dahua Linhao Zhu
- 25. NetEase
• One of the largest online
game/news/music provider in
China.
• 245 GPU Cluster runs
Submarine.
• One of the model built is
music recommendation
model which invoked
1B+/days.
Community Use Cases
LinkedIn
• 250+ GPU machines
• 500+ TensorFlow
trainings/day.
• Serves applications in
recommendation systems and
NLP.
• Collaboration on runtime and
SDK development.
Ke.com
• Largest online real-estate
brokerage website in China.
• 50+ GPU machines (includes
19 multi-v100 GPU machines),
based on Hadoop trunk
(3.3.0).
• Serves applications like
image/voice recognition, etc.
- 26. Thank you!
Please join the community!
Website:
https://hadoop.apache.org/submarine/
Weekly Community Meeting:
https://docs.google.com/document/d/1XkrcyVil_ORV1UP-
JhosGzK8qWGXXX3wuplo4RtC7u0/edit
Code:
https://github.com/apache/hadoop