Apache Submarine: Unified Machine Learning Platform

Apache Submarine
Unified Machine Learning Platform
Keqiu Hu
Staff Software Engineer LinkedIn
Wangda Tan
Apache Hadoop PMC Member
Sr. Engineering Manager @ Cloudera

Agenda ML in Production & Requirements
What is Apache Submarine？
Demo
Current state and future

Machine Learning In Production

What is included in a ML training lifecycle

Data Pipeline For Machine Learning
ETL
Data Exploration
Join / Sampling /
Feature Extraction
Split train, test Data set, etc.
Model Training
Model Saving,
Versioning, etc.
Model Deployment
(Online Serving)

• Expert of ML algorithms, models, libraries,
feature engineering.
• Need tools and platforms to gain insight of
data, build models productively. Create ML
pipeline (such as: data labeling,
transformation, etc.)
• Mostly familiar with Python, Spark, Hive,
etc.
• Not familiar with platform stuffs.
Data Scientist

Model Exploration
• Pre-process using Spark/Hive,
(or some small scale
alternatives)
• Experiment using sampled
dataset with notebooks.
(Single node)
• Experiment with full dataset to
get best results. (distributed)
What Data Scientist Expect? (Cont)
Reproducible Experiment
• Record parameters, code,
metrics of experiment
• Dependency management,
coding once, run everywhere.
• Easy to fine-tune parameters,
AutoML.
Model Management
• Easy to manage model, and
push to production
• Model assurance, monitoring

What Data Scientist NOT expect to know?
• Deep understanding of resource
management system concepts in
YARN/Kubernetes (how capacity
scheduler, k8s operations, etc.)
• Compute engine tuning (memory
configuration, shuffling
performance)
• Nitty gritty details of underlying
infra, it should just work

Compute Engine
Connector
Submarine Service
Submarine Workbench
SDK
Submarine Architecture
Projects ModelData Hub
Java/Python/REST API,
Mini-Submarine
Metric StoreNotebook
Cluster Orchestrator
Runtime

Submarine “Hello World”
java -cp hadoop-submarine-<version>.jar submarine.cli job run
--framework tensorflow
--name tf-job-001
--input_path “hdfs://default/dataset/cifar-10-data”
--checkpoint_path “hdfs://default/tmp/cifar-10-jobdir”
--num_workers 2
--worker_resources memory=8G,vcores=2,gpu=2
--worker_launch_cmd "cmd for worker … "
--num_ps 1
--ps_resources memory=4G,vcores=2,gpu=0
--ps_launch_cmd "cmd for ps ..."

Submarine 训练和推理演示

Current State
Submarine v0.1 released after Apache Hadoop v3.2.0
Submarine v0.2 released
PyTorch support
Thin and uber jar
Addded LinkedIn TonY execution runtime (Hadoop 2.7.3+compatability)
Zeppelin Submarine interpreter
Mini-submarine (All-In-One image)

Future
• We are working with Hadoop/Apache community to spin-off
Submarine to a new Apache project.
• Some more features we are working on
Task JIRA Target Version
Submarine Workbench (Web/Server) SUBMARINE-98/SUBMARINE-131 0.3.0
Submarine Kubernetes Support SUBMARINE-154 0.3.0
Metrics Support (Like MLflow) TBD TBD

Development Team
Community Members
Hadoop Community PMC & Committer
Zeppelin Community PMC & Committer
Cloudera Wangda Tan，Zhankun Tang，Sunil Govind …
NetEase Xun Liu, Quan Zhou
LinkedIn Keqiu Hu
Alibaba Jeff Zhang
Ke.com Guoxian Zhao，Feng Liu，Huiyang Jian
JD Wanqiang Ji
Dahua Linhao Zhu

NetEase
• One of the largest online
game/news/music provider in
China.
• 245 GPU Cluster runs
Submarine.
• One of the model built is
music recommendation
model which invoked
1B+/days.
Community Use Cases
LinkedIn
• 250+ GPU machines
• 500+ TensorFlow
trainings/day.
• Serves applications in
recommendation systems and
NLP.
• Collaboration on runtime and
SDK development.
Ke.com
• Largest online real-estate
brokerage website in China.
• 50+ GPU machines (includes
19 multi-v100 GPU machines),
based on Hadoop trunk
(3.3.0).
• Serves applications like
image/voice recognition, etc.

Thank you!
Please join the community!
Website:
https://hadoop.apache.org/submarine/
Weekly Community Meeting:
https://docs.google.com/document/d/1XkrcyVil_ORV1UP-
JhosGzK8qWGXXX3wuplo4RtC7u0/edit
Code:
https://github.com/apache/hadoop

Apache Submarine: Unified Machine Learning Platform

Related slideshows

More Related Content

Apache Submarine: Unified Machine Learning Platform