SlideShare a Scribd company logo
Apache Submarine
Unified Machine Learning Platform
Keqiu Hu
Staff Software Engineer LinkedIn
Wangda Tan
Apache Hadoop PMC Member
Sr. Engineering Manager @ Cloudera
Agenda ML in Production & Requirements
What is Apache Submarine?
Demo
Current state and future
Machine Learning In Production
Machine Learning in tutorial
What is included in a ML training lifecycle
Data Pipeline For Machine Learning
ETL
Data Exploration
Join / Sampling /
Feature Extraction
Split train, test Data set, etc.
Model Training
Model Saving,
Versioning, etc.
Model Deployment
(Online Serving)
• Expert of ML algorithms, models, libraries,
feature engineering.
• Need tools and platforms to gain insight of
data, build models productively. Create ML
pipeline (such as: data labeling,
transformation, etc.)
• Mostly familiar with Python, Spark, Hive,
etc.
• Not familiar with platform stuffs.
Data Scientist
Model Exploration
• Pre-process using Spark/Hive,
(or some small scale
alternatives)
• Experiment using sampled
dataset with notebooks.
(Single node)
• Experiment with full dataset to
get best results. (distributed)
What Data Scientist Expect? (Cont)
Reproducible Experiment
• Record parameters, code,
metrics of experiment
• Dependency management,
coding once, run everywhere.
• Easy to fine-tune parameters,
AutoML.
Model Management
• Easy to manage model, and
push to production
• Model assurance, monitoring
What Data Scientist NOT expect to know?
• Deep understanding of resource
management system concepts in
YARN/Kubernetes (how capacity
scheduler, k8s operations, etc.)
• Compute engine tuning (memory
configuration, shuffling
performance)
• Nitty gritty details of underlying
infra, it should just work
What is Submarine?
Compute Engine
Connector
Submarine Service
Submarine Workbench
SDK
Submarine Architecture
Projects ModelData Hub
Java/Python/REST API,
Mini-Submarine
Metric StoreNotebook
Cluster Orchestrator
Runtime
Submarine “Hello World”
java -cp hadoop-submarine-<version>.jar submarine.cli job run
--framework tensorflow
--name tf-job-001
--input_path “hdfs://default/dataset/cifar-10-data”
--checkpoint_path “hdfs://default/tmp/cifar-10-jobdir”
--num_workers 2
--worker_resources memory=8G,vcores=2,gpu=2
--worker_launch_cmd "cmd for worker … "
--num_ps 1
--ps_resources memory=4G,vcores=2,gpu=0
--ps_launch_cmd "cmd for ps ..."
Demo: Mini Submarine
演示 mini-submarine
Demo: Zeppelin integration
Submarine 训练和推理演示
Demo: Submarine Workbench
Submarine Workbench
现状
Current State and Future
Current State
Submarine v0.1 released after Apache Hadoop v3.2.0
Submarine v0.2 released
PyTorch support
Thin and uber jar
Addded LinkedIn TonY execution runtime (Hadoop 2.7.3+compatability)
Zeppelin Submarine interpreter
Mini-submarine (All-In-One image)
Future
• We are working with Hadoop/Apache community to spin-off
Submarine to a new Apache project.
• Some more features we are working on
Task JIRA Target Version
Submarine Workbench (Web/Server) SUBMARINE-98/SUBMARINE-131 0.3.0
Submarine Kubernetes Support SUBMARINE-154 0.3.0
Metrics Support (Like MLflow) TBD TBD
Submarine Community
Development Team
Community Members
Hadoop Community PMC & Committer
Zeppelin Community PMC & Committer
Cloudera Wangda Tan,Zhankun Tang,Sunil Govind …
NetEase Xun Liu, Quan Zhou
LinkedIn Keqiu Hu
Alibaba Jeff Zhang
Ke.com Guoxian Zhao,Feng Liu,Huiyang Jian
JD Wanqiang Ji
Dahua Linhao Zhu
NetEase
• One of the largest online
game/news/music provider in
China.
• 245 GPU Cluster runs
Submarine.
• One of the model built is
music recommendation
model which invoked
1B+/days.
Community Use Cases
LinkedIn
• 250+ GPU machines
• 500+ TensorFlow
trainings/day.
• Serves applications in
recommendation systems and
NLP.
• Collaboration on runtime and
SDK development.
Ke.com
• Largest online real-estate
brokerage website in China.
• 50+ GPU machines (includes
19 multi-v100 GPU machines),
based on Hadoop trunk
(3.3.0).
• Serves applications like
image/voice recognition, etc.
Thank you!
Please join the community!
Website:
https://hadoop.apache.org/submarine/
Weekly Community Meeting:
https://docs.google.com/document/d/1XkrcyVil_ORV1UP-
JhosGzK8qWGXXX3wuplo4RtC7u0/edit
Code:
https://github.com/apache/hadoop

More Related Content

Apache Submarine: Unified Machine Learning Platform

  • 1. Apache Submarine Unified Machine Learning Platform Keqiu Hu Staff Software Engineer LinkedIn Wangda Tan Apache Hadoop PMC Member Sr. Engineering Manager @ Cloudera
  • 2. Agenda ML in Production & Requirements What is Apache Submarine? Demo Current state and future
  • 3. Machine Learning In Production
  • 5. What is included in a ML training lifecycle
  • 6. Data Pipeline For Machine Learning ETL Data Exploration Join / Sampling / Feature Extraction Split train, test Data set, etc. Model Training Model Saving, Versioning, etc. Model Deployment (Online Serving)
  • 7. • Expert of ML algorithms, models, libraries, feature engineering. • Need tools and platforms to gain insight of data, build models productively. Create ML pipeline (such as: data labeling, transformation, etc.) • Mostly familiar with Python, Spark, Hive, etc. • Not familiar with platform stuffs. Data Scientist
  • 8. Model Exploration • Pre-process using Spark/Hive, (or some small scale alternatives) • Experiment using sampled dataset with notebooks. (Single node) • Experiment with full dataset to get best results. (distributed) What Data Scientist Expect? (Cont) Reproducible Experiment • Record parameters, code, metrics of experiment • Dependency management, coding once, run everywhere. • Easy to fine-tune parameters, AutoML. Model Management • Easy to manage model, and push to production • Model assurance, monitoring
  • 9. What Data Scientist NOT expect to know? • Deep understanding of resource management system concepts in YARN/Kubernetes (how capacity scheduler, k8s operations, etc.) • Compute engine tuning (memory configuration, shuffling performance) • Nitty gritty details of underlying infra, it should just work
  • 11. Compute Engine Connector Submarine Service Submarine Workbench SDK Submarine Architecture Projects ModelData Hub Java/Python/REST API, Mini-Submarine Metric StoreNotebook Cluster Orchestrator Runtime
  • 12. Submarine “Hello World” java -cp hadoop-submarine-<version>.jar submarine.cli job run --framework tensorflow --name tf-job-001 --input_path “hdfs://default/dataset/cifar-10-data” --checkpoint_path “hdfs://default/tmp/cifar-10-jobdir” --num_workers 2 --worker_resources memory=8G,vcores=2,gpu=2 --worker_launch_cmd "cmd for worker … " --num_ps 1 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps ..."
  • 21. Current State Submarine v0.1 released after Apache Hadoop v3.2.0 Submarine v0.2 released PyTorch support Thin and uber jar Addded LinkedIn TonY execution runtime (Hadoop 2.7.3+compatability) Zeppelin Submarine interpreter Mini-submarine (All-In-One image)
  • 22. Future • We are working with Hadoop/Apache community to spin-off Submarine to a new Apache project. • Some more features we are working on Task JIRA Target Version Submarine Workbench (Web/Server) SUBMARINE-98/SUBMARINE-131 0.3.0 Submarine Kubernetes Support SUBMARINE-154 0.3.0 Metrics Support (Like MLflow) TBD TBD
  • 24. Development Team Community Members Hadoop Community PMC & Committer Zeppelin Community PMC & Committer Cloudera Wangda Tan,Zhankun Tang,Sunil Govind … NetEase Xun Liu, Quan Zhou LinkedIn Keqiu Hu Alibaba Jeff Zhang Ke.com Guoxian Zhao,Feng Liu,Huiyang Jian JD Wanqiang Ji Dahua Linhao Zhu
  • 25. NetEase • One of the largest online game/news/music provider in China. • 245 GPU Cluster runs Submarine. • One of the model built is music recommendation model which invoked 1B+/days. Community Use Cases LinkedIn • 250+ GPU machines • 500+ TensorFlow trainings/day. • Serves applications in recommendation systems and NLP. • Collaboration on runtime and SDK development. Ke.com • Largest online real-estate brokerage website in China. • 50+ GPU machines (includes 19 multi-v100 GPU machines), based on Hadoop trunk (3.3.0). • Serves applications like image/voice recognition, etc.
  • 26. Thank you! Please join the community! Website: https://hadoop.apache.org/submarine/ Weekly Community Meeting: https://docs.google.com/document/d/1XkrcyVil_ORV1UP- JhosGzK8qWGXXX3wuplo4RtC7u0/edit Code: https://github.com/apache/hadoop