饿了么 TensorFlow 深度学习平台：elearn

AI
Machine Learning
Deep Learning

MapReduce
Hadoop & Spark
Data
Executor Executor Executor Executor Executor
Result
map
reduce

MapReduce
Hadoop & Spark
“In parallel computing, an embarrassingly parallel workload or problem
(also called perfectly parallel or pleasingly parallel) is one where little or
no effort is needed to separate the problem into a number of parallel
tasks.”
—— Wikipedia
https://en.wikipedia.org/wiki/Embarrassingly_parallel

Training Data Validation Data Test Data
Model Serving request
train
(cpu/gpu)
✔
✘

Graph
(DAG)
How does TensorFlow work
Python
Golang
Java
…
Graph
(DAG)
C++ Core
compile run

Preprocess
Data
Executor Executor Executor Executor Executor
Result
map
reduce
Hive, Spark, Storm, …
Distributed Storage
NFS, HDFS, S3, …
Training
Training Data Validation Data Test Data
Model Serving request
train
(cpu/gpu)
✔
✘
TensorFlow
(Distributed Training)

• TaaS (TensorFlow as a Service)
• 开始于 2016 年年 8 ⽉月底
• 受到 Google Cloud 的 CloudML 产品启发
• 让算法⼯工程师可以专注于算法，其它的事情交给 elearn
分布式存储
CPU 弹性需求 service 的 IP、Port 管理理
⼤大量量 container 的⽣生命周期 API
GPU

Datastore
• Abstraction of data
• Compatible with many types of distributed storage
• Decoupling computing and storage
• Bring your own storage

Datastore
Client 缺点应⽤用场景
NFS nfs-utils 性能问题、权限管理理
存放训练数据
模型训练
S3
compatible
s3fs-fuse
⽆无法创建空⽬目录
mkdir
存放训练数据
MinFS ⽆无法创建 symlink 存放训练数据
GlusterFS - - -

饿了么 TensorFlow 深度学习平台：elearn

GPU with Docker
• GPU 内存使⽤用有讲究 
https://www.tensorﬂow.org/tutorials/using_gpu
• GPU Docker images 
https://hub.docker.com/r/nvidia/cuda/
• nvidia-docker 不不是必须 
CUDA_SO, NVIDIA_SO, DEVICES, …

GPU with Kubernetes
--feature-gates="Accelerators=true"

TensorFlow 的上⼀一代产品
DistBelief
• 解决了了数据量量 > GPU 最⼤大内存的问题
• paper 中提到最⼤大的 model，达到了了 
1.7 billion parameters, utilizing 81 machines, delivering a 12x speedup.
• 它的缺点：
• 擅⻓长图像识别，但对其它机器器学习 model 适⽤用性差，⽽而且不不⽀支持 mobile
• 维护⼤大规模系统成为负担，缺乏抽象。
https://research.google.com/pubs/pub40565.html

TensorFlow Ecosystem: Integrating TensorFlow with Your Infrastructure — Derek Murray, Jonathan Hseu

👋⼿手动管理理的资源
• IP or DNS
• port
• task ID
• CPU/GPU ID
😫抓狂的节奏
登录 10 台服务器器
mount 10 遍共享存储
⼩小⼼心翼翼安排 GPU 资源
仔细设计 10 条启动命令
敲 10 遍这不不⼀一样的启动命令
⼿手动启动⼀一个 TensorBoard
⼿手动创建⽬目录
mkdir + cp ⼿手动管理理模型版本
……

天呐！这才训练一次
以后可怎么每周训练、天天训练啊？！

TensorFlow (GPU/CPU) cluster benchmark
Total: 6 GPUs
master worker
2 GPUs x 1 1 GPU x 4
Total: 9 GPUs
master worker
Total: 3 GPUs
master worker
11 global steps/s
28 global steps/s
45 global steps/s
Total: 9 CPUs
master worker
2 CPUs x 1 1 CPU x 7
5 global steps/s
Powered by

Cluster
(Monthly Training)
model - 2017-07-23
model - 2017-07-23
model - 2017-07-16
model - 2017-07-09
model - 2017-07-02
Recommendation Model
Copy
GRPC
Online Serving
Use TensorFlow
Golang/Java Binding to
Load the Model
Dow
nload
System Datastore
User Deﬁned
Datastore

为 TensorFlow 量量身打造
如果让 elearn ⽀支持其它 deep learning 框架，太容易易了了， 
只要写 driver 就可以了了，但是我不不会去这样做的。

尽量量发挥 Kubernetes 的特⾊色功能和编排能⼒力力 
Deployment, Job, StatefulSet, ConﬁgMap, Nginx Ingress, …
如果让 elearn ⽀支持其它 Cloud，太容易易了了，只要写 interface 够通⽤用，
然后写 driver 就可以了了，但会失去 Kubernetes 本身的意义，
所以 elearn cloud interfaces 的设计是⾯面向功能的。

让 Kubernetes 变成 OS 
所有⼩小的操作都以 Job 形式下发
⽐比如训练完成后的 post run、保存 modelVersion，这些任务都是 Kubernetes Job，
并不不在 elearn server 上运⾏行行。

I’m 江骏 / ohmystack
@饿了了么
https://github.com/ohmystack
http://weibo.com/jiangjun1990
http://ohmystack.com/
推荐项⽬目（点击链接前往）
dt (docker-tool)
最爽功能：dt ssh
gotool
管理理 Golang 开发环境 GOPATH 的利利器器

欢迎关注饿了了么技术社区

饿了么 TensorFlow 深度学习平台：elearn

More Related Content

饿了么 TensorFlow 深度学习平台：elearn