SlideShare a Scribd company logo
Distributed TensorFlow
on Kubernetes
資訊與通訊研究所 蔣是文 Mac Chiang
Copyright 2017 ITRI 工業技術研究院
Agenda
• Kubernetes Introduction
• Scheduling GPUs on Kubernetes
• Distributed TensorFlow Introduction
• Running Distributed TensorFlow on Kubernetes
• Experience Sharing
• Summary
2
Copyright 2017 ITRI 工業技術研究院
What’s Kubernetes
• “Kubernetes” is Greek for captain or pilot
• Experiences from Google and design by Goolge
• Kubernetes is a production-grade, open-source platform that
orchestrates the placement (scheduling) and execution of
application containers within and across computer clusters.
• Masters manage the cluster and the nodes are used to host the
running applications.
3
Copyright 2017 ITRI 工業技術研究院
Why Kubernetes
4
• Automatic binpacking
• Horizontal scaling
• Automated rollouts and rollback
• Service monitor
• Self-healing
• Service discovery and load balancing
• 100% Open source, written in Go
Copyright 2017 ITRI 工業技術研究院
Scheduling GPUs on Kubernetes
• Nvidia drivers are installed
• Turned on alpha feature gate Accelerators across the
system
▪ --feature-gates="Accelerators=true“
• Nodes must be using docker engine as the container
runtime
5
Copyright 2017 ITRI 工業技術研究院
Scheduling GPUs on Kubernetes (cont.)
6
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
-
name: gpu-container-1
image: gcr.io/google_containers/pause:2.0
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs
-
name: gpu-container-2
image: gcr.io/google_containers/pause:2.0
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
Copyright 2017 ITRI 工業技術研究院
Scheduling on Different GPU Versions
7
• Labeling nodes with GPU HW type
• Specify the GPU types via Node Affinity rules
Tesla P100
Node1 Node2
2 * K80 1 * P100
Tesla K80
gpu:k80 gpu:p100
Copyright 2017 ITRI 工業技術研究院
Access to CUDA libraries from Docker
nvidia-driver
libcuda.so
8
Copyright 2017 ITRI 工業技術研究院
TensorFlow
9
• Originally developed by the Google Brain Team
within Google's Machine Intelligence research
organization
• An open source software library for numerical
computation using data flow graphs
• Nodes in the graph represent mathematical
operations, while the graph edges represent the
multidimensional data arrays (tensors)
communicated between them.
• Support one or more CPUs or GPUs in a desktop,
server, or mobile device with a single API
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow
10
http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow (cont.)
11
• Replication
▪ In-graph
▪ Between-graph
• Training
▪ Asynchronous
▪ Synchronous
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S
12
• TensorFlow ecosystem
▪ https://github.com/tensorflow/ecosystem
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
13
• Prepare codes for distributed training
▪ Flags for configuring the task
▪ Construct the cluster and start the server
▪ Set the device before graph construction# Construct the cluster and start the server
ps_spec = FLAGS.ps_hosts.split(",")
worker_spec = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({
"ps": ps_spec,
"worker": worker_spec})
server = tf.train.Server(
cluster, job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)):
# Construct the TensorFlow graph.
# Run the TensorFlow graph.
# Flags for configuring the task
flags.DEFINE_integer("task_index", None,
"Worker task index, should be >= 0. task_index=0 is "
"the master worker task that performs the variable "
"initialization.")
flags.DEFINE_string("ps_hosts", None,
"Comma-separated list of hostname:port pairs")
flags.DEFINE_string("worker_hosts", None,
"Comma-separated list of hostname:port pairs")
flags.DEFINE_string("job_name", None, "job name: worker or ps")
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
14
• Build docker image
▪ Prepare Dockerfile
▪ Build docker image
docker build -t <image_name>:v1 -f Dockerfile .
docker build -t macchiang/mnist:v7 -f Dockerfile .
docker push <image_name>:v1 Push image to docker hub
docker push macchiang/mnist:v7
FROM tensorflow/tensorflow:latest-gpu
COPY mnist_replicatensorflow/tensorflow.py /
ENTRYPOINT ["python", "/mnist_replica.py"]
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
15
• My revised history
▪ https://hub.docker.com/r/macchiang/mnist/tags/
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
16
• Specify parameters in jinja template file
▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir
▪ You may optionally specify credential_secret_name and
credential_secret_key if you need to read and write to Google Cloud
Storage
• Generate K8S YAML and create services and pods
▪ python render_template.py mnist.yaml.jinja | kubectl create -f -
command:
- "/mnist_replica.py"
args:
- "--data_dir=/data/raw-data"
- "--task_index=0"
- "--job_name=worker“
- "--worker_hosts=mnist-worker-0:5000,mnist-worker-1:5000“
- "--ps_hosts=mnist-ps-0:5000"
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
17
Worker0
Worker1
Service mnist-worker-0
Service mnist-worker-1
:5000
PS0
Service mnist-ps-0
:5000
:5000
Training Data
NFS
Training Result
NFS
Read
Write
Copyright 2017 ITRI 工業技術研究院
Distributed Tensorflow with CPUs
18
Container Orchestration Platform
35 Nodes
ImageNet Data
(145GB)
NFS
Training Result
NFS
2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
48GB Memory
• Inception Model
▪ Spent 9.23 days
35 Containers
Rethinking the Inception Architecture for Computer Vision
Copyright 2017 ITRI 工業技術研究院
Summary
• Kubernetes
▪ Production-grade container orchestration platform
▪ GPU resources management
a. Nvidia GPU only now
b. In Kuberntest 1.8, you can use NVIDIA device plugin.
» https://github.com/NVIDIA/k8s-device-plugin
• Kubernetes + Distributed TensorFlow
▪ Easy to build the distributed training cluster
▪ Leverage Kubernetes advantages
a. Restart failed container
b. Monitoring
c. Scheduling
19
Thank you!
macchiang@itri.org.tw
Kubernetes Taiwan User Group

More Related Content

Distributed tensorflow on kubernetes

  • 2. Copyright 2017 ITRI 工業技術研究院 Agenda • Kubernetes Introduction • Scheduling GPUs on Kubernetes • Distributed TensorFlow Introduction • Running Distributed TensorFlow on Kubernetes • Experience Sharing • Summary 2
  • 3. Copyright 2017 ITRI 工業技術研究院 What’s Kubernetes • “Kubernetes” is Greek for captain or pilot • Experiences from Google and design by Goolge • Kubernetes is a production-grade, open-source platform that orchestrates the placement (scheduling) and execution of application containers within and across computer clusters. • Masters manage the cluster and the nodes are used to host the running applications. 3
  • 4. Copyright 2017 ITRI 工業技術研究院 Why Kubernetes 4 • Automatic binpacking • Horizontal scaling • Automated rollouts and rollback • Service monitor • Self-healing • Service discovery and load balancing • 100% Open source, written in Go
  • 5. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes • Nvidia drivers are installed • Turned on alpha feature gate Accelerators across the system ▪ --feature-gates="Accelerators=true“ • Nodes must be using docker engine as the container runtime 5
  • 6. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes (cont.) 6 apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container-1 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs - name: gpu-container-2 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
  • 7. Copyright 2017 ITRI 工業技術研究院 Scheduling on Different GPU Versions 7 • Labeling nodes with GPU HW type • Specify the GPU types via Node Affinity rules Tesla P100 Node1 Node2 2 * K80 1 * P100 Tesla K80 gpu:k80 gpu:p100
  • 8. Copyright 2017 ITRI 工業技術研究院 Access to CUDA libraries from Docker nvidia-driver libcuda.so 8
  • 9. Copyright 2017 ITRI 工業技術研究院 TensorFlow 9 • Originally developed by the Google Brain Team within Google's Machine Intelligence research organization • An open source software library for numerical computation using data flow graphs • Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. • Support one or more CPUs or GPUs in a desktop, server, or mobile device with a single API
  • 10. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow 10 http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
  • 11. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow (cont.) 11 • Replication ▪ In-graph ▪ Between-graph • Training ▪ Asynchronous ▪ Synchronous
  • 12. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S 12 • TensorFlow ecosystem ▪ https://github.com/tensorflow/ecosystem
  • 13. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 13 • Prepare codes for distributed training ▪ Flags for configuring the task ▪ Construct the cluster and start the server ▪ Set the device before graph construction# Construct the cluster and start the server ps_spec = FLAGS.ps_hosts.split(",") worker_spec = FLAGS.worker_hosts.split(",") cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec}) server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) if FLAGS.job_name == "ps": server.join() with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)): # Construct the TensorFlow graph. # Run the TensorFlow graph. # Flags for configuring the task flags.DEFINE_integer("task_index", None, "Worker task index, should be >= 0. task_index=0 is " "the master worker task that performs the variable " "initialization.") flags.DEFINE_string("ps_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("worker_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("job_name", None, "job name: worker or ps")
  • 14. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 14 • Build docker image ▪ Prepare Dockerfile ▪ Build docker image docker build -t <image_name>:v1 -f Dockerfile . docker build -t macchiang/mnist:v7 -f Dockerfile . docker push <image_name>:v1 Push image to docker hub docker push macchiang/mnist:v7 FROM tensorflow/tensorflow:latest-gpu COPY mnist_replicatensorflow/tensorflow.py / ENTRYPOINT ["python", "/mnist_replica.py"]
  • 15. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 15 • My revised history ▪ https://hub.docker.com/r/macchiang/mnist/tags/
  • 16. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 16 • Specify parameters in jinja template file ▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir ▪ You may optionally specify credential_secret_name and credential_secret_key if you need to read and write to Google Cloud Storage • Generate K8S YAML and create services and pods ▪ python render_template.py mnist.yaml.jinja | kubectl create -f - command: - "/mnist_replica.py" args: - "--data_dir=/data/raw-data" - "--task_index=0" - "--job_name=worker“ - "--worker_hosts=mnist-worker-0:5000,mnist-worker-1:5000“ - "--ps_hosts=mnist-ps-0:5000"
  • 17. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 17 Worker0 Worker1 Service mnist-worker-0 Service mnist-worker-1 :5000 PS0 Service mnist-ps-0 :5000 :5000 Training Data NFS Training Result NFS Read Write
  • 18. Copyright 2017 ITRI 工業技術研究院 Distributed Tensorflow with CPUs 18 Container Orchestration Platform 35 Nodes ImageNet Data (145GB) NFS Training Result NFS 2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 48GB Memory • Inception Model ▪ Spent 9.23 days 35 Containers Rethinking the Inception Architecture for Computer Vision
  • 19. Copyright 2017 ITRI 工業技術研究院 Summary • Kubernetes ▪ Production-grade container orchestration platform ▪ GPU resources management a. Nvidia GPU only now b. In Kuberntest 1.8, you can use NVIDIA device plugin. » https://github.com/NVIDIA/k8s-device-plugin • Kubernetes + Distributed TensorFlow ▪ Easy to build the distributed training cluster ▪ Leverage Kubernetes advantages a. Restart failed container b. Monitoring c. Scheduling 19

Editor's Notes

  1. https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-intro/
  2. 去年四⽉中旬 Google 釋出 TensorFlow 0.8,新增加分散式運算能⼒,使 TensorFlow 可在數百台的機器上執⾏ 訓練程序,建立各種機器學習模型,將原本要耗費數天或數個星期的模型訓練過程縮短到數⼩時 而TensorFlow的工作(Job)可拆成多個相同功能的任務(Task),這些工作又分成Parameter server與Worker,兩者功能說明如下: Parameter server:主要根據梯度更新變數,並儲存於tf.Variable,可解釋為僅儲存模型的變數,並存放Variable副本。 Worker:通常稱為計算節點,主要執行密集型的Graph運算資源,並根據變數運算梯度,亦能儲存Graph副本。 • Client:是⽤於建立 TensorFlow 計算 Graph,並建立與叢集進⾏互動的tensorflow::Session ⾏ 程,⼀般由 Python 或 C++ 實作,單⼀客⼾端可以同時連接多個 TF 伺服器連接,同時也能被 多個 TF 伺服器連接. • Master Service:是⼀個 RPC 服務⾏程,⽤來遠端連線⼀系列分散式裝置,主要提供 tensorflow::Session介⾯,並負責透過 Worker Service 與⼯作的任務進⾏溝通. • Worker Service:是⼀個可以使⽤本地裝置(CPU 或 GPU)對部分 Graph 進⾏運算的 RPC 邏 輯,透過 worker_service.proto 介⾯來實作,所有 TensorFlow 伺服器均包含了 Worker Service 邏輯
  3. 在TensorFlow中启动分布式深度学习模型训练任务也有两种模式。一种为In-graph replication。在这种模式下神经网络的参数会都保存在同一个TensorFlow计算图中,只有计算会分配到不同计算服务器。另一种为Between-graph replication,这种模式下所有的计算服务器也会创建参数,但参数会通过统一的方式分配到参数服务器。因为In-graph replication处理海量数据的能力稍弱,所以Between-graph replication是一个更加常用的模式。 In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker. Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:psas before using tf.train.replica_device_setter to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker. 深度学习模型常用的有两种分布式训练方式。一种是同步更新,另一种是异步更新。如上面的ppt所示,在同步更新模式下,所有服务器都会统一读取参数的取值,计算参数梯度,最后再统一更新。而在异步更新模式下,不同服务器会自己读取参数,计算梯度并更新参数,而不需要与其他服务器同步。同步更新的最大问题在于,不同服务器需要同步完成所有操作,于是快的服务器需要等待慢的服务器,资源利用率会相对低一些。而异步模式可能会使用陈旧的梯度更新参数导致训练的效果受到影响。不同的更新模式各有优缺点,很难统一的说哪一个更好,需要具体问题具体分析 Asynchronous training. In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above. Synchronous training. In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using thetf.train.SyncReplicasOptimizer).