SlideShare a Scribd company logo
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017
DEEP LEARNING & MULTI
GPUs
Training Deep Learning Models on
Multiple GPUs in the Cloud
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036
Madrid
hablemos@beeva.com
www.beeva.com
ENRIQUE OTERO
enrique.otero@beeva.com
@beevalabs_eom
Data Scientist in
BEEVA
hablemos@beeva.com |
www.beeva.com
The intro: deep learning & GPUs
The training: challenges &
benchmarks on image classification
The lessons: science, engineering,
infrastructure & business
WWW.BEEVA.COM
3
BIG
DATA
CLOUD
COMPUTI
NG
MACHINE
INTELLIGE
NCE
● INNOVATION LABS
● INNOVATION SERVICES
100
%
+40
%Annual growth rate
in last 4 years
+65
0Employees
in Spain
+80
0Employees
globally
WE MAKE COMPLEX THINGS
SIMPLE
4
Deep Learning disruption
● Computer Vision
● Speech Recognition
● Machine Translation
● Ranking
● more (labeled) data
● more computing power
● some tricks
Why now?
Source: http://yann.lecun.com/exdb/lenet/
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017
7
1º
4º
2º
3º
shaders &
hacks OpenGL,
DirectX
GPUs. Nvidia & CUDA
THANK YOU GAMERS!THANK YOU
GAMERS!
Training times vs. accuracy
Accelerate training is
essential!
Source: https://github.com/sailing-pmls/pmls-caffe/
Source: Canziani et al 2017
● Stochastic Gradient Descent
(SGD)
● Mini-batch SGD
Source: Andrew Ng.
Source: http://www.eeng.dcu.ie/~mcguinne/
Error (loss) function
Stochastic gradient descent
Data parallel vs. model
parallel
● Faster or larger models?
Asynchronous vs.
Synchronous
● Fast or precise?
Distributed training
Source:
https://github.com/tensorflow/models/tree/master/research/inception
(Multi-node) third party
benchmarks
Seems quite
good!!
Easy?
Source: https://chainer.org
(Multi-node) third party
benchmarks
ResNet152 (8 to 256 gpus): 95% to 90%
efficiency
AlexNet (8 to 256 gpus): 78% to 53%
efficiency
Source: mxnet on AWS 16 x p2.16x
(Multi-node) third party
benchmarks
Small print:
● High speed connections!
● Synthetic data vs. real data
● Bottlenecks in hard disk
And more...
● accuracy penalization
● number of parameter serversSource: tensorflow.org
Source: https://chainer.org
(Multi-node) third party
benchmarks
So let’s start single-node multi-gpu in
Let’s begin: Tesla K80
K80 GPUs on:
● AWS p2: 1, 8 & 16
○ ready-to-go AMIs
● Azure NC: 1, 2 & 4
● Google Cloud Platform: 1 to
8
○ setup scripts
Let’s begin: MNIST & cloud
● Goal: saturate GPUs!
● Bottlenecks:
○ I/O Reads
○ Image pipelines
■ Decoding
■ Data augmentation
○ Communications:
■ efficient primitives: NCCL
■ Overlap with computation
■ QPI < PCIe < NVLink
Data pipeline bottlenecks
(internal) interconnections matter
Azure NC24 (4
K80):
Google n1-
highmem-32 with 8
K80:
GPUDirect & PCIe
GPUDirect
support
(internal) Interconnections matter
GPUDirect & PCIe
GPUDirect
support
(internal) Interconnections matter
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017
More lessons: CIFAR10 on AWS
AWS p2.8x = 8x GPU K80
sync. data-parallel
After 8 epochs...
mxnet:
● validation accuracy = [0.77,
0.82]
tensorflow:
● validation accuracy = [0.47,
0.59]
Batch sizes matter
● Larger batches reduce
communication
overhead
○ More throughput
● But degrade
convergence.
○ Less accuracy!
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017
Accuracy vs. throughput
Empirical workaround for learning
rates:
● warm up: start small
● increase 5 epochs...
● finish on #gpus x lr
Scenario:
● NVLink, 50Gbps Ethernet (> 15Gbps)
● Caffe2 + NCCL + gloo
● Synchronous SGD + Momentum
Source: Facebook AI Research
Being practical: fine tuning with
MxNet
Source: https://mxnet.incubator.apache.org/how_to/finetune.html
Being practical: fine tuning with
MxNet
Scenario:
● p2.8x
● ResNet50
● batch-size: 16 x gpu
● lr = lr_i x gpu
● 1 epoch
94% efficiency :)
Being practical: fine tuning with
MxNet
Scenario:
● p2.8x
● ResNet152
● batch-size: 16,32 x gpu
● lr = lr_i x gpu
● 1 epoch
● val-acc = [0.830, 0.836]
95% efficiency :)
< 1% accuracy loss
What about costs?
Tesla K80 price on premises
Source: amazon.com November 2017
Tesla K80 prices on cloud
1$/h per-second billing
only 0.3$/h on AWS spot
market
Purchase 1 or rent 4000 to
12000 hours!
Training ResNet50 Imagenet1K (100 epochs):
180$ to 730$
Fine-tuning (8 epochs): < 2$
2014 to 2017: from Kepler... to
Volta!
Source: aws.amazon.com New! October 2017
And Tesla Pascal P100 beta on Google Cloud Platform New!
September 2017.
on-demand spot
Extra: NVIDIA optimized
containers!
New!
Extra: NVIDIA Volta on AWS P3
instances!
• Great performance!
• Cost-effective (on-
demand)
• (still) scarce availability
Summary
SCIENCE
Batch sizes & learning rates
matter!
● high batch sizes degrade
convergence
● linear scaling rule & warm-up
ENGINEERING
Data pipeline matters!
● Data feed
● Overlap computation &
communications
INFRASTRUCTURE
Architecture & Bandwidth
matters!
● Volta > Pascal > Kepler
● NVLink > PCle > (25 Gbps)
Ethernet
BUSINESS
Pricing matters!
● Cost effective cloud instances in
spot market
THANKS FOR YOUR TIME
hablemos@beeva.com | www.beeva.com
And we’re hiring!

More Related Content

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017