Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017

DEEP LEARNING & MULTI
GPUs
Training Deep Learning Models on
Multiple GPUs in the Cloud
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036
Madrid
hablemos@beeva.com
www.beeva.com

ENRIQUE OTERO
enrique.otero@beeva.com
@beevalabs_eom
Data Scientist in
BEEVA
hablemos@beeva.com |
www.beeva.com
The intro: deep learning & GPUs
The training: challenges &
benchmarks on image classification
The lessons: science, engineering,
infrastructure & business

WWW.BEEVA.COM
3
BIG
DATA
CLOUD
COMPUTI
NG
MACHINE
INTELLIGE
NCE
● INNOVATION LABS
● INNOVATION SERVICES
100
%
+40
%Annual growth rate
in last 4 years
+65
0Employees
in Spain
+80
0Employees
globally
WE MAKE COMPLEX THINGS
SIMPLE

4
Deep Learning disruption
● Computer Vision
● Speech Recognition
● Machine Translation
● Ranking

● more (labeled) data
● more computing power
● some tricks
Why now?
Source: http://yann.lecun.com/exdb/lenet/

7
1º
4º
2º
3º
shaders &
hacks OpenGL,
DirectX
GPUs. Nvidia & CUDA

THANK YOU GAMERS!THANK YOU
GAMERS!

Training times vs. accuracy
Accelerate training is
essential!
Source: https://github.com/sailing-pmls/pmls-caffe/
Source: Canziani et al 2017

● Stochastic Gradient Descent
(SGD)
● Mini-batch SGD
Source: Andrew Ng.
Source: http://www.eeng.dcu.ie/~mcguinne/
Error (loss) function
Stochastic gradient descent

Data parallel vs. model
parallel
● Faster or larger models?
Asynchronous vs.
Synchronous
● Fast or precise?
Distributed training
Source:
https://github.com/tensorflow/models/tree/master/research/inception

(Multi-node) third party
benchmarks
Seems quite
good!!
Easy?
Source: https://chainer.org

benchmarks
ResNet152 (8 to 256 gpus): 95% to 90%
efficiency
AlexNet (8 to 256 gpus): 78% to 53%
efficiency
Source: mxnet on AWS 16 x p2.16x

benchmarks
Small print:
● High speed connections!
● Synthetic data vs. real data
● Bottlenecks in hard disk
And more...
● accuracy penalization
● number of parameter serversSource: tensorflow.org
Source: https://chainer.org

benchmarks
So let’s start single-node multi-gpu in

Let’s begin: Tesla K80
K80 GPUs on:
● AWS p2: 1, 8 & 16
○ ready-to-go AMIs
● Azure NC: 1, 2 & 4
● Google Cloud Platform: 1 to
8
○ setup scripts

● Goal: saturate GPUs!
● Bottlenecks:
○ I/O Reads
○ Image pipelines
■ Decoding
■ Data augmentation
○ Communications:
■ efficient primitives: NCCL
■ Overlap with computation
■ QPI < PCIe < NVLink
Data pipeline bottlenecks

(internal) interconnections matter
Azure NC24 (4
K80):
Google n1-
highmem-32 with 8
K80:

GPUDirect & PCIe
GPUDirect
support
(internal) Interconnections matter

More lessons: CIFAR10 on AWS
AWS p2.8x = 8x GPU K80
sync. data-parallel
After 8 epochs...
mxnet:
● validation accuracy = [0.77,
0.82]
tensorflow:
● validation accuracy = [0.47,
0.59]

Batch sizes matter
● Larger batches reduce
communication
overhead
○ More throughput
● But degrade
convergence.
○ Less accuracy!

Accuracy vs. throughput
Empirical workaround for learning
rates:
● warm up: start small
● increase 5 epochs...
● finish on #gpus x lr
Scenario:
● NVLink, 50Gbps Ethernet (> 15Gbps)
● Caffe2 + NCCL + gloo
● Synchronous SGD + Momentum
Source: Facebook AI Research

Being practical: fine tuning with
MxNet
Source: https://mxnet.incubator.apache.org/how_to/finetune.html

MxNet
Scenario:
● p2.8x
● ResNet50
● batch-size: 16 x gpu
● lr = lr_i x gpu
● 1 epoch
94% efficiency :)

MxNet
Scenario:
● p2.8x
● ResNet152
● batch-size: 16,32 x gpu
● lr = lr_i x gpu
● 1 epoch
● val-acc = [0.830, 0.836]
95% efficiency :)
< 1% accuracy loss

Tesla K80 price on premises
Source: amazon.com November 2017

Tesla K80 prices on cloud
1$/h per-second billing
only 0.3$/h on AWS spot
market
Purchase 1 or rent 4000 to
12000 hours!
Training ResNet50 Imagenet1K (100 epochs):
180$ to 730$
Fine-tuning (8 epochs): < 2$

2014 to 2017: from Kepler... to
Volta!
Source: aws.amazon.com New! October 2017
And Tesla Pascal P100 beta on Google Cloud Platform New!
September 2017.
on-demand spot

Extra: NVIDIA optimized
containers!
New!

Extra: NVIDIA Volta on AWS P3
instances!
• Great performance!
• Cost-effective (on-
demand)
• (still) scarce availability

Summary
SCIENCE
Batch sizes & learning rates
matter!
● high batch sizes degrade
convergence
● linear scaling rule & warm-up
ENGINEERING
Data pipeline matters!
● Data feed
● Overlap computation &
communications
INFRASTRUCTURE
Architecture & Bandwidth
matters!
● Volta > Pascal > Kepler
● NVLink > PCle > (25 Gbps)
Ethernet
BUSINESS
Pricing matters!
● Cost effective cloud instances in
spot market

THANKS FOR YOUR TIME
hablemos@beeva.com | www.beeva.com
And we’re hiring!

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017

More Related Content

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017