The Convergence of HPC and Deep Learning

THE CONVERGENCE OF HPC
AND DEEP LEARNING
Axel Koehler, Principal Solution Architect
HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano

3
FACTORS DRIVING CHANGES IN HPC
End$of$Dennard$Scaling$places$a$cap$on$
single$threaded$performance
Increasing$application$performance$will$
require$fine$grain$parallel$code$with$
significant$computational$intensity
AI$and$Data$Science$emerging$as$
important$new$components$of$scientific$
discovery
Dramatic$improvements$in$accuracy,$
completeness$and$response$time$yield$
increased$insight$from$huge$volumes$of$
data
Cloud$based$usage$models,$in?situ$
execution$and$visualization$emerging$as$
new$workflows$critical$to$the$science$
process$and$productivity
Tight$coupling$of$interactive$simulation,$
visualization,$data$analysis/AI
Service$Oriented$Architectures$(SOA)

4
Multiple Experiments Coming or
Upgrading In the Next 10 Years
15$TB/Day
10X$Increase$in$
Data$Volume
Exabyte/Day
30X$Increase$
in$power
Personal$Genomics
Cryo$EM

5
TESLA PLATFORM
ONE Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS
& TOOLS
APPLICATIONS
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK
CLOUDTESLA GPU NVIDIA DGX /
DGX-Station
NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense
…
DeepStream SDK
NCCL cuBLAS
cuSPARSE
cuDNN TensorRT
ECOSYSTEM TOOLS
HPC
+450
Applications
COMPUTEWORKS
CUDA C/C++
FORTRAN
SYSTEM OEM CLOUDNVIDIA HGX-1

6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge requirement on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge$requirement$on$compute$power$(FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path

7
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations

8
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.

9
COMMUNICATION BETWEEN GPUS
Large scale models:
• Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes
• The size of the model will further increase in the future
Data$parallel$training
• Each$worker$trains$the$same$layers$on$a$different$data$batch
• NVLINK$allows$the$separation$of$data$loading$and$gradient$
averaging
Model$parallel$training
• All$workers$train$on$same$batchX$workers$communicate$
as$frequently$as$network$allows
• NVLINK$allows$the$separation$of$data$loading$and$
exchanges$for$activation http://mxnet.io/how_to/multi_devices.html

10
NVLINK AND MULTI-GPU SCALING
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
• Data loading over PCIe (red)
• Gradient averaging over NVLink (blue)
• No sharing of communication resources:
No congestion
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
QPI Link
• Data loading over PCIe
• Gradient averaging over PCIe and QPI
• Data loading and gradient averaging share
communication resources: Congestion
PCIe based system NVLINK$based system
For Data Parallel Training

11
NVLINK AND CNTK MULTI-GPU SCALING

12
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection

13
NVIDIA DGX-2
1
2$
3
5
4
6 Two Intel Xeon Platinum CPUs
7$$1.5 TB System Memory
13
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet

1414
• 18 NVLINK ports
• @50 GB/s per port bi-directional
• 900 GB/s total bi-directional
• Fully connected crossbar
• X4 PCIe Gen2 Management port
• GPIO
• I2C
• 2 billion transistors
NVSWITCH

15
FULL NON-BLOCKING BANDWIDTH

16
UNIFIED MEMORY PROVIDES
• Single memory view shared by all GPUs
• Automatic migration of data between GPUs
• User control of data locality
NVLINK PROVIDES
• All-to-all high-bandwidth peer mapping
between GPUs
• Full inter-GPU memory interconnect
(incl. Atomics)
NVSWITCH

VOLTA MULTI-PROCESS SERVICE
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to
the work queues within the GPU
• Reduced launch latency
• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent
address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C

Efficient inference deployment without batching system
Single Volta Client,
No Batching,
No MPS
VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency
Multiple Volta Clients,
No Batching,
Using MPS
Volta with
Batching
System
7x
faster
60% of
perf with
batching
V100 measured on pre-production hardware.

20
DEEP LEARNING IS A HPC WORKLOAD
HPC expertise is important for success
• HPC and Deep Learning require a huge amount of compute power (FLOPS)
• Mainly Double Precision arithmetic for HPC
• Single, half or 8b precision for Deep Learning Training/Inference
• HPC and Deep Learning are using inherently parallel algorithms
• HPC needs less memory per FLOPS than Deep Learning
• HPC is more demanding on network bandwidth than Deep Learning
• Data scientists like GPU dense systems (as much GPUs as possible per node)
• HPC has more demand for scalability than Deep Learning up to now
• Distributed training frameworks like Horovod (Uber) are meanwhile available

21
• Current DIY deep learning
environments are complex and
time consuming to build, test
and maintain
• Same issues affect HPC and
other accelerated applications
• Need multiple jobs from
different users to co-exist on
the same servers
NVIDIA Libraries
NVIDIA Docker
NVIDIA Driver
NVIDIA GPU
Open Source
Frameworks
SOFTWARE CHALLENGES

22
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge

23
NVIDIA SATURN V
AI supercomputer with 660 x DGX-1V
40$PF$Peak$FP64$Performance$,$
660$PF$DL$Tensor$Performance
• Primarily research focused
• Used internally for Deep Learning applied
research
• Many using testing algorithms, networks,
new approaches
• Embedded, robotic, auto, hyperscale, HPC
• Partner with university research and industry
collaborations
• Study convergence of data science and HPC
• All jobs are containerized

24
DEEP LEARNING DATA CENTER
Reference Architecture
http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html

25
COMBINING THE STRENGTHS OF HPC AND AI
• Implement$inference$models$with$real$time$
interactivity$
• Train$inference$models$to$improve$accuracy$and$
comprehend$more$of$the$physical$parameter$space
• Analyze$data$sets$that$are$simply$intractable$with$
classic$statistical$models
• Control$and$manage$complex$scientific$experiments
HPC
• Proven$algorithms$based$on$first$principles$theory
• Proven$statistical$models$for$accurate$results$in$
multiple$science$domains
• Develop$training$data$sets$using$first$principal$
models
• Incorporate$AI$models$in$semi?empirical$style$
applications$to$improve$throughput
• Validate$new$findings$from$AI
• New$methods$to$improve$predictive$accuracy,$insight$
into$new$phenomena$and$response$time
AI

26
MULTI-MESSENGER
ASTROPHYSICS
Despite2the2latest2development2in2computational2
power,2there2is2still2a2large2gap2in2linking2
relativistic2theoretical2models2to2observations.2
Max$Plank$Institute
Background
The aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory)
experiment successfully discovered signals proving Einstein’s theory of General
Relativity and the existence of cosmic Gravitational Waves. While this discovery
was by itself extraordinary it is seen to be highly desirable to combine multiple
observational data sources to obtain a richer understanding of the phenomena.
Challenge
The initial a LIGO discoveries were successfully completed using classic data
analytics. The processing pipeline used hundreds of CPU’s where the bulk of the
detection processing was done offline. Here the latency is far outside the range
needed to activate resources, such as the Large Synaptic Space survey Telescope
(LSST) which observe phenomena in the electromagnetic spectrum in time to
“see” what aLIGO can “hear”.
Solution
A DNN was developed and trained using a data set derived from the CACTUS
simulation using the Einstein Toolkit. The DNN was shown to produce better
accuracy with latencies 1000x better than the original CPU based waveform
detection.
Impact
Faster and more accurate detection of gravitational waves with the potential to
steer other observational data sources.

27
Background
Developing a new drug costs $2.5B and takes 10-15 years. Quantum chemistry
(QC) simulations are important to accurately screen millions of potential drugs to
a few most promising drug candidates.
Challenge
QC simulation is computationally expensive so researchers use approximations,
compromising on accuracy. To screen 10M drug candidates, it takes 5 years to
compute on CPUs.
Solution
Researchers at the University of Florida and the University of North Carolina
leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular
energy surfaces with super speed (microseconds versus several minutes),
extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current
computational methods.
Impact
Faster, more accurate screening at far lower cost
AI Quantum Breakthrough

28
SUMMARY
• Same GPU technology enabling powerful science is also enabling
the revolution in deep learning
• Deep learning is enabling many usages in science (eg. Image
recognition, classification, ..)
• Applications can use DL to train neural networks with already
simulated data and DL network can predict about the output
• GPU is the right technology for HPC and DL

Axel Koehler (akoehler@nvidia.com)
THE CONVERGENCE OF HPC
AND DEEP LEARNING

The Convergence of HPC and Deep Learning

Related slideshows

More Related Content

The Convergence of HPC and Deep Learning