SlideShare a Scribd company logo
THE CONVERGENCE OF HPC
AND DEEP LEARNING
Axel Koehler, Principal Solution Architect
HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano
3
FACTORS DRIVING CHANGES IN HPC
End$of$Dennard$Scaling$places$a$cap$on$
single$threaded$performance
Increasing$application$performance$will$
require$fine$grain$parallel$code$with$
significant$computational$intensity
AI$and$Data$Science$emerging$as$
important$new$components$of$scientific$
discovery
Dramatic$improvements$in$accuracy,$
completeness$and$response$time$yield$
increased$insight$from$huge$volumes$of$
data
Cloud$based$usage$models,$in?situ$
execution$and$visualization$emerging$as$
new$workflows$critical$to$the$science$
process$and$productivity
Tight$coupling$of$interactive$simulation,$
visualization,$data$analysis/AI
Service$Oriented$Architectures$(SOA)
4
Multiple Experiments Coming or
Upgrading In the Next 10 Years
15$TB/Day
10X$Increase$in$
Data$Volume
Exabyte/Day
30X$Increase$
in$power
Personal$Genomics
Cryo$EM
5
TESLA PLATFORM
ONE Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS
& TOOLS
APPLICATIONS
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK
CLOUDTESLA GPU NVIDIA DGX /
DGX-Station
NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense
…
DeepStream SDK
NCCL cuBLAS
cuSPARSE
cuDNN TensorRT
ECOSYSTEM TOOLS
HPC
+450
Applications
COMPUTEWORKS
CUDA C/C++
FORTRAN
SYSTEM OEM CLOUDNVIDIA HGX-1
6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge requirement on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge$requirement$on$compute$power$(FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path
7
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
8
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
9
COMMUNICATION BETWEEN GPUS
Large scale models:
• Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes
• The size of the model will further increase in the future
Data$parallel$training
• Each$worker$trains$the$same$layers$on$a$different$data$batch
• NVLINK$allows$the$separation$of$data$loading$and$gradient$
averaging
Model$parallel$training
• All$workers$train$on$same$batchX$workers$communicate$
as$frequently$as$network$allows
• NVLINK$allows$the$separation$of$data$loading$and$
exchanges$for$activation http://mxnet.io/how_to/multi_devices.html
10
NVLINK AND MULTI-GPU SCALING
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
• Data loading over PCIe (red)
• Gradient averaging over NVLink (blue)
• No sharing of communication resources:
No congestion
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
QPI Link
• Data loading over PCIe
• Gradient averaging over PCIe and QPI
• Data loading and gradient averaging share
communication resources: Congestion
PCIe based system NVLINK$based system
For Data Parallel Training
11
NVLINK AND CNTK MULTI-GPU SCALING
12
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection
13
NVIDIA DGX-2
1
2$
3
5
4
6 Two Intel Xeon Platinum CPUs
7$$1.5 TB System Memory
13
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
1414
• 18 NVLINK ports
• @50 GB/s per port bi-directional
• 900 GB/s total bi-directional
• Fully connected crossbar
• X4 PCIe Gen2 Management port
• GPIO
• I2C
• 2 billion transistors
NVSWITCH
15
FULL NON-BLOCKING BANDWIDTH
16
UNIFIED MEMORY PROVIDES
• Single memory view shared by all GPUs
• Automatic migration of data between GPUs
• User control of data locality
NVLINK PROVIDES
• All-to-all high-bandwidth peer mapping
between GPUs
• Full inter-GPU memory interconnect
(incl. Atomics)
NVSWITCH
VOLTA MULTI-PROCESS SERVICE
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to
the work queues within the GPU
• Reduced launch latency
• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent
address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C
Efficient inference deployment without batching system
Single Volta Client,
No Batching,
No MPS
VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency
Multiple Volta Clients,
No Batching,
Using MPS
Volta with
Batching
System
7x
faster
60% of
perf with
batching
V100 measured on pre-production hardware.
20
DEEP LEARNING IS A HPC WORKLOAD
HPC expertise is important for success
• HPC and Deep Learning require a huge amount of compute power (FLOPS)
• Mainly Double Precision arithmetic for HPC
• Single, half or 8b precision for Deep Learning Training/Inference
• HPC and Deep Learning are using inherently parallel algorithms
• HPC needs less memory per FLOPS than Deep Learning
• HPC is more demanding on network bandwidth than Deep Learning
• Data scientists like GPU dense systems (as much GPUs as possible per node)
• HPC has more demand for scalability than Deep Learning up to now
• Distributed training frameworks like Horovod (Uber) are meanwhile available
21
• Current DIY deep learning
environments are complex and
time consuming to build, test
and maintain
• Same issues affect HPC and
other accelerated applications
• Need multiple jobs from
different users to co-exist on
the same servers
NVIDIA Libraries
NVIDIA Docker
NVIDIA Driver
NVIDIA GPU
Open Source
Frameworks
SOFTWARE CHALLENGES
22
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge
23
NVIDIA SATURN V
AI supercomputer with 660 x DGX-1V
40$PF$Peak$FP64$Performance$,$
660$PF$DL$Tensor$Performance
• Primarily research focused
• Used internally for Deep Learning applied
research
• Many using testing algorithms, networks,
new approaches
• Embedded, robotic, auto, hyperscale, HPC
• Partner with university research and industry
collaborations
• Study convergence of data science and HPC
• All jobs are containerized
24
DEEP LEARNING DATA CENTER
Reference Architecture
http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html
25
COMBINING THE STRENGTHS OF HPC AND AI
• Implement$inference$models$with$real$time$
interactivity$
• Train$inference$models$to$improve$accuracy$and$
comprehend$more$of$the$physical$parameter$space
• Analyze$data$sets$that$are$simply$intractable$with$
classic$statistical$models
• Control$and$manage$complex$scientific$experiments
HPC
• Proven$algorithms$based$on$first$principles$theory
• Proven$statistical$models$for$accurate$results$in$
multiple$science$domains
• Develop$training$data$sets$using$first$principal$
models
• Incorporate$AI$models$in$semi?empirical$style$
applications$to$improve$throughput
• Validate$new$findings$from$AI
• New$methods$to$improve$predictive$accuracy,$insight$
into$new$phenomena$and$response$time
AI
26
MULTI-MESSENGER
ASTROPHYSICS
Despite2the2latest2development2in2computational2
power,2there2is2still2a2large2gap2in2linking2
relativistic2theoretical2models2to2observations.2
Max$Plank$Institute
Background
The aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory)
experiment successfully discovered signals proving Einstein’s theory of General
Relativity and the existence of cosmic Gravitational Waves. While this discovery
was by itself extraordinary it is seen to be highly desirable to combine multiple
observational data sources to obtain a richer understanding of the phenomena.
Challenge
The initial a LIGO discoveries were successfully completed using classic data
analytics. The processing pipeline used hundreds of CPU’s where the bulk of the
detection processing was done offline. Here the latency is far outside the range
needed to activate resources, such as the Large Synaptic Space survey Telescope
(LSST) which observe phenomena in the electromagnetic spectrum in time to
“see” what aLIGO can “hear”.
Solution
A DNN was developed and trained using a data set derived from the CACTUS
simulation using the Einstein Toolkit. The DNN was shown to produce better
accuracy with latencies 1000x better than the original CPU based waveform
detection.
Impact
Faster and more accurate detection of gravitational waves with the potential to
steer other observational data sources.
27
Background
Developing a new drug costs $2.5B and takes 10-15 years. Quantum chemistry
(QC) simulations are important to accurately screen millions of potential drugs to
a few most promising drug candidates.
Challenge
QC simulation is computationally expensive so researchers use approximations,
compromising on accuracy. To screen 10M drug candidates, it takes 5 years to
compute on CPUs.
Solution
Researchers at the University of Florida and the University of North Carolina
leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular
energy surfaces with super speed (microseconds versus several minutes),
extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current
computational methods.
Impact
Faster, more accurate screening at far lower cost
AI Quantum Breakthrough
28
SUMMARY
• Same GPU technology enabling powerful science is also enabling
the revolution in deep learning
• Deep learning is enabling many usages in science (eg. Image
recognition, classification, ..)
• Applications can use DL to train neural networks with already
simulated data and DL network can predict about the output
• GPU is the right technology for HPC and DL
Axel Koehler (akoehler@nvidia.com)
THE CONVERGENCE OF HPC
AND DEEP LEARNING

More Related Content

The Convergence of HPC and Deep Learning

  • 1. THE CONVERGENCE OF HPC AND DEEP LEARNING Axel Koehler, Principal Solution Architect HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano
  • 2. 3 FACTORS DRIVING CHANGES IN HPC End$of$Dennard$Scaling$places$a$cap$on$ single$threaded$performance Increasing$application$performance$will$ require$fine$grain$parallel$code$with$ significant$computational$intensity AI$and$Data$Science$emerging$as$ important$new$components$of$scientific$ discovery Dramatic$improvements$in$accuracy,$ completeness$and$response$time$yield$ increased$insight$from$huge$volumes$of$ data Cloud$based$usage$models,$in?situ$ execution$and$visualization$emerging$as$ new$workflows$critical$to$the$science$ process$and$productivity Tight$coupling$of$interactive$simulation,$ visualization,$data$analysis/AI Service$Oriented$Architectures$(SOA)
  • 3. 4 Multiple Experiments Coming or Upgrading In the Next 10 Years 15$TB/Day 10X$Increase$in$ Data$Volume Exabyte/Day 30X$Increase$ in$power Personal$Genomics Cryo$EM
  • 4. 5 TESLA PLATFORM ONE Data Center Platform for Accelerating HPC and AI TESLA GPU & SYSTEMS NVIDIA SDK INDUSTRY FRAMEWORKS & TOOLS APPLICATIONS FRAMEWORKS INTERNET SERVICES DEEP LEARNING SDK CLOUDTESLA GPU NVIDIA DGX / DGX-Station NVIDIA HGX-1 ENTERPRISE APPLICATIONS Manufacturing Automotive Healthcare Finance Retail Defense … DeepStream SDK NCCL cuBLAS cuSPARSE cuDNN TensorRT ECOSYSTEM TOOLS HPC +450 Applications COMPUTEWORKS CUDA C/C++ FORTRAN SYSTEM OEM CLOUDNVIDIA HGX-1
  • 5. 6 GPUS FOR HPC AND DEEP LEARNING NVIDIA Tesla V100 5120 energy efficient cores + TensorCores 7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision Huge requirement on communication and memory bandwidth NVLink 6 links per GPU a 50 GB/s bi- directional for maximum scalability between GPU’s CoWoS with HBM2 900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package Huge$requirement$on$compute$power$(FLOPS) NCCL High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs GPU Direct / GPU Direct RDMA Direct communication between GPUs by eliminating the CPU from the critical path
  • 6. 7 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Using Tensor cores via • Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..) • CUDA C++ Warp Level Matrix Operations
  • 7. 8 0 1 2 3 4 5 6 7 8 9 10 512 1024 2048 4096 Relative2Performance Matrix2Size2(M=N=K) cuBLAS Mixed2Precision2(FP162Input,2FP322compute) P1002(CUDA28) V1002Tensor2Cores22(CUDA29) 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 2 512 1024 2048 4096 Relative2Performance Matrix2Size2(M=N=K) cuBLAS Single2Precision2(FP32) P1002(CUDA28) V1002(CUDA29) cuBLAS GEMMS FOR DEEP LEARNING V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply 9.3x1.8x Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
  • 8. 9 COMMUNICATION BETWEEN GPUS Large scale models: • Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes • The size of the model will further increase in the future Data$parallel$training • Each$worker$trains$the$same$layers$on$a$different$data$batch • NVLINK$allows$the$separation$of$data$loading$and$gradient$ averaging Model$parallel$training • All$workers$train$on$same$batchX$workers$communicate$ as$frequently$as$network$allows • NVLINK$allows$the$separation$of$data$loading$and$ exchanges$for$activation http://mxnet.io/how_to/multi_devices.html
  • 9. 10 NVLINK AND MULTI-GPU SCALING PCIe Switch CPU PCIe Switch CPU 0 32 1 5 67 4 • Data loading over PCIe (red) • Gradient averaging over NVLink (blue) • No sharing of communication resources: No congestion PCIe Switch CPU PCIe Switch CPU 0 32 1 5 67 4 QPI Link • Data loading over PCIe • Gradient averaging over PCIe and QPI • Data loading and gradient averaging share communication resources: Congestion PCIe based system NVLINK$based system For Data Parallel Training
  • 10. 11 NVLINK AND CNTK MULTI-GPU SCALING
  • 11. 12 NVIDIA Collective Communications Library (NCCL) 2 Multi-GPU and multi-node collective communication primitives developer.nvidia.com/nccl High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more Multi-Node: InfiniBand verbs IP Sockets Multi-GPU: NVLink PCIe Automatic Topology Detection
  • 12. 13 NVIDIA DGX-2 1 2$ 3 5 4 6 Two Intel Xeon Platinum CPUs 7$$1.5 TB System Memory 13 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 13. 1414 • 18 NVLINK ports • @50 GB/s per port bi-directional • 900 GB/s total bi-directional • Fully connected crossbar • X4 PCIe Gen2 Management port • GPIO • I2C • 2 billion transistors NVSWITCH
  • 15. 16 UNIFIED MEMORY PROVIDES • Single memory view shared by all GPUs • Automatic migration of data between GPUs • User control of data locality NVLINK PROVIDES • All-to-all high-bandwidth peer mapping between GPUs • Full inter-GPU memory interconnect (incl. Atomics) NVSWITCH
  • 16. VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes GPU Execution Volta MPS Enhancements: • MPS clients submit work directly to the work queues within the GPU • Reduced launch latency • Improved launch throughput • Improved isolation amongst MPS clients • Address isolation with independent address spaces • Improved quality of service (QoS) • 3x more clients than Pascal A B C
  • 17. Efficient inference deployment without batching system Single Volta Client, No Batching, No MPS VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency Multiple Volta Clients, No Batching, Using MPS Volta with Batching System 7x faster 60% of perf with batching V100 measured on pre-production hardware.
  • 18. 20 DEEP LEARNING IS A HPC WORKLOAD HPC expertise is important for success • HPC and Deep Learning require a huge amount of compute power (FLOPS) • Mainly Double Precision arithmetic for HPC • Single, half or 8b precision for Deep Learning Training/Inference • HPC and Deep Learning are using inherently parallel algorithms • HPC needs less memory per FLOPS than Deep Learning • HPC is more demanding on network bandwidth than Deep Learning • Data scientists like GPU dense systems (as much GPUs as possible per node) • HPC has more demand for scalability than Deep Learning up to now • Distributed training frameworks like Horovod (Uber) are meanwhile available
  • 19. 21 • Current DIY deep learning environments are complex and time consuming to build, test and maintain • Same issues affect HPC and other accelerated applications • Need multiple jobs from different users to co-exist on the same servers NVIDIA Libraries NVIDIA Docker NVIDIA Driver NVIDIA GPU Open Source Frameworks SOFTWARE CHALLENGES
  • 20. 22 NVIDIA GPU CLOUD REGISTRY Deep Learning All major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch HPC NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC HPC Visualization Paraview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD Single NGC Account For use on GPUs everywhere - https://ngc.nvidia.com Common Software stack across NVIDIA GPUs NVIDIA GPU Cloud containerizes GPU- optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge
  • 21. 23 NVIDIA SATURN V AI supercomputer with 660 x DGX-1V 40$PF$Peak$FP64$Performance$,$ 660$PF$DL$Tensor$Performance • Primarily research focused • Used internally for Deep Learning applied research • Many using testing algorithms, networks, new approaches • Embedded, robotic, auto, hyperscale, HPC • Partner with university research and industry collaborations • Study convergence of data science and HPC • All jobs are containerized
  • 22. 24 DEEP LEARNING DATA CENTER Reference Architecture http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html
  • 23. 25 COMBINING THE STRENGTHS OF HPC AND AI • Implement$inference$models$with$real$time$ interactivity$ • Train$inference$models$to$improve$accuracy$and$ comprehend$more$of$the$physical$parameter$space • Analyze$data$sets$that$are$simply$intractable$with$ classic$statistical$models • Control$and$manage$complex$scientific$experiments HPC • Proven$algorithms$based$on$first$principles$theory • Proven$statistical$models$for$accurate$results$in$ multiple$science$domains • Develop$training$data$sets$using$first$principal$ models • Incorporate$AI$models$in$semi?empirical$style$ applications$to$improve$throughput • Validate$new$findings$from$AI • New$methods$to$improve$predictive$accuracy,$insight$ into$new$phenomena$and$response$time AI
  • 24. 26 MULTI-MESSENGER ASTROPHYSICS Despite2the2latest2development2in2computational2 power,2there2is2still2a2large2gap2in2linking2 relativistic2theoretical2models2to2observations.2 Max$Plank$Institute Background The aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory) experiment successfully discovered signals proving Einstein’s theory of General Relativity and the existence of cosmic Gravitational Waves. While this discovery was by itself extraordinary it is seen to be highly desirable to combine multiple observational data sources to obtain a richer understanding of the phenomena. Challenge The initial a LIGO discoveries were successfully completed using classic data analytics. The processing pipeline used hundreds of CPU’s where the bulk of the detection processing was done offline. Here the latency is far outside the range needed to activate resources, such as the Large Synaptic Space survey Telescope (LSST) which observe phenomena in the electromagnetic spectrum in time to “see” what aLIGO can “hear”. Solution A DNN was developed and trained using a data set derived from the CACTUS simulation using the Einstein Toolkit. The DNN was shown to produce better accuracy with latencies 1000x better than the original CPU based waveform detection. Impact Faster and more accurate detection of gravitational waves with the potential to steer other observational data sources.
  • 25. 27 Background Developing a new drug costs $2.5B and takes 10-15 years. Quantum chemistry (QC) simulations are important to accurately screen millions of potential drugs to a few most promising drug candidates. Challenge QC simulation is computationally expensive so researchers use approximations, compromising on accuracy. To screen 10M drug candidates, it takes 5 years to compute on CPUs. Solution Researchers at the University of Florida and the University of North Carolina leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular energy surfaces with super speed (microseconds versus several minutes), extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current computational methods. Impact Faster, more accurate screening at far lower cost AI Quantum Breakthrough
  • 26. 28 SUMMARY • Same GPU technology enabling powerful science is also enabling the revolution in deep learning • Deep learning is enabling many usages in science (eg. Image recognition, classification, ..) • Applications can use DL to train neural networks with already simulated data and DL network can predict about the output • GPU is the right technology for HPC and DL
  • 27. Axel Koehler (akoehler@nvidia.com) THE CONVERGENCE OF HPC AND DEEP LEARNING