R&D work on pre exascale HPC systems

Current R&D hands on work forpre-Exascale HPC systems
Joshua Mora, PhD
Agenda
•Heterogeneous HW flexibility on SW request.
•PRACE 2011 prototype, 1TF/kW efficiency ratio.
•Performance debugging of HW and SW on multi-socket, multi-chipset, multi-GPUs, multi-rail (IB).
•Affinity/binding.
•In background: [c/nc] HT topologies, I/O performance implications, performance counters, NIC/GPU device interaction with processors, SW stacks.
1
11/3/2010pre-Exascale HPC systems

If (HPC==customization) {Heterogeneous HW/SW flexibility;} else {Enterprise solutions;}
On the HW side, customization is all about
•Multi-socket systems, to provide cpucomputing density (ie. aggregated G[FLOP/Byte]/s), and memory capacity.
•Multi-chipset systems, to hook multiple GPUs (to accelerate computation) and NICs (tightly coupled processors/gpusacross computing nodes).
But
•Pay attention, among other things, to [nc/c]HT topologies to avoid running out of bandwidth (ie. resource limitations) between devices when the HPC application starts pushing to the limits in all directions.
2

3QDR IB switch
•36 portsFAT NODE in 4U
•8xSix-core @ 2.6GH
•256GB RAM @ DDR2-667
•4xPCIegen2
•4xQDR IB , 2x IB ports
•RNAcacheapplianceCLUSTER NODE 01 in 1U
•2xSix-core @ 2.6GHz
•16GB RAM @ DDR2-800
•1xPCIegen2
•1xQDR IB, 1 IB port
•RNAcacheclientCLUSTER NODE 08 in 1U
•2xSix-core @ 2.6GHz
•16GB RAM @ DDR2-800
•1xPCIegen2
•1xQDR IB, 1 IB port
•RNAcacheclient6GB/s bidir
6GB/s bidir
6GB/s bidir
6GB/s bidir
6GB/s bidir
6GB/s bidir
Example: RAM NFS (RNANETWORKS) over 4 QDR IB cards
2x10GB/s
2x10GB/s
8x10GB/s
Ultra fast local/swap/shared cluster file system Reads/Writes @ 20GB/sec
aggregated

Example: OpenMPapplication on NUMA system
Better to configure system as memory node not interleaved ?
Or memory node interleaved enabled ? Huge cHTtraffic, limitation.
Answer : modify application to exploit NUMA system by allocating local arrays after threads have started.
405
10
15
00:00.509
00:02.237
00:03.965
00:05.693
00:07.421
00:09.149
GB/s
Time
DRAM bw (GB/s) no interleaved
0.0
2.0
4.06.000:00.509
00:02.237
00:03.965
00:05.693
00:07.421
00:09.149
GB/s
Time
DRAM bw (GB/s) interleaved
gsmpsrch_interleave_t00N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s)

PRACE 2011 prototype, HW description
5
Multi rail QDR IB,
Fat tree topology

PRACE 2011 prototype, HW description (cont.)
•6U Sandwich unit
replicated 6 times
within 42U rack.
•Comp.Nodesconnected
with MultirailIB.
•Single 36 IB port switch
(fat tree).
•Management 1Gb/s
6
0

PRACE 2011 prototype, SW description
•“Plain” C/C++/Fortran (application)
•Pragmadirectives for OpenMP(multi threading)
•Pragmadirectives for HMPP (CAPS enterprise) (GPUs)
•MPI (OpenMPI,MVAPICH) (communications)
•GNU/Open64/CAPS (compilers)
•GNU debuggers
•ACML, LAPACK,.. (math libraries)
•“Plain” OS (eg. SLES11sp1, support for Multi-chip Module)
•No need to develop kernels in
•OpenCL(for ATI cards)
•CUDA (for NVIDIA cards).

PRACE 2011 prototype : exceeding all requirements
Requirements and Metrics
•Budget of 400k EUR
•20 TFLOP/s peak
•Contained in 1 square meter (ie. single rack)
•0.6 TFLOP/kW
•This system delivers real double precision 10TFLOP/s in rack
within 10kW, assuming multi-cores only pump in and out data to
GPUs.
•100 racks can deliver 1PFLOP within 1MW, >>20Million EUR.
•Reaching affordability point in private sector (eg. O&G).
8
Metric Requirement Proposal Exceeded
peak TF/s 20 22 yes
m^2 1 1 (42U rack) met
TF/kW 0.6 1.1 yes
EUR 400k 250K yes
11/3/2010 pre-Exascale HPC systems

PRACE 2011 prototype : HW flexibility on SW request
NextIObox:
•PCIegen2 switch box/appliance (GPUs, IB, FusionIO,..)
•Connection of x8 and x16 PCIegen1/2 devices
•Hot swappable PCI devices
•Dynamically reconfigurable through SW without having to reboot system
•Virtualization capabilities through IOMMU
•Applications can be “tuned” to use more or less GPUs on request:
Some GPU kernels are very efficient and the PCI bwrequirements are low Can dynamically allocate more GPUs to fewer compute nodes increasing Performance/Watt.
•Targeting SC10 to do a demo/cluster challenge.
9

Performance debugging of HW/SW on multi…you name it.
Example: 2 processor + 2 chipsets + 2 GPUs
PCIeSpeedTestavailable at developer.amd.com
10
CPU 1
CPU 0GPU 0CPU 2
CPU 3
Chipset 1
Mem1
Mem0Mem2
Mem3
ncHT
ncHT
GPU 1
Chipset 0
b-12 GB/s
u-6 GB/s
u-6 GB/sb-12 GB/sb-12 GB/s
b-12 GB/s
b-12 GB/s
b-12 GB/s
6
6
6
6
5
5
Pay attention to cHTtopology and
I/O balancing since
cHTlinks can restrict CPUGPU bandwidth
u-5 GB/s restricted bw

Performance debugging of HW on multi…you name it.
Problem: CPUGPU good, GPUCPU bad, why ?
11
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
00:00.376 00:09.016 00:17.656 00:26.296 00:34.936 00:43.576 00:52.216
Thousands
Time
bw (GB/s) core0_mem0 to/from GPU
GPU_00t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) GPU_00t N2. MEMORY CONTROLLER Total DRAM writes (MB/s)
GPU_00t N2. MEMORY CONTROLLER Total DRAM reads inc pf (MB/s) GPU_00t N3. HYPERTRANSPORT LINKS HT3 xmit (MB/s)
First phase: CPU  GPU at ~6.4GB/s, memory controller does reads.
Second phase: GPUCPU at 3.5GB/s, something is wrong.
On second phase: memory controller is doing writes and reads. Reads
do not make sense. Memory controller should only do writes.
CPUGPU GPUCPU
???
Low
good

While we figure out why we
have reads on MCT from
GPUCPU, lets look at the
affinity/binding
•On Node 0GPU 0
• GPU 0CPU 0, u-3.5GB/s
•On Node 1 GPU 0
•GPU 0CPU 1, u-2.1GB/s
•On Node 1 (process) but
memory/GART in Node 0
•GPU 0  CPU 1, u-3.5GB/s
•We cannot pin buffers for all
Nodes, to the closest Node to
the GPU, since overloads MCT
of that Node.
12
0
1
2
3
4
5
6
7
1 8 64 512 4096 32768 262144 2097152 16777216 134217728
GB/s
size (bytes)
core0_node0 CPU->GPU (GB/s) core0_node0 GPU->CPU (GB/s)
0
1
2
3
4
5
6
1 8 64 512 4096 32768 262144 2097152 16777216 134217728
GB/s
size (bytes)
0
1
2
3
4
5
6
7
1 8 64 512 4096 32768 262144 2097152 16777216 134217728
GB/s
size (bytes)

Linux/Windows driver fix got rid of reads on GPUCPU
13
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
00:00.512 00:09.152 00:17.792 00:26.432 00:35.072 00:43.712 00:52.352
Thousands
Time
bw (GB/s) core0_mem0 to/from GPU
c0_n0_00t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) c0_n0_04t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s)
c0_n0_00t N2. MEMORY CONTROLLER Total DRAM writes (MB/s) c0_n0_00t N2. MEMORY CONTROLLER Total DRAM reads inc pf (MB/s)
-100.0
-50.0
0.0
50.0
100.0
150.0
200.0
250.0
262144 1048576 4194304 16777216 67108864 268435456
% improvement
size (Bytes)
core0_mem0 % improvement
CPU->GPU (GB/s) GPU->CPU (GB/s)
Only
reads
Only
writes

Linux affinity/binding for CPU and GPU
14
It is important to set process/thread affinity/binding as well as memory affinity/binding.
•Externally using numactl. Caution, getting obsolete on Multi-Chip- Modules (MCM), since it sees a single chip of all the cores on the socket with single memory controller and single L3 shared cache.
•New tool: HWLOC (www.open-mpi.org/projects/hwloc), not yet implemented memory binding. Being used in OpenMPIand MVAPICH.
•HWLOC replaces also PLPA (obsolete) since It cannot see properly MCM processors (eg. Magny-Cours, Beckton).
•For CPU and GPU work, set the process/thread affinity and local memory binding to it. Do not rely on first touch for GPU.
•GPU driver can put GART on any node and incur into remote memory access when sending data from/to GPU.
•Enforcing memory binding , will create GART buffers on same memory node, maximizing I/O throughput.

Linux results with processor and memory binding
15
0
1
2
3
4
5
6
7
8
1 8 64 512 4096 32768 262144 2097152 16777216 134217728
GB/s
size (Bytes)
core0_mem0
0
1
2
3
4
5
6
7
8
1 8 64 512 4096 32768 262144 2097152 16777216 134217728
GB/s
size (Bytes)
core4_mem1

Windows affinity/binding for CPU and GPU
16
•There is no numactlcommand on Windows
•There is start /affinity command on DOS prompt that just sets process affinity.
•HWLOC tool also available on Windows, but again memory node binding is mandatory.
•Huge performance penalties if not done memory node.
•Requires usage of API provided by Microsoft for Windows HPC and Windows 7. Main function call: VirtualAllocNumaEx
•Debate among SW development teams on where the fix needs to be introduced. Possible SW: Application, OpenCL, CAL, User Driver, Kernel Driver.
•Ideally, OpenCL/CAL should read affinity of process thread and before creating GART resources, set numanode binding.
•Running out of time, quick fix implemented at application level (ie. microbenchmark). Concern about complexity since application developer needs to be aware of cHTtopology and NUMA nodes.

Portion of code changed and results
17
Example of allocating simple array with memory node binding:
a = (double *) VirtualAllocExNuma( hProcess, NULL, array_sz_bytes, MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE, NodeNumber);
/* a=(double *)malloc(array_sz_bytes); */
for (i=0; i<array_sz; i++) a[i]= function(i);
VirtualFreeEx( hProcess, (void *)a, array_sz_bytes, MEM_RELEASE);
/* free( (void *)a); */
For PCIeSpeedTest, fix introduced at USER LEVEL with VirtualAllocNumaEx +
calResCreate2D instead of calResAllocRemote2D which does inside CAL the
allocations without having the posibility to set the affinity by the USER.
0
1
2
3
4
5
6
7
8
1 8 64 512 4096 32768 262144 2097152 16777216 134217728
GB/s
size (Bytes)
Windows, memory node 0 binding
CPU->GPU GPU->CPU

Q&A
18
How many QDR IB cards can handle a single chipset ? Same performance as Gemini interconnect.
THANK YOU
0.0
2.0
4.0
6.0
8.0
10.0
00:00.510
00:04.830
00:09.150
00:13.470
00:17.790
00:22.11000:26.430
00:30.750
Thousands of MB/s (GB/s)
Time
3 QDR IB cards, single PCIegen2, RDMA write (GB/s) on client side
client_00t_okN2. MEMORY CONTROLLERTotal DRAM accesses (MB/s)
client_00t_okN3. HYPERTRANSPORT LINKSHT3 xmit (MB/s)
client_00t_okN3. HYPERTRANSPORT LINKSHT3 CRC (MB/s)

How to assess Application Efficiency with performance counters
–Efficiency questions (Theoretical vsmaximum achievable vsreal applications) on IO subsystems: cHT, ncHT, (multi) chipsets, (multi) HCAs, (multi) GPUs, HDs, SSDs.
–Most of those efficiency questions can be represented graphically.
Example: roofline model (Williams), which allows to compare architectures and how efficiently the workloads exploit them. Values obtained through performance counters.
Appendix: Understanding workload requirements1
4
16
642560.03
0.06
0.13
0.25
0.50
1.00
2.00
4.00
8.00
16.00
32.00
64.00
2P G34 MagnyCours2.2 GHz Stream Triad (OP=0.082) DGEMM (OP=14.15)
GF/s
Arithmetic intensity (GF/s)/(GB/s)
Peak GFLOP/s
11/3/2010
19
GROMACS (OP=5, GF=35)

R&D work on pre exascale HPC systems

Related slideshows

More Related Content

R&D work on pre exascale HPC systems