Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Toward a practical “HPC Cloud”:
Performance tuning of a virtualized HPC cluster

Ryousei Takano

Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST),
Japan

SC2011@Seattle, Nov.15 2011

Outline
•  What is HPC Cloud?
•  Performance tuning method for HPC Cloud
–  PCI passthrough
–  NUMA affinity
–  VMM noise reduction
•  Performance evaluation

2

HPC Cloud
HPC Cloud utilizes cloud resources in High
Performance Computing (HPC) applications
Virtualized
Clusters

Users require resources Provider allocates users a dedicated
according to needs virtual cluster on demand

Physical
Cluster

3

HPC Cloud (cont’d)
•  Pros:
–  User side: easy to deployment
–  Provider side: high resource utilization
•  Cons:
–  Performance degradation?

The method of performance tuning on a virtualized
environment is not established.

4

Toward a practical HPC Cloud
To reduce the overhead of “True” HPC Cloud
VM1
interrupt virtualization The performance is
Guest OS
To disable unnecessary services closing to that of bare
Physical
driver
on the host OS (i.e., ksmd). metals.

VMM
Reduce
VMM noise
NIC
Set NUMA (not completed)
affinity
VM (QEMU process)
Guest OS
Threads

Use PCI VCPU
threads
passthrough
Linux kernel

Current KVM
HPC Cloud
Its performance is
not good and Physical
unstable. CPU
CPU socket
5

PCI passthrough
IO emulation PCI passthrough SR-IOV
VM1 VM2 VM1 VM2 VM1 VM2
Guest OS Guest OS Guest OS
… … …
Guest Physical Physical
driver driver driver

VMM VMM VMM
vSwitch

Physical
driver

NIC NIC NIC

Switch (VEB)

IO emulation PCI passthrough SR-IOV
VM sharing
Performance
6

Virtual CPU scheduling
Bare Metal
Xen KVM
VM (Xen DomU) VM (QEMU process)

VM Guest OS Guest OS
(Dom0) Threads Threads
Virtual Machine
A guest OS can not run numactl

VCPU
V0 V1 V2 V3 V0 V1 V2 V3
threads
VCPU

Xen Hypervisor Linux kernel

KVM
Domain Process Virtual Machine
scheduler scheduler Monitor (VMM)

Physical Physical CPU
CPU P0 P1 P2 P3 P0 P1 P2 P3 Hardware
CPU socket
7

NUMA affinity
Bare Metal KVM
Linux VM (QEMU process)

Threads Guest OS
Threads
numactl
numactl bind threads
Process to vSocket
VCPU
scheduler V0 V1 V2 V3
threads

Linux kernel
taskset
KVM
pin vCPU to
Process CPU (Vn = Pn)
scheduler
CPU socket
Physical
CPU P0 P1 P2 P3
Physical
CPU P0 P1 P2 P3
memory memory
CPU socket
8

Evaluation
Evaluation of HPC applications on 16 nodes cluster
(part of AIST Green Cloud Cluster)
Compute node Dell PowerEdge M610 Host machine environment
CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1

Chipset Intel 5520 Linux kernel 2.6.32-5-amd64

Memory 48 GB DDR3 KVM 0.12.50

InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5
MPI Open MPI 1.4.2
Blade switch VM environment
InfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8
Memory 45 GB

9

MPI Point-to-Point
communication performance
10000
(higher is better)

1000
Bandwidth [MB/sec]

100

10 PCI passthrough improves MPI communication
throughput close to that of bare metal machines.
Bare Metal
KVM
1
1 10 100 1k 10k 100k 1M 10M 100M 1G
Message size [byte] Bare Metal: non-virtualized cluster
10

NUMA affinity
Execution time on a single node: NPB multi-zone
(Computational Fluid Dynamics) and Bloss (Non-linear
eignsolver)

SP-MZ [sec] BT-MZ [sec] Bloss [min]
Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00)
KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05)
KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)

NUMA affinity is an important performance factor not only
on bare metal machines but also on virtual machines.

11

NPB BT-MZ: Parallel efficiency
(higher is better)
300 100
Performance [Gop/s total]

250 Degradation of PE: 80

Parallel efficiency [%]
KVM: 2%, EC2: 14%
200
Bare Metal 60
150 KVM
Amazon EC2
40
100 Bare Metal (PE)
KVM (PE)
20
50 Amazon EC2 (PE)

0 0
1 2 4 8 16
Number of nodes
12

Bloss: Parallel efficiency
Bloss: non-linear internal eigensolver
–  Hierarchical parallel program by MPI and OpenMP
120
Overhead of communication
100
and virtualization
Parallel Efficiency [%]

80

60
Degradation of PE:
KVM: 8%, EC2: 22%
40

20 Bare Metal
KVM
Amazon EC2
Ideal
0
1 2 4 8 16
Number of nodes
13

Summary
HPC Cloud is promising!
•  The performance of coarse-grained parallel
applications is comparable to bare metal
machines
•  We plan to operate a private cloud service
“AIST Cloud” for HPC users
•  Open issues
–  VMM noise reduction
–  VMM-bypass device-aware VM scheduling
–  Live migration with VMM-bypass devices
14

LINPACK Efficiency
TOP500 June 2011
100 InfiniBand: 79%

80
Efficiency (%)

10 Gigabit Ethernet: 74%
60

40 Gigabit Ethernet: 54%
GPGPU machines
#451 Amazon EC2
InfiniBand cluster compute instances
20
Gigabit Ethernet
10 Gigabit Ethernet
Virtualization causes the
0 performance degradation!

TOP500 rank
Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak

Bloss: Parallel efficiency
Bloss: non-linear internal eigensolver
–  Hierarchical parallel program by MPI and OpenMP
120

100
Parallel Efficiency [%]

80

60
Binding threads and physical CPUs can
be sensitive to VMM noise and degrade
the performance.
40
Bare Metal
20 KVM
KVM (w/ bind)
Amazon EC2
Ideal
0
1 2 4 8 16
Number of nodes
16

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Related slideshows

More Related Content

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster