SlideShare a Scribd company logo
Toward a practical “HPC Cloud”:
  Performance tuning of a virtualized HPC cluster


                       Ryousei Takano

                              Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST),
                                                                  Japan


                SC2011@Seattle, Nov.15 2011
Outline
•  What is HPC Cloud?
•  Performance tuning method for HPC Cloud
  –  PCI passthrough
  –  NUMA affinity
  –  VMM noise reduction
•  Performance evaluation




                                             2
HPC Cloud
HPC Cloud utilizes cloud resources in High
Performance Computing (HPC) applications
Virtualized
 Clusters




      Users require resources   Provider allocates users a dedicated
      according to needs        virtual cluster on demand

 Physical
 Cluster



                                                                       3
HPC Cloud (cont’d)
•  Pros:
   –  User side: easy to deployment
   –  Provider side: high resource utilization
•  Cons:
   –  Performance degradation?

  The method of performance tuning on a virtualized
  environment is not established.




                                                      4
Toward a practical HPC Cloud
                          To reduce the overhead of                      “True” HPC Cloud
 VM1
                          interrupt virtualization                         The performance is
    Guest OS
                          To disable unnecessary services                 closing to that of bare
        Physical
         driver
                          on the host OS (i.e., ksmd).                           metals.

 VMM
                                                            Reduce
                                                            VMM noise
  NIC
                                  Set NUMA                  (not completed)
                                  affinity
                                                                          VM (QEMU process)
                                                                                 Guest OS
                                                                                Threads


                   Use PCI                                                    VCPU
                                                                              threads
                   passthrough
                                                                          Linux kernel

  Current                                                                      KVM
 HPC Cloud
Its performance is
   not good and                                                           Physical
     unstable.                                                            CPU
                                                                                          CPU socket
                                                                                                       5
PCI passthrough
  IO emulation                    PCI passthrough                     SR-IOV
VM1              VM2             VM1             VM2            VM1            VM2
 Guest OS                         Guest OS                       Guest OS
                       …                                …                             …
  Guest                            Physical                      Physical
  driver                            driver                        driver


VMM                             VMM                             VMM
            vSwitch

            Physical
             driver

NIC                              NIC                            NIC

                                                                       Switch (VEB)

                           IO emulation       PCI passthrough     SR-IOV
      VM sharing
      Performance
                                                                                          6
Virtual CPU scheduling
         Bare Metal
            Xen                                          KVM
            VM (Xen DomU)                 VM (QEMU process)

VM                  Guest OS                     Guest OS
(Dom0)     Threads                            Threads
                                                                                Virtual Machine
         A guest OS can not run numactl

                                           VCPU
             V0 V1         V2    V3                      V0   V1   V2   V3
                                           threads
            VCPU

Xen Hypervisor                            Linux kernel

                                               KVM
                        Domain                                Process           Virtual Machine
                        scheduler                             scheduler         Monitor (VMM)


 Physical                                 Physical CPU
 CPU           P0    P1     P2    P3                     P0   P1   P2     P3       Hardware
                                                                   CPU socket
                                                                                              7
NUMA affinity
        Bare Metal                                KVM
Linux                              VM (QEMU process)

  Threads                                 Guest OS
                                       Threads
                  numactl
                                                       numactl                bind threads
                 Process                                                      to vSocket
                                    VCPU
                 scheduler                        V0     V1      V2   V3
                                    threads


                                   Linux kernel
                                                          taskset
                                        KVM
                                                                              pin vCPU to
                                                         Process              CPU (Vn = Pn)
                                                         scheduler
                      CPU socket
Physical
CPU         P0   P1     P2   P3
                                      Physical
                                      CPU         P0     P1      P2   P3
            memory      memory
                                                                 CPU socket
                                                                                             8
Evaluation
 Evaluation of HPC applications on 16 nodes cluster
 (part of AIST Green Cloud Cluster)
   Compute node Dell PowerEdge M610                Host machine environment
CPU       Intel quad-core Xeon E5540/2.53GHz x2   OS             Debian 6.0.1

Chipset   Intel 5520                              Linux kernel   2.6.32-5-amd64

Memory    48 GB DDR3                              KVM            0.12.50

InfiniBand Mellanox ConnectX (MT26428)            Compiler       gcc/gfortran 4.4.5
                                                  MPI            Open MPI 1.4.2
                Blade switch                               VM environment
InfiniBand Mellanox M3601Q (QDR 16 ports)         VCPU       8
                                                  Memory     45 GB


                                                                                      9
MPI Point-to-Point
                     communication performance
                     10000
                                 (higher is better)

                      1000
Bandwidth [MB/sec]




                       100




                        10               PCI passthrough improves MPI communication
                                         throughput close to that of bare metal machines.
                                                                  Bare Metal
                                                                       KVM
                         1
                             1     10    100    1k  10k 100k 1M        10M 100M     1G
                                                 Message size [byte]     Bare Metal: non-virtualized cluster
                                                                                                          10
NUMA affinity
Execution time on a single node: NPB multi-zone
(Computational Fluid Dynamics) and Bloss (Non-linear
eignsolver)

                 SP-MZ [sec]    BT-MZ [sec]   Bloss [min]
   Bare Metal    94.41 (1.00)   138.01 (1.00) 21.02 (1.00)
   KVM           104.57 (1.11) 141.69 (1.03) 22.12 (1.05)
   KVM (w/ bind) 96.14 (1.02)   139.32 (1.01) 21.28 (1.01)


NUMA affinity is an important performance factor not only
on bare metal machines but also on virtual machines.


                                                             11
NPB BT-MZ: Parallel efficiency
                                                                        (higher is better)
                            300                                                              100
Performance [Gop/s total]




                            250    Degradation of PE:                                        80




                                                                                                  Parallel efficiency [%]
                                     KVM: 2%, EC2: 14%
                            200
                                  Bare Metal                                                 60
                            150   KVM
                                  Amazon EC2
                                                                                             40
                            100   Bare Metal (PE)
                                  KVM (PE)
                                                                                             20
                             50   Amazon EC2 (PE)


                              0                                                              0
                                  1            2          4         8            16
                                                    Number of nodes
                                                                                                                            12
Bloss: Parallel efficiency
                          Bloss: non-linear internal eigensolver
                                –  Hierarchical parallel program by MPI and OpenMP
                          120
                                                                                Overhead of communication
                          100
                                                                                and virtualization
Parallel Efficiency [%]




                           80


                           60
                                    Degradation of PE:
                                      KVM: 8%, EC2: 22%
                           40


                           20                                 Bare Metal
                                                                   KVM
                                                             Amazon EC2
                                                                   Ideal
                            0
                                1        2           4            8        16
                                               Number of nodes
                                                                                                        13
Summary
HPC Cloud is promising!
•  The performance of coarse-grained parallel
   applications is comparable to bare metal
   machines
•  We plan to operate a private cloud service
   “AIST Cloud” for HPC users
•  Open issues
  –  VMM noise reduction
  –  VMM-bypass device-aware VM scheduling
  –  Live migration with VMM-bypass devices
                                                14
LINPACK Efficiency
                                                                            TOP500 June 2011
           100                                                             InfiniBand: 79%

                 80
Efficiency (%)




                                                                      10 Gigabit Ethernet: 74%
                 60


                 40                                     Gigabit Ethernet: 54%
                      GPGPU machines
                                                                         #451 Amazon EC2
                               InfiniBand                                cluster compute instances
                 20
                               Gigabit Ethernet
                               10 Gigabit Ethernet
                                                                      Virtualization causes the
                  0                                                   performance degradation!

                                                     TOP500 rank
                      Efficiency   Maximum LINPACK performance Rmax    Theoretical peak performance Rpeak
Bloss: Parallel efficiency
                          Bloss: non-linear internal eigensolver
                                –  Hierarchical parallel program by MPI and OpenMP
                          120


                          100
Parallel Efficiency [%]




                           80


                           60
                                                                  Binding threads and physical CPUs can
                                                                  be sensitive to VMM noise and degrade
                                                                  the performance.
                           40
                                                                Bare Metal
                           20                                         KVM
                                                              KVM (w/ bind)
                                                               Amazon EC2
                                                                      Ideal
                            0
                                1         2           4              8          16
                                                Number of nodes
                                                                                                          16

More Related Content

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

  • 1. Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan SC2011@Seattle, Nov.15 2011
  • 2. Outline •  What is HPC Cloud? •  Performance tuning method for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction •  Performance evaluation 2
  • 3. HPC Cloud HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications Virtualized Clusters Users require resources Provider allocates users a dedicated according to needs virtual cluster on demand Physical Cluster 3
  • 4. HPC Cloud (cont’d) •  Pros: –  User side: easy to deployment –  Provider side: high resource utilization •  Cons: –  Performance degradation? The method of performance tuning on a virtualized environment is not established. 4
  • 5. Toward a practical HPC Cloud To reduce the overhead of “True” HPC Cloud VM1 interrupt virtualization The performance is Guest OS To disable unnecessary services closing to that of bare Physical driver on the host OS (i.e., ksmd). metals. VMM Reduce VMM noise NIC Set NUMA (not completed) affinity VM (QEMU process) Guest OS Threads Use PCI VCPU threads passthrough Linux kernel Current KVM HPC Cloud Its performance is not good and Physical unstable. CPU CPU socket 5
  • 6. PCI passthrough IO emulation PCI passthrough SR-IOV VM1 VM2 VM1 VM2 VM1 VM2 Guest OS Guest OS Guest OS … … … Guest Physical Physical driver driver driver VMM VMM VMM vSwitch Physical driver NIC NIC NIC Switch (VEB) IO emulation PCI passthrough SR-IOV VM sharing Performance 6
  • 7. Virtual CPU scheduling Bare Metal Xen KVM VM (Xen DomU) VM (QEMU process) VM Guest OS Guest OS (Dom0) Threads Threads Virtual Machine A guest OS can not run numactl VCPU V0 V1 V2 V3 V0 V1 V2 V3 threads VCPU Xen Hypervisor Linux kernel KVM Domain Process Virtual Machine scheduler scheduler Monitor (VMM) Physical Physical CPU CPU P0 P1 P2 P3 P0 P1 P2 P3 Hardware CPU socket 7
  • 8. NUMA affinity Bare Metal KVM Linux VM (QEMU process) Threads Guest OS Threads numactl numactl bind threads Process to vSocket VCPU scheduler V0 V1 V2 V3 threads Linux kernel taskset KVM pin vCPU to Process CPU (Vn = Pn) scheduler CPU socket Physical CPU P0 P1 P2 P3 Physical CPU P0 P1 P2 P3 memory memory CPU socket 8
  • 9. Evaluation Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster) Compute node Dell PowerEdge M610 Host machine environment CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1 Chipset Intel 5520 Linux kernel 2.6.32-5-amd64 Memory 48 GB DDR3 KVM 0.12.50 InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5 MPI Open MPI 1.4.2 Blade switch VM environment InfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8 Memory 45 GB 9
  • 10. MPI Point-to-Point communication performance 10000 (higher is better) 1000 Bandwidth [MB/sec] 100 10 PCI passthrough improves MPI communication throughput close to that of bare metal machines. Bare Metal KVM 1 1 10 100 1k 10k 100k 1M 10M 100M 1G Message size [byte] Bare Metal: non-virtualized cluster 10
  • 11. NUMA affinity Execution time on a single node: NPB multi-zone (Computational Fluid Dynamics) and Bloss (Non-linear eignsolver) SP-MZ [sec] BT-MZ [sec] Bloss [min] Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00) KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05) KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01) NUMA affinity is an important performance factor not only on bare metal machines but also on virtual machines. 11
  • 12. NPB BT-MZ: Parallel efficiency (higher is better) 300 100 Performance [Gop/s total] 250 Degradation of PE: 80 Parallel efficiency [%] KVM: 2%, EC2: 14% 200 Bare Metal 60 150 KVM Amazon EC2 40 100 Bare Metal (PE) KVM (PE) 20 50 Amazon EC2 (PE) 0 0 1 2 4 8 16 Number of nodes 12
  • 13. Bloss: Parallel efficiency Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP 120 Overhead of communication 100 and virtualization Parallel Efficiency [%] 80 60 Degradation of PE: KVM: 8%, EC2: 22% 40 20 Bare Metal KVM Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 13
  • 14. Summary HPC Cloud is promising! •  The performance of coarse-grained parallel applications is comparable to bare metal machines •  We plan to operate a private cloud service “AIST Cloud” for HPC users •  Open issues –  VMM noise reduction –  VMM-bypass device-aware VM scheduling –  Live migration with VMM-bypass devices 14
  • 15. LINPACK Efficiency TOP500 June 2011 100 InfiniBand: 79% 80 Efficiency (%) 10 Gigabit Ethernet: 74% 60 40 Gigabit Ethernet: 54% GPGPU machines #451 Amazon EC2 InfiniBand cluster compute instances 20 Gigabit Ethernet 10 Gigabit Ethernet Virtualization causes the 0 performance degradation! TOP500 rank Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak
  • 16. Bloss: Parallel efficiency Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP 120 100 Parallel Efficiency [%] 80 60 Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance. 40 Bare Metal 20 KVM KVM (w/ bind) Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 16