Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016
- 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dr. Jeffrey B. Layton
Global Scientific Computing
June 20, 2016
Building HPC Clusters as Code in
the [Almost] Infinite Cloud
- 2. Agenda
• Why cloud for HPC?
• Tools for creating clusters in the cloud
• SPOT + HPC = peas and carrots
• Fermi National Accelerator Laboratory
• Demo of scaling jobs on a budget
• Summary
- 3. Why cloud for HPC?
Scalability
• If you need to run on lots of cores, just spin them up
• If you don’t need nodes, turn them off (and don’t pay for them)
Time to research
• Usually on-premises high-performance computing (HPC) resources are centralized (shared)
• Researchers like to have their own nodes when they need them
World-wide collaboration
• Share data and interact with it by using the cloud
Latest technology and various instance types
Can save $$$
Flexibility: code as infrastructure
- 4. AWS HPC architectures—phases of deployment
• Fork lift
• Make it look like on-premises
• Cloud “port”
• Adapt to cloud features
• Auto Scaling
• Spot
• Born in the cloud
• Cycle computing
• Rethink application
• Microservices and serverless computing
You must think in “cloud”
You cannot think in “on-prem” and transpose
You must think in “cloud”
Do you think you can do that, Mr. Gant?
- 5. AWS HPC architecture
Master Node
Compute Node Compute Node
Compute Node Compute Node
Storage
(NFS, Parallel)
Master Instance
Compute Instance Compute Instance
Compute Instance Compute Instance
Storage
(NFS, Parallel)
On-premises AWS Cloud
Compute Instance
Compute Instance
- 6. HPC tools
MIT StarCluster
• No longer being supported nor developed
Bright Cluster Manager
• Good for hybrid solutions
CloudyCluster
• Omnibond (out of Clemson University)
Amazon Cfncluster
• Getting started writing your own tools
Alces Flight—on AWS Marketplace
- 7. Alces Flight
Alces Flight is software offering self-service
supercomputers by using the AWS
Marketplace (the cloud’s “App Store”). It
creates self-scaling clusters with more than 750
popular scientific applications pre-installed,
complete with libraries and various compiler
optimizations, ready to run. The clusters use
the AWS Spot market by default.
5 minutes
http://alces-flight.com
- 8. Alces Flight is familiar and flexible
• Same tools as virtually all HPC systems
• Environment modules
• Job scheduler (SGE)
• Catalog of 750+ prebuilt scientific applications and
libraries including visualization tools
• Alces gridware tool for application management
• Integrated with modules
• Defaults to the Spot market
• Auto Scaling cluster based on queued jobs
- 9. Flight enables collaboration
Access the graphical console of your
control node simultaneously with your
collaborators
• Run visual apps that use the elastic
cluster to drive visual results and you
can work together with the visual
console in real-time
Shared and secure cloud workspaces
• Control access and focus on data
analysis
• Make more discoveries faster
• Save lives
• Change the world
Collaborative IGV
Integrative Genomics
Viewer (IGV) workspace
for variant analysis
- 12. Spot Market filler
0.00
1.50
3.00
4.50
6.00
# CPUs
time
Spot Market
Our ultimate space
filler.
Spot Instances allow you
to name your own price
for spare AWS
computing capacity.
Great for workloads that
aren’t time sensitive, and
especially popular in
research (hint: it’s really
cheap).
- 13. Spot vs. On-Demand (YMMV)
4 compute nodes, 2 hours
• On-Demand, us-east
• $19.13
• Spot (us-west-1)
• $7.22
• Almost 1/3 the cost!
16 compute nodes, 32 hours
• On-Demand, us-east
• $1,018.77
• Spot (us-west-1)
• $223.11
• Almost 1/5 the cost!
- 15. Fermi National Accelerator Laboratory
Fermilab is America’s particle physics and accelerator lab.
• Mission: solve the mysteries of matter, energy, space and time
for the benefit of all.
More than 4,200 scientists worldwide use Fermilab and its
particle accelerators, detectors and computers for their
research.
- 16. Particle Physics Science Drivers
Utilize high-energy particle beam collisions to discover
• the origin of mass, the nature of dark matter, extra dimensions.
Employ high-flux beams to explore
• neutrino interactions, to answer questions about the origins of
the universe, matter-antimatter asymmetry, force unification.
• rare processes, to open a doorway to realms to ultra-high
energies, close to the unification scale.
- 19. Fermilab Facility Evolution: HEPCloud
HEPCloud: Provide cost
effective and efficient “elastic”
resource deployment, utilizing
sophisticated decision engine
and middleware for
automation. A single portal to
heterogeneous computing and
storage resources, both local
and “rental” (commercial or
academic).
• Initial focus on commercial
clouds➡️ AWS
- 20. AWS infrastructure
• AWS CloudFormation
automates the setup and
teardown of the Amazon Route
53 DNS entries, the Elastic
Load Balancing load balancer,
the Auto Scaling group, and
Amazon CloudWatch
monitoring
• Launched in each Availability
Zone prior to workflows being
run
- 21. On Demand services
● Workflows depend on software services to run
● Automating the deployment of these services on AWS on-
demand—enables scalability and cost savings
o Services include data caching (e.g. Squid) WMS , submission service, data transfer, etc.
o As services are made deployable on-demand, instantiate ensemble of services together
(e.g. through AWS CloudFormation)
● Example: On-demand Squid
- 25. HPC needs of particle physics workflows
Now that the HTC use case is out of the way…
Machine learning for pattern recognitions
Specialized HPC demands
Very large computations (petascale) of physics processes
necesary for theoretical interpretations
Very large computations (petascale) for modeling particle
accelerators and detectors
- 28. Summary
Easy to “recreate” clusters in the cloud
• Extremely scalable and flexible
Spot + HPC is a wonderful combination
• Saves time and money
Customer example—FNAL
Alces Flight in AWS Marketplace
This is only the beginning—rethink HPC applications for
fault tolerance, extreme scalability, etc.
- 31. Introduction
Setup 2 node cluster (2 compute nodes) where:
• Master node = c4.8xlarge
• 2x compute nodes = c4.xlarge
• 10GigE networking
Run compute nodes on Spot market and master node On-
Demand
Access cluster from Microsoft Windows box (using PuTTY)
- 32. Step 1
Start up cluster using Alces Flight JSON file
• CloudFormation service
• Click Create Stack
• Answer questions
• Key file is critical! You will use it to log in to master node.
• Choose a reasonable Spot price (check current market in region)
– http://aws.amazon.com/ec2/pricing/ (near bottom of page)
- 34. Create a stack
• Specify the
details of
template
instantiation
• Called a “stack”
• Allows you to
tailor stack to
needs
- 35. Stack details—top portion
Name of cluster
Spot bid
Instance type for compute nodes
Amazon S3 bucket for customizations
Key pair for that region
Number of initial nodes in cluster
- 44. Cluster configuration
Recall that Alces Flight comes with:
• Environment modules (connected to Alces Gridware)
• Pdsh
• SGE job scheduler
• GNU Compilers
• Alces Gridware
• Built on CentOS 7
• 750+ applications and libraries (MPI included)
Log into master node and try out commands (PuTTY). Run
application.
- 46. Keep alive in PuTTY
Go to Connection on
left menu
Click on it
Select Enable TCP
keepalives
Keeps PuTTY
connection alive
- 47. Add key to PuTTY session
Go to SSH on left menu
Expand menu
Select Auth
Use Browse to location
private key (should be the
same as was used when
cluster was created)
Note: Has to be in .ppk
format (might have to
convert it from .pem format)
- 48. Log in to master node!
Use “alces” as login (should
match what you input to create
cluster)
No password needed (uses
pass key)
Ready to go!
- 49. Check number of nodes
pdsh uses genders
– “nodes” are only
compute nodes
– “cluster” includes
master node
Be sure to check
“qhost” for compute
nodes
- 52. Install an application
Search for application
using “alces gridware
search …”
Install application using
“alces gridware install …”
Environment modules are
updated with application is
installed
- 55. Remove module and application
First, remove
module
Second, run
“alces gridware
purge… “
- 57. Demo 3
Cluster is up—show running MPI application
• Which MPI application (make it something reasonable)
Install application
• Show change in modules
Job script (go over details)
Submit job—show output of qstat
• Auto Scaling?
Show output from application (yes it’s running)
- 62. Create job script
Don’t forget that Alces Flight uses SGE
#!/bin/bash
#$ -j y –N imb –o $HOME/imb_out.$JOB_ID
#$ -pe mpinodes-verbose 2 –cwd –V
module load mpi/openmpi
module load apps/imb
mpirun IMB-MPI1
Alces also has job templates available:
“alces gridware templates list”
Editor's Notes
- ALCES Flight can instantly take a researcher from zero to hero, by building HPC clusters at any scale with one of the largest catalogs of scientific applications ever put in one place---all immediately accessible in the AWS Cloud.
Through AWS Marketplace, Alces Flight gives researchers access to a massive catalog of scientific apps in exactly the same way they’re used to working with a national supercomputing center, along with libraries, compilers and job schedulers that provide a very familiar look and feel. It also provides many things that national shared facilities can’t easily provide, like console access for GUIs and visualization tools, or admin access to install packages or modify the environment to suit the specific needs of a user.