Near Exascale Computing in the Cloud

Near Exascale Computing in the Cloud:
the use of GPU bursts for Multi-Messenger
Astrophysics with IceCube Data
Frank Würthwein
OSG Executive Director
UCSD/SDSC

Jensen Huang keynote @ SC19
2
The Largest Cloud Simulation in History
50k NVIDIA GPUs in the Cloud
350 Petaflops for 2 hours
Distributed across US, Europe & Asia
Saturday morning before SC19 we bought all GPU capacity
that was for sale worldwide across AWS, Azure, and Google

A Story of 3 Cloud Bursts
• Saturday before SC19:
- Buy the entire GPU capacity worldwide that is for sale in AWS, Azure,
and Google for a couple of hours.
- Proof of principle & measurement of global GPU capacity
• February 4th 2020:
- Buy a workday’s worth of GPU capacity of only the most cost effective
GPUs for our application.
- Establish standard operations & cost
• November 4th 2020:
- Repeat without any storage in the cloud. All data input and output via
network. EGRESS via cloud connect to minimize charges.
- Establish on-prem to cloud networking and cloud connect routing
3
We will discuss this story from beginning to end.

IceCube
5
A cubic kilometer of ice at the
south pole is instrumented
with 5160 optical sensors.
Astrophysics:
• Discovery of astrophysical neutrinos
• First evidence of neutrino point source (TXS)
• Cosmic rays with surface detector
Particle Physics:
• Atmospheric neutrino oscillation
• Neutrino cross sections at TeV scale
• New physics searches at highest energies
Earth Science:
• Glaciology
• Earth tomography
A facility with very
diverse science goals
Restrict this talk to
high energy Astrophysics

High Energy Astrophysics
Science case for IceCube
6
Universe is opaque to light
at highest energies and
distances.
Only gravitational waves
and neutrinos can
pinpoint most violent
events in universe.
Fortunately, highest energy
neutrinos are of cosmic origin.
Effectively “background free” as long
as energy is measured correctly.

High energy neutrinos from
outside the solar system
7
First 28 very high energy neutrinos from outside the solar system
Red curve is the photon flux
spectrum measured with the
Fermi satellite.
Black points show the
corresponding high energy
neutrino flux spectrum
measured by IceCube.
This demonstrates both the opaqueness of the universe to high energy
photons, and the ability of IceCube to detect neutrinos above the
maximum energy we can see light due to this opaqueness.
Science 342 (2013). DOI:
10.1126/science.1242856

Understanding the Origin
8
We now know high energy events happen in the universe. What are they?
p + g D + p + p 0 p + g g
p + g D + n + p + n + m + n
Cosm
Aya Ishihara
The hypothesis:
The same cosmic events produce
neutrinos and photons
We detect the electrons or muons from neutrino that interact in the ice.
Neutrino interact very weakly => need a very large array of ice instrumented
to maximize chances that a cosmic neutrino interacts inside the detector.
Need pointing accuracy to point back to origin of neutrino.
Telescopes the world over then try to identify the source in the direction
IceCube is pointing to for the neutrino.
Multi-messenger Astrophysics

The ν detection challenge
9
Optical Prop
Aya Ishihara
• Combining all the possible infor
• These features are included in s
• We’re always be developing the
Nature never tell us a perfect a
satisfactory agreeme
Ice properties change with
depth and wavelength
Observed pointing resolution at high
energies is systematics limited.
Central value moves
for different ice models
Improved e and τ reconstruction
Þ increased neutrino flux
detection
Þ more observations
Photon propagation through
ice runs efficiently on single
precision GPU.
Detailed simulation campaigns
to improve pointing resolution
by improving ice model.
Improvement in reconstruction with
better ice model near the detectors

First evidence of an origin
10
side view
125m
top view
0 500 1000 1500 2000 2500 3000
nanoseconds
Figure 1: Event display for neutrino event IceCube-170922A. The time at which a DOM
observed a signal is reflected in the color of the hit, with dark blues for earliest hits and yellow
First location of a source of very high energy neutrinos.
Neutrino produced high energy muon
near IceCube. Muon produced light as it
traverses IceCube volume. Light is
detected by array of phototubes of
IceCube.
IceCube alerted the astronomy community of the
observation of a single high energy neutrino on
September 22 2017.
A blazar designated by astronomers as TXS
0506+056 was subsequently identified as most likely
source in the direction IceCube was pointing. Multiple
telescopes saw light from TXS at the same time
IceCube saw the neutrino.
Science 361, 147-151
(2018). DOI:10.1126/science.aat2890

IceCube’s Future Plans
11
| IceCube Upgrade and Gen2 | Summer Blot | TeVPA 2018 16
The IceCube-Gen2 Facility
Preliminary timeline
MeV- to EeV-scale physics
Surface array
High Energy
Array
Radio array
PINGU
IC86
2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 … 2032
Today
Surface air shower
Construction
R&D Design & Approval
IceCube Upgrade
IceCube Upgrade
Deployment
Near term:
add more phototubes to deep core to increase granularity of measurements.
Longer term:
• Extend instrumented
volume at smaller
granularity.
• Extend even smaller
granularity deep core
volume.
• Add surface array.
Improve detector for low & high energy neutrinos

The Idea
• Integrate all GPUs available for sale
worldwide into a single HTCondor pool.
- use 28 regions across 3 cloud providers for a
burst of a couple hours, or so.
• IceCube submits their photon propagation
workflow to this HTCondor pool.
- we handle the input, the jobs on the GPUs, and
the output as a single globally distributed system.
13
Run a GPU burst relevant in scale
for future Exascale HPC systems.

A global HTCondor pool
• IceCube, like all OSG user communities, relies on
HTCondor for resource orchestration
- This demo used the standard tools
• Dedicated HW setup
- Avoid disruption of OSG production system
- Optimize HTCondor setup for the spiky nature of the demo
§ multiple schedds for IceCube to submit to
§ collecting resources in each cloud region, then collecting from all
regions into global pool
14

HTCondor Distributed CI
15
Collector
Collector Collector
Collector
Collector
Negotiator
Scheduler Scheduler
Scheduler
IceCube
VM
VM
VM
10 schedd’s
One global resource pool

Using native Cloud storage
• Input data pre-staged into native Cloud storage
- Each file in one-to-few Cloud regions
§ some replication to deal with limited predictability of resources per region
- Local to Compute for large regions for maximum throughput
- Reading from “close” region for smaller ones to minimize ops
• Output staged back to region-local Cloud storage
• Deployed simple wrappers around Cloud native file
transfer tools
- IceCube jobs do not need to customize for different Clouds
- They just need to know where input data is available
(pretty standard OSG operation mode)
16

Science with 51,000 GPUs
achieved as peak performance
17
Time in Minutes
Each color is a different
cloud region in US, EU, or Asia.
Total of 28 Regions in use.
Peaked at 51,500 GPUs
~380 Petaflops of fp32
8 generations of NVIDIA GPUs used.
Summary of stats at peak

A Heterogenous Resource Pool
18
28 cloud Regions across 4 world regions
providing us with 8 GPU generations.
No one region or GPU type dominates!

Science Produced
19
Distributed High-Throughput
Computing (dHTC) paradigm
implemented via HTCondor provides
global resource aggregation.
Largest cloud region provided 10.8% of the total
dHTC paradigm can aggregate
on-prem anywhere
HPC at any scale
and multiple clouds

Performance vs GPU type
21
42% of the science was done on V100 in 19% of the wall time.

Second Cloud Burst focused on
maximizing science/$$$
2nd burst was an 8h work day in pacific time zone on a
random Tuesday in February
Do a burst that we could repeat anytime,
with any dHTC application.

A Day of Cloud Use
23
Integrated one
EFLOP32 hour
170 PFLOP32s plateau
Total bill: ~$60k,
including networking and storage
We did a 2nd run on February 4th 2020 to focus on a cost-effective 8h work day
We picked a “random” Tuesday during peak working hours (pacific).

Cost to support cloud as a
“24x7” capability
• February 2020: roughly $60k per ExaFLOP32 hour
• This burst was executed by 2 people
- Igor Sfiligoi (SDSC) to support the infrastructure.
- David Schultz (UW Madison) to create and submit the
IceCube workflows.
§ “David” type person is needed also for on-prem science workflows.
• To make this a routine operations capability for any
open science that is dHTC capable would require
another 50% FTE “Cloud Budget Manager”.
- There is substantial effort involved in just dealing with cost &
budgets for a large community of scientists.
24

To provide an aggregate ExaFlop32
hour per day dHTC production capability
in the commercial cloud for the sum of
many sciences today would require:
1.5FTE of human effort
$60k of cloud costs per day
This does not include the human effort to train the community,
define the workflows, run the workflows, … i.e. it does not include
what the scientists themselves still have to do.

3rd Cloud Burst
Buy enough GPUs to saturate 100Gbps network to
UW Madison with overflow to UCSD
Do EGRESS entirely via Internet2 Cloud Connect
for AWS, Azure, and Google
Scale of GPU burst peaked at
60% of second cloud burst.
Used smaller set of regions but
still all 3 providers

Egress data intensive in nature
• Cloud burst ~ saturated
100 Gbps link
- To make good use of a
large fraction
of available Cloud GPUs
• IceCube simulations
are relatively heavy in
egress data
- 2 GB per job
- Job length ~= 0.5 hour
• And very spiky
- The whole file
is transferred
after compute
completed
• Input sizes small-ish
- 0.25 GB Peaked at 90.3Gbps at UW Madison
plus an additional 10-20Gbps at UCSD

Using Internet2 Cloud Connect Service
• Egress costs notoriously high
• Buying dedicated links is
cheaper
- If provisioned on demand
• Internet2 acts as provider for
the research community
- For AWS, Azure and GCP
• No 100Gbps links available
- Had to stitch together 21 links,
at 10Gbps, 5Gbps and 2 Gbps
Each color band belongs
to one network link
https://internet2.edu/services/cloud-connect/
130TB
in 5hours

Struggled with spiky workload
during trial run
• Attempted burst in trial run lead
to “oscillatory” network use.
• Noticed links to different providers behave differently
- Some capped, some flexible
- Long upload times when congested => waste of money
AWS
Azure
50Gbps

Slow & careful during big “burst”
• Ramp up for over 2 hours • Still not perfect
- But much smoother
2 hour ramp
GB/sec GB/sec
21 network links
were provisioned
IO across individual
links quite chaotic
Started slow to randomize job end times, and thus network transfers.
And yet, the individual link utilization is still quite spikey.

Screenshot of provisioned links
Bought:
10 links @ 5Gbps
5 links @ 2Gbps
6 links @ 10Gbps
Our ability to use a link depended on the
availability of GPUs in the corresponding region.
A bit of a Tetris problem.

Very different provisioning
in the 3 Clouds
• AWS the most complex
- And requires initiation by
on-prem network engineer
• Many steps after initial request
- Accept connection request
- Create VPG
- Associate VPG with VPC
- Create DCG
- Create VIF
§ Relay back to on-prem the BGP key
- Establish VPC -> VPG routing
- Associate DCG -> VPG
• And don’t forget the Internet routers, if
you use dedicated VPCs
• GCP the simplest
- Create Cloud Router
- Create Interconnect
§ Provide key to on-prem
• Azure not much harder
- Make sure the VPC has Gateway subnet
- Create ExpressRoute (ER)
§ Provide key to on-prem
- Create VNG
- Create connection between ER and VNG
- But Azure comes with many more
options
to choose from
This was the hardest of our 3 cloud bursts
because it required a lot of coordination,
and had too many parts without automation.
(Tetris problem of GPU availability, job end time, link bandwidth)

Additional on-prem
networking setup needed
• Quote from Michael Hare, UW Madison
Network engineer:
In addition to network configuration [at] UW Madison (AS59), we provisioned BGP
based Layer 3 MPLS VPNs (L3VPNs) towards Internet2 via our regional aggregator,
BTAA OmniPop.
This work involved reaching out to the BTAA NOC to coordinate on VLAN numbers
and paths and to [the] Internet2 NOC to make sure the newly provisioned VLANs
were configurable inside OESS.
Due to limitations in programmability or knowledge at the time regarding duplicate
IP address towards the cloud (GCP, Azure, AWS) endpoints, we built several discrete
L3VPNs inside the Internet2 network to accomplish the desired topology.
Not something domain scientists can expect to accomplish.

Applicability beyond IceCube
• All the large instruments we know off
- LHC, LIGO, DUNE, LSST, …
• Any midscale instrument we can think off
- XENON, GlueX, Clas12, Nova, DES, Cryo-EM, …
• A large fraction of Deep Learning
- But not all of it …
• Basically, anything that has bundles of
independently schedulable jobs that can be
partitioned to adjust workloads to have 0.5 to
few hour runtimes on modern GPUs.
34

IceCube is ready for Exascale
• Humanity has built extraordinary instruments by pooling
human and financial resources globally.
• The computing for these large collaborations fits perfectly to
the cloud or scheduling holes in Exascale HPC systems due
to its “ingeniously parallel” nature. => dHTC
• The dHTC computing paradigm applies to a wide range of
problems across all of open science.
- We are happy to repeat this with anybody willing to spend $50k in the
clouds.
35
Contact us at: support@opensciencegrid.org
Or me personally at: fkw@ucsd.edu
Demonstrated elastic burst at 51,500 GPUs
IceCube is ready for Exascale

Acknowledgements
• This work was partially supported by the
NSF grants OAC-1941481, MPS-1148698,
OAC-1841530, OAC-1904444, and OAC-
1826967, OPP-1600823
36

Near Exascale Computing in the Cloud

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Near Exascale Computing in the Cloud

Similar to Near Exascale Computing in the Cloud (20)

Recently uploaded

Recently uploaded (20)

Near Exascale Computing in the Cloud