Toward a National Research Platform

“Toward a National Research Platform”
Invited Presentation
Open Science Grid All Hands Meeting
Salt Lake City, UT
March 20, 2018
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net
1

30 Years Ago NSF Brought to University Researchers
a DOE HPC Center Model
NCSA Was Modeled on LLNL SDSC Was Modeled on MFEnet
1985/6

• The First National Telecom-Interconnected 155 Mbps Research Network
– 65 Science Projects
– Into the San Diego Convention Center
• I-Way Featured:
– Networked Visualization Applications
– Large-Scale Immersive Displays
– I-Soft Programming Environment
– Led to the Globus Project
I-WAY: Information Wide Area Year
Supercomputing ‘95
UIC
http://archive.ncsa.uiuc.edu/General/Training/SC95/GII.HPCC.html
See talk by:
Brian Bockelman

NSF’s PACI Program was Built on the vBNS
to Prototype America’s 21st Century Information Infrastructure
The PACI Grid Testbed
National Computational Science
1997
vBNS
led to
Key Role
of Miron Livny
& Condor

UCSD Has Been Working Toward PRP for Over 15 Years:
NSF OptIPuter, Quartzite, Prism Awards
PI Papadopoulos,
2013-2015
PI Smarr,
2002-2009
PI Papadopoulos,
2004-2007
Precursors to DOE
Defining DMZ in 2010

Based on Community Input and on ESnet’s Science DMZ Concept,
NSF Has Funded Over 100 Campuses to Build DMZs
Red 2012 CC-NIE Awardees
Yellow 2013 CC-NIE Awardees
Green 2014 CC*IIE Awardees
Blue 2015 CC*DNI Awardees
Purple Multiple Time Awardees
Source: NSF
NSF Program Officer: Kevin Thompson

(GDC)
Logical Next Step: The Pacific Research Platform Networks Campus DMZs
to Create a Regional End-to-End Science-Driven “Big Data Superhighway” System
NSF CC*DNI Grant
$5M 10/2015-10/2020
PI: Larry Smarr, UC San Diego Calit2
Co-PIs:
• Camille Crittenden, UC Berkeley CITRIS,
• Tom DeFanti, UC San Diego Calit2/QI,
• Philip Papadopoulos, UCSD SDSC,
• Frank Wuerthwein, UCSD Physics and SDSC
Letters of Commitment from:
• 50 Researchers from 15 Campuses
• 32 IT/Network Organization Leaders
NSF Program Officer: Amy Walton
Source: John Hess, CENIC

Note That the OSG Cluster Map
Has Major Overlap with the NSF-Funded DMZ Map
Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
NSF CC* Grants

Bringing OSG Software and Services
to a Regional-Scale DMZ

• FIONAs PCs [a.k.a ESnet DTNs]:
– ~$8,000 Big Data PC with:
– 1 CPUs
– 10/40 Gbps Network Interface Cards
– 3 TB SSDs or 100+ TB Disk Drive
– Extensible for Higher Performance to:
– +NVMe SSDs for 100Gbps Disk-to-Disk
– +Up to 8 GPUs [4M GPU Core Hours/Week]
– +Up to 160 TB Disks for Data Posting
– +Up to 38 Intel CPUs
– $700 10Gpbs FIONAs Being Tested
• FIONettes are $270 FIONAs
– 1Gbps NIC With USB-3 for Flash Storage or SSD
Big Data Science Data Transfer Nodes (DTNs)-
Flash I/O Network Appliances (FIONAs)
FIONette—1G, $250
Phil Papadopoulos, SDSC &
Tom DeFanti, Joe Keefe & John Graham, Calit2
Key PRP Innovation: UCSD Designed FIONAs To Solve the Disk-to-Disk
Data Transfer Problem at Full Speed on 10/40/100G Networks
FIONAS—10/40G, $8,000

We Measure Disk-to-Disk Throughput
with 10GB File Transfer Using Globus GridFTP
4 Times Per Day in Both Directions for All PRP Sites
January 29, 2016
From Start of Monitoring 12 DTNs
to 24 DTNs Connected at 10-40G
in 1 ½ Years
July 21, 2017
Source: John Graham, Calit2/QI

PRP’s First 2 Years:
Connecting Multi-Campus Application Teams and Devices
Earth
Sciences

PRP Over CENIC
Couples UC Santa Cruz Astrophysics Cluster to LBNL NERSC Supercomputer
CENIC 2018
Innovations in
Networking
Award for
Research
Applications

100 Gbps FIONA at UCSC Allows for Downloads to the UCSC Hyades Cluster
from the LBNL NERSC Supercomputer for DESI Science Analysis
300 images per night.
100MB per raw image
120GB per night
250 images per night.
530MB per raw image
800GB per night
Source: Peter Nugent, LBNL
Professor of Astronomy, UC Berkeley
Precursors to
LSST and NCSA
NSF-Funded Cyberengineer
Shaw Dong @UCSC
Receiving FIONA
Feb 7, 2017

Jupyter Has Become the Digital Fabric for Data Sciences
PRP Creates UC-JupyterHub Backbone
Source: John Graham, Calit2
Goal: Jupyter Everywhere

LHCOne Traffic Growth is Large Now
But Will Explode in 2026
31 Petabytes
in January 2018
+38% Change
Within Last Year
LHC Accounts for 47% of
Total ESNet traffic Today
Dramatic Data Volume Growth
Expected for HL-LHC in 2026

Data Transfer Rates From 40 Gbps DTN in UCSD Physics Building,
Across Campus on PRISM DMZ, Then to Chicago’s Fermilab Over CENIC/ESnet
Based on This Success,
Würthwein Will Upgrade 40G DTN to 100G
For Bandwidth Tests & Kubernetes Integration
With OSG, Caltech, and UCSC

LHC Data Analysis
Running on PRP
Two Projects:
• OSG Cluster-in-a-Box for “T3”
• Distributed Xrootd Cache for “T2”

First Steps Toward
Integrating OSG and PRP – Tier 3 “Cluster-in-a Box”

PRP Distributed Tier-2 Cache
Across Caltech & UCSD
Cache
Server
Cache
Server…
Redirect
or
Cache
Server
Cache
Server…
Redirect
or
UCSD Caltech
Redirector Top Level Cache
Global Data Federation of CMS
Applications Can Connect at Local
or Top Level Cache Redirector
 Test the System as
Individual or Joint Cache
Provisioned pilot systems:
PRP UCSD: 9 x 12 SATA Disk of 2TB
@ 10Gbps for Each System
PRP Caltech: 2 x 30 SATA Disk of 6TB
@ 40Gbps for Each System
Production Use (UCSD only)
I/O in Production Limited by
# of Apps Hitting the Cache,
and Their I/O Patterns

Game Changer: Using Kubernetes
to Manage Containers Across the PRP
“Kubernetes is a way of stitching together
a collection of machines into, basically, a big computer,”
--Craig Mcluckie, Google
and now CEO and Founder of Heptio
"Everything at Google runs in a container."
--Joe Beda,Google
“Kubernetes has emerged as
the container orchestration engine of choice
for many cloud providers including
Google, AWS, Rackspace, and Microsoft,
and is now being used in HPC and Science DMZs.
--John Graham, Calit2/QI UC San Diego
See talk by:
Rob Gardner

Distributed Computation on PRP Nautilus HyperCluster
Coupling SDSU Cluster and SDSC Comet Using Kubernetes Containers
25 years
Developed and executed MPI-based PRP Kubernetes Cluster execution
[CO2,aq] 100 Year Simulation
4 days
75 years
100 years
• 0.5 km x 0.5 km x 17.5 m
• Three sandstone layers
separated by two shale
layers
Simulating the Injection of CO2
in Brine-Saturated Reservoirs:
Poroelastic & Pressure-Velocity
Fields Solved In Parallel With MPI
Using Domain Decomposition
Across Containers
Source: Chris Paolini and Jose Castillo, SDSU

Rook is Ceph Cloud-Native Object Storage
‘Inside’ Kubernetes
https://rook.io/
Source: John Graham, Calit2/QI
See talk by:
Shawn McKee

FIONA8: Adding GPUs to FIONAs
Supports Data Science Machine Learning
Multi-Tenant Containerized GPU JupyterHub
Running Kubernetes / CoreOS
Eight Nvidia GTX-1080 Ti GPUs
~$13K
32GB RAM, 3TB SSD, 40G & Dual 10G ports
Source: John Graham, Calit2

FIONA8
FIONA8
100G Epyc NVMe
Nautilus - A Multi-Tenant Containerized PRP HyperCluster for Big Data Applications
Running Kubernetes with Rook/Ceph Cloud Native Storage and GPUs for Machine Learning
40G SSD 3T
100G NVMe 6.4T
SDSU
100G Gold NVMe
March 2018 John
Graham, Calit2/QI
100G NVMe 6.4T
Caltech
40G SSD
UCAR
FIONA8
UCI
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
sdx-controller
controller-0
Calit2
100G Gold FIONA8
SDSC
40G SSD
UCR 40G SSD
USC
40G SSD
UCLA
40G SSD
Stanford
40G SSD
UCSB
100G NVMe 6.4T
40G SSD
UCSC
40G SSD
Hawaii
Rook/Ceph - Block/Object/FS
Swift API compatible with
SDSC, AWS, and Rackspace
Kubernetes
Centos7

FIONA8
FIONA8
100G Epyc NVMe
40G 160TB
100G NVMe 6.4T
SDSU
100G Gold NVMe
March 2018 John Graham, UCSD
100G NVMe 6.4T
Caltech
40G 160TB
UCAR
FIONA8
UCI
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
FIONA8
sdx-controller
controller-0
Calit2
100G Gold FIONA8
SDSC
40G 160TB
UCR 40G 160TB
USC
40G 160TB
UCLA
40G 160TB
Stanford
40G 160TB
UCSB
100G NVMe 6.4T
40G 160TB
UCSC
40G 160TB
Hawaii
Running Kubernetes/Rook/Ceph On PRP
Allows Us to Deploy a Distributed PB+ of Storage for Posting Science Data
Rook/Ceph - Block/Object/FS
Swift API compatible with
SDSC, AWS, and Rackspace
Kubernetes
Centos7

Collaboration Opportunity with OSG & PRP
on Distributed Storage
1.8PB1.2PB1.6PB
210TB
Total data volume pulled last year
is dominated by 4 caches.
OSG Is Operating a Distributed Caching CI.
At Present, 4 Caches Provide Significant Use
PRP Kubernetes Infrastructure Could Either
Grow Existing Caches by Adding Servers,
or by Adding Additional Locations
See talks by:
Alex Feltus
Derek Weitzel
StashCache Users include:
See talk by
Marcelle Soares-Santos
LIGO
DES

New NSF CHASE-CI Grant Creates a Community Cyberinfrastructure:
Adding a Machine Learning Layer Built on Top of the Pacific Research Platform
Caltech
UCB
UCI UCR
UCSD
UCSC
Stanford
MSU
UCM
SDSU
NSF Grant for High Speed “Cloud” of 256 GPUs
For 30 ML Faculty & Their Students at 10 Campuses
for Training AI Algorithms on Big Data
NSF Program Officer: Mimi McClure

48 GPUs for
OSG Applications
UCSD Adding >350 Game GPUs to Data Sciences Cyberinfrastructure -
Devoted to Data Analytics and Machine Learning
SunCAVE 70 GPUs
WAVE + Vroom 48 GPUs
FIONA with
8-Game GPUs
88 GPUs
for Students
CHASE-CI Grant Provides
96 GPUs at UCSD
for Training AI Algorithms on Big Data

Next Step: Surrounding the PRP Machine Learning Platform
With Clouds of GPUs and Non-Von Neumann Processors
Microsoft Installs Altera FPGAs
into Bing Servers &
384 into TACC for Academic Access
CHASE-CI
64-TrueNorth
Cluster
64-bit GPUs
4352x NVIDIA Tesla V100 GPUs
See talk by:
Hurtado Anampa

PRP is Partnering with NSF Grants Supporting
Advanced Cyberinfrastructure Facilitators to Explore PRP Extension Toward NRP
PRP Connected
 ACI-REF has also spawned the 35-member
Campus Research Computing Consortium
(CaRCC), Funded by the NSF as
a Research Coordination Network (RCN)
 CaRCC is Dedicated to Sharing Best Practices,
Expertise, and Resources, Enabling
the Advancement of Campus-Based
Research Computing Activities
Across the Nation
Jim Bottum, Principal Investigator
Tom Cheatham, ACI REF Chair of Campus PIs
ACI-REF
CaRCC
See talk by
Tom Cheatham

Expanding to the Global Research Platform
Via CENIC/Pacific Wave, Internet2, and International Links
PRP
PRP’s Current
International
Partners
Korea Shows Distance is Not the Barrier
to Above 5Gb/s Disk-to-Disk Performance
Netherlands
Guam
Australia
Korea
Japan
Singapore

The Second National Research Platform Workshop
Bozeman, MT August 6-7, 2018
A follow-up FIONA workshop
will be held as a lead into
the 2nd NRP workshop
in Bozeman,
starting August 2nd.
While the workshop will be
open to the community,
there is a specific focus
on EPSCoR-affiliated
and minority serving institutions.
Co-Chairs:
Larry Smarr, Calit2
Inder Monga, ESnet
Ana Hunsinger, Internet2
Local Host: Jerry Sheehan, MSU

Our Support:
• US National Science Foundation (NSF) awards
 CNS 0821155, CNS-1338192, CNS-1456638, CNS-1730158,
ACI-1540112, & ACI-1541349
• University of California Office of the President CIO
• UCSD Chancellor’s Integrated Digital Infrastructure Program
• UCSD Next Generation Networking initiative
• Calit2 and Calit2 Qualcomm Institute
• CENIC, PacificWave and StarLight
• DOE ESnet

Toward a National Research Platform

More Related Content

What's hot

What's hot (20)

Similar to Toward a National Research Platform

Similar to Toward a National Research Platform (20)

More from Larry Smarr

More from Larry Smarr (20)

Recently uploaded

Recently uploaded (20)

Toward a National Research Platform