Award Abstract # 1032778
SDCI NMI Improvement: Maintenance of the Rocks Cluster Toolkit and Enhancements for Scalable, Reliable, and Resilient Clustered Data Storage

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UNIVERSITY OF CALIFORNIA, SAN DIEGO
Initial Amendment Date: August 29, 2010
Latest Amendment Date: March 1, 2011
Award Number: 1032778
Award Instrument: Standard Grant
Program Manager: Daniel Katz
OAC
�Office of Advanced Cyberinfrastructure (OAC)
CSE
�Direct For Computer & Info Scie & Enginr
Start Date: September 1, 2010
End Date: August 31, 2013�(Estimated)
Total Intended Award Amount: $499,999.00
Total Awarded Amount to Date: $499,999.00
Funds Obligated to Date: FY 2010 = $499,999.00
History of Investigator:
  • Philip Papadopoulos (Principal Investigator)
    ppapadopoulos@ucsd.edu
  • Mason Katz (Former Co-Principal Investigator)
  • Gregory Bruno (Former Co-Principal Investigator)
Recipient Sponsored Research Office: University of California-San Diego
9500 GILMAN DR
LA JOLLA
CA �US �92093-0021
(858)534-4896
Sponsor Congressional District: 50
Primary Place of Performance: University of California-San Diego
9500 GILMAN DR
LA JOLLA
CA �US �92093-0021
Primary Place of Performance
Congressional District:
50
Unique Entity Identifier (UEI): UYTTZT6G9DT1
Parent UEI:
NSF Program(s): SOFTWARE DEVELOPEMENT FOR CI
Primary Program Source: 01001011DB�NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s): 768300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.080

ABSTRACT

Today an increasing number of scientists need reliable, extensible, large-scale and high-performance storage to be tightly integrated into their computing/analysis workflows. These researchers often require their data to be available not only on remote computational clusters but also on their specialized, in-laboratory, equipment, workstations and displays. Unfortunately, many are being hindered because they do not have, but need in their labs, the fundamental storage capacity and bandwidth this is often available only in specialized data centers. For example, leaders in biological research are revolutionizing their science to be data intensive by developing and using instruments like high field-strength electron microscopes that generate terabytes of data weekly. While the specialized parallel storage systems can be built today by experts to meet both the capacity and throughput needs of a computationally-intensive analysis on clusters, the level of administrative effort is enormous for both initial deployment and ongoing operation. For many scientists, their needs are rapidly entering the realm of data intensive but their access to capable and reliable storage is limited either because of the complexity or expense (or both) of existing solutions. This same limit, which the Rocks cluster toolkit has successfully addressed, existed in computational clusters a decade ago.

In this award, the established and widely-used Rocks clustering software toolkit will be expanded to include not only ongoing production support and engineering enhancements for computational clusters but also to progressively address a litany of issues directly related to clustered storage provisioning, monitoring, and event generation. In particular, the impact of this award will be to bring the the current simplicity of compute cluster deployment to also include (1) farms of network-attached file servers and (2) dedicated parallel, high-performance storage clusters through the standard Rocks extension mechanism called Rolls. In addtion, the development of a monitoring architecture targeted specifically at storage subsystems to include per-disk metrics, file-system metrics, and aggregated network utilization for both Lustre Parallel and NFS-based server farms use will be started. The investigators will start the design and development of mechanisms that will enable correlation of file server utilization with jobs running on clients and remote workstations.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Papadopoulos, Philip M "Extending Clusters to Amazon EC2 Using the Rocks Toolkit" Internationl Journal of High Perfomance Computing Applications , v.25 , 2011 , p.317 10.1177/1094342011414747

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The Rocks cluster toolkit began as a project to make it easier for domain scientists to deploy and maintain commodity clusters in their laboratories without requiring significant administrative effort.   Prior to tools like Rocks, researchers would typically spend days, weeks, and sometimes months installing and configuring software to build a  functioning compute or Beowulf cluster.    With Rocks, a complete, functional, up-to-date cluster can be built from "bare metal" (no pre-installed operating system)  in just a few hours.

Based upon the CentOS distribution (but is compatible with nearly any Redhat-derived distribution), Rocks enables users to easily construct customizable, robust, high-performance clusters.  Even though the roots of the toolkit are in high-performance computing (HPC),  Rocks can and is used to build tiled-display walls, high-performance storage systems, and virtualized computing (cloud) hosting.  The intellectual merit of this project is to investigate efficient, robust, and reproducible methods for building a a variety of clustered systems.  The end-result has been a series of public, open-source software releases that increase both the flexibility and functionality of clustered systems.   Rocks is used around the world and at US production-level leading computing resource providers like Pacific Northwest National Laboratory, the San Diego Supercomputer Center, and the Texas Advanced Computing Center.  It is used at many research and teaching universities throughout the US including University of Nevada, Reno, University of Memphis, Rutgers University, University of Wisconsin, Florida Atlantic University,  Clark Atlanta Univerisity and the University of Connecticut. Small and large companies also deploy clusters using Rocks. The reach of the software toolkit as computing infrastructure means that it impacts every scientist who utilizes Rocks-managed computing and data, even if they are unaware that their favorite computing system is built with this toolkit.

The toolkit tames the inherent complexity of building and configuring clustered data systems.  A full build of the released software creates over 300 packages that complement the standard single-node operating system installation. Support for MPI (Message Passing Interface), Infiniband, several load schedulers, upgraded versions of common scripting languages like Python and Perl, bioinformatics tools, advanced storage (e.g. the ZFS file system), virtualization, creation of EC2-compatible cloud images, web servers, and cluster monitoring are all optional extensions supported by the basic toolkit.

  A fundamental design and intellectual goal was  user-extensibility via the Rolls mechanism that enables anyone to develop new packages, and more importantly, the configuration needed for automatic deployment in a cluster configuration. The toolkit uses an extensible "graph" to define a particular cluster where rolls provide well-defined subgraphs. This approach enables software like load schedulers (e.g., HTCondor or Torque) to easily define server and client configurations. Rolls, automates this configuration so that end-users deploy a basic working configuration that can then be further customized to meet needs.    Rolls are standalone components of cluster functionality that can be mixed and matched to create the specifc configuration desired by the end-user simply by including the roll at the time of installation.


Rocks itself is quite scalable with production deployments on large systems. Gordon (at the San Diego Supercomputer Center) is part of XSEDE and is 1024 nodes, and 16K cores.  The PIC (Programatic Insitutional Computing) cluster at Pacific Northwest National Laboratories is over 20K cores.  Many have developed Rolls that go beyond the core functionality o...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page