Jsm madduri-august-2015

globus.org/genomics
Finding Needles in a Haystack – Big Data
Management and Analysis using Globus
Ravi Madduri
madduri@anl.gov
JSM 2015, Seattle, Washington

globus.org/genomics
• Globus Genomics is developed, operated, and supported by
researchers, developers, and bioinformaticians at the
Computation Institute – University of Chicago/Argonne
National Lab
• We are a non-profit organization building solutions for non-
profit researchers
• Our goal is to support the advancement of science by bringing
together our strengths and capabilities to help meet the
unique needs of researchers and research institutions
Who We Are

globus.org/genomics
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
Finding needles in haystacks
Pose
question
3

globus.org/genomics
Imagine if a researcher, when
tackling a problem, could easily:
• Assemble, integrate, and interpret all
relevant data within a knowledge network
• Be informed of anomalies, patterns, gaps
• Formulate & apply computational models
• Outsource tasks if local expertise lacking
• Launch automated processes to test
hypotheses, expand knowledge network
• Pay for all this by taking on other tasks

globus.org/genomics
We will cover
• Accelerating Scientific Discovery Process
by providing Science as a Service
– Research Data Management
– Analyzing Research Data
• Interactive Analysis
• Large-scale Analysis
– Publishing Results so others can
• Discover
• Validate
• Reproduce/Use

globus.org/genomics
90% of cancer patients carry a
mutation that may be
responsive to a known drug
Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian
Hospital in New York in Nature, April, 2015

Trying to find a single causative gene for
diseases with a complex genetic background
is like looking for the proverbial needle in a
haystack
– Nancy Cox
(Vanderbilt)

globus.org/genomics
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”
Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of
scientists, 100Ks of CPUs, Bs of tasks

globus.org/genomics
How do we accelerate discovery
without requiring that every lab acquire
a haystack-sorting machine?
Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia

globus.org/genomics
Managing big data with Globus
PI initiates transfer
request; or requested
automatically by script,
science gateway
1
Globus transfers files
reliably, securely
Light Source
Compute Facility
2
PI selects files to
share, selects
user or group,
and sets access
permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
Curator reviews and
approves; data set
published on campus
or other system
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS  Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
informs throughout
6 8
Publication
Repository
Personal Computer

globus.org/genomics
Globus Platform-as-a-Service
Identity, Group, Profile
Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
GlobusAPIs
GlobusConnect

globus.org/genomics
Globus Adoption and Usage
• 166,449 active Globus endpoints
• 27,961 users registered
• Biggest transfer: 500.42TB
• Longest running transfer: 182 days.
• Fastest transfer: 58.5Gbps (average)
• 55TB moved per day, on average, since the
service was launched in November 2010
• Average throughput: 637.7Mbps (since
service launch)

globus.org/genomics
Analyzing Big Data using Globus
Galaxies
Sequencing
Centers
Sequencing
Centers
Public
Data
Storage
Local Cluster/
CloudSeq
Center
Research Lab
Globus provides for
• High-performance
• Fault-tolerant
• Secure
file transfer between
all data-endpoints
Data management Data analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy
Data Libraries
Globus Genomics
on Amazon EC2
• Analytical tools are
automatically run
on the scalable
compute
resources when
possible
• Globus integrated
within Galaxy
• Web-based UI
• Drag-Drop
workflow
creations
• Easily modify
workflows with
new tools
Galaxy-based workflow
managementGlobus
Genomics

globus.org/genomics
Our Science Stack
• Galaxy
– Interactive execution, iPython, R
– Creation, Execution, Sharing, Discovering Workflows
• Globus
– Data management
– Identity Management
• AWS
– HTCondor, Chef, EC2, EBS, S3, SNS
– Spot, Route 53, Cloud Formation
SaaS
PaaS
IaaS

globus.org/genomics
Examples of what
researchers have done

globus.org/genomics
• 134 samples and 4 workflows
• 4 TB data initially
• 2200 core hours in 6 days
Cox lab, UChicago

globus.org/genomics
Consensus Caller

globus.org/genomics
Rediscovery of previously observed variants Transition/Transversion Ratio
Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio

globus.org/genomics
Contaminated Samples

globus.org/genomics
Olopade lab, UChicago
A profile of inherited predisposition to breast
cancer among Nigerian women
Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner,
S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola,
O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
• 200 targeted exomes
• 200 GB data initially
• 76,920 core hours in 1.25 days

globus.org/genomics
Expanding Consensus
Genotyper – SNVs, Indels, SVs
RAW
FASTQs
GATK
Pipeline/HC
FreeBayes
SAMtools
mpileup
GATK
Pipeline/UG
VCF
VCF
VCF
VCF
Consensus
Genotyper
VCF
Atlas2
Delly/Contra
VCF
VCF

globus.org/genomics
14 deleterious SNVs and 11 damaging
Indels (BRCA1: 15, BRCA2: 4, PALB2: 2,
BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were
found in 29 subjects, and they were all
confidently detected among 5 callers.
Identified SNVs and Indels were all
confirmed by Sanger sequencing.
Preliminary Results are very
encouraging

globus.org/genomics
QC
PPMI ADNI
Adenocarcinoma
http://bit.ly/1M0h6Yx
http://bit.ly/A10R89y
Adrenal
Brain Alignment
Feature
count
Alignment
QC
1. Query and
discover data
3. Execute parallel alignment
workflow on dynamically
provisioned cloud resources
ERMrest
2. Transfer
bags
Alignment
FilesAlignment
Files
3. Publish
bags
BDDS Collection
Alignment
FilesAlignment
Files
Differential
expression
Differential
expression
4. Discover published data
and execute comparison
workflow
Combining Data management
and Analysis

globus.org/genomics
Gene Expression Results

globus.org/genomics
Globus Genomics at a
glance
30
institutions, groups
10s
million core hours
labs
2 PBs
raw sequences
analyzed
>1500
analysis tools
1000s
genomes processed
>50
workflows
99%
uptime over the past
two years
1 PB
largest single transfer
to do
5 days
longest running
workflow
100s
different species
1000s
genomes processed
5 days
longest running
workflow

globus.org/genomics
Other Globus Genomics users
Dobyns
Lab
Cox Lab
Volchenboum Lab
Olopade Lab
Nagarajan Lab

globus.org/genomics
Pricing includes
• Estimated compute
• Storage (one month)
• Globus Genomics platform usage
• Support
Costs are remarkably low

globus.org/genomics
Globus Genomics – Making it routine to find
needles in NGS haystacks
www.globus.org/genomics

globus.org/genomics
Other Examples of
Science as a Service
• PDACS - Portal for data analysis services for
cosmological simulations
• CVRG Galaxy – Large-scale ECG Data
Analysis
• Globus Proteomics
• eMatter – Material Science Simulations
• FACE-IT - Framework to Advance Climate,
Economic, and Impact Investigations with
Information Technology (usefaceit.org)

globus.org/genomics
• More information on Globus
Genomics:www.globus.org/geno
mics
• More information on Globus:
www.globus.org

globus.org/genomics
Our work is supported by:
U. S. D E PART M ENT OF
ENERGY
31

globus.org/genomics
Thank you!
@madduri

Jsm madduri-august-2015

Related slideshows

More Related Content

Jsm madduri-august-2015

Editor's Notes