Containerised Bioinformatics Pipeline on AWS

CONTAINERISED BIOINFORMATICS PIPELINE ON AWS
JOINT PILOT PROJECT BETWEEN GARVAN AND GENOME.ONE
PRESENTED BY LIVIU CONSTANTINESCU

Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
AARON STATHAM
CHIEF BIOINFORMATICS OFFICER
ANDREW STONE
HEAD OF COHORT SEQUENCING
MARK PINESE
SENIOR RESEARCH OFFICER
DAVID THOMAS
DIRECTOR, KINGHORN CANCER CENTRE
BINOOP NANU
GENOME.ONE DEVOPS ENGINEER
LIVIU CONSTANTINESCU
HEAD OF GENOME.ONE DEVOPS
BEN THURGOOD
PRINCIPAL SOLUTIONS ARCHITECT, AWS
JAMIE NELSON
ACCOUNT MANAGER, AWS
The Team 01

XTen Sequencing Sites
Around the World
Our researchers are leaders in biomedical and clinical sciences. We have one of the ﬁrst clinical genomics enterprises in the world and
strengths in key enabling technologies such as data mining and software engineering. Using our Illumina XTen sequencers, a person’s
entire DNA sequence — three billion letters of genetic code — now can be read in Australia in just a few days for about $1500.
Garvan (via Genome.One) is the Largest Genome Sequencing Facility in the Southern Hemisphere
02
Genome.One

The Problem of Scale
And the Importance of Cost Optimisation
Faced with an increasingly commoditised market, and
competitors with much larger teams, we need to lower
costs in all aspects of our enterprise while maintaining
our very high levels of accuracy, traceability,
documentation, consistency, functionality and quality.
Further, we need to be able to pivot quickly, and innovate
fast, to stay at the forefront of genomics research in
Australia, and the world.
We’ve tackled this by adopting agile, lean approaches at
every level of our business, and building strong in-house
software development capabilities. Now we are shifting
towards a “software ﬁrst, hardware last” model for our
bioinformatics pipelines, applying the same rigorous,
change-managed continuous improvement model to our
infrastructure, science and compute that we do to the
software we develop in-house.
Genome.One owns 12 high-
throughput next-generation
sequencers, and analyses up
to 300 genomes every week.
03

Software First, Hardware Last
20 x
cost reduction
in compute
Hardware is big, slow-moving, and
quickly out of date.
It represents a high transaction cost, and depreciates in value
rapidly post-purchase. This applies to everything from our
genetic sequencers to the machines that analyse the data that
comes out of them. Treating our hardware infrastructure the
same way we do our code oﬀers us numerous advantages:
Cost Reduction
Faster Analysis Speed
Risk Reduction, by Removing Errors and Security Violations
Aside from that, it also led to a:
04

05
A Containerised Bioinformatics Pipeline
•We sequence whole human genomes at
scale (up to 18,000 per year)
•Each genome produces ~80 GB of raw
data (in FASTQ format).
•We need to run every such genome
through an analysis process yielding BAM
ﬁles (~160GB) and ﬁnally gVCFs (~8GB).
•Each genome can be processed
independently, and is thus a great
candidate for computing in parallel at
scale.
•Continuous improvement also drives
further increases in data size.
The Processing Pipeline
• We created a Docker image that:
• Downloads raw data from NCI or Amazon S3.
• Processes this data.
• Uploads the results to NCI or Amazon S3.
• It is optimised for c3.8xlarge instance (using
320GB of ephemeral storage)
• Runs a genome in approximately 20 hours
(including data transfer time).
• Each stage is maxing out either CPU or RAM.
• c3.8xlarge on demand price is $2.117/h (~$40/
genome).
• Infrastructure as Code allows us to take
advantage of the Spot price, however, which
hovers around 35 – 50c (~$7-10/genome).
What we Built

This Pipeline’s First Research Application
Introducing ISKS and the Medical Genome Reference Bank
Sarcomas are rare and deadly cancers that are usually
diagnosed at an advanced stage or following metastasis.
Earlier diagnosis of sarcoma is expected to lead to greatly
improved patient survival, but the rarity of the cancer in the
general population makes rapid diagnosis extremely
challenging.
Multiple lines of evidence suggest that sarcoma risk has a
substantial genetic component: sarcomas overwhelmingly
aﬀect the young, sarcoma survivors are at increased risk of
second cancers. The two research cohorts used in this study
(MGRB and ISKS) represent the extremes of sarcoma risk.
They will be made available to all Australian scientists, via the
SGC web portal at: sgc.garvan.org.au
Developing A Method of Identifying
Individuals at High Risk of Sarcoma
06
ISKS (n=1,000) – Young, Aﬀected
MGRB (n=4000) – Elderly, Healthy

The Beneﬁt to Australia
Two Main Analyses will
Follow:
The MGRB & ISKS projects together
represent 10 million AUD of research
investment.
Assess the genomic patterns
associated with healthy old age.
Provide a control sequencing cohort 
for disease linkage studies.
Determine loci and genes involved
in Sarcoma Risk.
MGRB will be analysed to:
MGRB will be compared with ISKS to:
Major Advances in Sarcoma Research, and a Universal Control Cohort for Genomics Studies
07

Monitoring, Tagging
and Visibility 4
Data Mover Architecture
3
Smart Queuing System
2
Optimised Compute
Instance 1
Autom
ate
w
ith
CloudForm
ation
We never intended this architecture to be
running 24/7, and neither did we intend it to
be dedicated to this speciﬁc research study.
Nor, for that matter, do we want to be
limited to a single instance.
Like all other code in the Genome.One
Ecosystem, and the container the pipeline is
based on, our entire compute architecture is
source controlled, change managed, and
deployable at the click of a button.
For this, we used AWS CloudFormation.
An Investment in the Future
Our Process, Step By Step
How we Turned our Container into a Change Managed, Optimised & Repeatable Pipeline
08

09
Deployment Automation
Using Nested Templates in AWS CloudFormation
•kccgpipelines-master.json
•Network.json
•Nat.json
•S3endpoint.json
•Bastion.json
•DB.json
•SQS.json
•ECSCluster.json
•App.json
•ECSCluster-monitor.json
•DirectConnect.json
•DataMover.json

• Every stage of the pipeline has diﬀering compute needs.
Compute Optimisation
• Each stage is maxing out either CPU or RAM
10

11
Smart Queuing System
•If a task does not have a queued or
running job, create a new job and
submit to an Amazon SQS queue.
•When tasks are in the Amazon SQS
queue, scale up the Amazon EC2/ECS
cluster.
•When containers are not working and
the Amazon SQS queue is empty,
scale down the Amazon EC2/ECS
cluster.
Amazon SQS
•Create an Amazon ECS cluster to run 1
container per instance.
•Create an Amazon EC2 autoscaling group
to run containers on.
Amazon ECS
•Serves a simple task management database:
•1 genome == 1 task
•1 job == 1 attempt at completing a task
Amazon RDS

12
Design 
of Production
Scale Deployment

Information Radiators
Displaying Pipeline Information at-a-Glance
9
We continuously monitored run time, the
number of tasks in the queue, the prices
of the instances we needed on Spot,
instances being used, data in the Amazon
S3 buckets. We tied alerts to all of these.
Shared Vision and Transparency
By using billing tags and
automatically assigning these to
all resources via the AWS
CloudFormation scripts that
create them, plus specifying our
desired range of Spot prices, we
maintained complete control
over what we were paying, when.
Total Control of Billing
G
Connecting our existing suite of CI/CD
tools (Atlassian’s Bamboo/Bitbucket, with
sample ﬂow monitored in Atlassian JIRA)
to the system was easy and effective, via
some simple dashboarding tools and
Atlassian JIRA, alongside the AWS CLI.
Complete End-to-End Reporting
E
13PipelineMonitoringDashboard

14
State of the Pipeline
•4,500 whole genomes processed.
•Raw data currently staged on NCI.
•Data pulled up to Amazon EC2 for processing,
results deposited into an Amazon S3 bucket.
•Processing occurs in a c3.8xlarge autoscaling
group of max 150 instances with max spot bid
@ $2.90. Costing for Phase 1 to follow.
•After project is ﬁnished, results are pulled
back down to NCI using datamover nodes in a
DirectConnect VPC.
•Phase 1 data is already up on
sgc.garvan.org.au, accessible for free.
Overview
•Direct Connect from AWS and NCI.
•Direct Connect between AWS and Garvan
(via UNSW).
•Use AWS Snowball to deliver the genomes
on a physical device.
•Partial egress waiver available through
AWS for scientiﬁc studies like this one.
•Phase 2 metrics are 2859 high-quality
samples, with roughly 70 million loci
genotyped in each. Roughly 3 million CPU-
hours, 1.1 Petabytes data.
Egress Options

Phase 1 Reliability & Speed Metrics
131
20
4
Compute time for each of the
successful jobs was on average
22.6 hours
Compute for this phase was
completed between 1st of April
and 11th of April (11 days)
856 genomes were completed in
1040 jobs (attempts)
Reasons for Retries: Accidental terminations (~20), Spot terminations, Container termination (due to pipeline errors).
Pipeline errors were largely OOM errors in the HaplotypeCaller stage, likely due to recent real world data being larger than the test data.
15

Spot Price Fluctuation Observed 
(Some terminations here and there, mainly in AZ 2A)
•In our trials, spot prices scaled comfortably to about 150 c3.8xlarge instances and 200 r3.8xlarge
instances (about 10000 cores total), before prices were signiﬁcantly aﬀected.
16

NAT Network Throughput
We observed signiﬁcant spend on data mover instances, and will Optimise this in Future
•NAT Gateway pricing sub-optimal here.
•Instead, used c3.large instances, but
maxed out at ~4GB/s data throughput.
•Upgrading to c3.8xlarge saw bandwidth
increase to 20GB/s or higher.
•Due to the signiﬁcant costs involved, we
suggest that anyone building a similar
architecture use a Squid Proxy instead.
•See: https://aws.amazon.com/articles/
5995712515781075
17

• Compute
– Total for 856 samples = $15,093
• $17.6 per genome
• $12,409 on c3.8xlarge spot instances for compute
• $2,476 on NAT instances (could be improved, via Squid Proxy
solution)
• Data egress
– Amazon S3 -> Direct Connect -> UNSW -> NCI
– Roughly $8.7 per genome
• Amazon S3 storage
– Total for 856 samples = $1,954 ($2.2 per genome)
• Grand total estimate == $28.5 USD (no GST)
– $37.33 AUD per genome.
Per-Genome Costs Observed 18

Further Evolution
Our Roadmap for Refining the Bioinformatics Pipeline
2
Phase 2 – Adjust Data Metrics
We can reduce the frequency of Out of Memory errors with
some more work on the pipeline’s optimisation, to account
for the spikes in RAM usage we observed.
3
Phase 3 – More Instance Types
We’d like to optimise container versions for multiple instance
types, and use SpotFleet to deploy based on what’s
economical/available. Our pipeline’s output ought to scale
linearly with CPU cores, given sufficient RAM. We’ll also
investigate the use of Amazon EBS over ephemeral storage.
4
Phase 4 – Optimise NAT Instances
Switch over to a Squid Proxy model over our current NAT
instances, or simply find a cheaper instance type with higher
throughput. Alternatively, we can preload all raw data from
NCI to Amazon S3 (over Direct Connect) and likely avoid
needing (expensive) NAT instances altogether.
1
Complete – Containerised AWS Pipeline
Pipeline works end-to-end, optimised for the cloud, and able
to produce a high-quality result within the budget available.
5
Alternative – AWS Batch
Amazon’s new service, AWS Batch, was not available in
Sydney when we built this pipeline, and we are very
interested in its potential to bring even more gains.
19

Special Thanks
• Our sponsors and collaborators (shown to left).
– Our sequencing staff at Garvan and Genome.One.
• NSW OHMR
– For funding the research study.
• Aspree
– For providing samples.
• The 45 and Up Study
– For providing samples.
• Amazon Web Services
– For Jamie and Ben’s valuable assistance completing
this pilot study.
– For defraying some of the costs of developing the
pipeline, and of data egress for the study.
20

Sydney
2017
@OKFNAu
@HealthHackAu 
#HealthHackAu
http://healthhack.com.au
info@healthhack.com.au
3rd and 4th November

Containerised Bioinformatics Pipeline on AWS

More Related Content

Containerised Bioinformatics Pipeline on AWS