SlideShare a Scribd company logo
CONTAINERISED BIOINFORMATICS PIPELINE ON AWS
JOINT PILOT PROJECT BETWEEN GARVAN AND GENOME.ONE
PRESENTED BY LIVIU CONSTANTINESCU
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
AARON STATHAM
CHIEF BIOINFORMATICS OFFICER
ANDREW STONE
HEAD OF COHORT SEQUENCING
MARK PINESE
SENIOR RESEARCH OFFICER
DAVID THOMAS
DIRECTOR, KINGHORN CANCER CENTRE
BINOOP NANU
GENOME.ONE DEVOPS ENGINEER
LIVIU CONSTANTINESCU
HEAD OF GENOME.ONE DEVOPS
BEN THURGOOD
PRINCIPAL SOLUTIONS ARCHITECT, AWS
JAMIE NELSON
ACCOUNT MANAGER, AWS
The Team 01
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
XTen Sequencing Sites
Around the World
Our researchers are leaders in biomedical and clinical sciences. We have one of the first clinical genomics enterprises in the world and
strengths in key enabling technologies such as data mining and software engineering. Using our Illumina XTen sequencers, a person’s
entire DNA sequence — three billion letters of genetic code — now can be read in Australia in just a few days for about $1500.
Garvan (via Genome.One) is the Largest Genome Sequencing Facility in the Southern Hemisphere
02
Genome.One
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
The Problem of Scale
And the Importance of Cost Optimisation
Faced with an increasingly commoditised market, and
competitors with much larger teams, we need to lower
costs in all aspects of our enterprise while maintaining
our very high levels of accuracy, traceability,
documentation, consistency, functionality and quality.
Further, we need to be able to pivot quickly, and innovate
fast, to stay at the forefront of genomics research in
Australia, and the world.
We’ve tackled this by adopting agile, lean approaches at
every level of our business, and building strong in-house
software development capabilities. Now we are shifting
towards a “software first, hardware last” model for our
bioinformatics pipelines, applying the same rigorous,
change-managed continuous improvement model to our
infrastructure, science and compute that we do to the
software we develop in-house.
Genome.One owns 12 high-
throughput next-generation
sequencers, and analyses up
to 300 genomes every week.
03
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Software First, Hardware Last
20 x
cost reduction
in compute
Hardware is big, slow-moving, and
quickly out of date.
It represents a high transaction cost, and depreciates in value
rapidly post-purchase. This applies to everything from our
genetic sequencers to the machines that analyse the data that
comes out of them. Treating our hardware infrastructure the
same way we do our code offers us numerous advantages:
Cost Reduction
Faster Analysis Speed
Risk Reduction, by Removing Errors and Security Violations
Aside from that, it also led to a:
04
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
05
A Containerised Bioinformatics Pipeline
•We sequence whole human genomes at
scale (up to 18,000 per year)
•Each genome produces ~80 GB of raw
data (in FASTQ format).
•We need to run every such genome
through an analysis process yielding BAM
files (~160GB) and finally gVCFs (~8GB).
•Each genome can be processed
independently, and is thus a great
candidate for computing in parallel at
scale.
•Continuous improvement also drives
further increases in data size.
The Processing Pipeline
• We created a Docker image that:
• Downloads raw data from NCI or Amazon S3.
• Processes this data.
• Uploads the results to NCI or Amazon S3.
• It is optimised for c3.8xlarge instance (using
320GB of ephemeral storage)
• Runs a genome in approximately 20 hours
(including data transfer time).
• Each stage is maxing out either CPU or RAM.
• c3.8xlarge on demand price is $2.117/h (~$40/
genome).
• Infrastructure as Code allows us to take
advantage of the Spot price, however, which
hovers around 35 – 50c (~$7-10/genome).
What we Built
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
This Pipeline’s First Research Application
Introducing ISKS and the Medical Genome Reference Bank
Sarcomas are rare and deadly cancers that are usually
diagnosed at an advanced stage or following metastasis.
Earlier diagnosis of sarcoma is expected to lead to greatly
improved patient survival, but the rarity of the cancer in the
general population makes rapid diagnosis extremely
challenging.
Multiple lines of evidence suggest that sarcoma risk has a
substantial genetic component: sarcomas overwhelmingly
affect the young, sarcoma survivors are at increased risk of
second cancers. The two research cohorts used in this study
(MGRB and ISKS) represent the extremes of sarcoma risk.
They will be made available to all Australian scientists, via the
SGC web portal at: sgc.garvan.org.au
Developing A Method of Identifying
Individuals at High Risk of Sarcoma
06
ISKS (n=1,000) – Young, Affected
MGRB (n=4000) – Elderly, Healthy
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
The Benefit to Australia
Two Main Analyses will
Follow:
The MGRB & ISKS projects together
represent 10 million AUD of research
investment.
Assess the genomic patterns
associated with healthy old age.
Provide a control sequencing cohort

for disease linkage studies.
Determine loci and genes involved
in Sarcoma Risk.
MGRB will be analysed to:
MGRB will be compared with ISKS to:
Major Advances in Sarcoma Research, and a Universal Control Cohort for Genomics Studies
07
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Monitoring, Tagging
and Visibility 4
Data Mover Architecture
3
Smart Queuing System
2
Optimised Compute
Instance 1
Autom
ate
w
ith
CloudForm
ation
We never intended this architecture to be
running 24/7, and neither did we intend it to
be dedicated to this specific research study.
Nor, for that matter, do we want to be
limited to a single instance.
Like all other code in the Genome.One
Ecosystem, and the container the pipeline is
based on, our entire compute architecture is
source controlled, change managed, and
deployable at the click of a button.
For this, we used AWS CloudFormation.
An Investment in the Future
Our Process, Step By Step
How we Turned our Container into a Change Managed, Optimised & Repeatable Pipeline
08
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
09
Deployment Automation
Using Nested Templates in AWS CloudFormation
•kccgpipelines-master.json
•Network.json
•Nat.json
•S3endpoint.json
•Bastion.json
•DB.json
•SQS.json
•ECSCluster.json
•App.json
•ECSCluster-monitor.json
•DirectConnect.json
•DataMover.json
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
• Every stage of the pipeline has differing compute needs.
Compute Optimisation
• Each stage is maxing out either CPU or RAM
10
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
11
Smart Queuing System
•If a task does not have a queued or
running job, create a new job and
submit to an Amazon SQS queue.
•When tasks are in the Amazon SQS
queue, scale up the Amazon EC2/ECS
cluster.
•When containers are not working and
the Amazon SQS queue is empty,
scale down the Amazon EC2/ECS
cluster.
Amazon SQS
•Create an Amazon ECS cluster to run 1
container per instance.
•Create an Amazon EC2 autoscaling group
to run containers on.
Amazon ECS
•Serves a simple task management database:
•1 genome == 1 task
•1 job == 1 attempt at completing a task
Amazon RDS
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
12
Design

of Production
Scale Deployment
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Information Radiators
Displaying Pipeline Information at-a-Glance
9
We continuously monitored run time, the
number of tasks in the queue, the prices
of the instances we needed on Spot,
instances being used, data in the Amazon
S3 buckets. We tied alerts to all of these.
Shared Vision and Transparency
By using billing tags and
automatically assigning these to
all resources via the AWS
CloudFormation scripts that
create them, plus specifying our
desired range of Spot prices, we
maintained complete control
over what we were paying, when.
Total Control of Billing
G
Connecting our existing suite of CI/CD
tools (Atlassian’s Bamboo/Bitbucket, with
sample flow monitored in Atlassian JIRA)
to the system was easy and effective, via
some simple dashboarding tools and
Atlassian JIRA, alongside the AWS CLI.
Complete End-to-End Reporting
E
13PipelineMonitoringDashboard
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
14
State of the Pipeline
•4,500 whole genomes processed.
•Raw data currently staged on NCI.
•Data pulled up to Amazon EC2 for processing,
results deposited into an Amazon S3 bucket.
•Processing occurs in a c3.8xlarge autoscaling
group of max 150 instances with max spot bid
@ $2.90. Costing for Phase 1 to follow.
•After project is finished, results are pulled
back down to NCI using datamover nodes in a
DirectConnect VPC.
•Phase 1 data is already up on
sgc.garvan.org.au, accessible for free.
Overview
•Direct Connect from AWS and NCI.
•Direct Connect between AWS and Garvan
(via UNSW).
•Use AWS Snowball to deliver the genomes
on a physical device.
•Partial egress waiver available through
AWS for scientific studies like this one.
•Phase 2 metrics are 2859 high-quality
samples, with roughly 70 million loci
genotyped in each.  Roughly 3 million CPU-
hours, 1.1 Petabytes data.
Egress Options
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Phase 1 Reliability & Speed Metrics
131
20
4
Compute time for each of the
successful jobs was on average
22.6 hours
Compute for this phase was
completed between 1st of April
and 11th of April (11 days)
856 genomes were completed in
1040 jobs (attempts)
Reasons for Retries: Accidental terminations (~20), Spot terminations, Container termination (due to pipeline errors).
Pipeline errors were largely OOM errors in the HaplotypeCaller stage, likely due to recent real world data being larger than the test data.
15
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Spot Price Fluctuation Observed

(Some terminations here and there, mainly in AZ 2A)
•In our trials, spot prices scaled comfortably to about 150 c3.8xlarge instances and 200 r3.8xlarge
instances (about 10000 cores total), before prices were significantly affected.
16
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
NAT Network Throughput
We observed significant spend on data mover instances, and will Optimise this in Future
•NAT Gateway pricing sub-optimal here.
•Instead, used c3.large instances, but
maxed out at ~4GB/s data throughput.
•Upgrading to c3.8xlarge saw bandwidth
increase to 20GB/s or higher.
•Due to the significant costs involved, we
suggest that anyone building a similar
architecture use a Squid Proxy instead.
•See: https://aws.amazon.com/articles/
5995712515781075
17
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
• Compute
– Total for 856 samples = $15,093
• $17.6 per genome
• $12,409 on c3.8xlarge spot instances for compute
• $2,476 on NAT instances (could be improved, via Squid Proxy
solution)
• Data egress
– Amazon S3 -> Direct Connect -> UNSW -> NCI
– Roughly $8.7 per genome
• Amazon S3 storage
– Total for 856 samples = $1,954 ($2.2 per genome)
• Grand total estimate == $28.5 USD (no GST)
– $37.33 AUD per genome.
Per-Genome Costs Observed 18
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Further Evolution
Our Roadmap for Refining the Bioinformatics Pipeline
2
Phase 2 – Adjust Data Metrics
We can reduce the frequency of Out of Memory errors with
some more work on the pipeline’s optimisation, to account
for the spikes in RAM usage we observed.
3
Phase 3 – More Instance Types
We’d like to optimise container versions for multiple instance
types, and use SpotFleet to deploy based on what’s
economical/available. Our pipeline’s output ought to scale
linearly with CPU cores, given sufficient RAM. We’ll also
investigate the use of Amazon EBS over ephemeral storage.
4
Phase 4 – Optimise NAT Instances
Switch over to a Squid Proxy model over our current NAT
instances, or simply find a cheaper instance type with higher
throughput. Alternatively, we can preload all raw data from
NCI to Amazon S3 (over Direct Connect) and likely avoid
needing (expensive) NAT instances altogether.
1
Complete – Containerised AWS Pipeline
Pipeline works end-to-end, optimised for the cloud, and able
to produce a high-quality result within the budget available.
5
Alternative – AWS Batch
Amazon’s new service, AWS Batch, was not available in
Sydney when we built this pipeline, and we are very
interested in its potential to bring even more gains.
19
Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au
Special Thanks
• Our sponsors and collaborators (shown to left).
– Our sequencing staff at Garvan and Genome.One.
• NSW OHMR
– For funding the research study.
• Aspree
– For providing samples.
• The 45 and Up Study
– For providing samples.
• Amazon Web Services
– For Jamie and Ben’s valuable assistance completing
this pilot study.
– For defraying some of the costs of developing the
pipeline, and of data egress for the study.
20
Sydney
2017
		@OKFNAu
@HealthHackAu

#HealthHackAu
http://healthhack.com.au
info@healthhack.com.au
3rd and 4th November

More Related Content

Containerised Bioinformatics Pipeline on AWS

  • 1. CONTAINERISED BIOINFORMATICS PIPELINE ON AWS JOINT PILOT PROJECT BETWEEN GARVAN AND GENOME.ONE PRESENTED BY LIVIU CONSTANTINESCU
  • 2. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au AARON STATHAM CHIEF BIOINFORMATICS OFFICER ANDREW STONE HEAD OF COHORT SEQUENCING MARK PINESE SENIOR RESEARCH OFFICER DAVID THOMAS DIRECTOR, KINGHORN CANCER CENTRE BINOOP NANU GENOME.ONE DEVOPS ENGINEER LIVIU CONSTANTINESCU HEAD OF GENOME.ONE DEVOPS BEN THURGOOD PRINCIPAL SOLUTIONS ARCHITECT, AWS JAMIE NELSON ACCOUNT MANAGER, AWS The Team 01
  • 3. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au XTen Sequencing Sites Around the World Our researchers are leaders in biomedical and clinical sciences. We have one of the first clinical genomics enterprises in the world and strengths in key enabling technologies such as data mining and software engineering. Using our Illumina XTen sequencers, a person’s entire DNA sequence — three billion letters of genetic code — now can be read in Australia in just a few days for about $1500. Garvan (via Genome.One) is the Largest Genome Sequencing Facility in the Southern Hemisphere 02 Genome.One
  • 4. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au The Problem of Scale And the Importance of Cost Optimisation Faced with an increasingly commoditised market, and competitors with much larger teams, we need to lower costs in all aspects of our enterprise while maintaining our very high levels of accuracy, traceability, documentation, consistency, functionality and quality. Further, we need to be able to pivot quickly, and innovate fast, to stay at the forefront of genomics research in Australia, and the world. We’ve tackled this by adopting agile, lean approaches at every level of our business, and building strong in-house software development capabilities. Now we are shifting towards a “software first, hardware last” model for our bioinformatics pipelines, applying the same rigorous, change-managed continuous improvement model to our infrastructure, science and compute that we do to the software we develop in-house. Genome.One owns 12 high- throughput next-generation sequencers, and analyses up to 300 genomes every week. 03
  • 5. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Software First, Hardware Last 20 x cost reduction in compute Hardware is big, slow-moving, and quickly out of date. It represents a high transaction cost, and depreciates in value rapidly post-purchase. This applies to everything from our genetic sequencers to the machines that analyse the data that comes out of them. Treating our hardware infrastructure the same way we do our code offers us numerous advantages: Cost Reduction Faster Analysis Speed Risk Reduction, by Removing Errors and Security Violations Aside from that, it also led to a: 04
  • 6. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au 05 A Containerised Bioinformatics Pipeline •We sequence whole human genomes at scale (up to 18,000 per year) •Each genome produces ~80 GB of raw data (in FASTQ format). •We need to run every such genome through an analysis process yielding BAM files (~160GB) and finally gVCFs (~8GB). •Each genome can be processed independently, and is thus a great candidate for computing in parallel at scale. •Continuous improvement also drives further increases in data size. The Processing Pipeline • We created a Docker image that: • Downloads raw data from NCI or Amazon S3. • Processes this data. • Uploads the results to NCI or Amazon S3. • It is optimised for c3.8xlarge instance (using 320GB of ephemeral storage) • Runs a genome in approximately 20 hours (including data transfer time). • Each stage is maxing out either CPU or RAM. • c3.8xlarge on demand price is $2.117/h (~$40/ genome). • Infrastructure as Code allows us to take advantage of the Spot price, however, which hovers around 35 – 50c (~$7-10/genome). What we Built
  • 7. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au This Pipeline’s First Research Application Introducing ISKS and the Medical Genome Reference Bank Sarcomas are rare and deadly cancers that are usually diagnosed at an advanced stage or following metastasis. Earlier diagnosis of sarcoma is expected to lead to greatly improved patient survival, but the rarity of the cancer in the general population makes rapid diagnosis extremely challenging. Multiple lines of evidence suggest that sarcoma risk has a substantial genetic component: sarcomas overwhelmingly affect the young, sarcoma survivors are at increased risk of second cancers. The two research cohorts used in this study (MGRB and ISKS) represent the extremes of sarcoma risk. They will be made available to all Australian scientists, via the SGC web portal at: sgc.garvan.org.au Developing A Method of Identifying Individuals at High Risk of Sarcoma 06 ISKS (n=1,000) – Young, Affected MGRB (n=4000) – Elderly, Healthy
  • 8. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au The Benefit to Australia Two Main Analyses will Follow: The MGRB & ISKS projects together represent 10 million AUD of research investment. Assess the genomic patterns associated with healthy old age. Provide a control sequencing cohort
 for disease linkage studies. Determine loci and genes involved in Sarcoma Risk. MGRB will be analysed to: MGRB will be compared with ISKS to: Major Advances in Sarcoma Research, and a Universal Control Cohort for Genomics Studies 07
  • 9. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Monitoring, Tagging and Visibility 4 Data Mover Architecture 3 Smart Queuing System 2 Optimised Compute Instance 1 Autom ate w ith CloudForm ation We never intended this architecture to be running 24/7, and neither did we intend it to be dedicated to this specific research study. Nor, for that matter, do we want to be limited to a single instance. Like all other code in the Genome.One Ecosystem, and the container the pipeline is based on, our entire compute architecture is source controlled, change managed, and deployable at the click of a button. For this, we used AWS CloudFormation. An Investment in the Future Our Process, Step By Step How we Turned our Container into a Change Managed, Optimised & Repeatable Pipeline 08
  • 10. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au 09 Deployment Automation Using Nested Templates in AWS CloudFormation •kccgpipelines-master.json •Network.json •Nat.json •S3endpoint.json •Bastion.json •DB.json •SQS.json •ECSCluster.json •App.json •ECSCluster-monitor.json •DirectConnect.json •DataMover.json
  • 11. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au • Every stage of the pipeline has differing compute needs. Compute Optimisation • Each stage is maxing out either CPU or RAM 10
  • 12. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au 11 Smart Queuing System •If a task does not have a queued or running job, create a new job and submit to an Amazon SQS queue. •When tasks are in the Amazon SQS queue, scale up the Amazon EC2/ECS cluster. •When containers are not working and the Amazon SQS queue is empty, scale down the Amazon EC2/ECS cluster. Amazon SQS •Create an Amazon ECS cluster to run 1 container per instance. •Create an Amazon EC2 autoscaling group to run containers on. Amazon ECS •Serves a simple task management database: •1 genome == 1 task •1 job == 1 attempt at completing a task Amazon RDS
  • 13. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au 12 Design
 of Production Scale Deployment
  • 14. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Information Radiators Displaying Pipeline Information at-a-Glance 9 We continuously monitored run time, the number of tasks in the queue, the prices of the instances we needed on Spot, instances being used, data in the Amazon S3 buckets. We tied alerts to all of these. Shared Vision and Transparency By using billing tags and automatically assigning these to all resources via the AWS CloudFormation scripts that create them, plus specifying our desired range of Spot prices, we maintained complete control over what we were paying, when. Total Control of Billing G Connecting our existing suite of CI/CD tools (Atlassian’s Bamboo/Bitbucket, with sample flow monitored in Atlassian JIRA) to the system was easy and effective, via some simple dashboarding tools and Atlassian JIRA, alongside the AWS CLI. Complete End-to-End Reporting E 13PipelineMonitoringDashboard
  • 15. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au 14 State of the Pipeline •4,500 whole genomes processed. •Raw data currently staged on NCI. •Data pulled up to Amazon EC2 for processing, results deposited into an Amazon S3 bucket. •Processing occurs in a c3.8xlarge autoscaling group of max 150 instances with max spot bid @ $2.90. Costing for Phase 1 to follow. •After project is finished, results are pulled back down to NCI using datamover nodes in a DirectConnect VPC. •Phase 1 data is already up on sgc.garvan.org.au, accessible for free. Overview •Direct Connect from AWS and NCI. •Direct Connect between AWS and Garvan (via UNSW). •Use AWS Snowball to deliver the genomes on a physical device. •Partial egress waiver available through AWS for scientific studies like this one. •Phase 2 metrics are 2859 high-quality samples, with roughly 70 million loci genotyped in each.  Roughly 3 million CPU- hours, 1.1 Petabytes data. Egress Options
  • 16. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Phase 1 Reliability & Speed Metrics 131 20 4 Compute time for each of the successful jobs was on average 22.6 hours Compute for this phase was completed between 1st of April and 11th of April (11 days) 856 genomes were completed in 1040 jobs (attempts) Reasons for Retries: Accidental terminations (~20), Spot terminations, Container termination (due to pipeline errors). Pipeline errors were largely OOM errors in the HaplotypeCaller stage, likely due to recent real world data being larger than the test data. 15
  • 17. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Spot Price Fluctuation Observed
 (Some terminations here and there, mainly in AZ 2A) •In our trials, spot prices scaled comfortably to about 150 c3.8xlarge instances and 200 r3.8xlarge instances (about 10000 cores total), before prices were significantly affected. 16
  • 18. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au NAT Network Throughput We observed significant spend on data mover instances, and will Optimise this in Future •NAT Gateway pricing sub-optimal here. •Instead, used c3.large instances, but maxed out at ~4GB/s data throughput. •Upgrading to c3.8xlarge saw bandwidth increase to 20GB/s or higher. •Due to the significant costs involved, we suggest that anyone building a similar architecture use a Squid Proxy instead. •See: https://aws.amazon.com/articles/ 5995712515781075 17
  • 19. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au • Compute – Total for 856 samples = $15,093 • $17.6 per genome • $12,409 on c3.8xlarge spot instances for compute • $2,476 on NAT instances (could be improved, via Squid Proxy solution) • Data egress – Amazon S3 -> Direct Connect -> UNSW -> NCI – Roughly $8.7 per genome • Amazon S3 storage – Total for 856 samples = $1,954 ($2.2 per genome) • Grand total estimate == $28.5 USD (no GST) – $37.33 AUD per genome. Per-Genome Costs Observed 18
  • 20. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Further Evolution Our Roadmap for Refining the Bioinformatics Pipeline 2 Phase 2 – Adjust Data Metrics We can reduce the frequency of Out of Memory errors with some more work on the pipeline’s optimisation, to account for the spikes in RAM usage we observed. 3 Phase 3 – More Instance Types We’d like to optimise container versions for multiple instance types, and use SpotFleet to deploy based on what’s economical/available. Our pipeline’s output ought to scale linearly with CPU cores, given sufficient RAM. We’ll also investigate the use of Amazon EBS over ephemeral storage. 4 Phase 4 – Optimise NAT Instances Switch over to a Squid Proxy model over our current NAT instances, or simply find a cheaper instance type with higher throughput. Alternatively, we can preload all raw data from NCI to Amazon S3 (over Direct Connect) and likely avoid needing (expensive) NAT instances altogether. 1 Complete – Containerised AWS Pipeline Pipeline works end-to-end, optimised for the cloud, and able to produce a high-quality result within the budget available. 5 Alternative – AWS Batch Amazon’s new service, AWS Batch, was not available in Sydney when we built this pipeline, and we are very interested in its potential to bring even more gains. 19
  • 21. Liviu Constantinescu - Containerised Bioinformatics Pipeline on AWS kccg.garvan.org.au l.constantinescu@garvan.org.au Special Thanks • Our sponsors and collaborators (shown to left). – Our sequencing staff at Garvan and Genome.One. • NSW OHMR – For funding the research study. • Aspree – For providing samples. • The 45 and Up Study – For providing samples. • Amazon Web Services – For Jamie and Ben’s valuable assistance completing this pilot study. – For defraying some of the costs of developing the pipeline, and of data egress for the study. 20