VariantSpark on AWS

Lynn Langit for CSIRO Bioinformatics
VariantSpark on AWS

Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr SzulAidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
David Levy
Dan Andrews
Kaitao Lai,
PhD
Arash Bayat
PhD
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio
Lynn Langit
Jim Counts
Matthew Jones
Natalie Twine,
PhD
Prabha Pillay
Transformational Bioinformatics Team
www.csiro.au
Denis C. Bauer | @allPowerde

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Denis C. Bauer | @allPowerde

VariantSpark works with VCF Data

Supervised ML: Wide Random Forests

Custom Splits
• Horizontal
• Vertical
Gini Scoring
• Local
• Global

Performance – Faster and More Accurate
VariantSpark can scale to 100% of the genome
low Accuracy high
lowSpeedhigh

Scaling to 50M variables & 10K samples
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel CPUs
• Xeon E5-2660@2.20GHz
• 128 GB RAM
• Spark 1.6.1
• 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
Range
GWAS Range

Building a Cloud Data Pipeline
Spark
•IaaS, PaaS, SaaS Vendors
•Alibaba, AWS, GCP…

Moving to the Cloud – SaaS / Databricks

Synthetic Phenotype: Who is a Bondi Hipster?

Hello VariantSpark via Hipster-Index
BUILDS Community

Try it out SaaS: VariantSpark Notebook

Transformational Bioinformatics | @allPowerde

Spark Server Cluster Pipeline Pattern
Jupyter Notebook Data Lake

Try it out: VariantSpark on AWS EMR

Try it out: VariantSpark on AWS EKS

Apache Spark 2.3+ with Kubernetes

CSIRO Team Trains Other Researchers
Team creates reproducible
cloud environments
• AWS CloudFormation Templates for
EMR
• Setup screencasts for Databricks
and AWS
• Scripts and recommended
parameters

Next Steps
• VariantSpark on GCP
• Use GCP DataProc – compare to AWS EMR
• Use GCP GKE – compare to AWS EKS (K8)
• VariantSpark on Terra.bio
• First optimize container for GCP raw compute
• Write WDL for VariantSpark tool/workflow
• Publish on Dockstore as Tool/Workflow
• Publish example Jupyter notebooks
• Publish example Terra.bio VariantSpark
workflow

Bioinformatics Tools on AWS
Lynn Langit for CSIRO Bioinformatics

VariantSpark on AWS

Related slideshows

More Related Content

VariantSpark on AWS

Editor's Notes