SlideShare a Scribd company logo
Lynn Langit for CSIRO Bioinformatics
VariantSpark on AWS
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr SzulAidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
David Levy
Dan Andrews
Kaitao Lai,
PhD
Arash Bayat
PhD
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio
Lynn Langit
Jim Counts
Matthew Jones
Natalie Twine,
PhD
Prabha Pillay
Transformational Bioinformatics Team
www.csiro.au
Denis C. Bauer | @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Denis C. Bauer | @allPowerde
VariantSpark works with VCF Data
Supervised ML: Wide Random Forests
Custom Splits
• Horizontal
• Vertical
Gini Scoring
• Local
• Global
Performance – Faster and More Accurate
VariantSpark can scale to 100% of the genome
low Accuracy high
lowSpeedhigh
Scaling to 50M variables & 10K samples
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel CPUs
• Xeon E5-2660@2.20GHz
• 128 GB RAM
• Spark 1.6.1
• 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
Range
GWAS Range
Building a Cloud Data Pipeline
Spark
•IaaS, PaaS, SaaS Vendors
•Alibaba, AWS, GCP…
True CloudCosts
CloudCompute Services Choices
Moving to the Cloud – SaaS / Databricks
Synthetic Phenotype: Who is a Bondi Hipster?
Example Notebook: Databricks
Hello VariantSpark via Hipster-Index
BUILDS Community
Try it out SaaS: VariantSpark Notebook
Transformational Bioinformatics | @allPowerde
Spark Server Cluster Pipeline Pattern
Jupyter Notebook Data Lake
Configuration Challenges
Try it out: VariantSpark on AWS EMR
Try it out: VariantSpark on AWS EKS
Apache Spark 2.3+ with Kubernetes
Try it out: VariantSpark on AWS EKS
Try it out: VariantSpark on AWS EKS
CSIRO Team Trains Other Researchers
Team creates reproducible
cloud environments
• AWS CloudFormation Templates for
EMR
• Setup screencasts for Databricks
and AWS
• Scripts and recommended
parameters
Next Steps
• VariantSpark on GCP
• Use GCP DataProc – compare to AWS EMR
• Use GCP GKE – compare to AWS EKS (K8)
• VariantSpark on Terra.bio
• First optimize container for GCP raw compute
• Write WDL for VariantSpark tool/workflow
• Publish on Dockstore as Tool/Workflow
• Publish example Jupyter notebooks
• Publish example Terra.bio VariantSpark
workflow
Bioinformatics Tools on AWS
Lynn Langit for CSIRO Bioinformatics

More Related Content

VariantSpark on AWS

Editor's Notes

  1. VariantSpark - https://bioinformatics.csiro.au/variantspark GT-Scan Suite - https://gt-scan.csiro.au/
  2. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2269-7
  3. http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/
  4. https://github.com/aehrc/VariantSpark
  5. https://academics.cloud.databricks.com/#notebook/170398/command/170419
  6. Photo from - http://www.drjasonfox.com/
  7. https://aehrc.github.io/VariantSpark/notebook-examples/VariantSpark_HipsterIndex.html
  8. https://docs.databricks.com/applications/genomics/variant-spark.html
  9. https://aehrc.com/als-genomic-research/
  10. Quickly access a managed Spark cluster - AWS EC2 / spot instances Link to your data and perform whole genome analysis in real-time
  11. https://medium.com/@lynnlangit/scaling-custom-machine-learning-on-aws-part-2-emr-6dfc3cd91a1f
  12. https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html https://github.com/jamesrcounts/VariantSpark/tree/spark2.3/kubernetes
  13. https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html https://github.com/jamesrcounts/VariantSpark/tree/spark2.3/kubernetes