VariantSpark on AWS
- 2. Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr SzulAidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
David Levy
Dan Andrews
Kaitao Lai,
PhD
Arash Bayat
PhD
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio
Lynn Langit
Jim Counts
Matthew Jones
Natalie Twine,
PhD
Prabha Pillay
Transformational Bioinformatics Team
www.csiro.au
Denis C. Bauer | @allPowerde
- 7. Performance – Faster and More Accurate
VariantSpark can scale to 100% of the genome
low Accuracy high
lowSpeedhigh
- 8. Scaling to 50M variables & 10K samples
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel CPUs
• Xeon E5-2660@2.20GHz
• 128 GB RAM
• Spark 1.6.1
• 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
Range
GWAS Range
- 9. Building a Cloud Data Pipeline
Spark
•IaaS, PaaS, SaaS Vendors
•Alibaba, AWS, GCP…
- 25. CSIRO Team Trains Other Researchers
Team creates reproducible
cloud environments
• AWS CloudFormation Templates for
EMR
• Setup screencasts for Databricks
and AWS
• Scripts and recommended
parameters
- 26. Next Steps
• VariantSpark on GCP
• Use GCP DataProc – compare to AWS EMR
• Use GCP GKE – compare to AWS EKS (K8)
• VariantSpark on Terra.bio
• First optimize container for GCP raw compute
• Write WDL for VariantSpark tool/workflow
• Publish on Dockstore as Tool/Workflow
• Publish example Jupyter notebooks
• Publish example Terra.bio VariantSpark
workflow
Editor's Notes
- VariantSpark - https://bioinformatics.csiro.au/variantspark
GT-Scan Suite - https://gt-scan.csiro.au/
- https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2269-7
- http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/
- https://github.com/aehrc/VariantSpark
- https://academics.cloud.databricks.com/#notebook/170398/command/170419
- Photo from - http://www.drjasonfox.com/
- https://aehrc.github.io/VariantSpark/notebook-examples/VariantSpark_HipsterIndex.html
- https://docs.databricks.com/applications/genomics/variant-spark.html
- https://aehrc.com/als-genomic-research/
- Quickly access a managed Spark cluster - AWS EC2 / spot instances
Link to your data and perform whole genome analysis in real-time
- https://medium.com/@lynnlangit/scaling-custom-machine-learning-on-aws-part-2-emr-6dfc3cd91a1f
- https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
https://github.com/jamesrcounts/VariantSpark/tree/spark2.3/kubernetes
- https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
https://github.com/jamesrcounts/VariantSpark/tree/spark2.3/kubernetes