SlideShare a Scribd company logo
Everything Comes in 3’sAngel PizarroDirector, ITMAT Bioinformatics FacilityUniversity of Pennsylvania School of Medicine
OutlineThis talk looks at the practical aspects of Cloud ComputingWe will be diving into specific examples3pillars of systems design3storage implementations3 areas of bioinformatics And how they are affected by clouds3interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors
Pillars of Systems DesignProvisioningAPI access (AWS, Microsoft, RackSpace, GoGrid, etc.)Not discussing further, since this is the WHOLE POINT of cloud computing.ConfigurationHow to get a system up to the point you can do something with itCommand and ControlHow to tell the system what to do
System Configuration with ChefAutomatic installation of packages, service configuration and initializationSpecifications use a real programming language with known behaviorBring the system to an idempotent statehttp://opscode.com/chef/http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
Chef Recipes & CookbooksThe specification for installing and configuring a system componentAble to support more than one platformHas access to system-wide informationhostname, IP addr, RAM, # processors, etc.Contain templates, documentation, static files & assetsCan define dependencies on other recipesExecuted in order, execution stops at first failure
Simple Recipe : RsyncInstall rsync to the systemMeta data file states what platforms are supportedNote that Chef is a Linux centric systemBUT, the WikiWiki is MessyMessyLook at Chef Solo & Resources
More Complex Recipe: HeartbeatInstalls heartbeat packageRegisters the service and specifies that is can be restarted and provides a status messageFinally it starts the service
Command and ControlTraditional grid computingQSUB – SGE, PBS, TorqueUsually requires tightly coupled and static systemsShared file systems, firewalls, user accounts, shared exe & lib locationsBest for capability processes (e.g. MPI) Map-Reduce is the new hotnessBest for data-parallel processesAssumes loosely coupled non-static componentsJob staging is a critical component
Map Reduce in a NutshellAlgorithm pioneered by Google for distributed data analysisData-parallel analysis fit well into this modelSplit data, work on each part in parallel, then merge resultsHadoop, Disco, CloudCrowd, …
Serial Execution of Proteomics Search
Parallel Proteomics Search
Roll-Your-Own MR on AWSDefine small scripts toSplit a FASTA fileRun a BLAT searchThe first script make defines the inputs of the secondSubmit the input FASTA to S3Start a master node as the central communication hubStart slave nodes, configured to ask for work from master and save results back to S3Press “Play”
Workflow of Distributed BLATBoot master & slavesPCMasterSubmit the BLAT jobS3SlaveInitial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goesUpload inputsDownload resultsSlaveSlaveSlave
Master Node => ResqueGithub developed background job processing frameworkJobs attached to a class from your application, stored as JSONUses REDIS key-value storeSimple front end for viewing job queue status, failed jobhttp://github.com/defunkt/resqueResque can invoke any class that has a class method “perform()”
The scripts
Storage in the Cloud : S3Permanent storage for your dataPay as you go for transmission and holdingEliminates backupsPretty good CDNAble to hook into better CDN SLA via CloudFrontCan be slow at timesReports of 10 second delay, but average is 300ms responseYour DataS3
S3 Costs
Storage 2: Distributed FS on EC2Hadoop HDFS, Gigaspaces, etc.Network latency may be an issue for traditional DFSsGluster, GPFS, etc.Tighter integration with execution framework, better performance?Your DataEC2 NodeEC2 NodeEC2 NodeEC2 NodeEC2 Node Disk
DFS on EC2 m1.xlarge Costs* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
Storage 3: Memory Grids“RAM is the new Disk”Application level RAM clusteringTerracotta, Gemstone Gemfire, Oracle, Cisco, GigaspacesPerformance for capability jobs?Your DataEC2 RAMEC2 RAMEC2 RAMEC2 RAMEC2 RAMEC2 RAM* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
Memory Grid CostTake home message: Unless your needs are small, you may be better off procuring bare-metal resources
Cloud Influence on BioinformaticsComputational BiologyAlgorithms will need to account for large I/O latencyStatistical tests will need to account for incomplete information, or incremental resultsSoftware EngineeringBuilt for the cloud algorithms are popping upCloudBurst is a feature example in AWS EMR!Application to Life SciencesDeploy ready-made images for useCycle Computing, ViPDAC, others soon to follow
Algorithms need to be I/O centricIncur a slightly higher computational burden to reduce I/O across non-optimal networksP. Balaji, W. Feng, H. Lin 2008
Some Internal ProjectsResource ManagerService for on-demand provisioning and release of EC2 nodesUtilizes Chef to define and apply roles (compute node, DB server, etc)Terminates idle compute nodes at 52 minutesWorkflow ManagerDefines and executes data analysis workflowsRelies on RM to provision nodesOnce appropriate worker nodes are available, acts as the central work queueRUMRNA-SeqUltimate MapperMap Reduce  RNA-Seq analysis pipelineCombines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
Bowtie Alone
RUM (Bowtie + BLAT + processing)Significantly increases the confidence of your data
RUM CostsComputational cost ~$100 - $2006-8 hours per lane on m2.4xlarge ($2.40 / hour)Cost of reagents ~= $10,0001% of total
AcknowledgementsGarret FitzGeraldIan BlairJohn HogeneschGreg GrantTilo GrosserNIH & UPENN for support My TeamDavid AustinAndrew BraderWeichen WuRate me!   http://speakerrate.com/talks/3041-everything-comes-in-3-s

More Related Content

Everything comes in 3's

  • 1. Everything Comes in 3’sAngel PizarroDirector, ITMAT Bioinformatics FacilityUniversity of Pennsylvania School of Medicine
  • 2. OutlineThis talk looks at the practical aspects of Cloud ComputingWe will be diving into specific examples3pillars of systems design3storage implementations3 areas of bioinformatics And how they are affected by clouds3interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors
  • 3. Pillars of Systems DesignProvisioningAPI access (AWS, Microsoft, RackSpace, GoGrid, etc.)Not discussing further, since this is the WHOLE POINT of cloud computing.ConfigurationHow to get a system up to the point you can do something with itCommand and ControlHow to tell the system what to do
  • 4. System Configuration with ChefAutomatic installation of packages, service configuration and initializationSpecifications use a real programming language with known behaviorBring the system to an idempotent statehttp://opscode.com/chef/http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
  • 5. Chef Recipes & CookbooksThe specification for installing and configuring a system componentAble to support more than one platformHas access to system-wide informationhostname, IP addr, RAM, # processors, etc.Contain templates, documentation, static files & assetsCan define dependencies on other recipesExecuted in order, execution stops at first failure
  • 6. Simple Recipe : RsyncInstall rsync to the systemMeta data file states what platforms are supportedNote that Chef is a Linux centric systemBUT, the WikiWiki is MessyMessyLook at Chef Solo & Resources
  • 7. More Complex Recipe: HeartbeatInstalls heartbeat packageRegisters the service and specifies that is can be restarted and provides a status messageFinally it starts the service
  • 8. Command and ControlTraditional grid computingQSUB – SGE, PBS, TorqueUsually requires tightly coupled and static systemsShared file systems, firewalls, user accounts, shared exe & lib locationsBest for capability processes (e.g. MPI) Map-Reduce is the new hotnessBest for data-parallel processesAssumes loosely coupled non-static componentsJob staging is a critical component
  • 9. Map Reduce in a NutshellAlgorithm pioneered by Google for distributed data analysisData-parallel analysis fit well into this modelSplit data, work on each part in parallel, then merge resultsHadoop, Disco, CloudCrowd, …
  • 10. Serial Execution of Proteomics Search
  • 12. Roll-Your-Own MR on AWSDefine small scripts toSplit a FASTA fileRun a BLAT searchThe first script make defines the inputs of the secondSubmit the input FASTA to S3Start a master node as the central communication hubStart slave nodes, configured to ask for work from master and save results back to S3Press “Play”
  • 13. Workflow of Distributed BLATBoot master & slavesPCMasterSubmit the BLAT jobS3SlaveInitial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goesUpload inputsDownload resultsSlaveSlaveSlave
  • 14. Master Node => ResqueGithub developed background job processing frameworkJobs attached to a class from your application, stored as JSONUses REDIS key-value storeSimple front end for viewing job queue status, failed jobhttp://github.com/defunkt/resqueResque can invoke any class that has a class method “perform()”
  • 16. Storage in the Cloud : S3Permanent storage for your dataPay as you go for transmission and holdingEliminates backupsPretty good CDNAble to hook into better CDN SLA via CloudFrontCan be slow at timesReports of 10 second delay, but average is 300ms responseYour DataS3
  • 18. Storage 2: Distributed FS on EC2Hadoop HDFS, Gigaspaces, etc.Network latency may be an issue for traditional DFSsGluster, GPFS, etc.Tighter integration with execution framework, better performance?Your DataEC2 NodeEC2 NodeEC2 NodeEC2 NodeEC2 Node Disk
  • 19. DFS on EC2 m1.xlarge Costs* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
  • 20. Storage 3: Memory Grids“RAM is the new Disk”Application level RAM clusteringTerracotta, Gemstone Gemfire, Oracle, Cisco, GigaspacesPerformance for capability jobs?Your DataEC2 RAMEC2 RAMEC2 RAMEC2 RAMEC2 RAMEC2 RAM* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
  • 21. Memory Grid CostTake home message: Unless your needs are small, you may be better off procuring bare-metal resources
  • 22. Cloud Influence on BioinformaticsComputational BiologyAlgorithms will need to account for large I/O latencyStatistical tests will need to account for incomplete information, or incremental resultsSoftware EngineeringBuilt for the cloud algorithms are popping upCloudBurst is a feature example in AWS EMR!Application to Life SciencesDeploy ready-made images for useCycle Computing, ViPDAC, others soon to follow
  • 23. Algorithms need to be I/O centricIncur a slightly higher computational burden to reduce I/O across non-optimal networksP. Balaji, W. Feng, H. Lin 2008
  • 24. Some Internal ProjectsResource ManagerService for on-demand provisioning and release of EC2 nodesUtilizes Chef to define and apply roles (compute node, DB server, etc)Terminates idle compute nodes at 52 minutesWorkflow ManagerDefines and executes data analysis workflowsRelies on RM to provision nodesOnce appropriate worker nodes are available, acts as the central work queueRUMRNA-SeqUltimate MapperMap Reduce RNA-Seq analysis pipelineCombines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
  • 26. RUM (Bowtie + BLAT + processing)Significantly increases the confidence of your data
  • 27. RUM CostsComputational cost ~$100 - $2006-8 hours per lane on m2.4xlarge ($2.40 / hour)Cost of reagents ~= $10,0001% of total
  • 28. AcknowledgementsGarret FitzGeraldIan BlairJohn HogeneschGreg GrantTilo GrosserNIH & UPENN for support My TeamDavid AustinAndrew BraderWeichen WuRate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

Editor's Notes

  1. REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.http://www.mpiblast.org/About/Publications