Everything comes in 3's

Everything Comes in 3’sAngel PizarroDirector, ITMAT Bioinformatics FacilityUniversity of Pennsylvania School of Medicine

OutlineThis talk looks at the practical aspects of Cloud ComputingWe will be diving into specific examples3pillars of systems design3storage implementations3 areas of bioinformatics And how they are affected by clouds3interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors

Pillars of Systems DesignProvisioningAPI access (AWS, Microsoft, RackSpace, GoGrid, etc.)Not discussing further, since this is the WHOLE POINT of cloud computing.ConfigurationHow to get a system up to the point you can do something with itCommand and ControlHow to tell the system what to do

System Configuration with ChefAutomatic installation of packages, service configuration and initializationSpecifications use a real programming language with known behaviorBring the system to an idempotent statehttp://opscode.com/chef/http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg

Chef Recipes & CookbooksThe specification for installing and configuring a system componentAble to support more than one platformHas access to system-wide informationhostname, IP addr, RAM, # processors, etc.Contain templates, documentation, static files & assetsCan define dependencies on other recipesExecuted in order, execution stops at first failure

Simple Recipe : RsyncInstall rsync to the systemMeta data file states what platforms are supportedNote that Chef is a Linux centric systemBUT, the WikiWiki is MessyMessyLook at Chef Solo & Resources

More Complex Recipe: HeartbeatInstalls heartbeat packageRegisters the service and specifies that is can be restarted and provides a status messageFinally it starts the service

Command and ControlTraditional grid computingQSUB – SGE, PBS, TorqueUsually requires tightly coupled and static systemsShared file systems, firewalls, user accounts, shared exe & lib locationsBest for capability processes (e.g. MPI) Map-Reduce is the new hotnessBest for data-parallel processesAssumes loosely coupled non-static componentsJob staging is a critical component

Map Reduce in a NutshellAlgorithm pioneered by Google for distributed data analysisData-parallel analysis fit well into this modelSplit data, work on each part in parallel, then merge resultsHadoop, Disco, CloudCrowd, …

Serial Execution of Proteomics Search

Roll-Your-Own MR on AWSDefine small scripts toSplit a FASTA fileRun a BLAT searchThe first script make defines the inputs of the secondSubmit the input FASTA to S3Start a master node as the central communication hubStart slave nodes, configured to ask for work from master and save results back to S3Press “Play”

Workflow of Distributed BLATBoot master & slavesPCMasterSubmit the BLAT jobS3SlaveInitial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goesUpload inputsDownload resultsSlaveSlaveSlave

Master Node => ResqueGithub developed background job processing frameworkJobs attached to a class from your application, stored as JSONUses REDIS key-value storeSimple front end for viewing job queue status, failed jobhttp://github.com/defunkt/resqueResque can invoke any class that has a class method “perform()”

Storage in the Cloud : S3Permanent storage for your dataPay as you go for transmission and holdingEliminates backupsPretty good CDNAble to hook into better CDN SLA via CloudFrontCan be slow at timesReports of 10 second delay, but average is 300ms responseYour DataS3

Storage 2: Distributed FS on EC2Hadoop HDFS, Gigaspaces, etc.Network latency may be an issue for traditional DFSsGluster, GPFS, etc.Tighter integration with execution framework, better performance?Your DataEC2 NodeEC2 NodeEC2 NodeEC2 NodeEC2 Node Disk

DFS on EC2 m1.xlarge Costs* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3

Storage 3: Memory Grids“RAM is the new Disk”Application level RAM clusteringTerracotta, Gemstone Gemfire, Oracle, Cisco, GigaspacesPerformance for capability jobs?Your DataEC2 RAMEC2 RAMEC2 RAMEC2 RAMEC2 RAMEC2 RAM* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads

Memory Grid CostTake home message: Unless your needs are small, you may be better off procuring bare-metal resources

Cloud Influence on BioinformaticsComputational BiologyAlgorithms will need to account for large I/O latencyStatistical tests will need to account for incomplete information, or incremental resultsSoftware EngineeringBuilt for the cloud algorithms are popping upCloudBurst is a feature example in AWS EMR!Application to Life SciencesDeploy ready-made images for useCycle Computing, ViPDAC, others soon to follow

Algorithms need to be I/O centricIncur a slightly higher computational burden to reduce I/O across non-optimal networksP. Balaji, W. Feng, H. Lin 2008

Some Internal ProjectsResource ManagerService for on-demand provisioning and release of EC2 nodesUtilizes Chef to define and apply roles (compute node, DB server, etc)Terminates idle compute nodes at 52 minutesWorkflow ManagerDefines and executes data analysis workflowsRelies on RM to provision nodesOnce appropriate worker nodes are available, acts as the central work queueRUMRNA-SeqUltimate MapperMap Reduce RNA-Seq analysis pipelineCombines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads

RUM (Bowtie + BLAT + processing)Significantly increases the confidence of your data

RUM CostsComputational cost ~$100 - $2006-8 hours per lane on m2.4xlarge ($2.40 / hour)Cost of reagents ~= $10,0001% of total

AcknowledgementsGarret FitzGeraldIan BlairJohn HogeneschGreg GrantTilo GrosserNIH & UPENN for support My TeamDavid AustinAndrew BraderWeichen WuRate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

Everything comes in 3's

Related slideshows

More Related Content

Everything comes in 3's

Editor's Notes