The document discusses Internet2, an advanced networking consortium that operates a 15,000 mile fiber optic network for research and education. It provides very high speed connectivity and collaboration technologies to facilitate large data sharing and frictionless research. Examples are given of life sciences projects utilizing Internet2's high-speed network for genomic research and agricultural applications involving terabytes of satellite and sensor data. The network is expanding to include cloud computing resources and supercomputing centers to enable global-scale distributed scientific computing and collaboration.
The Pacific Research Platform:a Science-Driven Big-Data Freeway System
The Pacific Research Platform will create a regional "Big Data Freeway System" along the West Coast to support science. It will connect major research institutions with high-speed optical networks, allowing them to share vast amounts of data and computational resources. This will enable new forms of collaborative, data-intensive research for fields like particle physics, astronomy, biomedicine, and earth sciences. The first phase aims to establish a basic networked infrastructure, with later phases advancing capabilities to 100Gbps and beyond with security and distributed technologies.
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
Building the Pacific Research Platform: Supernetworks for Big Data Science
The document summarizes Dr. Larry Smarr's presentation on building the Pacific Research Platform (PRP) to enable big data science across research universities on the West Coast. The PRP provides 100-1000 times more bandwidth than today's internet to support research fields from particle physics to climate change. In under 2 years, the prototype PRP has connected researchers and datasets across California through optical networks and is now expanding nationally and globally. The next steps involve adding machine learning capabilities to the PRP through GPU clusters to enable new discoveries from massive datasets.
The Pacific Research Platform (PRP) is a high-bandwidth global private "cloud" connected to commercial clouds that provides researchers with distributed computing resources. It links Science DMZs at universities across California and beyond using a high-performance network. The PRP utilizes Data Transfer Nodes called FIONAs to transfer data at near full network speeds. It has adopted Kubernetes to orchestrate software containers across its resources. The PRP provides petabytes of distributed storage and hundreds of GPUs for machine learning. It allows researchers to perform data-intensive science across multiple universities much faster than possible individually.
Cyberenvironments integrate shared and custom cyberinfrastructure resources into a process-oriented framework to support scientific communities and allow researchers to focus on their work rather than managing infrastructure. They enable more complex multi-disciplinary challenges to be tackled through enhanced knowledge production and application. Key challenges include coordinating distributed resources and users without centralization and evolving systems rapidly to keep pace with advancing science.
The document provides an overview of the development of the NIH Data Commons. It discusses factors driving the need for a data commons, including large amounts of data being generated and increased support for data sharing. It outlines the goals of making data findable, accessible, interoperable and reusable. Several pilots are exploring the feasibility of the commons framework, including placing large datasets in the cloud and developing indexing methods. Considerations in fully realizing the commons are also discussed, such as standards, discoverability, policies and incentives.
EMBL Australian Bioinformatics Resource AHM - Data Commons
This document discusses the development of the NIH Data Commons, which aims to create a shared framework and infrastructure for biomedical data. It notes the increasing amounts of data being generated and the need for data sharing and interoperability. The Data Commons framework treats data, tools, and publications as digital objects that are findable, accessible, interoperable and reusable. Current pilots include deploying reference datasets in the cloud, indexing data and tools, and a credits system for cloud resources. Challenges discussed include metrics, costs, standards, incentives and sustainability. The framework's relevance for supporting open data in Australia is also addressed.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help regulate emotions and stress levels.
Peering The Pacific Research Platform With The Great Plains Network
The Pacific Research Platform (PRP) connects research institutions across the western United States with high-speed networks to enable data-intensive science collaborations. Key points:
- The PRP connects 15 campuses across California and links to the Great Plains Network, allowing researchers to access remote supercomputers, share large datasets, and collaborate on projects like analyzing data from the Large Hadron Collider.
- The PRP utilizes Science DMZ architectures with dedicated data transfer nodes called FIONAs to achieve high-speed transfer of large files. Kubernetes is used to manage distributed storage and computing resources.
- Early applications include distributed climate modeling, wildfire science, plankton imaging, and cancer genomics. The PR
The document discusses the vision and progress of the Pacific Research Platform (PRP) in creating a "big data freeway" across the West Coast to enable data-intensive science. It outlines how the PRP builds on previous NSF and DOE networking investments to provide dedicated high-performance computing resources, like GPU clusters and Jupyter hubs, connected by high-speed networks at multiple universities. Several science driver teams are highlighted, including particle physics, astronomy, microbiology, earth sciences, and visualization, that will leverage PRP resources for large-scale collaborative data analysis projects.
Looking Back, Looking Forward NSF CI Funding 1985-2025
This document provides an overview of the development of national research platforms (NRPs) from 1985 to the present, with a focus on the Pacific Research Platform (PRP). It describes the evolution of the PRP from early NSF-funded supercomputing centers to today's distributed cyberinfrastructure utilizing optical networking, containers, Kubernetes, and distributed storage. The PRP now connects over 15 universities across the US and internationally to enable data-intensive science and machine learning applications across multiple domains. Going forward, the document discusses plans to further integrate regional networks and partner with new NSF-funded initiatives to develop the next generation of NRPs through 2025.
Massive-Scale Analytics Applied to Real-World Problems
In this deck from PASC18, David Bader from Georgia Tech presents: Massive-Scale Analytics Applied to Real-World Problems.
"Emerging real-world graph problems include: detecting and preventing disease in human populations; revealing community structure in large social networks; and improving the resilience of the electric power grid. Unlike traditional applications in computational science and engineering, solving these social problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for research on scalable algorithms and development of frameworks for solving these real-world problems on high performance computers, and for improved models that capture the noise and bias inherent in the torrential data streams. In this talk, Bader will discuss the opportunities and challenges in massive data-intensive computing for applications in social sciences, physical sciences, and engineering."
Watch the video: https://wp.me/p3RLHQ-iPk
Learn more: https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Talk at DOE CIO's Big Data Tech Summit -- latest take on why and wherefore of software as a service (SaaS) for science, and the Globus Online work we are doing, with various DOE examples.
The Clinical Decision Support Consortium aims to advance clinical decision support through several research projects. The Consortium will assess, define, demonstrate, and evaluate best practices for clinical decision support across multiple healthcare settings and electronic health record platforms. Several research teams will focus on knowledge management, clinical guideline translation, decision support content development and delivery, and evaluating demonstrations of decision support. The goal is to improve healthcare quality by facilitating widespread use of evidence-based clinical decision support.
This document summarizes a presentation about providing next-generation sequencing analysis capabilities using Globus Genomics. It outlines challenges with current manual approaches to sequencing data analysis, including difficulties moving large datasets between locations and maintaining complex analysis scripts. The presentation introduces Globus Genomics, which uses Globus data transfer services integrated with Galaxy to provide a workflow-based system for sequencing analysis without requiring local installation or configuration. Key benefits include on-demand access to scalable cloud resources, ability to easily modify and reuse analysis workflows, and integration with data sources. The system aims to accelerate genomic research by automating and simplifying analysis.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
1) Globus Genomics addresses challenges in sequencing analysis by providing a platform that integrates data transfer via Globus Online, workflow management in Galaxy, and scalable compute resources in AWS.
2) An example collaboration with the Dobyns Lab saw over a 10x speedup in exome data analysis by replacing a manual process with Globus Genomics.
3) Globus Genomics leverages XSEDE services like Globus Transfer and Nexus while integrating additional resources like sequencing centers and cloud computing, in order to reduce the costs and complexities of genomic research for communities not traditionally using advanced cyberinfrastructure.
This poster is prepared for the upcoming BD2K All hands meeting. We present the BDDS Knowledge Discovery platform as applied to understanding the role of amyloid burden in Alzheimers and Parkisons.
Globus Genomics provides tools and services to help researchers manage and analyze large genomic datasets. It uses Globus data management tools to securely transfer data between institutions. Researchers can then run analysis workflows on cloud compute resources through Galaxy interfaces. This enables researchers to assemble diverse datasets, apply multiple computational models, and publish results for others to discover, validate, and reuse. Examples show researchers using Globus Genomics to process petabytes of sequencing data and perform genome-wide analysis across many institutions. The goal is to accelerate scientific discovery by making it easier for researchers to find "needles in haystacks" through data-intensive computational approaches.
Presentation offered at http://www.smartiotlondon.com/2016-seminar-programme/big-data-and-genomics-the-future-of-genetic-engineering
Bioinformatics: the marriage of biology and Big Data, and how this will change the way we perform genetic engineering.
This presentation explain our company (Alkol Biotech) compares DNA strands, focusing on its development of the “EunergyCane” sugarcane crop: Europe’s only sugarcane variety. It explain the tools the company uses such as Big Data, Machine Learning, and Fast Sequencing.
Learning Outcomes:
1 – Learn on the new field of “Bioinformatics”, which is the marriage of IT and biology
2 – Learn how Big Data is changing the game on genetic engineering
3 – Learn what are the tools used and expected results
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
The document summarizes the ADAS&ME project which aims to develop advanced driver assistance systems that can automatically transfer control between the vehicle and driver based on the driver's state and environmental context. The project has a budget of 9.5 million euros over 42 months and involves companies and research institutions developing technologies like high-definition maps, vehicle connectivity, and systems for monitoring driver state and handling non-reactive drivers. Several use cases are outlined focusing on commercial vehicles and motorcycles, with scenarios presented for smooth transitions between automated and manual driving and handling emergencies if the driver does not respond.
The document discusses effective ways to use Ansible including setting up passwordless SSH authentication, limiting playbooks to single machines, managing private keys, using roles for scalability, and tuning Ansible for performance. The objective is to deploy Kubernetes from scratch with one command using best practices for Ansible configuration, authentication, and organization.
This document summarizes a presentation about Globus Genomics, a service that provides genomic data analysis tools and workflows through a web interface. It allows users to securely transfer data, run standardized analysis pipelines, access computational resources on demand through Amazon Web Services, and collaborate on shared data and workflows. The service aims to make genomic analysis more accessible, reproducible, and sustainable through various pricing models and support for individual labs and bioinformatics cores.
This document summarizes Senator Barack Obama's health policy plan, which focuses on achieving universal health care coverage, health care reform, and strengthening public health. It outlines some of the key problems in the current US healthcare system from the perspectives of providers, purchasers, and consumers. Obama's plan would invest in health information technology and reform reimbursement to align with quality. The plan is estimated to cost $50-65 billion annually but could save $120-200 billion through reduced administrative costs, improved disease management, and health IT savings. If implemented, it could lower family insurance costs by $2,500 and cover 10 million more people.
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
This document provides an overview of Apache Sqoop and discusses the transition from Sqoop 1 to Sqoop 2. Sqoop is a tool for transferring data between relational databases and Hadoop. Sqoop 1 was connector-based and had challenges around usability and security. Sqoop 2 addresses these with a new architecture separating connections from jobs, centralized metadata management, and role-based security for database access. Sqoop 2 is the primary focus of ongoing development to improve ease of use, extensibility, and security of data transfers with Hadoop.
Ramesh Raskar
MIT Media Lab
Ramesh Raskar is an Associate Professor at MIT Media Lab. Ramesh Raskar joined the Media Lab from Mitsubishi Electric Research Laboratories in 2008 as head of the Lab’s Camera Culture research group. His research interests span the fields of computational photography, inverse problems in imaging and human-computer interaction. Recent projects and inventions include transient imaging to look around a corner, a next generation CAT-Scan machine, imperceptible markers for motion capture (Prakash), long distance barcodes (Bokode), touch+hover 3D interaction displays (BiDi screen), low-cost eye care devices (Netra,Catra), new theoretical models to augment light fields (ALF) to represent wave phenomena and algebraic rank constraints for 3D displays(HR3D).
In 2004, Raskar received the TR100 Award from Technology Review, which recognizes top young innovators under the age of 35, and in 2003, the Global Indus Technovator Award, instituted at MIT to recognize the top 20 Indian technology innovators worldwide. In 2009, he was awarded a Sloan Research Fellowship. In 2010, he received the Darpa Young Faculty award. Other awards include Marr Prize honorable mention 2009, LAUNCH Health Innovation Award, presented by NASA, USAID, US State Dept and NIKE, 2010, Vodafone Wireless Innovation Project Award (first place), 2011. He holds over 50 US patents and has received four Mitsubishi Electric Invention Awards. He is currently co-authoring a book on Computational Photography.
This document provides an overview of the Leap Motion controller and its capabilities for tracking hand gestures and finger positions. It explains how to set up a basic Leap Motion application using a 2D canvas, initialize the Leap controller, and include an animation loop to continuously draw hand tracking data. Examples are given for visualizing the Leap geometric system, common gestures like swipes and taps, and hand parameters using online code snippets and demos.
This document discusses various types of 3D displays. It begins with an overview of depth cues that can be presented to the human visual system, both monocular cues like size and occlusion, as well as binocular cues like retinal disparity and convergence. The document then presents a taxonomy of 3D display technologies, categorizing them as either glasses-bound or unencumbered designs. Specific display types are described in more detail, including head-mounted displays, spatial and temporal multiplexing, parallax barriers, integral imaging, and volumetric and holographic displays. Multi-view rendering techniques for generating stereoscopic images are also covered, such as using OpenGL for anaglyph generation and off-axis perspective projection.
What is Media in MIT Media Lab, Why 'Camera Culture'
'Media' is a plural for medium. The medium for impact of digital technologies at MIT Media Lab can be photons, electrons, neurons, atoms, cells, musical notes and more.
Over the last 40 years, computing has moved from processor, network, social and more sensory.
MIT Media Lab works at the intersection of computing and such media for human-centric technologies.
This document provides an overview of the pipeline for multiview computer vision. It describes taking multiple photographs, detecting and matching features between images, estimating homographies to relate the images, generating blended intermediate frames, and creating a video from the sequence of frames. It also provides details on steps like feature detection and description, matching features, estimating homographies, image blending, and writing video files.
Glass contains a 1 GHz processor, 16GB storage, 512MB RAM, GPS, Bluetooth, WiFi, 5MP camera, bone conduction speaker, gyroscope, accelerometer, and 640x360 display. It detects blinks and winks using a proximity sensor and IR emitter/receiver near the right eye. Developers can build apps using the Mirror API or by sideloading Android APKs that utilize Glass-specific APIs. The documentation provides examples of building simple apps.
What is SIGGRAPH NEXT?
By Juliet Fiss
What will be the next big thing at SIGGRAPH, and how can the SIGGRAPH community contribute in an impactful way to fields outside of traditional computer graphics? SIGGRAPH NEXT at SIGGRAPH 2015 explored these questions. In this new addition to the SIGGRAPH program, an eclectic set of speakers gave TED-style talks and posed grand challenges to the SIGGRAPH community. In this blog post, Professor Ramesh Raskar of the MIT Media Lab introduces SIGGRAPH NEXT and outlines his vision for it.
What will be the next big thing at SIGGRAPH?
The SIGGRAPH community has a set of hammers that it uses to solve problems: geometry processing, rendering, animation, and imaging. What will be the next hammer, the next major field of study, appear at SIGGRAPH? Let’s examine where our research ideas come from. Often, advances in machine learning, optimization, signal processing, and optics forge our hammers. Our selection of hammer also depends on the nails we see. The most common application areas of computer graphics currently include computer-aided design, movies, games, and photography.
We often ask: “Does this work contribute to SIGGRAPH techniques?”
We should also ask, “Does this work contribute SIGGRAPH techniques to _____?”
When we answer the challenges posed by these traditional application areas of computer graphics, we are “drinking our own champagne.” We have made amazing progress in these application areas, and we should celebrate! SIGGRAPH NEXT is about finding new varieties of champagne; for that, we need new varieties of grapes. We should invite others from nontraditional and emerging application areas to enjoy our champagne with us, and they will become part of our community. First, we can expand our work in existing areas like mobile, user interaction, virtual reality, fabrication, and new types of cameras. We can also expand into emerging areas such as healthcare, energy, education, entrepreneurship, materials, tissue fabrication, and social media. What’s next?
Professor Raskar highlights three top areas where we can make an impact. One big take-home message is that many of these applications involve biology: bio is the new digital, and it will affect us ubiquitously.
BeSTGRID aims to enhance research capability in New Zealand by providing skills and infrastructure to help researchers engage with new eResearch services and kick start centralized infrastructure. Since 2006, BeSTGRID has delivered services and tools to support research collaboration on shared datasets and computational resources. BeSTGRID coordinates access to compute and data resources, provides discipline-specific services and applications, and builds a sustainable community to develop middleware, applications, and services.
The document summarizes the Safe Share project, which aims to enable the secure exchange of health data between research sites for medical research. It establishes a higher assurance network using encrypted overlays between network nodes. It also explores implementing an authentication, authorization and accounting infrastructure to allow researchers to access data and systems using their home institution credentials. Several pilot programs are underway to test the network and authentication capabilities. The overall goal is to accelerate medical research while maintaining strict security and privacy of sensitive health data.
The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...Larry Smarr
The Pacific Research Platform (PRP) is a distributed big data and machine learning cyberinfrastructure connecting researchers across multiple UC campuses. It was established in 2015 with NSF funding and has since expanded to include other California universities and national/international partners. The PRP provides high-speed networks, storage, and computing resources like GPUs. It has enabled new data-intensive collaborations and significantly accelerated research workflows. The PRP also supports educational initiatives, providing computing resources for data science courses impacting thousands of students.
The Pacific Research Platform (PRP) is a multi-institutional cyberinfrastructure project that connects researchers across California and beyond to share large datasets. It spans the 10 University of California campuses, major private research universities, supercomputer centers, and some out-of-state universities. Fifteen multi-campus research teams in fields like physics, astronomy, earth sciences, biomedicine, and multimedia will drive the technical needs of the PRP over five years. The goal is to create a "big data freeway" to allow high-speed sharing of data between research labs, supercomputers, and repositories across multiple networks without performance loss over long distances.
The document discusses the Pacific Research Platform (PRP), a distributed cyberinfrastructure that connects researchers and data across multiple campuses in California and beyond using optical fiber networking. Key points:
- The PRP uses high-speed networking infrastructure like the CENIC network to connect data generators and consumers across 15+ campuses, creating an integrated "big data freeway system".
- It deploys specialized data transfer nodes called FIONAs to enable high-speed transfer of large datasets between sites at near the full network speed.
- Recent additions include using Kubernetes to orchestrate containers across the PRP infrastructure and integrating machine learning resources through the CHASE-CI grant to support data-intensive AI applications.
The Pacific Research Platform:a Science-Driven Big-Data Freeway SystemLarry Smarr
The Pacific Research Platform will create a regional "Big Data Freeway System" along the West Coast to support science. It will connect major research institutions with high-speed optical networks, allowing them to share vast amounts of data and computational resources. This will enable new forms of collaborative, data-intensive research for fields like particle physics, astronomy, biomedicine, and earth sciences. The first phase aims to establish a basic networked infrastructure, with later phases advancing capabilities to 100Gbps and beyond with security and distributed technologies.
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
Building the Pacific Research Platform: Supernetworks for Big Data ScienceLarry Smarr
The document summarizes Dr. Larry Smarr's presentation on building the Pacific Research Platform (PRP) to enable big data science across research universities on the West Coast. The PRP provides 100-1000 times more bandwidth than today's internet to support research fields from particle physics to climate change. In under 2 years, the prototype PRP has connected researchers and datasets across California through optical networks and is now expanding nationally and globally. The next steps involve adding machine learning capabilities to the PRP through GPU clusters to enable new discoveries from massive datasets.
Berkeley cloud computing meetup may 2020Larry Smarr
The Pacific Research Platform (PRP) is a high-bandwidth global private "cloud" connected to commercial clouds that provides researchers with distributed computing resources. It links Science DMZs at universities across California and beyond using a high-performance network. The PRP utilizes Data Transfer Nodes called FIONAs to transfer data at near full network speeds. It has adopted Kubernetes to orchestrate software containers across its resources. The PRP provides petabytes of distributed storage and hundreds of GPUs for machine learning. It allows researchers to perform data-intensive science across multiple universities much faster than possible individually.
Cyberenvironments integrate shared and custom cyberinfrastructure resources into a process-oriented framework to support scientific communities and allow researchers to focus on their work rather than managing infrastructure. They enable more complex multi-disciplinary challenges to be tackled through enhanced knowledge production and application. Key challenges include coordinating distributed resources and users without centralization and evolving systems rapidly to keep pace with advancing science.
The document provides an overview of the development of the NIH Data Commons. It discusses factors driving the need for a data commons, including large amounts of data being generated and increased support for data sharing. It outlines the goals of making data findable, accessible, interoperable and reusable. Several pilots are exploring the feasibility of the commons framework, including placing large datasets in the cloud and developing indexing methods. Considerations in fully realizing the commons are also discussed, such as standards, discoverability, policies and incentives.
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
This document discusses the development of the NIH Data Commons, which aims to create a shared framework and infrastructure for biomedical data. It notes the increasing amounts of data being generated and the need for data sharing and interoperability. The Data Commons framework treats data, tools, and publications as digital objects that are findable, accessible, interoperable and reusable. Current pilots include deploying reference datasets in the cloud, indexing data and tools, and a credits system for cloud resources. Challenges discussed include metrics, costs, standards, incentives and sustainability. The framework's relevance for supporting open data in Australia is also addressed.
Global Research Platforms: Past, Present, FutureLarry Smarr
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help regulate emotions and stress levels.
Peering The Pacific Research Platform With The Great Plains NetworkLarry Smarr
The Pacific Research Platform (PRP) connects research institutions across the western United States with high-speed networks to enable data-intensive science collaborations. Key points:
- The PRP connects 15 campuses across California and links to the Great Plains Network, allowing researchers to access remote supercomputers, share large datasets, and collaborate on projects like analyzing data from the Large Hadron Collider.
- The PRP utilizes Science DMZ architectures with dedicated data transfer nodes called FIONAs to achieve high-speed transfer of large files. Kubernetes is used to manage distributed storage and computing resources.
- Early applications include distributed climate modeling, wildfire science, plankton imaging, and cancer genomics. The PR
Pacific Research Platform Science DriversLarry Smarr
The document discusses the vision and progress of the Pacific Research Platform (PRP) in creating a "big data freeway" across the West Coast to enable data-intensive science. It outlines how the PRP builds on previous NSF and DOE networking investments to provide dedicated high-performance computing resources, like GPU clusters and Jupyter hubs, connected by high-speed networks at multiple universities. Several science driver teams are highlighted, including particle physics, astronomy, microbiology, earth sciences, and visualization, that will leverage PRP resources for large-scale collaborative data analysis projects.
Looking Back, Looking Forward NSF CI Funding 1985-2025Larry Smarr
This document provides an overview of the development of national research platforms (NRPs) from 1985 to the present, with a focus on the Pacific Research Platform (PRP). It describes the evolution of the PRP from early NSF-funded supercomputing centers to today's distributed cyberinfrastructure utilizing optical networking, containers, Kubernetes, and distributed storage. The PRP now connects over 15 universities across the US and internationally to enable data-intensive science and machine learning applications across multiple domains. Going forward, the document discusses plans to further integrate regional networks and partner with new NSF-funded initiatives to develop the next generation of NRPs through 2025.
Massive-Scale Analytics Applied to Real-World Problemsinside-BigData.com
In this deck from PASC18, David Bader from Georgia Tech presents: Massive-Scale Analytics Applied to Real-World Problems.
"Emerging real-world graph problems include: detecting and preventing disease in human populations; revealing community structure in large social networks; and improving the resilience of the electric power grid. Unlike traditional applications in computational science and engineering, solving these social problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for research on scalable algorithms and development of frameworks for solving these real-world problems on high performance computers, and for improved models that capture the noise and bias inherent in the torrential data streams. In this talk, Bader will discuss the opportunities and challenges in massive data-intensive computing for applications in social sciences, physical sciences, and engineering."
Watch the video: https://wp.me/p3RLHQ-iPk
Learn more: https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Talk at DOE CIO's Big Data Tech Summit -- latest take on why and wherefore of software as a service (SaaS) for science, and the Globus Online work we are doing, with various DOE examples.
The Clinical Decision Support Consortium aims to advance clinical decision support through several research projects. The Consortium will assess, define, demonstrate, and evaluate best practices for clinical decision support across multiple healthcare settings and electronic health record platforms. Several research teams will focus on knowledge management, clinical guideline translation, decision support content development and delivery, and evaluating demonstrations of decision support. The goal is to improve healthcare quality by facilitating widespread use of evidence-based clinical decision support.
This document summarizes a presentation about providing next-generation sequencing analysis capabilities using Globus Genomics. It outlines challenges with current manual approaches to sequencing data analysis, including difficulties moving large datasets between locations and maintaining complex analysis scripts. The presentation introduces Globus Genomics, which uses Globus data transfer services integrated with Galaxy to provide a workflow-based system for sequencing analysis without requiring local installation or configuration. Key benefits include on-demand access to scalable cloud resources, ability to easily modify and reuse analysis workflows, and integration with data sources. The system aims to accelerate genomic research by automating and simplifying analysis.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
1) Globus Genomics addresses challenges in sequencing analysis by providing a platform that integrates data transfer via Globus Online, workflow management in Galaxy, and scalable compute resources in AWS.
2) An example collaboration with the Dobyns Lab saw over a 10x speedup in exome data analysis by replacing a manual process with Globus Genomics.
3) Globus Genomics leverages XSEDE services like Globus Transfer and Nexus while integrating additional resources like sequencing centers and cloud computing, in order to reduce the costs and complexities of genomic research for communities not traditionally using advanced cyberinfrastructure.
Role of Amyloid Burden in cognitive decline Ravi Madduri
This poster is prepared for the upcoming BD2K All hands meeting. We present the BDDS Knowledge Discovery platform as applied to understanding the role of amyloid burden in Alzheimers and Parkisons.
Globus Genomics provides tools and services to help researchers manage and analyze large genomic datasets. It uses Globus data management tools to securely transfer data between institutions. Researchers can then run analysis workflows on cloud compute resources through Galaxy interfaces. This enables researchers to assemble diverse datasets, apply multiple computational models, and publish results for others to discover, validate, and reuse. Examples show researchers using Globus Genomics to process petabytes of sequencing data and perform genome-wide analysis across many institutions. The goal is to accelerate scientific discovery by making it easier for researchers to find "needles in haystacks" through data-intensive computational approaches.
Presentation offered at http://www.smartiotlondon.com/2016-seminar-programme/big-data-and-genomics-the-future-of-genetic-engineering
Bioinformatics: the marriage of biology and Big Data, and how this will change the way we perform genetic engineering.
This presentation explain our company (Alkol Biotech) compares DNA strands, focusing on its development of the “EunergyCane” sugarcane crop: Europe’s only sugarcane variety. It explain the tools the company uses such as Big Data, Machine Learning, and Fast Sequencing.
Learning Outcomes:
1 – Learn on the new field of “Bioinformatics”, which is the marriage of IT and biology
2 – Learn how Big Data is changing the game on genetic engineering
3 – Learn what are the tools used and expected results
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)joseplaborda
The document summarizes the ADAS&ME project which aims to develop advanced driver assistance systems that can automatically transfer control between the vehicle and driver based on the driver's state and environmental context. The project has a budget of 9.5 million euros over 42 months and involves companies and research institutions developing technologies like high-definition maps, vehicle connectivity, and systems for monitoring driver state and handling non-reactive drivers. Several use cases are outlined focusing on commercial vehicles and motorcycles, with scenarios presented for smooth transitions between automated and manual driving and handling emergencies if the driver does not respond.
The document discusses effective ways to use Ansible including setting up passwordless SSH authentication, limiting playbooks to single machines, managing private keys, using roles for scalability, and tuning Ansible for performance. The objective is to deploy Kubernetes from scratch with one command using best practices for Ansible configuration, authentication, and organization.
This document summarizes a presentation about Globus Genomics, a service that provides genomic data analysis tools and workflows through a web interface. It allows users to securely transfer data, run standardized analysis pipelines, access computational resources on demand through Amazon Web Services, and collaborate on shared data and workflows. The service aims to make genomic analysis more accessible, reproducible, and sustainable through various pricing models and support for individual labs and bioinformatics cores.
This document summarizes Senator Barack Obama's health policy plan, which focuses on achieving universal health care coverage, health care reform, and strengthening public health. It outlines some of the key problems in the current US healthcare system from the perspectives of providers, purchasers, and consumers. Obama's plan would invest in health information technology and reform reimbursement to align with quality. The plan is estimated to cost $50-65 billion annually but could save $120-200 billion through reduced administrative costs, improved disease management, and health IT savings. If implemented, it could lower family insurance costs by $2,500 and cover 10 million more people.
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
This document provides an overview of Apache Sqoop and discusses the transition from Sqoop 1 to Sqoop 2. Sqoop is a tool for transferring data between relational databases and Hadoop. Sqoop 1 was connector-based and had challenges around usability and security. Sqoop 2 addresses these with a new architecture separating connections from jobs, centralized metadata management, and role-based security for database access. Sqoop 2 is the primary focus of ongoing development to improve ease of use, extensibility, and security of data transfers with Hadoop.
Ramesh Raskar
MIT Media Lab
Ramesh Raskar is an Associate Professor at MIT Media Lab. Ramesh Raskar joined the Media Lab from Mitsubishi Electric Research Laboratories in 2008 as head of the Lab’s Camera Culture research group. His research interests span the fields of computational photography, inverse problems in imaging and human-computer interaction. Recent projects and inventions include transient imaging to look around a corner, a next generation CAT-Scan machine, imperceptible markers for motion capture (Prakash), long distance barcodes (Bokode), touch+hover 3D interaction displays (BiDi screen), low-cost eye care devices (Netra,Catra), new theoretical models to augment light fields (ALF) to represent wave phenomena and algebraic rank constraints for 3D displays(HR3D).
In 2004, Raskar received the TR100 Award from Technology Review, which recognizes top young innovators under the age of 35, and in 2003, the Global Indus Technovator Award, instituted at MIT to recognize the top 20 Indian technology innovators worldwide. In 2009, he was awarded a Sloan Research Fellowship. In 2010, he received the Darpa Young Faculty award. Other awards include Marr Prize honorable mention 2009, LAUNCH Health Innovation Award, presented by NASA, USAID, US State Dept and NIKE, 2010, Vodafone Wireless Innovation Project Award (first place), 2011. He holds over 50 US patents and has received four Mitsubishi Electric Invention Awards. He is currently co-authoring a book on Computational Photography.
This document provides an overview of the Leap Motion controller and its capabilities for tracking hand gestures and finger positions. It explains how to set up a basic Leap Motion application using a 2D canvas, initialize the Leap controller, and include an animation loop to continuously draw hand tracking data. Examples are given for visualizing the Leap geometric system, common gestures like swipes and taps, and hand parameters using online code snippets and demos.
This document discusses various types of 3D displays. It begins with an overview of depth cues that can be presented to the human visual system, both monocular cues like size and occlusion, as well as binocular cues like retinal disparity and convergence. The document then presents a taxonomy of 3D display technologies, categorizing them as either glasses-bound or unencumbered designs. Specific display types are described in more detail, including head-mounted displays, spatial and temporal multiplexing, parallax barriers, integral imaging, and volumetric and holographic displays. Multi-view rendering techniques for generating stereoscopic images are also covered, such as using OpenGL for anaglyph generation and off-axis perspective projection.
'Media' is a plural for medium. The medium for impact of digital technologies at MIT Media Lab can be photons, electrons, neurons, atoms, cells, musical notes and more.
Over the last 40 years, computing has moved from processor, network, social and more sensory.
MIT Media Lab works at the intersection of computing and such media for human-centric technologies.
This document provides an overview of the pipeline for multiview computer vision. It describes taking multiple photographs, detecting and matching features between images, estimating homographies to relate the images, generating blended intermediate frames, and creating a video from the sequence of frames. It also provides details on steps like feature detection and description, matching features, estimating homographies, image blending, and writing video files.
Glass contains a 1 GHz processor, 16GB storage, 512MB RAM, GPS, Bluetooth, WiFi, 5MP camera, bone conduction speaker, gyroscope, accelerometer, and 640x360 display. It detects blinks and winks using a proximity sensor and IR emitter/receiver near the right eye. Developers can build apps using the Mirror API or by sideloading Android APKs that utilize Glass-specific APIs. The documentation provides examples of building simple apps.
What is SIGGRAPH NEXT?
By Juliet Fiss
What will be the next big thing at SIGGRAPH, and how can the SIGGRAPH community contribute in an impactful way to fields outside of traditional computer graphics? SIGGRAPH NEXT at SIGGRAPH 2015 explored these questions. In this new addition to the SIGGRAPH program, an eclectic set of speakers gave TED-style talks and posed grand challenges to the SIGGRAPH community. In this blog post, Professor Ramesh Raskar of the MIT Media Lab introduces SIGGRAPH NEXT and outlines his vision for it.
What will be the next big thing at SIGGRAPH?
The SIGGRAPH community has a set of hammers that it uses to solve problems: geometry processing, rendering, animation, and imaging. What will be the next hammer, the next major field of study, appear at SIGGRAPH? Let’s examine where our research ideas come from. Often, advances in machine learning, optimization, signal processing, and optics forge our hammers. Our selection of hammer also depends on the nails we see. The most common application areas of computer graphics currently include computer-aided design, movies, games, and photography.
We often ask: “Does this work contribute to SIGGRAPH techniques?”
We should also ask, “Does this work contribute SIGGRAPH techniques to _____?”
When we answer the challenges posed by these traditional application areas of computer graphics, we are “drinking our own champagne.” We have made amazing progress in these application areas, and we should celebrate! SIGGRAPH NEXT is about finding new varieties of champagne; for that, we need new varieties of grapes. We should invite others from nontraditional and emerging application areas to enjoy our champagne with us, and they will become part of our community. First, we can expand our work in existing areas like mobile, user interaction, virtual reality, fabrication, and new types of cameras. We can also expand into emerging areas such as healthcare, energy, education, entrepreneurship, materials, tissue fabrication, and social media. What’s next?
Professor Raskar highlights three top areas where we can make an impact. One big take-home message is that many of these applications involve biology: bio is the new digital, and it will affect us ubiquitously.
BeSTGRID aims to enhance research capability in New Zealand by providing skills and infrastructure to help researchers engage with new eResearch services and kick start centralized infrastructure. Since 2006, BeSTGRID has delivered services and tools to support research collaboration on shared datasets and computational resources. BeSTGRID coordinates access to compute and data resources, provides discipline-specific services and applications, and builds a sustainable community to develop middleware, applications, and services.
Application of Assent in the safe - Networkshop44Jisc
The document summarizes the Safe Share project, which aims to enable the secure exchange of health data between research sites for medical research. It establishes a higher assurance network using encrypted overlays between network nodes. It also explores implementing an authentication, authorization and accounting infrastructure to allow researchers to access data and systems using their home institution credentials. Several pilot programs are underway to test the network and authentication capabilities. The overall goal is to accelerate medical research while maintaining strict security and privacy of sensitive health data.
High Performance Cyberinfrastructure for Data-Intensive ResearchLarry Smarr
This document summarizes a lecture given by Dr. Larry Smarr on high performance cyberinfrastructure for data-intensive research. The summary discusses:
1) The need for dedicated high-bandwidth networks separate from the shared internet to enable big data research due to the increasing volume of digital scientific data.
2) Extensions being made to networks like CENIC in California to provide campus "Big Data Freeways" connecting instruments, computing resources, and remote facilities.
3) The use of networks like HPWREN to provide high-performance wireless access for data-intensive applications in rural areas like astronomy, wildfire detection, and more.
BeSTGRID aims to enhance research capability in New Zealand by providing skills training and infrastructure support. Since 2006, BeSTGRID has delivered services and tools to support research collaboration on shared data and computational resources. BeSTGRID coordinates access to compute and storage resources across New Zealand and provides discipline-specific applications and services to support researchers.
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
Gianpaolo Coro, ISTI-CNR, at BlueBRIDGE workshop on "Data Management services to support stock assessement", held during the Annual ICES Science conference 2016
This document summarizes a presentation about the African Open Science Platform (AOSP). It discusses challenges during the 2014-2015 Ebola outbreak in sharing health data openly. The AOSP vision is for African scientists to be leaders in open science and addressing challenges. Its mission is to provide a trusted system for finding, depositing, managing, and reusing research data, software and metadata. It discusses similar initiatives like the European Open Science Cloud and Google's plan for a new internet cable to Africa. It outlines AOSP's pilot activities from 2016-2019 and outlines draft plans for its data science school, eInfrastructure ecosystem, and flagship data-intensive project. National and international strategies supporting open science and the AOSP
This document provides an introduction to big data, including:
- Big data is characterized by its volume, velocity, and variety, which makes it difficult to process using traditional databases and requires new technologies.
- Technologies like Hadoop, MongoDB, and cloud platforms from Google and Amazon can provide scalable storage and processing of big data.
- Examples of how big data is used include analyzing social media and search data to gain insights, enabling personalized experiences and targeted advertising.
- As data volumes continue growing exponentially from sources like sensors, simulations, and digital media, new tools and approaches are needed to effectively analyze and make sense of "big data".
Democratizing Science through Cyberinfrastructure - Manish ParasharLarry Smarr
This document summarizes a presentation by Manish Parashar on democratizing science through cyberinfrastructure. The key points are:
1) Broad, fair, and equitable access to advanced cyberinfrastructure is essential for democratizing 21st century science, but there are significant barriers related to knowledge, technical issues, social factors, and balancing capabilities.
2) An advanced cyberinfrastructure ecosystem for all requires integrated portals, access to local and national resources through high-speed networks, diverse allocation modes, embedded expertise networks, and broad training.
3) Realizing this vision will require a scalable federated ecosystem with diverse capabilities and incentives for partnerships to meet growing needs for cyberinfrastructure and
The document outlines the vision, mission, and strategy of the STFC (Science and Technology Facilities Council) in implementing e-Science technologies. The goals are to exploit data from STFC facilities through innovative infrastructure, integrate activities nationally and internationally, and improve computation and data management capabilities to enable new scientific discoveries.
Shared services - the future of HPC and big data facilities for UK researchMartin Hamilton
Slides from Jisc panel session at HPC & Big Data 2016 with contributions from the Francis Crick Institute, QMUL and King's College London covering their use of the Jisc shared data centre and the eMedLab project
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...Larry Smarr
Invited Presentation
Symposium on Computational Biology and Bioinformatics:
Remembering John Wooley
National Institutes of Health
Bethesda, MD
July 29, 2016
The document summarizes the Ticer Summer School held on August 24th, 2006. It discusses topics related to digital libraries, grids, e-science, and data management. Examples are provided of different e-science projects that utilize grids for data-intensive applications in domains such as climate modeling, biomedicine, and high-energy physics. Requirements for users and owners of data resources on grids are also outlined.
Arkady Zaslavsky, Charith Perera, Dimitrios Georgakopoulos, Sensing as a Service and Big Data, Proceedings of the International Conference on Advances in Cloud Computing (ACC), Bangalore, India, July, 2012, Pages 21-29 (8)
1. Pushing Discovery with
Internet2
Cloud to Supercomputing
in Life Sciences
DAN TAYLOR
Director, Business Development, Internet2
BIO-IT WORLD 2016
BOSTON
APRIL, 2016
2. 2 –
8/30/
Internet2 Overview
• An advanced networking consortium
– Academia
– Corporations
– Government
• Operates a best-in-class national optical network
– 15,000 miles of dedicated fiber
– 100G routers and optical transport systems
– 8.8 Tbps capacity
• For over 20 years, our mission has been to
– Provide cost effective broadband and collaboration technologies to facilitate
frictionless research in Big Science – broad collaboration, extremely large data
sets
– Create tomorrow’s networks & a platform for networking research
– Engage stakeholders in
• Bridging the IT/Researcher gap
• Developing new technologies critical to their missions
3. [ 3 ]
The 4th Gen Internet2 Network
Internet2 Network
by the numbers
17 Juniper MX960 nodes
31 Brocade and Juniper
switches
49 custom colocation facilities
250+ amplification racks
15,717 miles of newly
acquired dark fiber
2,400 miles of partnered
capacity with Zayo
Communications
8.8 Tbps of optical capacity
100 Gbps of hybrid Layer 2
and Layer 3 capacity
300+ Ciena ActiveFlex 6500
network elements
4. Technology
• A Research Grade high speed network –
optimized for “Elephant flows”
• Layer 1 – secure point to point wavelength networking
• Advanced Layer 2 Services – Open virtual network for Life
Sciences with connectivity speeds up to 100 Gbs
• SDN Network Virtualization customer trials now
• Advanced Layer 3 Services – High speed IP connectivity to
the world
• Superior economics
• Secure sharing of online research resource
– federated identity management
system
5. [ 5 ]
Internet2 Members and Partners
255 Higher Education members
67 Affiliate members
41 R&E Network members
82 Industry members
65+ Int’l partners reaching over
100 Nations
93,000+ Community anchor institutions
Focused on member technology needs
since 1996
"The idea of being
able to collaborate
with anybody,
anywhere, without
constraint…"
—Jim Bottum, CIO,
Clemson University
Community
6. 6 –
8/30/
Strong international partnerships
• Agreements with
international networking
partners offer
interoperability and
access
• Enable collaboration
between U.S. researchers
and overseas counterparts
in over 100 international R
& E networks
Community
10. Life Sciences Research Today
• Sharing Big Data sets (genomic, environmental, imagery) key to basic and applied research
• Reproducibility - need to capture methods as well as raw data
– High variability in analytic processes and instruments
– Inconsistent formats and standards
• Lack of metadata & standards
• Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant)
• 21k human genes can make >100k proteins
• >50% of genes are controlled by day-night cycles
• Proteins have an average half-life of 30 hours
• Several thousand metabolites are rapidly changing
• Traits are environmentally and genetically controlled
• Information Technology - High Performance Computing and Networking - now can explore
these systems through simulation
• Collaboration
– Cross Domain, Cross Discipline
– Distribution of systems and talent is global
– Resources are public, private and academic
11. BIO-IT Trends in the Trenches 2015
with Chris Dagdigian
Take Aways
- Science is changing faster than IT funding
cycle for data intensive computing
environments
- Forward looking 100G multi site , multi
party collaborations required
- Cloud adoption driven by capability vs cost
- Centralized data center dead; future is
distributed computing/data stores
- Big pharma security challenge has
been met
- SDN is real and happening now; part of
infrastructure automation wave
- Blast radius more important than ever:
DOE’s Science DMZ architecture is a
solution
https://youtu.be/U6i0THTxe4o
http://www.slideshare.net/chrisdag/201
5-bioit-trends-from-the-frenches
2015 Bio-IT World Conference & Expo
• Change
• Networking
• Cloud
• Decentralized Collaboration
• Security
• Mission Networks
15. 15 –
8/30/20
2012: US – China 10 Gbps Link
Fed Ex: 2 days
Internet+ FTP: 26 hours
China ‐ US 10G Link: 30 secs
Dr. Lin Fang Dr. Dawei Lin
Sample.fa
(24GB)
16. NCBI/UC-Davis/BGI : First ultra high speed transfer of
genomic data between China & US, June 2012
“The 10 Gigabit network connection is even
faster than transferring data to most local hard
drives,” said Dr. Lin [of UC, Davis]. “The use of
a 10 Gigabit network connection will be
groundbreaking, very much like email
replacing hand delivered mail for
communication. It will enable scientists in the
genomics-related fields to communicate and
transfer data more rapidly and conveniently,
and bring the best minds together to better
explore the mysteries of life science.” (BGI
press release)
Life Sciences Engagement
16 Community
18. [ 18 ]
USDA Agriculture Research Services Science Network
• USDA scope is far beyond human
19. USDA Agricultural Research Services
Use Cases
• Drought (Soil Moisture) Project – Challenging Volumes
of Data
– NASA satellite data storage - 7 TB/mo., 36mo mission
– ARS Hydrology and Remote Sensing Lab analysis - 108 TB
– Data completely re-process 3 to 5 times
• Microbial Genomics Project – Computational
Bottlenecks
– Individual Strains of bacteria and microorganism communities
related to
Food Safety
Animal Health
Feed Efficiency
20. [ 20 ]
ARS Big Data Initiative
Big Data Workshop Recommendations,
(February 2013)
Three Pillars of the ARS Big Data Implementation
Plan – Network, HPC, Virtual Research Support
(April, 2014)
• Develop a Science DMZ
• Enable high-speed, low-latency transfer of
research data to HPC and storage from ARS
locations
• Virtual Researcher Support
Implementation Complete (Nov. 2015)
Clay Center, NE; Albany, CA; Beltsville
Labs/Nat’l Ag. Library, Beltsville, MD
Stoneville, MS; Ft. Collins, CO
Ames/NADC, IA
• ARS Scientific Computing
Assessment
• Final Report March 2014
21. SCInet Locations and Gateways
USDA AGRICULTURAL RESEARCH
SERVICE
Albany, CA
Ft. Collins, CO Clay Center, NE Ames, IA
Stoneville, MS
Beltsville, MD
100 Gb
100 Gb
100 Gb
10 Gb
10 Gb10 Gb
22. Cloud & Distributed Research Computing
@Scale
[ 22 ] Community
Internet2 Approach :
Agile scaling of resources and capacity
Access to multi-domain, multi-discipline expertise in one dynamic global community
Offer a bottomless toolbox for Innovation for the researcher
23. [ 23 ]
New High Speed Cloud Collaborations
8/30/20
16
23
10, x10G, x100G
25. Syngenta Science Network
• Syngenta is a leading agriculture
company helping to improve global
food security by enabling millions of
farmers to make better use of
available resources.
• Key research challenge:
How to grow plants
more efficiently?
• Internet2 members, especially land
grant universities, are important
research partners.
26. The Challenge
– Increasing size of scientific data sets
– Growing number of useful external resources
and partners
– Complexity of genomic analyses is
increasing
– Need for big data collaborations across the
globe
– Must Innovate
27. – Higher data throughput
– High speed connectivity to AWS Direct Connect
Surge HPC
Collaborations with academic community
– High speed connections to best-in-class supercomputing resources
NCSA – University of Illinois
Leverage NCSA expertise in building custom R&D workflows
Leverage NCSA Industry Partnership Program
A*Star Supercomputing Center in Singapore
Supports a global, distributed, scientific computing capability
– Global scale : creating a global fabric for computing and collaboration
28. “I want to be 15 minutes behind NCSA and 6
months ahead of my competition”
- Keith Gray, BP
[ 28 ]
National Center for Supercomputing
Applications
30. [ 30 ]
NCSA Mayo Clinic @Scale Genome-Wide
Association Study
for Alzheimer’s disease
• NCSA Private Sector Program
– UIUC HPCBio
– Mayo Clinic
• BlueWatersteam and Swiss Institute of Bioinformatics
worked together to identify which genetic variants
interact to influence gene expression patterns that
may associate with Alzheimer’s disease
31. [ 31 ]
Big Data and Big Compute Problem
• 50,011,495,056 pairs of variants
• Each variant pair is tested against
181 subjects and 24,544 genic regions
• Computationally large problem,
PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters
• Can be a big data problem:
- 500 PB if keep all results
- 4 TB when using a conservative cutoff
33. UCSC Cancer Genomics Hub: Large Data Flows to End
Users
1G
8G
15G
Cumulative TBs of CGH
Files Downloaded
Data Source: David Haussler, Brad Smith,
UCSC; Larry Smarr, CalIT2
30 PB
http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html
34. [ 34 ]
SDSC Protein Data Base Archive
• Repository of atomic coordinates and other information describing proteins and other
important biological macromolecules. Structural biologists use methods such as X-ray
crystallography, NMR spectroscopy, and cryo-electron microscopy to determine
the location of each atom relative to each other in the molecule. Information is
annotated and publicly released into the archive by the wwPDB.
35. SDSC
• Expertise
– Bioinformatics programming
and applications support.
– Computational chemistry
methods.
– Compliance requirements,
e.g., for dbGaP, FISMA and
HIPAA.
– Data mining techniques,
machine learning and
predictive analytics
– HPC and storage system
architecture and design.
– Scientific workflow systems
and informatics pipelines.
• Education and Training
– Intensive Boot camps for
working professionals - Data
Mining, Graph Analytics, and
Bioinformatics and Scientific
Worflows.
– Customized, on-site training
sessions/programs.
– Data Science Certificate
program.
– “Hackathon” events in data
science and other topics.
36. 8/30/20
16
Sherlock Cloud: A HIPAA-Compliant
Cloud
Healthcare IT Managed Services - SDSC Center of Excellence
36
• Expertise in Systems, Cyber Security, Data Management,
Analytics, Application Development, Advanced User Support and
Project Management
• Operating the first & largest FISMA Data Warehouse platform for
Medicaid fraud, waste and abuse analysis
• Leveraged FISMA experience to offer HIPAA- Compliant
managed hosting for UC and academia
• Supporting HHS CMS, NIH, UCOP and other UC Campuses
• Sherlock services : Data Lab, Analytics, Case Management
and Compliant Cloud
39. Community Data Science Resources
renci RADII and GWU HIVE
Driving Infrastructure Virtualization
Enabling Reproducibility For FDA Submissions
[ 39 ]
40. RADII
Resource Aware Datacentric collaboratIve Infrastructure
Goal
Make data-driven collaborations a ‘turn-key’ experience for domain
researchers and a ‘commodity’ for the science community
Approach
A new cyber-infrastructure to manage data-centric collaborations based
upon natural models of collaborations that occur among scientists.
RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar
SDSC: Amit Majumdar
DUKE: Erich Huang
Workflows - especially data-driven workflows and workflow
ensembles - are becoming a centerpiece of modern computational
science.
41. RADII Rationale
• Multi-institutional research teams grapple with multitude of resources
– Policy-restricted large data sets
– Campus compute resources
– National compute resources
– Instruments that produce data
• Interconnected by networks
– Campus, regional, national providers
• Many options, much complexity
• Data and infrastructure are treated separately
RADII Creates
A cyberinfrastructure that integrates data and resource
management from the ground up to support data-centric research.
RADII allows scientists to easily map collaborative data-driven
activities onto a dynamically configurable cloud infrastructure.
42. Infrastructure management
have no visibility into data
resources
Data management solutions
have no visibility into the
infrastructure
RADII: Foundational technologies
Data-grids present distributed data under a
one single abstraction and authorization
layer
Networked Infrastructure as a Service (NIaaS)
for rapid deployment of programmable
network virtual infrastructure (clouds).
Disjoint solutions
Incompatible resource abstractions
Gap
to reduce the data-infrastructure management gap
43. RADII System – Virtualizing Data, Compute and Network
for Collaboration
43
Novel mechanisms to
represent data-centric
collaborations using DFD
formalism
Data-centric resource
management
mechanisms for
provisioning and de-
provisioning resources
dynamically through
out the lifecycle of
collaborations
Novel mechanisms to
map data processes,
computations, storage
and organization entities
onto infrastructure
44. FDA and George Washington University
Big Data Decisions:
Linking Regulatory and Industry
Organizations with
HIVE Bio-Compute Objects
[ 44 ]
Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016
45. EI
H V
From Jan 2016: Vahan Simonyan, Raja Mazumder
lecture NIH: Frontiers in Data Science Series
https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1
High-performance Integrated Virtual Environment
A regulatory NGS data analysis platform
46. BIG DATA – From a range of samples and instruments to approval for
use
analysis and
review
sample
archival
sequencing run
file transfer
regulation
computation
pipelines
produced files
are massive in
size
transfer is
slow
too large to keep
forever; not
standardized
difficult to
validate
difficult to
visualize and
interpret
how do we
avoid
mistakes?
NGS lifecycle: from a biological sample to biomedical research and regulation
47. • Data Size: petabyte scale, soon exa-bytes
• Data Transfer: too slow over existing networks
• Data Archival: retaining consistent datasets across many years of mandated
evidence maintenance is difficult
• Data Standards: floating standards, multiplicity of formats, inadequate
communication protocols
• Data Complexity: sophisticated IT framework needed for complex dataflow
• Data Privacy: constrictive legal framework and ownership issues across the board
from the patient bedside to the FDA regulation
• Data Security: large number of complicated security rules and data protection tax IT
subsystems and cripple performance
• Computation Size: distributed computing, inefficiently parallelized, requires large
investment of hardware, software and human-ware
• Computation Standards: non canonical computation protocols, difficult to compare,
reproduce, rely on computations
• Computation Complexity: significant investment of time and efforts to learn
appropriate skills and avoid pitfalls in complex computational pipelines
• Interpretation: large outputs from enormous computations are difficult to visualize
and summarize
• Publication: peer review and audit requires communication by massive amount of
information
... and how do we avoid mistakes ?
software challenges and needs
48. HIVE is an End to End Solution
• Data retrieval from anywhere in the world
• Storage of extra large scale data
• Security approved by OIM
• Integrator platform to bring different data and analytics together
• Tailor made analytics designed around needs
• Visualization made to help in interpretation of data
• Support of the entire hard-, soft-ware and knowledge infrastructure
• Expertise accumulated in the agency
• Bio-Compute objects repository to provide reproducibility and
interoperability and long term referable storage of computations and results
HIVE is not
• an application to perform few tasks
• yet another database
• a computer cluster or a cloud or a data center
• an IT subsystem
More:
http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491
893.htm
49. Instantiation
DataTypeDefinitions Definitions of
metadata
types
Data Typing
Engine
Definitions of
computations
metadata
Data
Bio-compute
Definitions of
algorithms
and pipeline
descriptions
Computational
protocols
Verifiable
results
within
acceptable
uncertainty/er
ror
Scientifically
reliable
interpretation
HIVE data universe
50. industry FDA regulatory
analysis
2. compute
3.
submit
1. data-
forming
6.
issues
resubmits
5. regulatory
decision
4.
SOPP/prot
ocols
consumer
$ millions of dollars
7. yes
7. no
regulatory iterations
~$800 Million R&D dollars for a single
drug
~$2.6 Billion total cost
51. industr
y
FD
A
HIVE
public-HIVEGalaxy
CLC
DNA-nexus
2. compute
3. submit1. data-forming
6. issues
resubmits
5. bio-
compute
2. HIVE
SOPP/protocols
4.
SOPP/prot
ocols
consumer
7.
yes
7 .no
4. submit
bio-compute
integration
3.
compute
Facilitate
integration
$ millions of dollars
bio-compute as a way to link regulatory
and industry organizations
53. [ 53 ]
Community-developed framework of
trust enables:
• Secure, streamlined sharing of
protected resources
• Consolidated management of user
identities and access
• Delivery of an integrated portfolio of
community-developed solutions
[ 53 ]
Trusted Identity in Research
The standard for over
600 higher education
institutions—and
counting!
54. [ 54 ]
15 425+
2 160+
0 2000+
7.8 million
Academic
Participants
Sponsored
Partners
Registered
Service Providers
Individuals served
by federated IdM
Foundation for Trust & Identity
54
®
55. • Eric Boyd, Internet2
• Stephen Wolff, Internet2
• Stephen Goff, PhD, CyVERSE/iPlant, University of Arizona
• Chris Dagdigian, BioTeam
• Daiwei Lin, PhD, NIAID, NIH
• Paul Gibson, USDA ARS
• Paul Travis, Syngenta
• Evan Burness, NCSA
• Sandeep Chandra, SDSC
• Jonathan Allen, PhD, Lawrence Livermore National Lab
• Claris Castillo, PhD, RENCI
• Vahan Simonyan, PhD, FDA
• Raja Mazumder, PhD, George Washington University
• Eli Dart, ESNET, US Department of Energy
• BGI
• Nature
[ 55 ]
Acknowledgements
58. [ 58 ]
Rising expectations
Network throughput required to move y bytes in x time.
(US Dept of Energy - http://fasterdata.es.net).
should
be easy
This
year
Greetings I’m Dan Taylor from Internet2 – thanks for joining us. I’m going to talk a bit about internet2 and the work we’re doing with clouds and other compute resources in our community. There are a lot of slides and I’ll move quickly so pls stop by our booth or download the slides if you have questions.
Internet2 is the Research and education network for the US. We’re a membership consortium of academia , government and corporations.
Internet2 is an advanced networking consortium comprised of 221 U.S. universities, in cooperation with 45 leading corporations, 66 government agencies, laboratories and other institutions of higher learning, 35 regional and state research and education networks and more than 100 national research and education networking organizations representing over 50 countries
Internet2 actively engages our stakeholders in the development of important new technologies including middleware, security, network research and performance measurement capabilities which are critical to the achievement of the mission goals of our members.
Throughout our first 15 years, Internet2 has served a unique role among networking organizations, pioneering the use of advanced network applications and technologies, and facilitating their development—to facilitate the work of the research community.
Internet2 operates an advanced national optical network based on 17,500 miles of dedicated fiber and utilizes the latest 100G routers and optical transport systems with 8.8Tbps of system capacity
Goal: Deepen and extend, advance, sustain, advance digital resources ecosystem.Value: Growing portfolio of resources and services: advanced computing, high-end visualization, data analysis, and other resources and services. Interoperability with other infrastructures.
membership numbers as of 2014-03-27
Campus Champions (200 at 175 institutions)14,000 participants in training workshops (online and in person).
Absolutely key to our success is the global partnerships we have formed.
[>>] Internet2 partners with over 50 national research and education networks including our friends in Canada to enable connectivity to more than 100 international networks.
These partnerships provide the basis for understanding how to facilitate collaborations between the US Internet2 community and counterparts in other countries
Our global partnerships have yielded important developments in new technologies. For example - the DICE collaborative is a partnership between GEANT, Internet2, CANARIE and ESnet which provides a joint forum for North American and European investment in advanced networking leadership
Our collaboration has led to the development of world-leading tools like PerfSONAR and dynamic circuit networking – which I will touch on later. Our focus in 2010 is to deliver direct services to our members as a result of our development investments
Our community has a track record of IT successes ; we haven’t looked at life sciences yet but I’m pretty sure the Internet2 community’s impact is even greater there
R&E must keep constructing the conditions that spur innovation
Give innovators an environment where they’re free to try new, untested, unpopular, ridiculously challenging things
Innovation requires a big playground
An innovation platform must encourage utilization, not limit it
Life sciences research shares many of the trends we see else where in big science - data set sizes growing rapidly, increased need for collaboration – but we also see a new ecosystem fueling research. At the same time , however, diminishing R&D $ are pressuring the industry and government .
Chris Dagdigian does a great job detailing how IT deals with the changes in Life sciences research. I have a couple of takeways from his talks, its useful to see how
Internet2 addresses whats going on
Scientific instrument technology – which generates scinetifc data – is changing faster than the IT refresh cycle.
Organizations see the big data wave coming and are now implementing 100 G networks to get ahead of the rising tide
Organizations are going to the cloud to be able to do things they’ve can’t do on theyir own, not just to save monney
Centralization will not wliminate the need to move data
Security concerns with high speed transfers and collaboration can be addressed
Virtualized infrastructure is moving to the wide area
Big science flows are more disruptive than ever to enterprise networks – theres a trend toward separating business and research networks
One of the things we used to in the R&E community is change in scientific data growth
The internet2 community has dealt with the data tsunami for many years now. The LHC shut down for 2 years to upgrade its power – annual output has jumped from 13 to 30 petabytes a year. This data is distributed thru out the world by the R&E networks. In Life Sciences driver is NGS, falling in price rapidly and a proliferation of devices generating data all over the world
http://www.nature.com/news/large-hadron-collider-the-big-reboot-1.16095
Our network has responded
Back in 2012 we showed how a 10G link from beijing to UC Davis could change the game. A 24 GB file that would take 26 hours to traverse the internet was transferred in 30 seconds
Researchers likened the difference in collaboration like going from letters to email
So we’re seeing organizations get ahead of the tsunami by getting bigger networks. I recently helped the Department of Agriculture’s Agriculture Research Service do just this.
I like to show this slide to illustrate how much llfe there is beyond humans, and USDA ARS has to deal with many of them – and how they impact our world. It shows the size of genomes of various species, with the x axis being a log scale. Humans are there at the top, one of a number of mammals the usda is interested in .
But they are also interested in birds, crustaceans, fish, fungi , algae , bacteria and protozoans – and of course plants. And, some are extremely complex – you see the size of the wheat genome is orders of magnitude larger than the human.
Beyond genomics , these kinds of projects create huge volumes of data as well as computational bottlenecks
To attack this problem they gathered requirements in 2013, hired bioTeam to do an assessment and we actually completed a 6 node science network of 10 and 100G links by the end of 2015. that was fast!!
R&E collaborations are handled at the 100G links on the coasts and another 100g feeds the new HPC in Ames Iowa
You can view Internet2 as the medium for all the data and computing resources, forming a problem solving community around these high speed connections
Syngenta , a life sciences company , is a great example of an organization making the most of these connections
They are an agribusiness with a mission to improve plant productivity, they stay on the leading edge of science thru their internal research and their collaboration with the academic community
Syngenta was challenged by many of the issues USDA saw, but on a global scale and even more pressure to innovate.
We installed a 10G Layer 2 service that provided high speed Direct Connect access to AWS where they could do surge HPC and retrieve sequencing data outsourced to the academic community. They also could connect to NCSA to build and run custom pipelines. They can also use the connection to work with A*Star supercomputer center in Singapore , where they intend on building an asian genomic center. Finally we expect to bring up locations in switzerland and GB, completing a global research network.
I just mentioned NCSA and this resource deserves a few seconds. NCSA does a lot of work with industry , and a comment from a VP at BP says it all….
Leveraging its talent and one of the fastest computers in the world, NCSA provides companies with a full range of services to help the innovate
They do a lot of work in the life sciences ; the one I’ll note here is an alzheimers gwas study with Mayo clinic
In this one they handled an enormous amount of data and kind of strong armed the computational challenge – what wouldve taken 2 years at Mayo was done in 6 hrs on Blue Waters
Another incredible resource in the community is SDSC
You may know them as the home of CGHUB which holds the cancer genome atlas. Note the bits/second growth from 1g to 15 G from 2012 -2015
CGHub is a large-scale data repository and portal for the National Cancer Institute’s Cancer Genome Research Programs
Current capacity is 5 Petabytes, scalable to 20 Petabytes.
The Cancer Genome Atlas, one data collection of many in the CGH, by itself could produce 10 PB in the next four years
As an illustration of how Internet2 is making network resources accessible, consider the the UCSC Cancer Genomics Hub, operated by the University of California at Santa Cruz and located at the San Diego Supercomputer Center co-location facility. Without the “big pipes” provided between SDSC and Internet2, the CGH would not be able to keep pace with demand for its data.
As both users and data in the repository grew over a three year period, the bandwidth needed to support the activity grew by 15x.
SDSC also has other important data sources like the Protein Data base archive
They also have consulting services very much focused to support life sciences research
I’d also note the cloud environment they built for HHS CMS – FISMA compliant and HIPAA ready.
The National Labs are also a huge part of the community
Whenever I run into a Metagenomics problem I reference Jonathan Allen’s huge microbiome work with metagenomics
We also have a number of interesting efforts to facilitate collaboration and reproducibility.
RADII is an exciting project that virtualizes clouds leveraging iRODS and virtual networks. The idea is to allow researchers, not IT, to spin up and monitor local and cloud resources, compute and network infrastructure on demand. So for example when I need to complete collaborative a workflow and move data and compute over a number of compute resources
Radii allows you to
represent data-centric collaborations using standard modeling mechanisms;
map data processes, computations, storage, and organizational entities onto the physical infrastructure with the click of a button
provision and de-provision infrastructure dynamically throughout the lifecycle of the collaboration.
Radii builds on the data management of irods and infrastructure virtualization of ORCA and Exogeni to give researchers control over the infrastructure that’s necessary for collaboration
Here’s an example of this virtualization, with researchers at Duke UNC and Scripps sharing data and workflows on SDSC compute resources.
Ease of use,
Improve end to end performance perceived by the scientists
To enable this vision we need two technologies with high level of programmability and automation.
A collaboration between the FDA and Gw is looking improve reproducibility by using biocompute objects. This should accelerate regulatory approvals and reduce costs.
This represents the process for FDA submissions supported by NGS. There is a lot of opportunities for making mistakes along the way. These mistakes result in delays and costly resubmissions
Of the challenges in gaining agreement at the end of this process, many of which are addressed by HIVE, its potential to impact reproducibility is the most exciting
The HIVE platform is big data analysis solution used by the FDA and available to industry. The bio compute objects repository is key to reproducibility
To get to better reproducibility, HIVE relies on a data typing engine to define meta data for the data , computations and both algorithms and pipelines to create a biocompute object related to the submission that’s reusable by the FDA.
Data typing engine- facility allowing to register structure, syntax and ontologies of the information fields of objects.
Metadata type- descriptive information on the structure of data files or electronic records.
Computation metadata- Description of arguments and parameters (not values) for computational analysis.
Definitions of algorithms and pipeline descriptions- descriptions of the characteristics for executable applications.
Data- collection of actual values observed and accumulated during experimentation by a device or an observer.
Computational protocol- well parameterized computational pipeline designed to produce scientifically meritable outcomes with appropriate data.
Bio-compute- instance of an actual execution of the computational protocols on a given set of data with actual values of parameters generating identifiable outcomes/results.
HIVE would help by recording the parameters of the analysis as biocompute objects (or use existing ones in the public repository) and share them with FDA so they can verify that analysis.
Data forming is done using a public hive and integrated with your usual analytic tools. The resulting biocompute objects are submitted to FDA; these biocompute objects are used in the FDA HIVe to validate the results of the submission.
Finally I ‘ll say a few words about federated identity.
Over 10 yrs ago the R&E Community recognized the importance of trust in collaborations and created the InCommon federated identity management solution.
We now have a leading solution with around 8MM users. Pls stop by the booth for more information.