The document discusses big data use cases and requirements. It provides 51 detailed use cases across various domains that generate many terabytes to petabytes of data. It also describes extracting 437 specific requirements from the use cases and analyzing trends. The next steps involve matching requirements to a reference architecture and prioritizing use cases for implementation.
GlobusWorld 2021: Arecibo Observatory Data Movement
The story of how Globus helped move petabytes of data from the Arecibo Observatory to TACC, and thereby save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by George Robb III.
Viene descritta la piattaforma EiAGRID/SmartGeo, un portale di calcolo e analisi dati per sismica a riflessione e acquisizioni GPR multioffset, che mette a disposizione dell'utente una serie di servizi di calcolo e di processing accessibili attraverso un'interfaccia Web basata su un'infrastruttura Grid. La piattaforma consente all'utente in campo, tramite un dispositivo client (laptop, PC, tablet, etc.), di usufruire di una serie di servizi computazionali che risiedono e girano su server remoti, secondo il paradigma SaaS (Software as a Service). Verranno illustrate le soluzioni modellistiche e tecnologiche adottate e alcuni risultati ottenuti su dati reali.
The document discusses the future of high performance computing (HPC). It covers several topics:
- Next generation HPC applications will involve larger problems in fields like disaster simulation, urban science, and data-intensive science. Projects like the Square Kilometer Array will generate exabytes of data daily.
- Hardware trends include using many-core processors, accelerators like GPUs, and heterogeneous computing with CPUs and GPUs. Future exascale systems may use conventional CPUs with GPUs or innovative architectures like Japan's Post-K system.
- The top supercomputers in the world currently include Summit, a IBM system combining Power9 CPUs and Nvidia Voltas at Oak Ridge, and China's Sunway Taihu
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
The document discusses the Materials Genome Initiative (MGI) and the High-Throughput Experimental Materials Collaboratory (HTE-MC). It describes NIST's role in supporting MGI through developing a materials innovation infrastructure. It outlines the vision for HTE-MC, which would integrate high-throughput synthesis and characterization tools across multiple institutions through a shared network and data management platform. This would provide broader access to experimental facilities and materials data to support accelerated materials discovery. A workshop was held in 2018 to discuss establishing the HTE-MC concept and defining its technical, operational and business models.
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
The document discusses using machine learning to accelerate materials discovery. Specifically:
- Scientists developed a system combining machine learning algorithms trained on experimental data with high-throughput experiments to discover new metallic glass alloys 200 times faster than before.
- The system uses machine learning models to predict optimal new material compositions and processing parameters based on large datasets of materials properties and compositions.
- As an example, the document discusses using random forest machine learning on a dataset of 2722 hydrogen storage alloy compositions and properties to predict promising new alloy compositions for hydrogen storage applications.
The document discusses grids and their potential use for data mining applications in Earth science. Some key points:
- Grids can connect distributed computing and data resources to enable large-scale applications and collaboration.
- The Grid Miner application was developed to mine satellite data on NASA's Information Power Grid as a demonstration.
- Grids could help couple satellite data archives to computational resources, allowing users to process large datasets.
- For this to be realized, data archives need to be connected to grids and tools developed to enable scientists to access and analyze data.
This document provides an overview of advanced research computing resources and services available to researchers at the University of York. It describes the research computing facilities including research0, the York Advanced Research Computing Cluster (YARCC), the regional N8 HPC facility, and the national ARCHER HPC service. It also covers storage, virtual machines, databases, software, support and training resources, research data management, and includes case studies of researchers using the facilities. The resources aim to support researchers by providing computing power for complex analysis and large datasets that is faster and more productive than standard desktop computers.
A description of software as infrastructure at NSF, and how Apache projects may be similar. What lessons can be shared from one organization to the other? How does science software compare with more general software?
Efficient O&G does not suffice in an industry downturn – effective investment in time and effort is required to rise above the pack
Production analysis need not be mystical; it should not be rote
Nuance and subtle variations provide leading indicators into impending production issues
Decline curves, certainly crucial, must be analyzed in context
Case-based, topological analysis, rule inference, curve plotting solutions are common solutions, but fall short
Application of nuance analysis within environment of Data-Intensive Scientific Discovery
The document proposes an Earth Science Collaboratory (ESC) that would provide access to Earth science models, data, tools, and services to facilitate collaboration and reproducibility in data-intensive Earth science research. It describes the current fragmented state of accessing and sharing models, data, tools, and knowledge. The ESC would integrate these components and provide services like cloud computing, discovery, and provenance tracking. It presents a use case of how the ESC could help collaboration in the development of precipitation retrieval algorithms for the Global Precipitation Measurement mission.
Machine Learning encompasses data acquisition, transmission, retention, analysis, and reduction. The expected outgrowth of 24x7 data systems and operations centers is Knowledge Engineering and Data Intensive Analytics AKA Machine Learning. This presentation will develop and apply Machine Learning concepts to the Upstream O&G industry. Specific focus will be given to the fundamental concepts and definitions of Machine Learning along with the application of Machine Learning.
The Open Science Data Cloud is a hosted, managed, distributed facility that allows scientists to manage and archive medium and large datasets, provide computational resources to analyze the data, and share the data with colleagues and the public. It currently consists of 6 racks, 212 nodes, 1568 cores and 0.9 PB of storage across 4 locations with 10G networks. Projects using the Open Science Data Cloud include Bionimbus for hosting genomics data and Matsu 2 for providing flood data to disaster response teams. The goal is to build it out over the next 10 years into a small data center for science that can preserve data like libraries and museums preserve collections.
The presentation provides overview and significance of the TERN long term ecological research network. The presentation was part of the Workshop on Approaches to Terrestrial Ecosystem Data Management : from collection to synthesis and beyond which was held on 9th of March 2016 in University of Queensland.
GlobusWorld 2021: Arecibo Observatory Data MovementGlobus
The story of how Globus helped move petabytes of data from the Arecibo Observatory to TACC, and thereby save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by George Robb III.
Viene descritta la piattaforma EiAGRID/SmartGeo, un portale di calcolo e analisi dati per sismica a riflessione e acquisizioni GPR multioffset, che mette a disposizione dell'utente una serie di servizi di calcolo e di processing accessibili attraverso un'interfaccia Web basata su un'infrastruttura Grid. La piattaforma consente all'utente in campo, tramite un dispositivo client (laptop, PC, tablet, etc.), di usufruire di una serie di servizi computazionali che risiedono e girano su server remoti, secondo il paradigma SaaS (Software as a Service). Verranno illustrate le soluzioni modellistiche e tecnologiche adottate e alcuni risultati ottenuti su dati reali.
The document discusses the future of high performance computing (HPC). It covers several topics:
- Next generation HPC applications will involve larger problems in fields like disaster simulation, urban science, and data-intensive science. Projects like the Square Kilometer Array will generate exabytes of data daily.
- Hardware trends include using many-core processors, accelerators like GPUs, and heterogeneous computing with CPUs and GPUs. Future exascale systems may use conventional CPUs with GPUs or innovative architectures like Japan's Post-K system.
- The top supercomputers in the world currently include Summit, a IBM system combining Power9 CPUs and Nvidia Voltas at Oak Ridge, and China's Sunway Taihu
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
The document discusses the Materials Genome Initiative (MGI) and the High-Throughput Experimental Materials Collaboratory (HTE-MC). It describes NIST's role in supporting MGI through developing a materials innovation infrastructure. It outlines the vision for HTE-MC, which would integrate high-throughput synthesis and characterization tools across multiple institutions through a shared network and data management platform. This would provide broader access to experimental facilities and materials data to support accelerated materials discovery. A workshop was held in 2018 to discuss establishing the HTE-MC concept and defining its technical, operational and business models.
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
The document discusses using machine learning to accelerate materials discovery. Specifically:
- Scientists developed a system combining machine learning algorithms trained on experimental data with high-throughput experiments to discover new metallic glass alloys 200 times faster than before.
- The system uses machine learning models to predict optimal new material compositions and processing parameters based on large datasets of materials properties and compositions.
- As an example, the document discusses using random forest machine learning on a dataset of 2722 hydrogen storage alloy compositions and properties to predict promising new alloy compositions for hydrogen storage applications.
The document discusses grids and their potential use for data mining applications in Earth science. Some key points:
- Grids can connect distributed computing and data resources to enable large-scale applications and collaboration.
- The Grid Miner application was developed to mine satellite data on NASA's Information Power Grid as a demonstration.
- Grids could help couple satellite data archives to computational resources, allowing users to process large datasets.
- For this to be realized, data archives need to be connected to grids and tools developed to enable scientists to access and analyze data.
This document provides an overview of advanced research computing resources and services available to researchers at the University of York. It describes the research computing facilities including research0, the York Advanced Research Computing Cluster (YARCC), the regional N8 HPC facility, and the national ARCHER HPC service. It also covers storage, virtual machines, databases, software, support and training resources, research data management, and includes case studies of researchers using the facilities. The resources aim to support researchers by providing computing power for complex analysis and large datasets that is faster and more productive than standard desktop computers.
A description of software as infrastructure at NSF, and how Apache projects may be similar. What lessons can be shared from one organization to the other? How does science software compare with more general software?
Efficient O&G does not suffice in an industry downturn – effective investment in time and effort is required to rise above the pack
Production analysis need not be mystical; it should not be rote
Nuance and subtle variations provide leading indicators into impending production issues
Decline curves, certainly crucial, must be analyzed in context
Case-based, topological analysis, rule inference, curve plotting solutions are common solutions, but fall short
Application of nuance analysis within environment of Data-Intensive Scientific Discovery
The document proposes an Earth Science Collaboratory (ESC) that would provide access to Earth science models, data, tools, and services to facilitate collaboration and reproducibility in data-intensive Earth science research. It describes the current fragmented state of accessing and sharing models, data, tools, and knowledge. The ESC would integrate these components and provide services like cloud computing, discovery, and provenance tracking. It presents a use case of how the ESC could help collaboration in the development of precipitation retrieval algorithms for the Global Precipitation Measurement mission.
Machine Learning encompasses data acquisition, transmission, retention, analysis, and reduction. The expected outgrowth of 24x7 data systems and operations centers is Knowledge Engineering and Data Intensive Analytics AKA Machine Learning. This presentation will develop and apply Machine Learning concepts to the Upstream O&G industry. Specific focus will be given to the fundamental concepts and definitions of Machine Learning along with the application of Machine Learning.
The Open Science Data Cloud is a hosted, managed, distributed facility that allows scientists to manage and archive medium and large datasets, provide computational resources to analyze the data, and share the data with colleagues and the public. It currently consists of 6 racks, 212 nodes, 1568 cores and 0.9 PB of storage across 4 locations with 10G networks. Projects using the Open Science Data Cloud include Bionimbus for hosting genomics data and Matsu 2 for providing flood data to disaster response teams. The goal is to build it out over the next 10 years into a small data center for science that can preserve data like libraries and museums preserve collections.
E&P organizations are turning more attention to accumulated data to enhance operating efficiency, safety, and recovery. The computing paradigm is shifting, the O&G paradigm is shifting, and the rise of the machine learning paradigm requires careful attention to top-down integrated systems engineering. A system approach will be presented to stimulate out-of-the-box thinking to address the machine learning paradigm.
High-level Meeting & Workshop on Environmental and Scientific Open Data for Sustainable Development Goals in Developing Countries. Madagascar, 4-6 December 2017
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
The Department of Energy's Integrated Research Infrastructure (IRI)Globus
We will provide an overview of DOE’s IRI initiative as it moves into early implementation, what drives the IRI vision, and the role of DOE in the larger national research ecosystem.
As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
This document provides an introduction to big data, including:
- Big data is characterized by its volume, velocity, and variety, which makes it difficult to process using traditional databases and requires new technologies.
- Technologies like Hadoop, MongoDB, and cloud platforms from Google and Amazon can provide scalable storage and processing of big data.
- Examples of how big data is used include analyzing social media and search data to gain insights, enabling personalized experiences and targeted advertising.
- As data volumes continue growing exponentially from sources like sensors, simulations, and digital media, new tools and approaches are needed to effectively analyze and make sense of "big data".
The document discusses Internet2, an advanced networking consortium that operates a 15,000 mile fiber optic network for research and education. It provides very high speed connectivity and collaboration technologies to facilitate large data sharing and frictionless research. Examples are given of life sciences projects utilizing Internet2's high-speed network for genomic research and agricultural applications involving terabytes of satellite and sensor data. The network is expanding to include cloud computing resources and supercomputing centers to enable global-scale distributed scientific computing and collaboration.
Australia's Environmental Predictive CapabilityTERN Australia
Federating world-leading research, data and technical capabilities to create Australia’s National Environmental Prediction System (NEPS).
Community consultation presentation.
3-12 February 2020
Dr Michelle Barker (Facilitator)
(Presentation v5)
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
The document discusses data management plan requirements for proposals submitted to the U.S. Department of Energy Office of Science for research funding. It provides context on the history of data management policies, outlines the four main requirements for inclusion of a data management plan, and suggests elements that should be included in the plan such as data types/sources, content/format, sharing/preservation, and protection. It also discusses tools like the Public Access Gateway for Energy and Science that can help manage access to research publications and data.
This document summarizes a presentation about the African Open Science Platform (AOSP). It discusses challenges during the 2014-2015 Ebola outbreak in sharing health data openly. The AOSP vision is for African scientists to be leaders in open science and addressing challenges. Its mission is to provide a trusted system for finding, depositing, managing, and reusing research data, software and metadata. It discusses similar initiatives like the European Open Science Cloud and Google's plan for a new internet cable to Africa. It outlines AOSP's pilot activities from 2016-2019 and outlines draft plans for its data science school, eInfrastructure ecosystem, and flagship data-intensive project. National and international strategies supporting open science and the AOSP
The Pacific Research Platform: a Science-Driven Big-Data Freeway SystemLarry Smarr
The Pacific Research Platform (PRP) is a multi-institutional partnership that establishes a high-capacity "big data freeway system" spanning the University of California campuses and other research universities in California to facilitate rapid data access and sharing between researchers and institutions. Fifteen multi-campus application teams in fields like particle physics, astronomy, earth sciences, biomedicine, and visualization drive the technical design of the PRP over five years. The goal of the PRP is to extend campus "Science DMZ" networks to allow high-speed data movement between research labs, supercomputer centers, and data repositories across campus, regional
Data accessibility and the role of informatics in predicting the biosphereAlex Hardisty
The variety, distinctiveness and complexity of life – biodiversity in other words and by implication the ecosystems in which it is situated – is our life support system. It is absolutely essential and more important than almost everything else but it is typically taken for granted. Today’s big societal challenges – food and water security, coping with environmental change and aspects of human health – are beyond the abilities of any one individual or research group to solve. Solving them depends not only on collaboration to deliver the appropriate scientific evidence but increasingly on vast amounts of data from multiple sources (environmental, taxonomic, genomic and ecological) gathered by manual observation and automated sensors, digitisation, remote sensing, and genetic sequencing. In April 2012 we called the biodiversity and ecosystems research communities to arms to formulate a consensus view on establishing an infrastructure to improve the accessibility of the ever-increasing volumes of biological data. We published the whitepaper: “A decadal view of biodiversity informatics: challenges and priorities” that has since been viewed more than 24,000 times. We envisage a shared and maintained multi-purpose network of computationally-based processing services sitting on top of an open data domain. By open data domain we mean data that is accessible i.e., published, registered and linked. BioVeL, pro-iBiosphere, ViBRANT and other FP7 funded projects have all explored aspects of this vision.
If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address.
The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective?
We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints.
That’s not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity.
This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
The Earth System Grid Federation (ESGF) is a distributed network of climate data servers that archives and shares model output data used by scientists worldwide. ESGF has led data archiving for the Coupled Model Intercomparison Project (CMIP) since its inception. The ESGF Holdings have grown significantly from CMIP5 to CMIP6 and are expected to continue growing rapidly. A new ESGF2 project funded by the US Department of Energy aims to modernize ESGF to handle exabyte scale data volumes through a new architecture based on centralized Globus services, improved data discovery tools, and data proximate computing capabilities.
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
This document summarizes the use of the ABoVE Science Cloud (ASC) to support research for the Arctic-Boreal Vulnerability Experiment (ABoVE). The ASC provides researchers with large datasets, computing resources, and tools to process and analyze remote sensing and model data related to Alaska and northern Canada. Several examples are given of projects using the ASC, including analyzing satellite imagery to map forest structure, tracking surface water changes over time, characterizing fire history, and modeling future forest composition under climate change. The ASC aims to facilitate collaboration by allowing scientists to access common datasets and run computationally-intensive processes in the cloud without having to directly transfer large amounts of data.
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
Según Hal Varian (experto en microeconomía y economía de la información y, desde el año 2002, Chief Economist de Google) “En los próximos años, el trabajo más atractivo será el de los estadísticos: La capacidad de recoger datos, comprenderlos, procesarlos, extraer su valor, visualizarlos, comunicarlos serán todas habilidades importantes en las próximas décadas. Ahora disponemos de datos gratuitos y omnipresentes. Lo que aún falta es la capacidad de comprender estos datos“.
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...Wolfgang Ksoll
NextGEOSS is a H2020 project that aims to create an open data hub and cloud platform for Earth observation data. It involves 27 partners from 13 countries with a budget of 10 million euros from 2016-2020. The project will develop advanced data discovery tools, enable user feedback, and enhance communities through tailored solutions. It will follow an open, inclusive, and agile development approach aligned with EU open data policies. Various pilot projects will use the data and platform for applications in agriculture, biodiversity, disaster risk reduction, and other areas. The data will come from Copernicus satellites, in situ sources, and other open data providers. Metadata will be harvested and standardized. Lessons learned so far include the need for scalable architectures
Victoria A. White Head, Computing Division FermilabVideoguy
Global scientific collaborations are essential for particle physics experiments. Fermilab experiments involve over 200 institutions from around the world, with over half of physicists and a third of students coming from outside the US. Fermilab is working to support these collaborations through networks, grid computing, guest scientists programs, and outreach. Advances in information technology and global e-science are profoundly impacting many fields.
The document provides an introduction to software engineering. It defines software engineering as an engineering discipline concerned with all aspects of software production. It discusses why software engineering is important given that errors in complex software systems can have devastating consequences, as shown through examples of software failures in air traffic control, satellite launches, and ambulance dispatch systems. The document also covers fundamental software engineering concepts like the software process, process models, and costs.
The document discusses software testing concepts like validation testing vs defect testing, system and component testing strategies, and test automation tools. It defines key terms like bugs, defects, errors, faults, and failures. It also describes techniques like equivalence partitioning and boundary value analysis that are used to generate test cases that thoroughly test software. Component testing tests individual program parts while system testing tests integrated groups of components. Test cases specify conditions to determine if software works as intended.
Cyclomatic complexity is a software metric used to measure the complexity of a program based on the number of linearly independent paths. It is calculated as the number of edges - nodes + 2 in the program's control flow graph. Higher cyclomatic complexity indicates a more complex program that is likely more error-prone. Testing seeks to determine the required quality standard and strategy before planning specific unit, integration, and system tests. Factors considered in test planning include prioritizing what to test based on damage severity and risk levels, determining test sources, who will perform the tests, where to conduct them, and when to terminate testing. The results are documented in a software test plan.
The document discusses function point analysis (FPA), a method used to estimate the size of a software project based on its functionality. FPA was initially developed by Allan J. Albrecht in 1979 at IBM. It measures the functional size of a software application in terms of function points, which are used to estimate factors like project time and resources required. FPA is independent of programming languages and can be used for various types of software systems. The document also discusses software quality metrics, which focus on measuring the quality of products, processes, and projects. These include metrics like defect density, customer problems, and customer satisfaction.
This document discusses techniques for estimating the cost of software projects. It explains that software cost estimation aims to predict the effort, time and total cost required. The key components of software costs are outlined as labor costs, hardware/software costs, and overhead costs. The document then examines various techniques for measuring programmer productivity and estimating project size, including lines of code, function points, and object points. Finally, it analyzes different estimation techniques like algorithmic modeling, expert judgment, analogy, and top-down vs. bottom-up approaches.
The document discusses software project management. It defines a software project as the complete process of software development from requirements gathering through testing and maintenance. A software project manager closely monitors the development process, prepares plans, arranges resources, and manages communication between team members. Software project management involves planning, scope management, estimation of size, effort, time and cost, and other activities. Estimation techniques include decomposition by functions or activities and empirical models. Lines of code is a common size metric but does not consider complexity. Effort estimation forecasts time required and project estimation uses a stepwise decomposition approach.
XML is a markup language that defines rules for encoding documents in a human- and machine-readable format. It allows users to define their own elements and tags to structure data. Some key benefits of XML include its extensibility, ability to carry data independently of presentation, and status as a public standard. While XML provides structure and organization, it does not perform computations or specify how data should be displayed.
The document discusses configuration management and software configuration management (SCM) concepts. It defines key SCM terms like baseline, software configuration item, and configuration. It describes the SCM process which includes identification, version control, change control, configuration auditing, and status reporting. Challenges of SCM in component-based software development are also covered. Effective SCM is important for software projects to manage changes and maintain integrity across software versions and releases.
High cohesion and low coupling are characteristics of good design that make software components more independent and modular. Cohesion refers to how related the responsibilities of a component are, while coupling refers to interdependencies between components. The document defines and provides examples of different types of cohesion and coupling, from ideal to poor, to help understand their impacts on maintenance and modifiability.
The document discusses several software development life cycle (SDLC) models, including waterfall, iterative, prototyping, and spiral models. It describes the basic stages and processes involved in each model. The waterfall model involves sequential stages of requirements analysis, design, implementation, testing, and deployment. The iterative model allows revisiting earlier stages and incremental releases. The prototyping model uses prototypes to gather early user feedback. Finally, the spiral model combines iterative development and risk analysis, proceeding in cycles of planning, risk analysis, development, and evaluation.
Software design involves deriving solutions that satisfy software requirements. The design process involves understanding the problem, identifying solutions, and describing solution abstractions at different levels. Design takes place through overlapping phases like architectural design, interface design, and component design. Good design principles include having linguistic modular units, few interfaces, small interfaces, explicit interfaces, and information hiding. This achieves cohesion within modules and loose coupling between modules.
This document provides an overview of software engineering and the evolution of practices in the field. It discusses how software development has progressed from an ad hoc exploratory approach to more systematic approaches utilizing structured programming, data structure design, data flow design, and object-oriented design. Modern practices emphasize prevention over correction of errors through life cycle models, documentation, testing and other techniques.
The document discusses software requirements and documentation. It states that properly documenting requirements is crucial to avoid mistakes during development. Requirements analysis involves gathering and analyzing requirements, then specifying them in a document. This ensures developers understand the problem and can develop a satisfactory solution. The document also discusses data flow modeling, object-oriented modeling, prototyping techniques, and classifying requirements as functional or non-functional.
The document provides guidance on writing a software requirements specification (SRS) document. An SRS document is important as it establishes shared expectations for a software project between clients and developers. It describes the intended use, features, and challenges of a software application. The SRS includes sections on purpose, scope, functional and non-functional requirements, interfaces, and design constraints. It is created before development to ensure all stakeholders understand what the software should do.
Animation involves manipulating still images to create the illusion of movement. Traditional animation involves drawing images by hand on transparent sheets that are photographed and exhibited as film. Today, computer-generated imagery (CGI) is commonly used. There are 12 principles for effective animation including squash and stretch, anticipation, staging, follow through and overlapping action, solid drawing, timing, and exaggeration. Different animation techniques include traditional 2D animation, digital 2D/3D animation, puppetry, claymation, cut-out animation, and flipbook animation. Common file formats for animation include PNG, JPG, GIF, and SVG.
Voice recognition and voice response systems allow for hands-free data entry using speech as the interface. Voice recognition systems analyze speech patterns to convert them to digital codes for computer input. Most require training a system to recognize a user's voice. Voice recognition is used in applications like manufacturing quality control and airline baggage sorting. Voice response systems provide verbal guidance for tasks using voice messaging and synthesis. Examples include automated phone systems and online services.
Windows was developed by Microsoft and macOS was developed by Apple. Windows was first launched in 1985 while macOS was first launched in 2001. Both operating systems allow for multitasking and have strong networking capabilities. However, Windows generally has better compatibility with third-party hardware and software while macOS has stronger security features and integration with other Apple devices.
The document discusses digital audio and sound systems. It covers topics like:
- Sound cards and speakers are needed to play sophisticated sounds on computers.
- Users can check for and adjust sound card settings through the Control Panel.
- Formats like WAV, AIFF, AU store uncompressed digital audio, while MP3, Vorbis use lossy compression.
- MIDI files contain instructions to recreate music rather than audio waves, making them much smaller in size.
Video is composed of a series of still images called frames displayed in rapid succession to create the illusion of motion. It involves both visual and audio components. There are two main types of video - analogue and digital. Analogue video represents images and sound through continuous signals while digital video uses discrete numeric data. Common video file formats include .MP4, .AVI, .WMV and .MOV, each suited for different uses and applications.
The document discusses text and its use in multimedia. It describes factors that affect text legibility like font size and style. It recommends choosing easily readable fonts in few sizes and colors. It also discusses tools for editing and designing fonts used to create custom fonts and manipulate existing ones. These tools include Fontographer and Font Monger. The document also discusses using text in multimedia, like subtitles, and navigation elements like menus and buttons. Hypertext and hypermedia are discussed along with their structures like nodes, anchors, and links that allow non-linear navigation.
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...Murugan Solaiyappan
Title: Relational Database Management System Concepts(RDBMS)
Description:
Welcome to the comprehensive guide on Relational Database Management System (RDBMS) concepts, tailored for final year B.Sc. Computer Science students affiliated with Alagappa University. This document covers fundamental principles and advanced topics in RDBMS, offering a structured approach to understanding databases in the context of modern computing. PDF content is prepared from the text book Learn Oracle 8I by JOSE A RAMALHO.
Key Topics Covered:
Main Topic : DATA INTEGRITY, CREATING AND MAINTAINING A TABLE AND INDEX
Sub-Topic :
Data Integrity,Types of Integrity, Integrity Constraints, Primary Key, Foreign key, unique key, self referential integrity,
creating and maintain a table, Modifying a table, alter a table, Deleting a table
Create an Index, Alter Index, Drop Index, Function based index, obtaining information about index, Difference between ROWID and ROWNUM
Target Audience:
Final year B.Sc. Computer Science students at Alagappa University seeking a solid foundation in RDBMS principles for academic and practical applications.
About the Author:
Dr. S. Murugan is Associate Professor at Alagappa Government Arts College, Karaikudi. With 23 years of teaching experience in the field of Computer Science, Dr. S. Murugan has a passion for simplifying complex concepts in database management.
Disclaimer:
This document is intended for educational purposes only. The content presented here reflects the author’s understanding in the field of RDBMS as of 2024.
Feedback and Contact Information:
Your feedback is valuable! For any queries or suggestions, please contact muruganjit@agacollege.in
How to Install Theme in the Odoo 17 ERPCeline George
With Odoo, we can select from a wide selection of attractive themes. Many excellent ones are free to use, while some require payment. Putting an Odoo theme in the Odoo module directory on our server, downloading the theme, and then installing it is a simple process.
Views in Odoo - Advanced Views - Pivot View in Odoo 17Celine George
In Odoo, the pivot view is a graphical representation of data that allows users to analyze and summarize large datasets quickly. It's a powerful tool for generating insights from your business data.
The pivot view in Odoo is a valuable tool for analyzing and summarizing large datasets, helping you gain insights into your business operations.
How to Create Sequence Numbers in Odoo 17Celine George
Sequence numbers are mainly used to identify or differentiate each record in a module. Sequences are customizable and can be configured in a specific pattern such as suffix, prefix or a particular numbering scheme. This slide will show how to create sequence numbers in odoo 17.
AI Risk Management: ISO/IEC 42001, the EU AI Act, and ISO/IEC 23894PECB
As artificial intelligence continues to evolve, understanding the complexities and regulations regarding AI risk management is more crucial than ever.
Amongst others, the webinar covers:
• ISO/IEC 42001 standard, which provides guidelines for establishing, implementing, maintaining, and continually improving AI management systems within organizations
• insights into the European Union's landmark legislative proposal aimed at regulating AI
• framework and methodologies prescribed by ISO/IEC 23894 for identifying, assessing, and mitigating risks associated with AI systems
Presenters:
Miriama Podskubova - Attorney at Law
Miriama is a seasoned lawyer with over a decade of experience. She specializes in commercial law, focusing on transactions, venture capital investments, IT, digital law, and cybersecurity, areas she was drawn to through her legal practice. Alongside preparing contract and project documentation, she ensures the correct interpretation and application of European legal regulations in these fields. Beyond client projects, she frequently speaks at conferences on cybersecurity, online privacy protection, and the increasingly pertinent topic of AI regulation. As a registered advocate of Slovak bar, certified data privacy professional in the European Union (CIPP/e) and a member of the international association ELA, she helps both tech-focused startups and entrepreneurs, as well as international chains, to properly set up their business operations.
Callum Wright - Founder and Lead Consultant Founder and Lead Consultant
Callum Wright is a seasoned cybersecurity, privacy and AI governance expert. With over a decade of experience, he has dedicated his career to protecting digital assets, ensuring data privacy, and establishing ethical AI governance frameworks. His diverse background includes significant roles in security architecture, AI governance, risk consulting, and privacy management across various industries, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: June 26, 2024
Tags: ISO/IEC 42001, Artificial Intelligence, EU AI Act, ISO/IEC 23894
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
How to Configure Time Off Types in Odoo 17Celine George
Now we can take look into how to configure time off types in odoo 17 through this slide. Time-off types are used to grant or request different types of leave. Only then the authorities will have a clear view or a clear understanding of what kind of leave the employee is taking.
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...Neny Isharyanti
Presented as a plenary session in iTELL 2024 in Salatiga on 4 July 2024.
The plenary focuses on understanding and intepreting relevant TPACK competence for teachers to be adept in teaching multimodality in the digital age. It juxtaposes the results of research on multimodality with its contextual implementation in the teaching of English subject in the Indonesian Emancipated Curriculum.
(T.L.E.) Agriculture: Essentials of GardeningMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏.𝟎)-𝐅𝐢𝐧𝐚𝐥𝐬
Lesson Outcome:
-Students will understand the basics of gardening, including the importance of soil, water, and sunlight for plant growth. They will learn to identify and use essential gardening tools, plant seeds, and seedlings properly, and manage common garden pests using eco-friendly methods.
Front Desk Management in the Odoo 17 ERPCeline George
Front desk officers are responsible for taking care of guests and customers. Their work mainly involves interacting with customers and business partners, either in person or through phone calls.
How to Show Sample Data in Tree and Kanban View in Odoo 17Celine George
In Odoo 17, sample data serves as a valuable resource for users seeking to familiarize themselves with the functionalities and capabilities of the software prior to integrating their own information. In this slide we are going to discuss about how to show sample data to a tree view and a kanban view.
Credit limit improvement system in odoo 17Celine George
In Odoo 17, confirmed and uninvoiced sales orders are now factored into a partner's total receivables. As a result, the credit limit warning system now considers this updated calculation, leading to more accurate and effective credit management.
How to Add Colour Kanban Records in Odoo 17 NotebookCeline George
In Odoo 17, you can enhance the visual appearance of your Kanban view by adding color-coded records using the Notebook feature. This allows you to categorize and distinguish between different types of records based on specific criteria. By adding colors, you can quickly identify and prioritize tasks or items, improving organization and efficiency within your workflow.
How to Add Colour Kanban Records in Odoo 17 Notebook
big_data_casestudies_2.ppt
1. 1
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Big Data Use Cases and Requirements
2. 2
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Requirements and Use Case Subgroup
2
The focus is to form a community of interest from industry, academia, and
government, with the goal of developing a consensus list of Big Data
requirements across all stakeholders. This includes gathering and
understanding various use cases from diversified application domains.
Tasks
•Gather use case input from all stakeholders
•Derive Big Data requirements from each use case.
•Analyze/prioritize a list of challenging general requirements that may delay
or prevent adoption of Big Data deployment
•Work with Reference Architecture to validate requirements and reference
architecture
•Develop a set of general patterns capturing the “essence” of use cases (to
do)
3. 3
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Use Case Template
• 26 fields completed for
51 usecases
– Government Operation: 4
– Commercial: 8
– Defense: 3
– Healthcare and Life
Sciences: 10
– Deep Learning and Social
Media: 6
– The Ecosystem for
Research: 4
– Astronomy and Physics: 5
– Earth, Environmental and
Polar Science: 10
– Energy: 1
4. 4
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
51 Detailed Use Cases: Many TB’s to Many PB’s
• Government Operation: National Archives and Records Administration, Census Bureau
• Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search,
Digital Materials, Cargo shipping (as in UPS)
• Defense: Sensors, Image surveillance, Situation Assessment
• Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis, Pathology,
Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
• Deep Learning and Social Media: Driving Car, Geolocate images/cameras, Twitter, Crowd
Sourcing, Network Science, NIST benchmark datasets
• The Ecosystem for Research: Metadata, Collaboration, Language Translation, Light source
experiments
• Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron Collider at
CERN, Belle Accelerator II in Japan
• Earth, Environmental and Polar Science: Radar Scattering in Atmosphere, Earthquake,
Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation
datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to
watersheds), AmeriFlux and FLUXNET gas sensors
• Energy: Smart grid
Next step involves matching extracted requirements and reference architecture.
Alternatively develop a set of general patterns capturing the “essence” of use cases.
5. 5
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Some Trends
• Practitioners consider themselves Data Scientists
• Images are a major source of Big Data
– Radar
– Light Synchrotrons
– Phones
– Bioimaging
5
• Hadoop and HDFS dominant
• Business – main emphasis at
NIST – interested in analytics and
assume HDFS
• Academia also extremely
interested in data management
• Clouds v. Grids
6. 6
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Example Use Case I: Summary of Genomics
• Application: 19: NIST Genome in a Bottle Consortium
– integrates data from multiple sequencing technologies and methods
– develops highly confident characterization of whole human genomes as
reference materials,
– develops methods to use these Reference Materials to assess performance of
any genome sequencing run.
• Current Approach:
– The storage of ~40TB NFS at NIST is full; there are also PBs of genomics data
at NIH/NCBI.
– Use Open-source sequencing bioinformatics software from academic groups on
a 72 core cluster at NIST supplemented by larger systems at collaborators.
• Futures:
– DNA sequencers can generate ~300GB compressed data/day which volume
has increased much faster than Moore’s Law.
– Future data could include other ‘omics’ measurements, which will be even larger
than DNA sequencing. Clouds have been explored.
Healthcare/Life Sciences
7. 7
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Example Use Case II: Census Bureau Statistical
Survey Response Improvement (Adaptive Design)
• Application: Survey costs are increasing as survey response declines.
– Uses advanced “recommendation system techniques” that are open and scientifically objective
– Data mashed up from several sources and historical survey para-data (administrative data about the
survey) to drive operational processes
– The end goal is to increase quality and reduce the cost of field surveys
• Current Approach:
– ~1PB of data coming from surveys and other government administrative sources.
– Data can be streamed with approximately 150 million records transmitted as field data streamed
continuously, during the decennial census.
– All data must be both confidential and secure.
– All processes must be auditable for security and confidentiality as required by various legal statutes.
– Data quality should be high and statistically checked for accuracy and reliability throughout the
collection process.
– Use Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory,
Cassandra, Pig software.
• Futures:
– Analytics needs to be developed which give statistical estimations that provide more detail, on a more
near real time basis for less cost.
– The reliability of estimated statistics from such “mashed up” sources still must be evaluated.
Government Operation
8. 8
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Example Use Case III: 26: Large-scale Deep Learning
• Application: 26: Large-scale Deep Learning
– Large models (e.g., neural networks with more neurons and connections) combined with large datasets
are increasingly the top performers in benchmark tasks for vision, speech, and Natural Language
Processing.
– One needs to train a deep neural network from a large (>>1TB) corpus of data (typically imagery, video,
audio, or text).
– Such training procedures often require customization of the neural network architecture, learning
criteria, and dataset pre-processing.
– In addition to the computational expense demanded by the learning algorithms, the need for rapid
prototyping and ease of development is extremely high.
• Current Approach:
– The largest applications so far are to image recognition and scientific studies of unsupervised learning
with 10 million images and up to 11 billion parameters on a 64 GPU HPC Infiniband cluster.
– Both supervised (using existing classified images) and unsupervised applications investigated.
• Futures:
– Large datasets of 100TB or more may be necessary in order to exploit the representational power of
the larger models.
– Training a self-driving car could take 100 million images at megapixel resolution.
– Deep Learning shares many characteristics with the broader field of machine learning. The paramount
requirements are high computational throughput for mostly dense linear algebra operations, and
extremely high productivity for researcher exploration.
– One needs integration of high performance libraries with high level (python) prototyping environments.
8
Deep Learning and Social Media
9. 9
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Example Use Case IV:
EISCAT 3D incoherent scatter radar system
• Application: EISCAT 3D incoherent scatter radar system
– EISCAT: European Incoherent Scatter Scientific Association
– Research on the lower, middle and upper atmosphere and ionosphere using the incoherent scatter radar
technique.
– This technique is the most powerful ground-based tool for these research applications.
– EISCAT studies instabilities in the ionosphere, as well as investigating the structure and dynamics of the
middle atmosphere. It is also a diagnostic instrument in ionospheric modification experiments with
addition of a separate Heating facility.
– Currently EISCAT operates 3 of the 10 major incoherent radar scattering instruments worldwide with its
facilities in in the Scandinavian sector, north of the Arctic Circle.
• Current Approach:
– The current old EISCAT radar generates terabytes per year rates and no present special challenges.
• Futures:
– The next generation radar, EISCAT_3D, will consist of a core site with a transmitting and receiving radar
arrays and four sites with receiving antenna arrays at some 100 km from the core.
– The fully operational 5-site system will generate several thousand times data of current EISCAT system
with 40 PB/year in 2022 and is expected to operate for 30 years.
– EISCAT 3D data e-Infrastructure plans to use the high performance computers for central site data
processing and high throughput computers for mirror sites data processing.
– Downloading the full data is not time critical, but operations require real-time information about certain
pre-defined events to be sent from the sites to the operation center and a real-time link from the operation
center to the sites to set the mode of radar operation on with immediate action.
9
Astronomy and Physics
10. 10
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
• Application: 51: Consumption forecasting in Smart Grids
– Predict energy consumption for customers, transformers, sub-stations and the electrical grid
service area using smart meters providing measurements every 15-mins at the granularity of
individual consumers within the service area of smart power utilities.
– Combine Head-end of smart meters (distributed), Utility databases (Customer Information,
Network topology; centralized), US Census data (distributed), NOAA weather data (distributed),
Micro-grid building information system (centralized), Micro-grid sensor network (distributed).
– This generalizes to real-time data-driven analytics for time series from cyber physical systems
• Current Approach:
– GIS based visualization.
– Data is around 4 TB a year for a city with 1.4M sensors in Los Angeles.
– Uses R/Matlab, Weka, Hadoop software.
– Significant privacy issues requiring anonymization by aggregation.
– Combine real time and historic data with machine learning for predicting consumption.
• Futures:
– Wide spread deployment of Smart Grids with new analytics integrating diverse data and
supporting curtailment requests. Mobile applications for client interactions.
Energy
Example Use Case V: Consumption forecasting in
Smart Grids
11. 11
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
• Application: 17:Pathology Imaging/ Digital Pathology II
• Current Approach:
– 1GB raw image data + 1.5GB analytical results per 2D image.
– MPI for image analysis; MapReduce + Hive with spatial extension on supercomputers and clouds.
– GPU’s used effectively. Figure below shows the architecture of Hadoop-GIS, a spatial data warehousing system over
MapReduce to support spatial analytics for analytical pathology imaging.
Example Use Case VI: Pathology Imaging
Healthcare/Life Sciences
Architecture of Hadoop-GIS, a spatial data warehousing system over
MapReduce to support spatial analytics for analytical pathology imaging
• Futures: Recently, 3D pathology
imaging is made possible through 3D
laser technologies or serially
sectioning hundreds of tissue
sections onto slides and scanning
them into digital images. Segmenting
3D microanatomic objects from
registered serial images could
produce tens of millions of 3D objects
from a single image. This provides a
deep “map” of human tissues for next
generation diagnosis. 1TB raw image
data + 1TB analytical results per 3D
image and 1PB data per moderated
hospital per year.
12. 12
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
• Application: 20: Comparative analysis for metagenomes and genomes
– Given a metagenomic sample, (1) determine the community composition in terms of other
reference isolate genomes, (2) characterize the function of its genes, (3) begin to infer
possible functional pathways, (4) characterize similarity or dissimilarity with other
metagenomic samples, (5) begin to characterize changes in community composition and
function due to changes in environmental pressures, (6) isolate sub-sections of data
based on quality measures and community composition.
• Current Approach:
– Integrated comparative analysis system for metagenomes and genomes, front ended by an
interactive Web UI with core data, backend precomputations, batch job computation submission
from the UI.
– Provide interface to standard bioinformatics tools (BLAST, HMMER, multiple alignment and
phylogenetic tools, gene callers, sequence feature predictors…).
• Futures:
– Management of heterogeneity of biological data is currently performed by RDMS (Oracle).
Unfortunately, it does not scale for even the current volume 50TB of data.
– NoSQL solutions aim at providing an alternative but unfortunately they do not always lend
themselves to real time interactive use, rapid and parallel bulk loading, and sometimes
have issues regarding robustness.
Example Use Case VII: Metagenomics
Healthcare/Life Sciences
13. 13
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
• Application: 27: Organizing large-scale, unstructured
collections of consumer photos
– Produce 3D reconstructions of scenes using collections of millions to billions of
consumer images, where neither the scene structure nor the camera positions
are known a priori.
– Use resulting 3d models to allow efficient browsing of large-scale photo
collections by geographic position.
– Geolocate new images by matching to 3d models. Perform object recognition on
each image. 3d reconstruction posed as a robust non-linear least squares
optimization problem where observed relations between images are constraints
and unknowns are 6-d camera pose of each image and 3-d position of each
point in the scene.
• Current Approach:
– Hadoop cluster with 480 cores processing data of initial applications.
– Note over 500 billion images on Facebook and over 5 billion on Flickr with over
500 million images added to social media sites each day.
13
Deep Learning Social Networking
Example Use Case VIII: Consumer photography
14. 14
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
27: Organizing large-scale, unstructured collections of
consumer photos II
• Futures:
– Need many analytics including feature extraction, feature matching, and large-scale
probabilistic inference, which appear in many or most computer vision and image
processing problems, including recognition, stereo resolution, and image denoising.
– Need to visualize large-scale 3-d reconstructions, and navigate large-scale collections of
images that have been aligned to maps.
Deep Learning Social Networking
15. 15
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
• Application: 28: Truthy: Information diffusion research from
Twitter Data
– Understanding how communication spreads on socio-technical networks.
– Detecting potentially harmful information spread at the early stage (e.g., deceiving
messages, orchestrated campaigns, untrustworthy information, etc.)
• Current Approach:
– 1) Acquisition and storage of a large volume (30 TB a year compressed) of continuous
streaming data from Twitter (~100 million messages/day, ~500GB data/day increasing);
– (2) near real-time analysis of such data, for anomaly detection, stream clustering, signal
classification and online-learning;
– 3) data retrieval, big data visualization, data-interactive Web interfaces, public API for data
querying. Use Python/SciPy/NumPy/MPI for data analysis. Information diffusion,
clustering, and dynamic network visualization capabilities already exist
• Futures:
– Truthy plans to expand incorporating Google+ and Facebook.
– Need to move towards Hadoop/IndexedHBase & HDFS distributed storage.
– Use Redis as an in-memory database to be a buffer for real-time analysis.
– Need streaming clustering, anomaly detection and online learning.
15
Deep Learning Social Networking
Example Use Case IX: Twitter Data
17. 17
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Requirements Gathering
• Data sources
– data size, file formats, rate of grow, at rest or in motion, etc.
• Data lifecycle management
– curation, conversion, quality check, pre-analytic processing, etc.
• Data transformation
– data fusion/mashup, analytics
• Capability infrastructure
– software tools, platform tools, hardware resources such as storage
and networking
• Security & Privacy; and data usage
– processed results in text, table, visual, and other formats
A total of 437 specific requirements under 35
high-level generalized requirement summaries.
18. 18
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Interaction Between Subgroups
Technology
Roadmap
Requirements
& Use Cases
Definitions &
Taxonomies
Reference
Architecture
Security &
Privacy
Due to time constraints, activities were carried out in parallel.
19. 19
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Reference
Architecture
• Multiple stacks of
technologies
– Open and
Proprietary
• Provide example
stacks for different
applications
• Come up with usage
patterns and best
practices
20. 20
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Next Steps
• Approach for RDA to implement use cases
and NBD to identify abstract interface
• Planning for implementation of usecases
– Resource availability
– Application-specific support
– Computation and storage leverage
• Multiple potential directions
– Prioritization is one of the goals for this meeting.
21. 21
Ilkay ALTINTAS and Geoffrey FOX - March, 2014
Key Links
• Use cases listing:
http://bigdatawg.nist.gov/usecases.php
• Latest version of the document (Dated Oct 12,
2013):
http://bigdatawg.nist.gov/_uploadfiles/M0245_
v5_6066621242.docx