The document discusses the evolution of computer architectures from early technological achievements like the transistor and integrated circuit. It describes increasing transistor densities following Moore's Law. Future technologies will focus on increasing core counts while decreasing cycle times and voltages. Performance will come from parallelism rather than clock speed increases due to heat limitations. The document outlines challenges in scaling to exascale systems by 2018.
The document discusses NVIDIA's role in powering the world's fastest supercomputers. It notes that the Summit supercomputer at Oak Ridge National Laboratory is now the fastest system, powered by 27,648 Volta Tensor Core GPUs to achieve over 122 petaflops. NVIDIA GPUs also power 17 of the world's 20 most energy efficient supercomputers, including Europe's fastest Piz Daint and Japan's fastest Fugaku supercomputer. Over 550 applications are now accelerated using NVIDIA GPUs.
Tianhe 2 is the worlds fastest super computer according to top500.org. This presentation is based on Tianhe 2. Tianhe 2 Software architecture, Hardware architecture, specifications, motive factors and several other key aspects are highlighted in this presentation.
This presentation prepared by me with 4 other Computer Science Undergraduate from University of Colombo School of Computing, Sri Lanka.
Fugaku is a Japanese supercomputer utilizing Fujitsu's A64FX CPU. It was designed through an iterative co-design process between application developers and Fujitsu to achieve over 100x performance gain compared to the previous K computer within a 30-40MW power budget. The A64FX CPU utilizes 7nm technology and features 48 Arm-based cores with high bandwidth memory to achieve superior floating point and memory bandwidth performance efficiently. Early evaluations show Fugaku meeting performance and power targets and outperforming x86 processors for real applications.
Supercomputers can perform calculations much faster than ordinary computers due to their high speeds and large memory. They are used for complex tasks like weather forecasting and scientific research that require extensive calculations. Supercomputers have evolved over time from single processors to massively parallel systems with thousands of processors. Their power is now measured in petaflops, or quadrillions of calculations per second. The top supercomputers currently are based in China, the United States, and use Linux operating systems and programming languages like Fortran and C++.
This document discusses how GPU clusters are accelerating scientific discovery by providing significantly higher performance and energy efficiency compared to CPU-only systems. Several examples are given of scientific applications such as molecular dynamics simulations, weather modeling, computational fluid dynamics, lattice quantum chromodynamics, and protein-DNA docking that have seen 10-100x speedups using GPUs. The world's fastest and most powerful supercomputers like Titan and Blue Waters are incorporating thousands of GPUs to enable exascale performance for open science.
The document describes the TSUBAME2 supercomputer system at the Tokyo Institute of Technology. It has the following key aspects:
- It has over 17 PFlops of computing power across 1408 thin computing nodes, 24 medium nodes, and 10 fat nodes with Intel and NVIDIA GPU processors.
- It has a total storage capacity of 11PB including 7PB of HDD storage, 4PB of tape storage, and 200TB of SSD storage.
- It utilizes a high-performance Infiniband QDR network with 12 core switches and over 180 edge switches for fast interconnectivity between nodes and storage.
This document provides an overview of supercomputers including their common uses, challenges, history and top systems. Some key points:
- Supercomputers are used for highly complex tasks like weather forecasting, climate modeling, and simulating nuclear weapons. They can process vast amounts of data and perform quadrillions of calculations per second.
- Major challenges include cooling systems to manage the large amounts of heat generated and high-speed data transfer between components.
- The US and Japan have historically dominated supercomputing. Early systems included the CDC 6600 (1964) and Cray-1 (1976). Modern systems use thousands of processors networked together.
- The top supercomputers today include China's Tianhe
NVIDIA's Jetson platform provides an AI computing solution for applications at the edge by running deep neural networks on low-power modules like the Jetson TX1. The Jetson TX1 module has powerful GPU processing capable of over 1 teraflop/s while consuming under 10 watts, making it suitable for applications in areas like industrial automation, robotics, smart cities, and more. Developers can use the Jetpack SDK and resources like the Deep Learning Institute to train models on servers and deploy them to Jetson modules for running AI inference in end products at the edge.
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
Robert Sheen from HPE gave a presentation on machine learning applications and accelerating deep learning. He provided a quick introduction to neural networks, discussing their structure and how they are inspired by biological neurons. Deep learning requires high performance computing due to its computational intensity during training. Popular deep learning frameworks like CogX were also discussed, which provide tools and libraries to help build and optimize neural networks. Finally, several enterprise use cases for machine learning and deep learning were highlighted, such as in finance, healthcare, security, and geospatial applications.
The CRAY-1 was the world's fastest supercomputer in the 1970s. Unveiled in 1976, it had a clock speed of 80MHz and computational rates of 138 MFLOPS sustained and 250 MFLOPS in bursts. The CRAY-1 was a vector processor that used techniques like chaining and vectorization to achieve high performance. It had a unique physical design of 12 wedge-shaped columns and pioneered new cooling technologies. The CRAY-1 established the class of supercomputers and paved the way for modern high performance computing.
Hi guys!!!
This is a presentation on super computers which gives you a better overview on the information of super computers.And i hope that this ppt has been useful for you.
If you have any doubts on this topic don't hesitate to ask in the comments section.
THANK YOU!!!!
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
- The document discusses the latest HPC technologies used in AI/Big Data infrastructures such as TSUBAME3.0 at Tokyo Institute of Technology and ABCI at AIST.
- It provides an overview of the capabilities and achievements of these supercomputers, including TSUBAME2.0 receiving the 2011 ACM Gordon Bell Prize.
- It emphasizes that future supercomputers need to focus on "BYTES" capabilities like bandwidth and capacity to better support large-scale data processing for AI/Big Data applications.
The document discusses supercomputers and provides details about some of the fastest including Tianhe-2, Titan, and Sequoia. Tianhe-2 has an average performance of 33.86 petaflops and peak of almost 55 petaflops, making it the fastest. Titan is being replaced by Summit in 2017 which will have over 100 petaflops of power. Sequoia located at Livermore Lab is also being replaced in 2017. The document also provides information on the components of supercomputers including RAM and CPUs, and discusses options for building your own smaller supercomputer cluster.
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
This document discusses customizing a deep learning accelerator. It begins with a demonstration of object detection using a Tiny YOLO v1 model on an FPGA-based prototype. It then discusses designing a high-efficiency accelerator with three steps: 1) increasing MAC processing elements and utilization, 2) increasing data supply, and 3) improving energy efficiency. Various neural network models are profiled to analyze memory bandwidth and computational power tradeoffs. The document proposes a customizable architecture and discusses solutions like layer fusion, quantization-aware training, and post-training quantization. Performance estimates using an equation-based profiler for sample models are provided to demonstrate the customized accelerator design.
Towards Exascale Simulations of Stellar Explosions with FLASHGanesan Narayanasamy
- ORNL is managed by UT-Battelle for the US Department of Energy and conducts research including simulations of stellar explosions using the FLASH code.
- The research aims to prepare FLASH to run on the upcoming Summit supercomputer by accelerating components like the nuclear kinetics module using GPUs.
- Preliminary results show significant speedups from using GPUs for large nuclear reaction networks that were previously too computationally expensive.
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
This document summarizes Fisnik Kraja's PhD defense on designing high performance computing architectures for reliable space applications. Kraja proposed an architecture using parallel processing nodes connected via a radiation-hardened management unit. Benchmarking of the 2DSSAR image reconstruction application showed optimizing for shared memory, distributed memory, and heterogeneous CPU/GPU systems. The best performance was achieved using a heterogeneous node with a multi-core CPU and dual GPUs, providing a 34.46x speedup. Kraja's conclusions recommended a design using powerful shared memory parallel processing nodes each with CPUs, GPUs, and distributed memory only if multiple nodes are needed.
This document summarizes information about supercomputers, including their history, uses, challenges, and top models. It notes that supercomputers can perform specialized calculations like weather forecasting and nuclear testing simulations. The earliest supercomputers date to the 1940s, while modern ones like Sunway Taihulight in China can perform trillions of calculations per second. Supercomputers run optimized operating systems and require complex cooling. The document lists the top 5 supercomputers as of 2018 and notes Bangladesh's plans to acquire a supercomputer to help with flood prediction and data analysis.
Supercomputers are the fastest and most powerful computers designed to solve complex problems quickly. They were introduced in the 1960s and are used for nuclear simulation, structural analysis, crash analysis, climatic predictions, cryptography, and computational chemistry. Modern supercomputer architectures trade processor speed for low power consumption to support more processors at room temperature. The IBM Blue Gene supercomputer and K computer are examples of large, energy efficient supercomputing systems that use different processor and cooling approaches.
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
Rakuten Technology Conference 2013
"TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data"
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
The document provides an overview of the history and evolution of semiconductors and integrated circuits from 1947 to present. It discusses key inventions and milestones such as the transistor in 1947, the integrated circuit in 1961, and Moore's Law predicting transistor doubling every two years. It also covers different chip design approaches including full custom, standard cell, gate arrays, and FPGAs, along with their relative costs, performance, and design complexities.
The document discusses parallel computing and multicore processors. It notes that Berkeley researchers believe multicore is the future of computing. It also discusses building an academic "manycore" research system using FPGAs to allow researchers to experiment with parallel algorithms, compilers, and programming models on thousands of processor cores. This would help drive innovation and avoid long waits between hardware and software iterations.
The document discusses the emergence of computation for interdisciplinary large data analysis. It notes that exponential increases in computational power and data are driving changes in science and engineering. Computational modeling is becoming a third pillar of science alongside theory and experimentation. However, continued increases in clock speeds are no longer feasible due to power constraints, necessitating the use of multi-core processors and parallelism. This is driving changes in software design to expose parallelism.
This document discusses how moving computationally intensive algorithms to Maxeler dataflow hardware can provide significant speedups, reductions in power consumption and space, while freeing up funds. Specifically, it notes that if a computer center spends €50M annually on electricity but reduces power usage 20x by using Maxeler, €47.5M could be saved. This money could pay the salaries of over 1000 PhD students per year.
Mateo Valero - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
El 3 de julio de 2014, organizamos en la Fundación Ramón Areces una jornada con el lema 'Big Data: de la investigación científica a la gestión empresarial'. En ella estudiamos los retos y oportunidades del Big data en las ciencias sociales, en la economía y en la gestión empresarial. Entre otros ponentes, acudieron expertos de la London School of Economics, BBVA, Deloite, Universidades de Valencia y Oviedo, el Centro Nacional de Supercomputación...
Evolution of Computing Microprocessors and SoCsazmathmoosa
The document discusses the evolution of microprocessors from the early 4004 chip in 1969 to modern multi-core processors. It highlights several generations of Intel x86 processors including the 4004, 8086, 80286, 80386, 80486, Pentium, Pentium Pro, Pentium II, Pentium III, Pentium 4, and later processors using the Core microarchitecture. Each new generation brought improvements like higher clock speeds, additional instructions sets, and architectural changes like pipelining to improve performance. The Pentium 4 introduced the NetBurst microarchitecture with a 20-stage pipeline and new capabilities like hyperthreading.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
The Yellow Brick Road of Semiconductor Technology
The talk provides a historical perspective on how the computer industry has taken advantage of Moore's Law and how we got to the era of multi-core processors. The talk will also address some of the challenges facing the industry in the future.
This document provides an overview of integrated circuit technology. It discusses the history of ICs from early mechanical computers to modern microprocessors containing billions of transistors. It explains why ICs were developed, including benefits like smaller size, higher speed, lower power consumption, and reduced manufacturing costs compared to discrete components. The document also summarizes different IC design approaches like full custom, standard cell, and gate array designs as well as classification of ICs by technology, design type, size, and other attributes. Finally, it provides examples of modern ICs and projections for continued advancement and scaling of IC technology.
In this video from the Argonne Training Program on Extreme-Scale Computing 2019, Jeffrey Vetter from ORNL presents: The Coming Age of Extreme Heterogeneity.
"In this talk, I'm going to talk about the high-level trends guiding our industry. Moore’s Law as we know it is definitely ending for either economic or technical reasons by 2025. Our community must aggressively explore emerging technologies now!"
Watch the video: https://wp.me/p3RLHQ-lic
Learn more: https://ft.ornl.gov/~vetter/
and
https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document summarizes Kraken, the second most powerful academic supercomputer. Kraken was funded by the National Science Foundation for $65 million and developed by the University of Tennessee, Knoxville. It enhances computational power for research and supports simulation research in fields like astrophysics, climate science, earth science and materials science. Kraken uses AMD quad core processors, integrated memory controllers and a Cray SeaStar2 interconnect in a 3D torus configuration running Linux and providing over 166 teraflops of processing power.
This document discusses challenges and opportunities for end nodes with multigigabit networking. It covers increasing bandwidth capabilities through technologies like DWDM and 10GbE. It also examines hardware challenges for processor, memory, and I/O buses. Software challenges discussed include zero-copy networking, ULNI/OS bypass, and network path pipelining. The document also summarizes network protocols like AQM, ECN, MPLS and their roles in high-speed networking.
This document discusses a lecture on hardware acceleration. It begins by providing background on Moore's law and how increasing transistor density led to issues with power consumption and thermal constraints. This motivated the evolution of specialized hardware acceleration to improve performance. The lecture then covers topics like coprocessors vs accelerators, common acceleration techniques, and examples of hardware acceleration. It also discusses challenges like debugging and coherency when designing accelerated systems.
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
The document discusses challenges faced with application specific supercomputer design. It provides an example of QPACE, a supercomputer designed for quantum chromodynamics (QCD) computations. Key challenges discussed include data ordering issues when using InfiniBand networking that could cause computations to use invalid data if ordering of writes to memory was not enforced. Ensuring proper data ordering is important to avoid software consuming data before it is valid.
The document discusses the limits of information and communication technologies (ICT) such as computing power, data storage, and network bandwidth. It proposes that future networks will need to scale in both size and functionality through approaches like federation of multiple networks. Cloud computing is presented as a potential approach to tackle these limits by providing on-demand access to shared computing resources over a network in a scalable and elastic manner. However, cloud computing is still associated with many marketing hype and open questions remain regarding its impact and how it can integrate with existing technologies.
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
This document summarizes a presentation given at the 2005 IEEE Hot Chips conference about parallelism in modern processors and how it relates to programming models. It discusses different types of parallelism available at the processor, system, and application levels. It then examines approaches to parallelism used by general-purpose CPUs, special-purpose CPUs like the Cell processor, and GPUs. While parallelism is increasing in these devices, programming them effectively remains challenging due to the difficulty of parallel programming and lack of appropriate language and tooling support. The document calls for more research in parallel programming models and languages to make better use of emerging multi-core architectures.
El documento describe un sistema llamado QuEM para medir la calidad de experiencia en servicios multimedia sobre IP. QuEM puede detectar eventos que degradan la calidad como cortes de audio, congelamiento de video, pixelación y degradación en la codificación. Usa técnicas como análisis de potencia de audio, vectores de movimiento y parámetros de cuantificación para identificar estos eventos. Integra las mediciones a lo largo del tiempo y provee una medición sencilla pero potente de la calidad de experiencia del usuario.
Este documento presenta un análisis y optimización del descargado de videos en dispositivos móviles. Integra las herramientas SPIN y ns-2 para realizar un análisis parametrizado dirigido por objetivos de escenarios de descarga de videos sobre TCP en entornos móviles sujetos a desconexiones. El enfoque propuesto reduce significativamente el tiempo de análisis al enfocarse sólo en simulaciones que cumplen los objetivos especificados.
Este documento presenta un análisis de la respuesta del canal de la red de alimentación de un vehículo. Se midieron canales en diferentes estados del motor y se analizaron según dos modelos: invariante en el tiempo y variante periódicamente en el tiempo. Los resultados mostraron que los canales presentan variaciones cíclicas insignificantes con respecto al ciclo del motor, por lo que un modelo lineal e invariante en el tiempo es adecuado para representar los canales en la red de alimentación de vehículos.
This document proposes a comprehensive framework for managing service level agreements (SLAs) across telecommunication services. The framework includes an SLA management process and non-hierarchical architecture to support SLAs involving end users, customers, service aggregators, software and infrastructure providers, and telecommunication network capabilities. The goal of the framework is to provide an automatic way to manage SLAs throughout complex multi-party service arrangements and integrate SLA penalties into service charging.
Este documento evalúa las prestaciones de un sistema OFDM sobre la red eléctrica de un vehículo de acuerdo con el nuevo estándar G.hn. Se modela el canal como un filtro lineal e invariante en el tiempo con ruido estacionario y colorido. El sistema OFDM usa pulse shaping en transmisión y windowing en recepción para mejorar el confinamiento espectral. Los resultados muestran tasas binarias superiores a 315 Mbps en el 50% de los canales y ganancias adicionales del 2.5-7% al usar windowing.
El documento describe el proyecto MANTICORE, que busca permitir el acceso y control de infraestructuras de red por parte de terceros a través de servicios de red virtualizados como IP Network as a Service. MANTICORE ofrece un entorno de software que automatiza los procesos de adquisición y configuración de redes IP para proveedores de infraestructura, servicios de red y usuarios finales a través de un mercado virtual. El proyecto busca implementar y probar estos servicios con tres comunidades de investigación.
Este documento presenta el módulo WiMAX para la plataforma DEMOCLES®, que permite simular redes WiMAX fijas y móviles. Se describen los modelos implementados para las capas física y MAC de WiMAX, así como ejemplos de simulación de streaming de video. El módulo provee una simulación precisa considerando interferencias y propagación, y puede escalar para estudiar nuevas tecnologías inalámbricas. El trabajo futuro incluye actualizar el simulador con avances en WiMAX e incorporar tecnologías
Este documento describe el proyecto GENESISX, el cual busca desarrollar nuevas capacidades para redes móviles de próxima generación. El proyecto explorará el acceso directo a IMS a través de redes 2G/3G, WiMAX móvil, redes ad-hoc y P2P, roaming IMS, QoS en WiMAX, servicios móviles multimedia, monitoreo inteligente de redes y gestión de redes WiMAX. Los beneficios esperados incluyen ancho de banda ubicuo, servicios basados en
This document describes a real-time MIMO-LTE test-bed developed by Telefónica I+D to test LTE technology. The test-bed includes a radio frontend module, communications and control module, and system processing module. It allows for demonstration of basic LTE operations and analysis of resource distribution and latency. Future work will focus on implementing advanced LTE technologies on a new hardware platform.
Este documento propone una arquitectura para extender la red UPnP más allá del hogar utilizando IMS. La arquitectura incluye un control point extendido, UPnP proxies en los hogares y en la red central, y servidores de metadatos y ontologías para permitir la búsqueda y reproducción de contenido multimedia entre hogares de forma semántica y segura. El documento también describe cómo tratar las URLs de contenido entre redes domésticas y cómo crear y gestionar metadatos de recursos multimedia de forma distribuida.
Nuba plataforma de_cloud_federada_para_servicios_de_infraestructuraTELECOM I+D
El documento describe la plataforma NUBA, la cual desarrolla una plataforma federada de cloud para proveer servicios de infraestructura de forma sencilla y automática. La plataforma permite el despliegue dinámico de servicios empresariales en la nube mediante la federación de múltiples proveedores de cloud, la gestión de recursos basada en criterios de negocio, y la simplificación en la definición y despliegue de servicios. Varias organizaciones participan en el proyecto para validar las capacidades avanzadas
El documento describe mecanismos para ahorrar energía en redes móviles mediante el análisis del comportamiento del tráfico. Se propone reducir los recursos de la red durante las horas de menor tráfico nocturno para ahorrar energía, asegurando al mismo tiempo una buena calidad para los usuarios. El apagado de células permite mayores ahorros de energía que otras estrategias, según simulaciones.
El documento presenta una nueva aplicación de viaje que utiliza la inteligencia colectiva para ofrecer recomendaciones personalizadas a los viajeros. La aplicación aprovecha la información de recursos multimedia, patrones personales y sociales detectados en Internet para clasificar sitios populares y tendencias y así recomendar lugares más relevantes para cada usuario en función de su perfil, preferencias y ubicación. El documento concluye con un caso práctico que muestra cómo la aplicación puede ayudar a planificar un viaje a Madrid de manera inteligente.
Este documento describe el desarrollo del sistema VITALAS para la indexación y recuperación de videos e imágenes a gran escala. VITALAS fue diseñado siguiendo un enfoque centrado en el usuario, realizando iteraciones y evaluaciones con usuarios para mejorar la interfaz. Los usuarios confirmaron que VITALAS ofrece una mejora sobre otras herramientas debido a su interfaz intuitiva que proporciona múltiples servicios de búsqueda innovadores.
Este documento describe los retos de monitorear y gestionar la calidad de experiencia (QoE) de los usuarios de servicios de telecomunicaciones. Explica cómo las plataformas de monitoreo pueden correlacionar la QoE con la calidad de servicio para optimizar la red y mejorar la satisfacción del cliente. También destaca la importancia de estandarizar las métricas y sistemas de medición de QoE para facilitar la cooperación entre operadores.
Sistema deteccion guiado_indoor_mediante_dispositivo_movil_tecnologia_bluetoothTELECOM I+D
Este documento describe un sistema de guiado indoor que utiliza dispositivos móviles y balizas Bluetooth. El sistema guía a los usuarios desde la entrada de un edificio o estacionamiento hasta su destino calculando la ruta óptima en base a las balizas detectadas. Se realizaron pruebas exitosas en un edificio y el sistema ofrece guiado robusto incluso ante errores. Sin embargo, requiere calibración para cada escenario y puede aumentar el consumo de batería en los dispositivos móviles.
Este documento describe varias soluciones técnicas para optimizar las redes DVB-T para proveer servicios locales y móviles. Estas soluciones incluyen el uso de Servicios de Área Local, diversidad de antenas, AL-FEC, modulación jerárquica, codificación de video escalable, y time slicing. Estas técnicas fueron evaluadas a través de simulaciones y pruebas de campo para reducir los requisitos mínimos de SNR para recepción móvil y proveer contenido localizado.
Sistema comunicacion oral_personas_sordasTELECOM I+D
El documento describe el desarrollo de un sistema de comunicación bidireccional entre personas sordas y oyentes para la renovación del permiso de conducir. Se generó un corpus paralelo en castellano y lengua de signos española (LSE) y se desarrollaron módulos de reconocimiento de voz, traducción a LSE y generación de voz a partir de LSE. La evaluación con funcionarios y usuarios sordos mostró buenos resultados objetivos pero también áreas de mejora como la naturalidad del agente virtual y la complejidad de la interfaz.
El documento describe una arquitectura para generar y transmitir metadatos asociados al tráfico multimedia de forma sincronizada. La arquitectura permite generar metadatos externamente al flujo multimedia, insertarlos y sincronizarlos para que elementos de red puedan acceder a la información adicional y proporcionar servicios avanzados como publicidad dirigida o análisis de presencia en videoconferencias 3D.
El documento describe las tres misiones de las universidades que incluyen la enseñanza, la investigación y contribuir al desarrollo socioeconómico. También describe las motivaciones y barreras para la colaboración entre empresas y centros de investigación, como acceder a conocimientos más amplios frente a diferencias en plazos. Además, explica los instrumentos y oportunidades del CDTI para financiar proyectos de I+D e impulsar la creación de empresas de base tecnológica.
7. Power Density 1 10 100 1000 i386 i486 Pentium® Pentium® Pro Pentium® II Pentium® III Hot plate Nuclear Reactor Sun's Surface Rocket Nozzle * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference keynote - 1999. Pentium® 4 Watts/cm 2
8.
9. Technology Outlook Shekhar Borkar, Micro37, P Medium High Very High Variability Energy scaling will slow down >0.5 >0.5 >0.35 Energy/Logic Op scaling 0.5 to 1 layer per generation 8-9 7-8 6-7 Metal Layers 1 1 1 1 1 1 1 1 RC Delay Reduce slowly towards 2-2.5 <3 ~3 ILD (K) Low Probability High Probability Alternate, 3G etc 128 11 2016 High Probability Low Probability Bulk Planar CMOS Delay scaling will slow down >0.7 ~0.7 0.7 Delay = CV/I scaling 256 64 32 16 8 4 2 Integration Capacity (BT) 8 16 22 32 45 65 90 Technology Node (nm) 2018 2014 2012 2010 2008 2006 2004 High Volume Manufacturing
10. We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, the number of gates on a chip will continue to increase. Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Lower Voltage Increase Clock Rate & Transistor Density Core Cache Core Cache Core C1 C2 C3 C4 Cache C1 C2 C3 C4 Cache C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4
11. Increasing chip performance: Intel´s Petaflop chip 80 processors in a die of 300 square mm. Terabytes per second of memory bandwidth Note: The barrier of the Teraflops was obtained by Intel in 1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters This will be possible in 3 years from now ICPP-2009, September 23rd 2009 Thanks to Intel
16. Looking at the Gordon Bell Prize 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors Superconductive materials 1 EFlop/s; ~2018; ?; 1x10 7 Processors (10 9 threads) Jack Dongarra
17. BSC-CNS e iniciativas a nivel internacional: IESP Build an international plan for developing the next generation open source software for scientific high-performance computing Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment
18. 1 EFlop/s “Clean Sheet of Paper” Strawman 4 FPUs+RegFiles/Core (=6 GF @1.5GHz) 1 Chip = 742 Cores (=4.5TF/s) 213MB of L1I&D; 93MB of L2 1 Node = 1 Proc Chip + 16 DRAMs (16GB) 1 Group = 12 Nodes + 12 Routers (=54TF/s) 1 Rack = 32 Groups (=1.7 PF/s) 384 nodes / rack 3.6EB of Disk Storage included 1 System = 583 Racks (=1 EF/s) 166 MILLION cores 680 MILLION FPUs 3.6PB = 0.0036 bytes/flops 68 MW w’aggressive assumptions Sizing done by “balancing” power budgets with achievable capabilities Largely due to Bill Dally Courtesy of Peter Kogge, UND
19. Education for Parallel Programming Multicore-based pacifier I multi-core programming I many-core programming We all massive parallel prog. I games
21. Initial developments Mechanical machines 1854: Boolean algebra by G. Boole 1904: Diode vacuum tube by J.A. Fleming 1938: Boolean Algebra & Electronics Switches, C. Shannon 1946: ENIAC by J.P. Eckert and J. Mauchly 1945: Stored program by J.V. Neuma nn ?????? 1947 : First transistor (Bell Labs) 1949: EDSAC by M. Wilkes 1952: UNIVAC I and IBM 701
22. In 50 Years ... Eniac , Eckert&Mauchly1946 ... 18000 vacuum tubes Pentium III playing DVD, 1998 ... 24 M transistors
23. Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “ Moore’s Law ” Moore’s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
24.
25. Computer Architecture Achievements 1951 : Microprogramming (M. Wilkes) 1962 : Virtual Memory (Atlas, Manchester) 1964 : Pipeline (CDC 6600, S. Cray, 10 Mflop/s) 1965 : Cache memory (M. Wilkes) 1975 : Vector processors (S. Cray) 1980 : RISC architecture (IBM, Berkeley, Stanford) 1982 : Multiprocessors with distributed memory 1990 : Superscalar processors : PA-Risc (HP) and RS-6000 (IBM) 1991 : Multiprocessors with distributed shared memory 1994 : SMT (M. Nemirowski, D. Tullsen, S. Eggers) 1994 : Speculative Multiprocessors ( G. Sohi, Winsconsin) 1996 : Value Prediction (J.P.Shen and M.Lipasti, CMU) 2000: Multicore/Manycore Architectures
26.
27. Virtual Worlds have huge potential beyond Games Commerce & Advertising Corporate Education First Responders Government Health Military Science Community Facilitation Social Change
28. Cray XT5-HE system Over 37,500 quad-core AMD Opteron processors running at 2.6 GHz, 224,162 cores. Power: 6.95 Mwatts 300 terabytes of memory 10 petabytes of disk space. 240 gigabytes per second disk bandwidth Cray's SeaStar2+ interconnect network. Jaguar @ ORNL: 1.75 PF/s Jack Dongarra
29. MareIncognito: Project structure 4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPE General kernels Automatic analysis Coarse/fine grain prediction Sampling Clustering Integration with Peekperf Contention, Collectives Overlap computation/communication Slimmed Networks Direct versus indirect networks Contribution to new Cell design Support for programming model Support for load balancing Support for performance tools Issues for future processors Coordinated scheduling: Run time, Process, Job Power efficiency StarSs: CellSs, SMPSs [email_address] OpenMP++ MPI + OpenMP/StarSs Performance analysis tools Processor and node Load balancing Interconnect Applications Programming models Models and prototype
30. Supercomputación y eCiencia 22 grupos de élite M ás de 120 investigadores seniors Más de 300 estudiantes de doctorado BSC-CNS: vertebrador de la investigación en supercomputación en España Application scope “Earth Sciences” Application scope “Astrophysics” Application scope “Engineering” Application scope “Physics” Application scope “Life Sciences” Compilers and tuning of application kernels Programming models and performance tuning tools Architectures and hardware technologies
31. High Performance Computing as key-enabler 1980 1990 2000 2010 2020 2030 Capacity: # of Overnight Loads cases run Available Computational Capacity [Flop/s] CFD-based LOADS & HQ Aero Optimisation & CFD-CSM Full MDO Real time CFD based in flight simulation x10 6 1 Zeta (10 21 ) 1 Peta (10 15 ) 1 Tera (10 12 ) 1 Giga (10 9 ) 1 Exa (10 18 ) 10 2 10 3 10 4 10 5 10 6 LES CFD-based noise simulation RANS Low Speed RANS High Speed HS Design Data Set UnsteadyRANS “ Smart” use of HPC power: Algorithms Data mining knowledge Capability achieved during one night batch Courtesy AIRBUS France
34. Weather, Climate and Earth Sciences: Roadmap 2009 Resolution : 80 Km Memory: ≈110 GB Storage: ≈ 8 TB NEC-SX9 48 vector procs: ≈ 40 days run 2015 Resolution : 20 Km MemSory: ≈ 3,5 TB Storage: ≈ 180 TB High resolution model with complete carbon cycle model Challenges: data viz and post-processing, data discovery, archiving 2020 Resolution : 1 Km Memory: ≈ 4 PB Storage: ≈ 150 PB Higher resolution with global cloud resolving model Challenges: data sharing, transfer memory management, I/O management
35. Education for Parallel Programming Multicore-based pacifier I multi-core programming I many-core programming We all massive parallel prog. I games
Access latency for main memory, even using a modern SDRAM with a CAS latency of 2, will typically be around 9 cycles of the **memory system clock** -- the sum of The latency between the FSB and the chipset (Northbridge) (+/- 1 clockcycle) The latency between the chipset and the DRAM (+/- 1 clockcycle) The RAS to CAS latency (2-3 clocks, charging the right row) The CAS latency (2-3 clocks, getting the right column) 1 cycle to transfer the data. The latency to get this data back from the DRAM output buffer to the CPU (via the chipset) (+/- 2 clockcycles) Assuming a typical 133 MHz SDRAM memory system (eg: either PC133 or DDR266/PC2100), and assuming a 1.3 GHz processor, this makes 9*10 = 90 cycles of the CPU clock to access main memory! Yikes, you say! And it gets worse – a 1.6 GHz processor would take it to 108 cycles, a 2.0 GHz processor to 135 cycles, and even if the memory system was increased to 166 MHz (and still stayed CL2), a 3.0 GHz processor would wait a staggering 162 cycles! Caches make the memory system seem almost as fast as the L1 cache, yet as large as main memory. A modern primary (L1) cache has a latency of just two or three **processor cycles**, which is dozens of times faster than accessing main memory, and modern primary caches achieve hit rates of around 90% for most applications. So 90% of the time, accessing memory only takes a couple of cycles. Good overview http://www.pattosoft.com.au/Articles/ModernMicroprocessors/
It is the conclusion of this TTA that, in the very near future (in fact some early examples are clearly in evidence right now), virtual worlds will extend their reach well beyond their current subject matter of on-line fantasy gaming to incorporate all manner of business and commerce. This evolution will quickly encompass many industries and business processes where IBM has traditionally had a significant business interests. In the education industry, it is not at all a stretch to imagine a university physics professor convening a kinematics lecture in a virtual world in which the professor could alter the force of gravity and move large, virtual objects to demonstrate environments on other planets. Closer to our industry, an IBM Industry Solution sales specialist could arrange to meet a client in a virtual world populated by highly realistic (virtual) world venues containing software solutions created by IBM and select business partners. In these virtual sales worlds, clients would interact with the solutions in the same manner as real world users, exploiting all the solution's functional capacities. For example, a virtual mobile work force solution could be demonstrated from multiple perspectives in the context of real business scenarios - the control center, the mobile vehicle etc. The solution demonstration would totally immerse the client in the solution experience there by creating an unparalleled selling tool. The possibilities are limitless. From top left, clockwise: (1) Worlds of Warcraft: A Tavern. This is just a symbolic representation of commerce & advertising within games. Many people run their own businesses within virtual worlds, trading both virtual and real items for virtual and real currencies. Microsoft’s acquisition of Massive Inc. has also now secured them a huge advertising ecosystem of game development companies, advertising agencies and leading brands, using online video games as another advertising channel for directed and personalized ads and product placement deals. The tavern represents the real-world metaphors that build community within virtual worlds, much like the 18 th century coffee houses lead to the formation of stock exchanges. Incidentally, there is a game advertising summit in San Francisco, June 9 th 2006. (2) Hazmat Hot zone: project based at the Entertainment Technology Center at Carnegie Melon University, is one of the earliest serious game projects and now has several scenarios up-and-running using Unreal-Tournament based graphics and game play. Intended users: fire-department personnel who handle HazMat response. HazMat uses multiplayer gaming technology and augmented communication practices to assist with team-based training vital to HazMat and other disaster response practices. (3) Virtual Iraq: Not only are the army using virtual world simulations for the training of troops and engagement planning, but also for the treatment of Post Traumatic Stress Disorder (PTSD) through the ability to “relive” traumatic events through simulation. ( http://www.washingtonpost.com/ac2/wp-dyn/A58360-2005Mar22?language=printer) (4) Simulation of forest fire disasters and how to combat them. (5) Virtual Acropolis: This is an example of using virtual environments as an educational and research tool for the humanities, in this case ancient history. The use of highly detailed models, created collaboratively by historians and researchers, to model world heritage sites for a variety of uses, including tourism, education, simulation of “what-if” scenarios, etc. imagine teaching history of a famous era or battle by immersing the student in a highly realistic, immersive simulation complete with architecture, artifacts and even populace of the period. These may also help the study of social history and sociological development and evolution via large scale community participation. (6) Food Force: From the United Nations World Food Program (WFP), Food Force is an educational video game telling the story of a hunger crisis on the fictitious island of Sheylan. Comprised of 6 mini-games or “missions”, the game takes young players from an initial crisis assessment through to delivery and distribution of food aid, with each sequential mission addressing a particular aspect of this challenging process. (http://www.food-force.com/) (7) Yourself Fitness: Yourself!Fitness is a complete fitness program on a disc - exercise, diet, motivation, and fitness tracking are all included. Your host is Maya, a dynamically generated digital personality who guides you through all aspects of the application. You need nothing more than an Xbox and a television set to partake. ( http://www.yourselffitness.com/) (8) Pulse!! The virtual clinical learning lab and simulation, for training of first responders in treatments and medical and nursing students. ( http://www.businessweek.com/innovate/content/apr2006/id20060410_051875.htm?chan=innovation_game+room_features). (10) Another picture of Worlds of Warcraft: This is just to illustrate the breadth, diversity and scale of virtual environments. It is easy to take for granted that the fact that this huge architectural vista and the tavern above are all parts of a single virtual world that is WoW, is a challenge to the rendering engine, to deal with a broad spectrum of conditions. Why is this important? It means that the same middleware engine can be used to a broad variety of simulation environments and applications these days, rather than purpose built or specialized simulations for specific scenarios, and are configurable through XML & scripting mechanisms. (centre) Google Earth: Now being offered as Enterprise Services for a variety of applications including real-estate, architecture & engineering, insurance, media. Google’s provision of 3D modelling tools and open repository for free is a significant step in them making Google Earth a platform for application development using it as a visualization engine and MySpace of the future. NEED FOR STANDARDS: Multiple Virtual Worlds Interconnected & Interdependent Independently operated Open standard interfaces, to allow: Avatar portability Property portability Security Metering, Billing, Separations, Settlements Distributed problem determination Distributed systems management
(Please note - this slide includes 2 animation steps) An exciting question to ask, is where is this research heading? In this slide you can see what is probably a familiar chart depicting the progress that has been made in supercomputing since the early 90s. (At each time point, the green line shows the 500th fast supercomputer, the dark blue line the fastest supercomputer, and the light blue line the summed power of the top 500 machines). These lines show a nice trend, which we’ve extrapolated out 10 years. [ANIMATE SLIDE] The IBM team’s latest simulation results fall here on the graph. These latest results represent a model about 4 and a half percent of the scale of the cerebral cortex, which was run at 1/83 of real time. The machine used provided 144 TB of memory and 0.5 PFLop/s. [ANIMATE SLIDE] Turning to the future, you can see that running human scale cortical simulations will require 4 PB of memory and to run these simulations in real time will require over 1 EFLop/s. If the current trends in supercomputing continue, however, the IBM team believes they will have the ability to perform such simulations in the not too distant future.