Valladolid final-septiembre-2010

“ Evolución de la Arquitectura de Computadores ” Valladolid, Septiembre 2010 Prof. Mateo Valero Director

Technological Achievements Transistor (Bell Labs, 1947) DEC PDP-1 (1957) IBM 7090 (1960) Integrated circuit (1958) IBM System 360 (1965) DEC PDP-8 (1965) Microprocessor (1971) Intel 4004

Power Density 1 10 100 1000           i386 i486 Pentium® Pentium® Pro Pentium® II Pentium® III Hot plate Nuclear Reactor Sun's Surface Rocket Nozzle * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference keynote - 1999. Pentium® 4 Watts/cm 2

Technology Outlook Shekhar Borkar, Micro37, P Medium High Very High Variability Energy scaling will slow down >0.5 >0.5 >0.35 Energy/Logic Op scaling 0.5 to 1 layer per generation 8-9 7-8 6-7 Metal Layers 1 1 1 1 1 1 1 1 RC Delay Reduce slowly towards 2-2.5 <3 ~3 ILD (K) Low Probability High Probability Alternate, 3G etc 128 11 2016 High Probability Low Probability Bulk Planar CMOS Delay scaling will slow down >0.7 ~0.7 0.7 Delay = CV/I scaling 256 64 32 16 8 4 2 Integration Capacity (BT) 8 16 22 32 45 65 90 Technology Node (nm) 2018 2014 2012 2010 2008 2006 2004 High Volume Manufacturing

We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, the number of gates on a chip will continue to increase. Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Lower Voltage Increase Clock Rate & Transistor Density Core Cache Core Cache Core C1 C2 C3 C4 Cache C1 C2 C3 C4 Cache C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

Increasing chip performance: Intel´s Petaflop chip 80 processors in a die of 300 square mm. Terabytes per second of memory bandwidth Note: The barrier of the Teraflops was obtained by Intel in 1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters  This will be possible in 3 years from now ICPP-2009, September 23rd 2009 Thanks to Intel

NVIDIA Fermi Architecture Unified 768KB L2 cache serves all threads GigaThread hardware scheduler assigns Thread Blocks to SMs Wide DRAM interface provides 12 GB/s bandwidth 16 Streaming- Multiprocessors (512 cores) execute Thread Blocks 620 Gigaflops

Cell Broadband Engine TM : A Heterogeneous Multi-core Architecture * Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.

Intel/UPC Since 2002 (Roger Espasa, Toni Juan) 40 People Microprocessor Development (Larrabee x86 many core)

Looking at the Gordon Bell Prize 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors Superconductive materials 1 EFlop/s; ~2018; ?; 1x10 7 Processors (10 9 threads) Jack Dongarra

BSC-CNS e iniciativas a nivel internacional: IESP Build an international plan for developing the next generation open source software for scientific high-performance computing Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment

1 EFlop/s “Clean Sheet of Paper” Strawman 4 FPUs+RegFiles/Core (=6 GF @1.5GHz) 1 Chip = 742 Cores (=4.5TF/s) 213MB of L1I&D; 93MB of L2 1 Node = 1 Proc Chip + 16 DRAMs (16GB) 1 Group = 12 Nodes + 12 Routers (=54TF/s) 1 Rack = 32 Groups (=1.7 PF/s) 384 nodes / rack 3.6EB of Disk Storage included 1 System = 583 Racks (=1 EF/s) 166 MILLION cores 680 MILLION FPUs 3.6PB = 0.0036 bytes/flops 68 MW w’aggressive assumptions Sizing done by “balancing” power budgets with achievable capabilities Largely due to Bill Dally Courtesy of Peter Kogge, UND

Education for Parallel Programming Multicore-based pacifier I multi-core programming I many-core programming We all massive parallel prog. I games

Initial developments Mechanical machines 1854: Boolean algebra by G. Boole 1904: Diode vacuum tube by J.A. Fleming 1938: Boolean Algebra & Electronics Switches, C. Shannon 1946: ENIAC by J.P. Eckert and J. Mauchly 1945: Stored program by J.V. Neuma nn ?????? 1947 : First transistor (Bell Labs) 1949: EDSAC by M. Wilkes 1952: UNIVAC I and IBM 701

In 50 Years ... Eniac , Eckert&Mauchly1946 ... 18000 vacuum tubes Pentium III playing DVD, 1998 ... 24 M transistors

Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “ Moore’s Law ” Moore’s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Computer Architecture Achievements 1951 : Microprogramming (M. Wilkes) 1962 : Virtual Memory (Atlas, Manchester) 1964 : Pipeline (CDC 6600, S. Cray, 10 Mflop/s) 1965 : Cache memory (M. Wilkes) 1975 : Vector processors (S. Cray) 1980 : RISC architecture (IBM, Berkeley, Stanford) 1982 : Multiprocessors with distributed memory 1990 : Superscalar processors : PA-Risc (HP) and RS-6000 (IBM) 1991 : Multiprocessors with distributed shared memory 1994 : SMT (M. Nemirowski, D. Tullsen, S. Eggers) 1994 : Speculative Multiprocessors ( G. Sohi, Winsconsin) 1996 : Value Prediction (J.P.Shen and M.Lipasti, CMU) 2000: Multicore/Manycore Architectures

Virtual Worlds have huge potential beyond Games Commerce & Advertising Corporate Education First Responders Government Health Military Science Community Facilitation Social Change

Cray XT5-HE system Over 37,500 quad-core AMD Opteron processors running at 2.6 GHz, 224,162 cores. Power: 6.95 Mwatts 300 terabytes of memory 10 petabytes of disk space. 240 gigabytes per second disk bandwidth Cray's SeaStar2+ interconnect network. Jaguar @ ORNL: 1.75 PF/s Jack Dongarra

MareIncognito: Project structure 4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPE General kernels Automatic analysis Coarse/fine grain prediction Sampling Clustering Integration with Peekperf Contention, Collectives Overlap computation/communication Slimmed Networks Direct versus indirect networks Contribution to new Cell design Support for programming model Support for load balancing Support for performance tools Issues for future processors Coordinated scheduling: Run time, Process, Job Power efficiency StarSs: CellSs, SMPSs [email_address] OpenMP++ MPI + OpenMP/StarSs Performance analysis tools Processor and node Load balancing Interconnect Applications Programming models Models and prototype

Supercomputación y eCiencia 22 grupos de élite M ás de 120 investigadores seniors Más de 300 estudiantes de doctorado BSC-CNS: vertebrador de la investigación en supercomputación en España Application scope “Earth Sciences” Application scope “Astrophysics” Application scope “Engineering” Application scope “Physics” Application scope “Life Sciences” Compilers and tuning of application kernels Programming models and performance tuning tools Architectures and hardware technologies

High Performance Computing as key-enabler 1980 1990 2000 2010 2020 2030 Capacity: # of Overnight Loads cases run Available Computational Capacity [Flop/s] CFD-based LOADS & HQ Aero Optimisation & CFD-CSM Full MDO Real time CFD based in flight simulation x10 6 1 Zeta (10 21 ) 1 Peta (10 15 ) 1 Tera (10 12 ) 1 Giga (10 9 ) 1 Exa (10 18 ) 10 2 10 3 10 4 10 5 10 6 LES CFD-based noise simulation RANS Low Speed RANS High Speed HS Design Data Set UnsteadyRANS “ Smart” use of HPC power: Algorithms Data mining knowledge Capability achieved during one night batch Courtesy AIRBUS France

Diseño del ITER TOKAMAK (JET, Oxford)

Supercomputación, teoría y experimentación Cortesia de IBM

Weather, Climate and Earth Sciences: Roadmap 2009 Resolution : 80 Km Memory: ≈110 GB Storage: ≈ 8 TB NEC-SX9 48 vector procs: ≈ 40 days run 2015 Resolution : 20 Km MemSory: ≈ 3,5 TB Storage: ≈ 180 TB High resolution model with complete carbon cycle model Challenges: data viz and post-processing, data discovery, archiving 2020 Resolution : 1 Km Memory: ≈ 4 PB Storage: ≈ 150 PB Higher resolution with global cloud resolving model Challenges: data sharing, transfer memory management, I/O management

Valladolid final-septiembre-2010

More Related Content

What's hot

What's hot (20)

Similar to Valladolid final-septiembre-2010

Similar to Valladolid final-septiembre-2010 (20)

More from TELECOM I+D

More from TELECOM I+D (20)

Valladolid final-septiembre-2010

Editor's Notes