SlideShare a Scribd company logo
2IntegratedSystems Laboratory
1Department of Electrical,Electronic
and InformationEngineering
RISC-V open-ISA and open-HW
– a Swiss army knife for HPC
ICS2020, Workshop on RISC-V and OpenPOWER 29.06.2020
Andrea Bartolini1 & PULP team1,2
||
Energy efficiency challenge: Exascale
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
100 EFLOPS
10 EFLOPS
1 EFLOPS
100 PFLOPS
10 PFLOPS 2 nJ/FLOP
1 PFLOPS 200 pJ/FLOP
100 TFLOPS 20 pJ/FLOP
10 TFLOPS 2 pJ/FLOP
1 TFLOPS 0.2 pJ/FLOP
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
x10 every 4 years
/10 every 4 years
HPC is now power-bound→ need 10x energy efficiency improvement every 4 years
PERFORMANCE
ENERGY PER
OPERATION*
*20MWatt supercomputer: Performance & EnOP
1EFLOPs →20pj/FLOP
||
Peak Performance Moore law
FPU Performance Dennard law
Number of FPUs Moore + Dennard
App. Parallelism Amdahl's law
10^9
Exaflops
10^18
Gigaflops
10^9
serial fraction
1/10^9
We need
programmability support
Already at 20MWatt
C.Cavazzoni
HPC trends
||
“traditional” CPUs chips
are designed for maximum
performance for all
possible workloads
Silicon area wasted to
maximize single thread
performace
Compute Power
Energy
Datacenter Capacity
C.Cavazzoni
Energy trends
||
New chips designed for
maximum performance in a
reduced set of workloads
Simple functional units,
poor single thread
performance, but
maximum throughput
Compute Power
Energy
Datacenter Capacity
C.Cavazzoni
Change of paradigm #1
||
Change of paradigm #2
||
https://indico-jsc.fz-juelich.de/event/76/session/0/contribution/1/material/slides/0.pdf
Wayne Joubert - OpenPOWER ADG 2018
Change of paradigm #3
||
Change of paradigm #4
Performance
Analysis
Scalable Monitoring
Framework
Machine
Learning
Data
Visualization
Resources
Management
Energy
efficiency
Job
Scheduling
Heterogeneous
Sensors
Common Interface
CRAC
PDU
CLUSTER
Reactive and Proactive
Feedbacks
ENV.
||
…. a Swiss army knife for HPC
ARIANE:
The 64b
Application
Processor
ARA
The Vector
Engine
NTX
The Network
Training
Accelerator
HERO:
The Open
Heterogeneous
Research Platform
ControlPULP:
The Power
Controller for
HPC server
SNITCH:
The Pseudo Dual-
Issue Processor
for FP Workload
EXASCALE
2021
https://pulp-platform.org/
sPIN on PULP:
Network-
Accelerated
Memory
Transfers
||
▪ RV64GC, 6-stage, in-order,
out-of-order execute
▪ 16 KiB instruction cache, 32 KiB data cache
▪ Transprecision floating-point unit (TP-FPU) [3]
▪ double-, single- and half-precision FP formats
▪ Two custom formats FP16alt and FP8
▪ All standard RISC-V formats as well as SIMD
▪ Two different implementations:
▪ Ariane High Performance (AHP): tuned for high-performance applications
▪ Ariane Low Power (ALP): tuned for light, single-threaded applications
10
Architecture: Ariane RISC-V Cores
ARIANE:
The 64b
Application
Processor
The cost of application-class processing: Energy and Performance Analysis of a Linux-ready 1.7-
GHz 64-bit RISC-V Core in 22-nm FDSOI Technology
||
OpenPiton+Ariane
▪ Boots SMP Linux
▪ New write-through
cache subsystem
with invalidations
and the TRI
interface
▪ LR/SC in L1.5 cache
▪ Fetch-and-op in L2
cache
▪ RISC-V Debug
▪ RISC-V Peripherals
11
If you are really passionate about cache coherent “scalable” machines…
OpenPiton+ Ariane: The First Open-Source, SMP Linux-booting RISC-V
System Scaling From One to Many Cores
||
▪ “Network Training Accelerator”
▪ 32 bit float streaming co-processor
(IEEE 754 compatible)
▪ Custom 300 bit “wide-inside” Fused
Multiply-Accumulate
▪ 1.7x lower RMSE than conventional
FPU
▪ 1 RISC-V core (”RI5CY”) and DMA
▪ 8 NTX co-processors
▪ 64 kB L1 scratchpad memory
(comparable to 48 kB in V100)
Key ideas to increase hardware efficiency:
▪ Reduction of von Neumann bottleneck
(load/store elision through streaming)
▪ Latency hiding through DMA-based
double-buffering
13
Architecture: Network Training Accelerator (NTX)
Schuiki, Fabian, Michael Schaffner, Frank K. Gürkaynak, and Luca Benini. "A scalable near-memory
architecture for training deep neural networks on large in-memory datasets." IEEE Transactions on
Computers 68, no. 4 (2018): 484-497.
Schuiki, Fabian, Michael Schaffner, and Luca Benini. "Ntx: An energy-efficient streaming accelerator
for floating-point generalized reduction workloads in 22 nm fd-soi." In 2019 Design, Automation & Test
in Europe Conference & Exhibition (DATE), pp. 662-667. IEEE, 2019.
||
Flexible Architecture NTX accelerated cluster
▪ 1 processor core controls 8 NTX coprocessors
▪ Attached to 128 kB shared TCDM via a logarithmic interconnect
▪ DMA engine used to transfer data (double buffering)
▪ Multiple clusters connected via interconnect (crossbar/NoC)
||
Network Training Accelerator (NTX)
▪ Processor configures Reg IF and manages DMA double-buffering in L1 memory
▪ Controller issues AGU, HWL, and FPU micro-commands based on configuration
▪ AGUs generate address streams for data access
▪ FMAC with extended precision + ML functions
▪ Reads/writes data via 2 memory ports (2 operand and 1 writeback streams)
RiscV
core
1 for 8
MultibankedL1
15
||
▪ 22nm FDX technology
▪ Two application-class RISC-V
Ariane cores [1] - DP
▪ RV64GCXsmallfloat
▪ General purpose workloads
▪ Network Training Accelerator (NTX)
[2] - FP
▪ Accelerates oblivious kernels:
▪ Deep neuralnetwork training
▪ Stencils
▪ General linearalgebraworkloads
▪ 1.25 MiB of shared L2 memory
▪ Peripherals
17
Kosmodrom
Ariane and NTX on the same technology
ARIANE:
The 64b
Application
Processor
Schuiki, Fabian, Michael Schaffner, and Luca Benini. "NTX: A 260 Gflop/sW Streaming
Accelerator for Oblivious Floating-Point Algorithms in 22 nm FD-SOI." In 2019 International
SoC Design Conference (ISOCC), pp. 117-118. IEEE, 2019.
||
▪ We achieve higher energy-
efficiency for AHP and ALP than
competitive RISC-V processors
(Rocket)
▪ Ariane contains slightly larger caches
(32 KiB compared to 16 KiB)
▪ The ALP implementation is penalized
because of less mature cell libraries
available to us (7k cells vs 2k cells)
▪ NTX achieves a 2x gain in energy-
efficiency compared to Tesla V100
18
Summary on Kosmodrom: State of the Art
6x
18x
Zaruba, Florian, Fabian Schuiki, Stefan Mach, and Luca Benini. "The Floating
Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and
Performance." In 2019 26th IEEE International Conference on Electronics,
Circuits and Systems (ICECS), pp. 767-770. IEEE, 2019.
||
Ariane
1GHz
2 DP-GFLOPS
8 GB/s
I$, D$
Instruction Data
Interconnect
256b
64b 64b
Ara
1GHz
16 DP-GFLOPS
32 GB/s
VRF
Data
256b
Instruction
Queue
ACK/TRAP
Enter ARA: Open-Source RISCV Vector Engine
⚫ Ara targets 0.5 DP-FLOP/B
– Memory bandwidth scales with the
number of physical lanes
Cavalcante, Matheus, Fabian Schuiki, Florian Zaruba, Michael Schaffner, and Luca
Benini. "Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor
With Multiprecision Floating-Point Support in 22-nm FD-SOI." IEEE Transactions on
Very Large Scale Integration (VLSI) Systems 28, no. 2 (2019): 530-543.
||
Matrix multiplication on Ara
⚫ Load row i of matrix B into vB
⚫ for (int j = 0; j < n; j++)
– Load element A[j, i]
– Broadcast it into vA
– vC ← vA . vB + vC
vld vB, 0(addrB)
(Unrolled loop)
▪ ld t0, 0(addrA)
▪ addi addrA, addrA, 8
▪ vins vA, t0, zero
▪ vmadd vC, vA, vB, vC
▪ ld t0, 0(addrA)
▪ addi addrA, addrA, 8
▪ vins vA, t0, zero
▪ vmadd vC, vA, vB, vC
||
Issue rate performance limitation
⚫ vmadds are issued at best
every four cycles
– Since Ariane is single-issue
⚫ If the vector MACs take less
than four cycles to execute,
the FPUs starve waiting for
instructions
– Von Neumann Bottleneck
⚫ This translates to a boundary
in the roofline plot
||
Ara: Figures of Merit
⚫ Ara: 4 lanes GF 22FDX 1.25 GHz
implementation
⚫ Clock frequency
⚫ 1.25 GHz (nominal), 0.92 GHz (worst
condition)
⚫ 40 gate delays
⚫ Area: 3400 kGE
⚫ 0.68 mm2
⚫ 256 x 256 MATMUL
⚫ Performance: 9.8 DP-GFLOPS
⚫ Power: 259 mW
⚫ Efficiency: 38 DP-GFLOPS/W
⚫ ⁓2.5X better than Ariane on same
benchmark
⚫ Area breakdown
||
Ara: Scalability
⚫ Each lane is almost independent
– Contains part of the VRF and its
functional units
⚫ Scalability limitations
– VLSU and SLDU: need to
communicate to all banks
⚫ Instance with 16 lanes:
– 1.04 GHz (nom.), 0.78 GHz (w)
– 10.7 MGE (2.13mm²)
– 32.4 DP-GFLOPS
– 40.8 DP-GFLOPS/W (peak)
VLSU
Ariane
SLDU
16 ARAs give you 1TFLOP at 12W - NOT BAD!
||
SNITCH
Cluster 0
CC1 CC3
MULDIV I$
…
Hive 0
▪ Built around Snitch core:
RV32I, 15 kGE
▪ Add 64b FPU subsystem:
core complex (CC)
▪ 4 CCs, MULDIV, I-cache:
hive
▪ 2 hives, TCDM,
peripherals: cluster
▪ N clusters, system X-bar,
memory: system
▪ Float subsystem adds
novel HW
▪ 2 stream semantic
registers
▪ FPU sequencer
FPU
CC0 Snitch
Hive 1
…
Peripherals
… …
B0 B1 B2 B31…
Cluster 1 Cluster 2 … Memory
||
Stream Semantic Registers and FREP
▪ Vanilla RISCs: low functional unit utilization
▪ Solutions often complex: CISC, VLIW,
vectoring
▪ Map registers to memory streams: SSRs
▪ Reads, writes become memory requests
▪ Programmable generator emits addresses
+ Unmodified ISA
+ Orthogonal to hardware loops
▪ Snitch CC: FPU instruction stream
decoupled
▪ FPU Sequencer can buffer, loop over
instructions
▪ Core, FPU fed in parallel → pseudo-dual
issue
dotp: fld ft0, 0(a0)
fld ft1, 0(a1)
fmadd.d ft2, ft0, ft1, ft2
addi a0, a0, 8
addi a1, a1, 8
bge a0, t0, dotp
call conf_addr_gen
frep buflen, rep
fmadd.d ft2, ft0, ft1, ft2
Zaruba, Florian, Fabian Schuiki, Torsten Hoefler, and Luca Benini. "Snitch: A 10
kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of
Floating-Point Intensive Workloads." arXiv preprint cs.AR/2002.10143 (2020).
Schuiki, Fabian, Florian Zaruba, Torsten Hoefler, and Luca Benini. "Stream
Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute
Utilization in Single-Issue Cores." IEEE Transactions on Computers (2020).
||07.07.2020Andrea Bartolini 26
SNITCH Figures of Merit
Normalized Performance
Higher performance than VP
Almost 80 DP GFlop/sW
 2x more efficient of ARA
 13pJ x DP FLOP
 Exascale!!
22nm FDX technology
||07.07.2020Andrea Bartolini 27
System-Level Integration of Accelerators
• Our accelerators show leading performance and energy efficiency in
silicon at the core and cluster level
• How can we unleash this potential in real computing systems?
Image: Nvidia Xavier die shot annotated by WikiChip Image: Summit Supercomputer by OLCF at ORNL
||07.07.2020Andrea Bartolini 28
HERO: Open-Source Heterogeneous Research Platform
HERO combines
• general-purpose Host CPUs
• domain-specific programmable many-core accelerators
to unite versatility with performance,
enabling task offloading and data sharing across heterogeneous
• ISAs (e.g., ARMv8 and RV32)
• memory subsystems (e.g., caches and SPMs, virtual and physical addresses)
• data models (e.g., LP64 and ILP32)
• OSes and runtime libraries (e.g., Linux and OpenMP Device RTL)
with minimal run-time overhead and transparent to application programmers.
A. Kurth, P. Vogel, A. Marongiu, A. Capotondi, and L.Benini: "HERO: Heterogeneous Embedded
Research Platform for Exploring RISC-V Manycore Accelerators on FPGA." Proceedings of the First
Workshop on Computer Architecture Research with RISC-V (CARRV), pp. 1-7, IEEE/ACM, 2017.
A.Kurth, A. Capotondi, P. Vogel, L.Benini, and A. Marongiu: "HERO: an Open-Source Research
Platform for HW/SW Exploration of Heterogeneous Manycore Systems." Proceedings of the
Second Workshop on Autotuning and Adaptivity Approaches for Energy-Efficient HPC Systems
(ANDARE), pp. 13-18, ACM, 2018.
||07.07.2020Andrea Bartolini 29
Network-Accelerated Memory Transfers: sPIN on PULP
S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J, Beranek, M. Besta, L.Benini, D.
Roweth, and T. Hoefler: "Network-Accelerated Non-Contiguous Memory Transfers." Proceedings of
the International Conference for High Performance Computing, Networking, Storage and Analysis
(SC), pp. 1-14, ACM, 2019.
sPIN on PULP:
Network-
Accelerated
Memory Transfers
Processing user-defined network packet kernels
on a PULP-based accelerator
in the NIC
||
ControlPULP
Control-
PULP
VRM
BMC
PE
S
Operating System
Application
System Management / RM
GovernorsIn band
Hints/Prescription
Power Cap Energy vs.
Througput
DIMM
RJ45
SystemManagement/RM
Out of band
Node Power Cap
RAS
Main Archi. Blocks w. :
- Sensors (PVT, Util, archi)
- Controls (f,Vdd,Vbb,PG,CG)
- In band a.k.a low latency / user-
space telemetry (power, perf, …)
- O.S. PM governors:
- cpufreq/ cpuidle
- Based on O.S. metrics
- Slow & often unused
- Low latency PM requests and/or
suggestions
- From the Application/run-time
- Power cap => Max perf @ P<Pmax
- Energy => Min Energy @ f=f*
- Throughput => F > Fmax @ T,P<Max
- Out-of-band – zero overhead
telemetry
- Node Pcap – Max perf @
Pnode<Pmax
- RAS – error and conditions reporting
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
Coming Soon…
Andrea Bartolini, et al. A PULP-based Parallel Power Controller for FutureExascale Systems, in:
2019 26th IEEE International Conferenceon Electronics, Circuits and Systems (ICECS)
||
ControlPULP
Coming Soon…
PM
task
• Read voltage regulator,
power, status (VR)
• Power model update
T
control
Watchdog reset
Write power controller settings
Write to internal memory telemetry data
Read PVT sensors
Read workload from O.S.
Read target P/C state settings, power budget
Read Pending BMC requests
Compute controller settings
BMC
• Read pending command queue
• Decode Command/data
• Perform action:
• Change target P/C state, power budget
• Set pending BMC
• Ask telemetry data
Copyright © European Processor Initiative 2019. EPI Tutorial/bologna/22-01-2020
Ack. Robert Balas, Giovanni Bambini, Andrea Bentivogli, Davide Rossi,
Antonio Mastrandrea, Christian Conficoni, Simone Benatti, Andrea Tilli
||
HPC Vertical: The European Processor Initiative
▪ High Performance General Purpose
Processor for HPC
▪ High-performance RISC-V based
accelerator
▪ Computing platform for autonomous cars
▪ Will also target the AI, Big Data and other
markets in order to be economically
sustainable
Europe Needs its own Processors
▪ Processors now control almost every aspect
of our lives
▪ Security (back doors etc.)
▪ Possible future restrictions on exports to
EU due to increasing protectionism
▪ A competitive EU supply chain for HPC
technologies will create jobs and growth in
Europe
▪ Sovereignty (data, economical, embargo)
32
||
General Purpose Processor (GPP) chip
 7 nm, chip-let technology
 ARM-SVE tiles
 EPAC RISC-V vector+AI accelerator tiles
 L1, L2, L3 cache subsystem + HBM + DDR
HBMHBM
HBMHBM
DDR DDR
DDR DDR
RISC-V Accelerator Demonstrator Test Chip
 22 nm FDSOI
 Only one RISC-V accelerator tile
 On-chip L1, L2 + off-chip HBM + DDR PHY
 Targets 128 DP GFLOPS (vector) 200+GOPs/W SP
(STX)
Highspeed
SerDes
Vector
lanes
STX
units
Vector
lanes
STX
units
Scalar
core
Scalar
core
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
First Generation EPI chips
Scalar Core + STX units based on NTX and Snitch!
GPP power manager based on ControlPULP!
||
…. a Swiss army knife for HPC
EXASCALE
2021
||http://pulp-platform.org
The fun is
just beginning

More Related Content

RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC

  • 1. 2IntegratedSystems Laboratory 1Department of Electrical,Electronic and InformationEngineering RISC-V open-ISA and open-HW – a Swiss army knife for HPC ICS2020, Workshop on RISC-V and OpenPOWER 29.06.2020 Andrea Bartolini1 & PULP team1,2
  • 2. || Energy efficiency challenge: Exascale Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019 100 EFLOPS 10 EFLOPS 1 EFLOPS 100 PFLOPS 10 PFLOPS 2 nJ/FLOP 1 PFLOPS 200 pJ/FLOP 100 TFLOPS 20 pJ/FLOP 10 TFLOPS 2 pJ/FLOP 1 TFLOPS 0.2 pJ/FLOP 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 x10 every 4 years /10 every 4 years HPC is now power-bound→ need 10x energy efficiency improvement every 4 years PERFORMANCE ENERGY PER OPERATION* *20MWatt supercomputer: Performance & EnOP 1EFLOPs →20pj/FLOP
  • 3. || Peak Performance Moore law FPU Performance Dennard law Number of FPUs Moore + Dennard App. Parallelism Amdahl's law 10^9 Exaflops 10^18 Gigaflops 10^9 serial fraction 1/10^9 We need programmability support Already at 20MWatt C.Cavazzoni HPC trends
  • 4. || “traditional” CPUs chips are designed for maximum performance for all possible workloads Silicon area wasted to maximize single thread performace Compute Power Energy Datacenter Capacity C.Cavazzoni Energy trends
  • 5. || New chips designed for maximum performance in a reduced set of workloads Simple functional units, poor single thread performance, but maximum throughput Compute Power Energy Datacenter Capacity C.Cavazzoni Change of paradigm #1
  • 8. || Change of paradigm #4 Performance Analysis Scalable Monitoring Framework Machine Learning Data Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC PDU CLUSTER Reactive and Proactive Feedbacks ENV.
  • 9. || …. a Swiss army knife for HPC ARIANE: The 64b Application Processor ARA The Vector Engine NTX The Network Training Accelerator HERO: The Open Heterogeneous Research Platform ControlPULP: The Power Controller for HPC server SNITCH: The Pseudo Dual- Issue Processor for FP Workload EXASCALE 2021 https://pulp-platform.org/ sPIN on PULP: Network- Accelerated Memory Transfers
  • 10. || ▪ RV64GC, 6-stage, in-order, out-of-order execute ▪ 16 KiB instruction cache, 32 KiB data cache ▪ Transprecision floating-point unit (TP-FPU) [3] ▪ double-, single- and half-precision FP formats ▪ Two custom formats FP16alt and FP8 ▪ All standard RISC-V formats as well as SIMD ▪ Two different implementations: ▪ Ariane High Performance (AHP): tuned for high-performance applications ▪ Ariane Low Power (ALP): tuned for light, single-threaded applications 10 Architecture: Ariane RISC-V Cores ARIANE: The 64b Application Processor The cost of application-class processing: Energy and Performance Analysis of a Linux-ready 1.7- GHz 64-bit RISC-V Core in 22-nm FDSOI Technology
  • 11. || OpenPiton+Ariane ▪ Boots SMP Linux ▪ New write-through cache subsystem with invalidations and the TRI interface ▪ LR/SC in L1.5 cache ▪ Fetch-and-op in L2 cache ▪ RISC-V Debug ▪ RISC-V Peripherals 11 If you are really passionate about cache coherent “scalable” machines… OpenPiton+ Ariane: The First Open-Source, SMP Linux-booting RISC-V System Scaling From One to Many Cores
  • 12. || ▪ “Network Training Accelerator” ▪ 32 bit float streaming co-processor (IEEE 754 compatible) ▪ Custom 300 bit “wide-inside” Fused Multiply-Accumulate ▪ 1.7x lower RMSE than conventional FPU ▪ 1 RISC-V core (”RI5CY”) and DMA ▪ 8 NTX co-processors ▪ 64 kB L1 scratchpad memory (comparable to 48 kB in V100) Key ideas to increase hardware efficiency: ▪ Reduction of von Neumann bottleneck (load/store elision through streaming) ▪ Latency hiding through DMA-based double-buffering 13 Architecture: Network Training Accelerator (NTX) Schuiki, Fabian, Michael Schaffner, Frank K. Gürkaynak, and Luca Benini. "A scalable near-memory architecture for training deep neural networks on large in-memory datasets." IEEE Transactions on Computers 68, no. 4 (2018): 484-497. Schuiki, Fabian, Michael Schaffner, and Luca Benini. "Ntx: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm fd-soi." In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 662-667. IEEE, 2019.
  • 13. || Flexible Architecture NTX accelerated cluster ▪ 1 processor core controls 8 NTX coprocessors ▪ Attached to 128 kB shared TCDM via a logarithmic interconnect ▪ DMA engine used to transfer data (double buffering) ▪ Multiple clusters connected via interconnect (crossbar/NoC)
  • 14. || Network Training Accelerator (NTX) ▪ Processor configures Reg IF and manages DMA double-buffering in L1 memory ▪ Controller issues AGU, HWL, and FPU micro-commands based on configuration ▪ AGUs generate address streams for data access ▪ FMAC with extended precision + ML functions ▪ Reads/writes data via 2 memory ports (2 operand and 1 writeback streams) RiscV core 1 for 8 MultibankedL1 15
  • 15. || ▪ 22nm FDX technology ▪ Two application-class RISC-V Ariane cores [1] - DP ▪ RV64GCXsmallfloat ▪ General purpose workloads ▪ Network Training Accelerator (NTX) [2] - FP ▪ Accelerates oblivious kernels: ▪ Deep neuralnetwork training ▪ Stencils ▪ General linearalgebraworkloads ▪ 1.25 MiB of shared L2 memory ▪ Peripherals 17 Kosmodrom Ariane and NTX on the same technology ARIANE: The 64b Application Processor Schuiki, Fabian, Michael Schaffner, and Luca Benini. "NTX: A 260 Gflop/sW Streaming Accelerator for Oblivious Floating-Point Algorithms in 22 nm FD-SOI." In 2019 International SoC Design Conference (ISOCC), pp. 117-118. IEEE, 2019.
  • 16. || ▪ We achieve higher energy- efficiency for AHP and ALP than competitive RISC-V processors (Rocket) ▪ Ariane contains slightly larger caches (32 KiB compared to 16 KiB) ▪ The ALP implementation is penalized because of less mature cell libraries available to us (7k cells vs 2k cells) ▪ NTX achieves a 2x gain in energy- efficiency compared to Tesla V100 18 Summary on Kosmodrom: State of the Art 6x 18x Zaruba, Florian, Fabian Schuiki, Stefan Mach, and Luca Benini. "The Floating Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and Performance." In 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp. 767-770. IEEE, 2019.
  • 17. || Ariane 1GHz 2 DP-GFLOPS 8 GB/s I$, D$ Instruction Data Interconnect 256b 64b 64b Ara 1GHz 16 DP-GFLOPS 32 GB/s VRF Data 256b Instruction Queue ACK/TRAP Enter ARA: Open-Source RISCV Vector Engine ⚫ Ara targets 0.5 DP-FLOP/B – Memory bandwidth scales with the number of physical lanes Cavalcante, Matheus, Fabian Schuiki, Florian Zaruba, Michael Schaffner, and Luca Benini. "Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, no. 2 (2019): 530-543.
  • 18. || Matrix multiplication on Ara ⚫ Load row i of matrix B into vB ⚫ for (int j = 0; j < n; j++) – Load element A[j, i] – Broadcast it into vA – vC ← vA . vB + vC vld vB, 0(addrB) (Unrolled loop) ▪ ld t0, 0(addrA) ▪ addi addrA, addrA, 8 ▪ vins vA, t0, zero ▪ vmadd vC, vA, vB, vC ▪ ld t0, 0(addrA) ▪ addi addrA, addrA, 8 ▪ vins vA, t0, zero ▪ vmadd vC, vA, vB, vC
  • 19. || Issue rate performance limitation ⚫ vmadds are issued at best every four cycles – Since Ariane is single-issue ⚫ If the vector MACs take less than four cycles to execute, the FPUs starve waiting for instructions – Von Neumann Bottleneck ⚫ This translates to a boundary in the roofline plot
  • 20. || Ara: Figures of Merit ⚫ Ara: 4 lanes GF 22FDX 1.25 GHz implementation ⚫ Clock frequency ⚫ 1.25 GHz (nominal), 0.92 GHz (worst condition) ⚫ 40 gate delays ⚫ Area: 3400 kGE ⚫ 0.68 mm2 ⚫ 256 x 256 MATMUL ⚫ Performance: 9.8 DP-GFLOPS ⚫ Power: 259 mW ⚫ Efficiency: 38 DP-GFLOPS/W ⚫ ⁓2.5X better than Ariane on same benchmark ⚫ Area breakdown
  • 21. || Ara: Scalability ⚫ Each lane is almost independent – Contains part of the VRF and its functional units ⚫ Scalability limitations – VLSU and SLDU: need to communicate to all banks ⚫ Instance with 16 lanes: – 1.04 GHz (nom.), 0.78 GHz (w) – 10.7 MGE (2.13mm²) – 32.4 DP-GFLOPS – 40.8 DP-GFLOPS/W (peak) VLSU Ariane SLDU 16 ARAs give you 1TFLOP at 12W - NOT BAD!
  • 22. || SNITCH Cluster 0 CC1 CC3 MULDIV I$ … Hive 0 ▪ Built around Snitch core: RV32I, 15 kGE ▪ Add 64b FPU subsystem: core complex (CC) ▪ 4 CCs, MULDIV, I-cache: hive ▪ 2 hives, TCDM, peripherals: cluster ▪ N clusters, system X-bar, memory: system ▪ Float subsystem adds novel HW ▪ 2 stream semantic registers ▪ FPU sequencer FPU CC0 Snitch Hive 1 … Peripherals … … B0 B1 B2 B31… Cluster 1 Cluster 2 … Memory
  • 23. || Stream Semantic Registers and FREP ▪ Vanilla RISCs: low functional unit utilization ▪ Solutions often complex: CISC, VLIW, vectoring ▪ Map registers to memory streams: SSRs ▪ Reads, writes become memory requests ▪ Programmable generator emits addresses + Unmodified ISA + Orthogonal to hardware loops ▪ Snitch CC: FPU instruction stream decoupled ▪ FPU Sequencer can buffer, loop over instructions ▪ Core, FPU fed in parallel → pseudo-dual issue dotp: fld ft0, 0(a0) fld ft1, 0(a1) fmadd.d ft2, ft0, ft1, ft2 addi a0, a0, 8 addi a1, a1, 8 bge a0, t0, dotp call conf_addr_gen frep buflen, rep fmadd.d ft2, ft0, ft1, ft2 Zaruba, Florian, Fabian Schuiki, Torsten Hoefler, and Luca Benini. "Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads." arXiv preprint cs.AR/2002.10143 (2020). Schuiki, Fabian, Florian Zaruba, Torsten Hoefler, and Luca Benini. "Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores." IEEE Transactions on Computers (2020).
  • 24. ||07.07.2020Andrea Bartolini 26 SNITCH Figures of Merit Normalized Performance Higher performance than VP Almost 80 DP GFlop/sW  2x more efficient of ARA  13pJ x DP FLOP  Exascale!! 22nm FDX technology
  • 25. ||07.07.2020Andrea Bartolini 27 System-Level Integration of Accelerators • Our accelerators show leading performance and energy efficiency in silicon at the core and cluster level • How can we unleash this potential in real computing systems? Image: Nvidia Xavier die shot annotated by WikiChip Image: Summit Supercomputer by OLCF at ORNL
  • 26. ||07.07.2020Andrea Bartolini 28 HERO: Open-Source Heterogeneous Research Platform HERO combines • general-purpose Host CPUs • domain-specific programmable many-core accelerators to unite versatility with performance, enabling task offloading and data sharing across heterogeneous • ISAs (e.g., ARMv8 and RV32) • memory subsystems (e.g., caches and SPMs, virtual and physical addresses) • data models (e.g., LP64 and ILP32) • OSes and runtime libraries (e.g., Linux and OpenMP Device RTL) with minimal run-time overhead and transparent to application programmers. A. Kurth, P. Vogel, A. Marongiu, A. Capotondi, and L.Benini: "HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA." Proceedings of the First Workshop on Computer Architecture Research with RISC-V (CARRV), pp. 1-7, IEEE/ACM, 2017. A.Kurth, A. Capotondi, P. Vogel, L.Benini, and A. Marongiu: "HERO: an Open-Source Research Platform for HW/SW Exploration of Heterogeneous Manycore Systems." Proceedings of the Second Workshop on Autotuning and Adaptivity Approaches for Energy-Efficient HPC Systems (ANDARE), pp. 13-18, ACM, 2018.
  • 27. ||07.07.2020Andrea Bartolini 29 Network-Accelerated Memory Transfers: sPIN on PULP S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J, Beranek, M. Besta, L.Benini, D. Roweth, and T. Hoefler: "Network-Accelerated Non-Contiguous Memory Transfers." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1-14, ACM, 2019. sPIN on PULP: Network- Accelerated Memory Transfers Processing user-defined network packet kernels on a PULP-based accelerator in the NIC
  • 28. || ControlPULP Control- PULP VRM BMC PE S Operating System Application System Management / RM GovernorsIn band Hints/Prescription Power Cap Energy vs. Througput DIMM RJ45 SystemManagement/RM Out of band Node Power Cap RAS Main Archi. Blocks w. : - Sensors (PVT, Util, archi) - Controls (f,Vdd,Vbb,PG,CG) - In band a.k.a low latency / user- space telemetry (power, perf, …) - O.S. PM governors: - cpufreq/ cpuidle - Based on O.S. metrics - Slow & often unused - Low latency PM requests and/or suggestions - From the Application/run-time - Power cap => Max perf @ P<Pmax - Energy => Min Energy @ f=f* - Throughput => F > Fmax @ T,P<Max - Out-of-band – zero overhead telemetry - Node Pcap – Max perf @ Pnode<Pmax - RAS – error and conditions reporting Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019 Coming Soon… Andrea Bartolini, et al. A PULP-based Parallel Power Controller for FutureExascale Systems, in: 2019 26th IEEE International Conferenceon Electronics, Circuits and Systems (ICECS)
  • 29. || ControlPULP Coming Soon… PM task • Read voltage regulator, power, status (VR) • Power model update T control Watchdog reset Write power controller settings Write to internal memory telemetry data Read PVT sensors Read workload from O.S. Read target P/C state settings, power budget Read Pending BMC requests Compute controller settings BMC • Read pending command queue • Decode Command/data • Perform action: • Change target P/C state, power budget • Set pending BMC • Ask telemetry data Copyright © European Processor Initiative 2019. EPI Tutorial/bologna/22-01-2020 Ack. Robert Balas, Giovanni Bambini, Andrea Bentivogli, Davide Rossi, Antonio Mastrandrea, Christian Conficoni, Simone Benatti, Andrea Tilli
  • 30. || HPC Vertical: The European Processor Initiative ▪ High Performance General Purpose Processor for HPC ▪ High-performance RISC-V based accelerator ▪ Computing platform for autonomous cars ▪ Will also target the AI, Big Data and other markets in order to be economically sustainable Europe Needs its own Processors ▪ Processors now control almost every aspect of our lives ▪ Security (back doors etc.) ▪ Possible future restrictions on exports to EU due to increasing protectionism ▪ A competitive EU supply chain for HPC technologies will create jobs and growth in Europe ▪ Sovereignty (data, economical, embargo) 32
  • 31. || General Purpose Processor (GPP) chip  7 nm, chip-let technology  ARM-SVE tiles  EPAC RISC-V vector+AI accelerator tiles  L1, L2, L3 cache subsystem + HBM + DDR HBMHBM HBMHBM DDR DDR DDR DDR RISC-V Accelerator Demonstrator Test Chip  22 nm FDSOI  Only one RISC-V accelerator tile  On-chip L1, L2 + off-chip HBM + DDR PHY  Targets 128 DP GFLOPS (vector) 200+GOPs/W SP (STX) Highspeed SerDes Vector lanes STX units Vector lanes STX units Scalar core Scalar core Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019 First Generation EPI chips Scalar Core + STX units based on NTX and Snitch! GPP power manager based on ControlPULP!
  • 32. || …. a Swiss army knife for HPC EXASCALE 2021