RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC

2IntegratedSystems Laboratory
1Department of Electrical,Electronic
and InformationEngineering
RISC-V open-ISA and open-HW
– a Swiss army knife for HPC
ICS2020, Workshop on RISC-V and OpenPOWER 29.06.2020
Andrea Bartolini1 & PULP team1,2

||
Energy efficiency challenge: Exascale
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
100 EFLOPS
10 EFLOPS
1 EFLOPS
100 PFLOPS
10 PFLOPS 2 nJ/FLOP
1 PFLOPS 200 pJ/FLOP
100 TFLOPS 20 pJ/FLOP
10 TFLOPS 2 pJ/FLOP
1 TFLOPS 0.2 pJ/FLOP
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
x10 every 4 years
/10 every 4 years
HPC is now power-bound→ need 10x energy efficiency improvement every 4 years
PERFORMANCE
ENERGY PER
OPERATION*
*20MWatt supercomputer: Performance & EnOP
1EFLOPs →20pj/FLOP

||
Peak Performance Moore law
FPU Performance Dennard law
Number of FPUs Moore + Dennard
App. Parallelism Amdahl's law
10^9
Exaflops
10^18
Gigaflops
10^9
serial fraction
1/10^9
We need
programmability support
Already at 20MWatt
C.Cavazzoni
HPC trends

||
“traditional” CPUs chips
are designed for maximum
performance for all
possible workloads
Silicon area wasted to
maximize single thread
performace
Compute Power
Energy
Datacenter Capacity
C.Cavazzoni
Energy trends

||
New chips designed for
maximum performance in a
reduced set of workloads
Simple functional units,
poor single thread
performance, but
maximum throughput
Compute Power
Energy
Datacenter Capacity
C.Cavazzoni
Change of paradigm #1

||
https://indico-jsc.fz-juelich.de/event/76/session/0/contribution/1/material/slides/0.pdf
Wayne Joubert - OpenPOWER ADG 2018

||
Performance
Analysis
Scalable Monitoring
Framework
Machine
Learning
Data
Visualization
Resources
Management
Energy
efficiency
Job
Scheduling
Heterogeneous
Sensors
Common Interface
CRAC
PDU
CLUSTER
Reactive and Proactive
Feedbacks
ENV.

||
…. a Swiss army knife for HPC
ARIANE:
The 64b
Application
Processor
ARA
The Vector
Engine
NTX
The Network
Training
Accelerator
HERO:
The Open
Heterogeneous
Research Platform
ControlPULP:
The Power
Controller for
HPC server
SNITCH:
The Pseudo Dual-
Issue Processor
for FP Workload
EXASCALE
2021
https://pulp-platform.org/
sPIN on PULP:
Network-
Accelerated
Memory
Transfers

||
▪ RV64GC, 6-stage, in-order,
out-of-order execute
▪ 16 KiB instruction cache, 32 KiB data cache
▪ Transprecision floating-point unit (TP-FPU) [3]
▪ double-, single- and half-precision FP formats
▪ Two custom formats FP16alt and FP8
▪ All standard RISC-V formats as well as SIMD
▪ Two different implementations:
▪ Ariane High Performance (AHP): tuned for high-performance applications
▪ Ariane Low Power (ALP): tuned for light, single-threaded applications
10
Architecture: Ariane RISC-V Cores
ARIANE:
The 64b
Application
Processor
The cost of application-class processing: Energy and Performance Analysis of a Linux-ready 1.7-
GHz 64-bit RISC-V Core in 22-nm FDSOI Technology

||
OpenPiton+Ariane
▪ Boots SMP Linux
▪ New write-through
cache subsystem
with invalidations
and the TRI
interface
▪ LR/SC in L1.5 cache
▪ Fetch-and-op in L2
cache
▪ RISC-V Debug
▪ RISC-V Peripherals
11
If you are really passionate about cache coherent “scalable” machines…
OpenPiton+ Ariane: The First Open-Source, SMP Linux-booting RISC-V
System Scaling From One to Many Cores

||
▪ “Network Training Accelerator”
▪ 32 bit float streaming co-processor
(IEEE 754 compatible)
▪ Custom 300 bit “wide-inside” Fused
Multiply-Accumulate
▪ 1.7x lower RMSE than conventional
FPU
▪ 1 RISC-V core (”RI5CY”) and DMA
▪ 8 NTX co-processors
▪ 64 kB L1 scratchpad memory
(comparable to 48 kB in V100)
Key ideas to increase hardware efficiency:
▪ Reduction of von Neumann bottleneck
(load/store elision through streaming)
▪ Latency hiding through DMA-based
double-buffering
13
Architecture: Network Training Accelerator (NTX)
Schuiki, Fabian, Michael Schaffner, Frank K. Gürkaynak, and Luca Benini. "A scalable near-memory
architecture for training deep neural networks on large in-memory datasets." IEEE Transactions on
Computers 68, no. 4 (2018): 484-497.
Schuiki, Fabian, Michael Schaffner, and Luca Benini. "Ntx: An energy-efficient streaming accelerator
for floating-point generalized reduction workloads in 22 nm fd-soi." In 2019 Design, Automation & Test
in Europe Conference & Exhibition (DATE), pp. 662-667. IEEE, 2019.

||
Flexible Architecture NTX accelerated cluster
▪ 1 processor core controls 8 NTX coprocessors
▪ Attached to 128 kB shared TCDM via a logarithmic interconnect
▪ DMA engine used to transfer data (double buffering)
▪ Multiple clusters connected via interconnect (crossbar/NoC)

||
Network Training Accelerator (NTX)
▪ Processor configures Reg IF and manages DMA double-buffering in L1 memory
▪ Controller issues AGU, HWL, and FPU micro-commands based on configuration
▪ AGUs generate address streams for data access
▪ FMAC with extended precision + ML functions
▪ Reads/writes data via 2 memory ports (2 operand and 1 writeback streams)
RiscV
core
1 for 8
MultibankedL1
15

||
▪ 22nm FDX technology
▪ Two application-class RISC-V
Ariane cores [1] - DP
▪ RV64GCXsmallfloat
▪ General purpose workloads
▪ Network Training Accelerator (NTX)
[2] - FP
▪ Accelerates oblivious kernels:
▪ Deep neuralnetwork training
▪ Stencils
▪ General linearalgebraworkloads
▪ 1.25 MiB of shared L2 memory
▪ Peripherals
17
Kosmodrom
Ariane and NTX on the same technology
ARIANE:
The 64b
Application
Processor
Schuiki, Fabian, Michael Schaffner, and Luca Benini. "NTX: A 260 Gflop/sW Streaming
Accelerator for Oblivious Floating-Point Algorithms in 22 nm FD-SOI." In 2019 International
SoC Design Conference (ISOCC), pp. 117-118. IEEE, 2019.

||
▪ We achieve higher energy-
efficiency for AHP and ALP than
competitive RISC-V processors
(Rocket)
▪ Ariane contains slightly larger caches
(32 KiB compared to 16 KiB)
▪ The ALP implementation is penalized
because of less mature cell libraries
available to us (7k cells vs 2k cells)
▪ NTX achieves a 2x gain in energy-
efficiency compared to Tesla V100
18
Summary on Kosmodrom: State of the Art
6x
18x
Zaruba, Florian, Fabian Schuiki, Stefan Mach, and Luca Benini. "The Floating
Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and
Performance." In 2019 26th IEEE International Conference on Electronics,
Circuits and Systems (ICECS), pp. 767-770. IEEE, 2019.

||
Ariane
1GHz
2 DP-GFLOPS
8 GB/s
I$, D$
Instruction Data
Interconnect
256b
64b 64b
Ara
1GHz
16 DP-GFLOPS
32 GB/s
VRF
Data
256b
Instruction
Queue
ACK/TRAP
Enter ARA: Open-Source RISCV Vector Engine
⚫ Ara targets 0.5 DP-FLOP/B
– Memory bandwidth scales with the
number of physical lanes
Cavalcante, Matheus, Fabian Schuiki, Florian Zaruba, Michael Schaffner, and Luca
Benini. "Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor
With Multiprecision Floating-Point Support in 22-nm FD-SOI." IEEE Transactions on
Very Large Scale Integration (VLSI) Systems 28, no. 2 (2019): 530-543.

||
Matrix multiplication on Ara
⚫ Load row i of matrix B into vB
⚫ for (int j = 0; j < n; j++)
– Load element A[j, i]
– Broadcast it into vA
– vC ← vA . vB + vC
vld vB, 0(addrB)
(Unrolled loop)
▪ ld t0, 0(addrA)
▪ addi addrA, addrA, 8
▪ vins vA, t0, zero
▪ vmadd vC, vA, vB, vC
▪ ld t0, 0(addrA)
▪ addi addrA, addrA, 8
▪ vins vA, t0, zero
▪ vmadd vC, vA, vB, vC

||
Issue rate performance limitation
⚫ vmadds are issued at best
every four cycles
– Since Ariane is single-issue
⚫ If the vector MACs take less
than four cycles to execute,
the FPUs starve waiting for
instructions
– Von Neumann Bottleneck
⚫ This translates to a boundary
in the roofline plot

||
Ara: Figures of Merit
⚫ Ara: 4 lanes GF 22FDX 1.25 GHz
implementation
⚫ Clock frequency
⚫ 1.25 GHz (nominal), 0.92 GHz (worst
condition)
⚫ 40 gate delays
⚫ Area: 3400 kGE
⚫ 0.68 mm2
⚫ 256 x 256 MATMUL
⚫ Performance: 9.8 DP-GFLOPS
⚫ Power: 259 mW
⚫ Efficiency: 38 DP-GFLOPS/W
⚫ ⁓2.5X better than Ariane on same
benchmark
⚫ Area breakdown

||
Ara: Scalability
⚫ Each lane is almost independent
– Contains part of the VRF and its
functional units
⚫ Scalability limitations
– VLSU and SLDU: need to
communicate to all banks
⚫ Instance with 16 lanes:
– 1.04 GHz (nom.), 0.78 GHz (w)
– 10.7 MGE (2.13mm²)
– 32.4 DP-GFLOPS
– 40.8 DP-GFLOPS/W (peak)
VLSU
Ariane
SLDU
16 ARAs give you 1TFLOP at 12W - NOT BAD!

||
SNITCH
Cluster 0
CC1 CC3
MULDIV I$
…
Hive 0
▪ Built around Snitch core:
RV32I, 15 kGE
▪ Add 64b FPU subsystem:
core complex (CC)
▪ 4 CCs, MULDIV, I-cache:
hive
▪ 2 hives, TCDM,
peripherals: cluster
▪ N clusters, system X-bar,
memory: system
▪ Float subsystem adds
novel HW
▪ 2 stream semantic
registers
▪ FPU sequencer
FPU
CC0 Snitch
Hive 1
…
Peripherals
… …
B0 B1 B2 B31…
Cluster 1 Cluster 2 … Memory

||
Stream Semantic Registers and FREP
▪ Vanilla RISCs: low functional unit utilization
▪ Solutions often complex: CISC, VLIW,
vectoring
▪ Map registers to memory streams: SSRs
▪ Reads, writes become memory requests
▪ Programmable generator emits addresses
+ Unmodified ISA
+ Orthogonal to hardware loops
▪ Snitch CC: FPU instruction stream
decoupled
▪ FPU Sequencer can buffer, loop over
instructions
▪ Core, FPU fed in parallel → pseudo-dual
issue
dotp: fld ft0, 0(a0)
fld ft1, 0(a1)
fmadd.d ft2, ft0, ft1, ft2
addi a0, a0, 8
addi a1, a1, 8
bge a0, t0, dotp
call conf_addr_gen
frep buflen, rep
fmadd.d ft2, ft0, ft1, ft2
Zaruba, Florian, Fabian Schuiki, Torsten Hoefler, and Luca Benini. "Snitch: A 10
kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of
Floating-Point Intensive Workloads." arXiv preprint cs.AR/2002.10143 (2020).
Schuiki, Fabian, Florian Zaruba, Torsten Hoefler, and Luca Benini. "Stream
Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute
Utilization in Single-Issue Cores." IEEE Transactions on Computers (2020).

||07.07.2020Andrea Bartolini 26
SNITCH Figures of Merit
Normalized Performance
Higher performance than VP
Almost 80 DP GFlop/sW
 2x more efficient of ARA
 13pJ x DP FLOP
 Exascale!!
22nm FDX technology

System-Level Integration of Accelerators
• Our accelerators show leading performance and energy efficiency in
silicon at the core and cluster level
• How can we unleash this potential in real computing systems?
Image: Nvidia Xavier die shot annotated by WikiChip Image: Summit Supercomputer by OLCF at ORNL

HERO: Open-Source Heterogeneous Research Platform
HERO combines
• general-purpose Host CPUs
• domain-specific programmable many-core accelerators
to unite versatility with performance,
enabling task offloading and data sharing across heterogeneous
• ISAs (e.g., ARMv8 and RV32)
• memory subsystems (e.g., caches and SPMs, virtual and physical addresses)
• data models (e.g., LP64 and ILP32)
• OSes and runtime libraries (e.g., Linux and OpenMP Device RTL)
with minimal run-time overhead and transparent to application programmers.
A. Kurth, P. Vogel, A. Marongiu, A. Capotondi, and L.Benini: "HERO: Heterogeneous Embedded
Research Platform for Exploring RISC-V Manycore Accelerators on FPGA." Proceedings of the First
Workshop on Computer Architecture Research with RISC-V (CARRV), pp. 1-7, IEEE/ACM, 2017.
A.Kurth, A. Capotondi, P. Vogel, L.Benini, and A. Marongiu: "HERO: an Open-Source Research
Platform for HW/SW Exploration of Heterogeneous Manycore Systems." Proceedings of the
Second Workshop on Autotuning and Adaptivity Approaches for Energy-Efficient HPC Systems
(ANDARE), pp. 13-18, ACM, 2018.

Network-Accelerated Memory Transfers: sPIN on PULP
S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J, Beranek, M. Besta, L.Benini, D.
Roweth, and T. Hoefler: "Network-Accelerated Non-Contiguous Memory Transfers." Proceedings of
the International Conference for High Performance Computing, Networking, Storage and Analysis
(SC), pp. 1-14, ACM, 2019.
sPIN on PULP:
Network-
Accelerated
Memory Transfers
Processing user-defined network packet kernels
on a PULP-based accelerator
in the NIC

||
ControlPULP
Control-
PULP
VRM
BMC
PE
S
Operating System
Application
System Management / RM
GovernorsIn band
Hints/Prescription
Power Cap Energy vs.
Througput
DIMM
RJ45
SystemManagement/RM
Out of band
Node Power Cap
RAS
Main Archi. Blocks w. :
- Sensors (PVT, Util, archi)
- Controls (f,Vdd,Vbb,PG,CG)
- In band a.k.a low latency / user-
space telemetry (power, perf, …)
- O.S. PM governors:
- cpufreq/ cpuidle
- Based on O.S. metrics
- Slow & often unused
- Low latency PM requests and/or
suggestions
- From the Application/run-time
- Power cap => Max perf @ P<Pmax
- Energy => Min Energy @ f=f*
- Throughput => F > Fmax @ T,P<Max
- Out-of-band – zero overhead
telemetry
- Node Pcap – Max perf @
Pnode<Pmax
- RAS – error and conditions reporting
Coming Soon…
Andrea Bartolini, et al. A PULP-based Parallel Power Controller for FutureExascale Systems, in:
2019 26th IEEE International Conferenceon Electronics, Circuits and Systems (ICECS)

||
ControlPULP
Coming Soon…
PM
task
• Read voltage regulator,
power, status (VR)
• Power model update
T
control
Watchdog reset
Write power controller settings
Write to internal memory telemetry data
Read PVT sensors
Read workload from O.S.
Read target P/C state settings, power budget
Read Pending BMC requests
Compute controller settings
BMC
• Read pending command queue
• Decode Command/data
• Perform action:
• Change target P/C state, power budget
• Set pending BMC
• Ask telemetry data
Copyright © European Processor Initiative 2019. EPI Tutorial/bologna/22-01-2020
Ack. Robert Balas, Giovanni Bambini, Andrea Bentivogli, Davide Rossi,
Antonio Mastrandrea, Christian Conficoni, Simone Benatti, Andrea Tilli

||
HPC Vertical: The European Processor Initiative
▪ High Performance General Purpose
Processor for HPC
▪ High-performance RISC-V based
accelerator
▪ Computing platform for autonomous cars
▪ Will also target the AI, Big Data and other
markets in order to be economically
sustainable
Europe Needs its own Processors
▪ Processors now control almost every aspect
of our lives
▪ Security (back doors etc.)
▪ Possible future restrictions on exports to
EU due to increasing protectionism
▪ A competitive EU supply chain for HPC
technologies will create jobs and growth in
Europe
▪ Sovereignty (data, economical, embargo)
32

||
General Purpose Processor (GPP) chip
 7 nm, chip-let technology
 ARM-SVE tiles
 EPAC RISC-V vector+AI accelerator tiles
 L1, L2, L3 cache subsystem + HBM + DDR
HBMHBM
HBMHBM
DDR DDR
DDR DDR
RISC-V Accelerator Demonstrator Test Chip
 22 nm FDSOI
 Only one RISC-V accelerator tile
 On-chip L1, L2 + off-chip HBM + DDR PHY
 Targets 128 DP GFLOPS (vector) 200+GOPs/W SP
(STX)
Highspeed
SerDes
Vector
lanes
STX
units
Vector
lanes
STX
units
Scalar
core
Scalar
core
First Generation EPI chips
Scalar Core + STX units based on NTX and Snitch!
GPP power manager based on ControlPULP!

||
…. a Swiss army knife for HPC
EXASCALE
2021

||http://pulp-platform.org
The fun is
just beginning

RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC

Related slideshows

More Related Content

RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC