Mirabilis_Presentation_DAC_June_2024.pptx

Mirabilis Design
EDA Software Company based in Silicon Valley
Integrating sub-system teams to the mission using System-Level Design
Highly experience Management and Engineering team
Over 150 man-years of background in semiconductors, automotive and
aerospace
VisualSim Architect –Design the Right product
Graphical modeling and simulation platform with complete set of system-level modeling IP
Eliminate all surprises prior to integration
Optimizing specification, collaboration between mission, sub-systems
and suppliers, evaluating use-cases and identify test scenarios for
system validation
Networking
18th companies
& 32nd universities
Electronics Modeling
35th customer
2008
Company Incorporated
2011
First Engagement with
HP and ISRO
2013
Announced
VisualSim
2014
University Program
10th Customer
2015
Stochastic and
Network modeling
2016 2018 2019
Automotive
& Avionics
2020
System-level IP
Open API
2022/23
Re-engineered
AI, DNN, Power, GPU
2021
Requirements Tracking
50th customer

VisualSim- The Product
Spend time designing … not working on Word/Excel/Powerpoint

VisualSim IP Library
Custom Creator
Communication
Power
RF, Baseband, Channels
Communication systems,
A/D transceivers, Antenna,
Analog, Signal/audio/Image
Processing
Power States, Allocation,
Transition, Loss, Battery,
Consumption, Management,
Generation, Distribution,
and Thermal
Sensors, Interfaces,
Distribution, Traces,
Software, VCD, ML, DNN
Traffic
Reports
Latency, Throughput,
Utilization, Ave/peak
power (instant, ave) ,
hit-ratio, Heat, Temp
RISC-V and Chiplets
RTOS and
Software
SiFive, In-Order/Out-
of-Order Generator,
Tilelink
Generic RTOS, ARINC
653, AUTOSAR, task
Graph
AMBA (AHB/ APB/ AXI/CHI),Tilelink
Corelink (600, 700), NoC (Generic,
Arteris, Signature, OpenEdges),
Virtual Channel, DMA, Crossbar,
Serial Switch, Bridge, UCie
SOC
Board-
Level
VME, PCI/PCI-X/PCIe 6.0, SPI 3.0,
1553B, FlexRay, CAN-FD/XL,
AFDX, TTEthernet, OpenVPX
Processors ARM (M0-55), R5, Cortex (A8,
A72, A53, A76, A77, A65, A78,
A720), Nvidia- Pascal to Ampere,
Generic GPU, mC, Leon, Power,
X86, DSP- TI and ADI, Tensilica,
Renesas SH, AI Engine, TPU
Stochastic
Queue ,Time
Queue, Quantity
Queue, Resources,
Scheduler
Scripting, RegEx, Task
graph, Use cases,
Hardware Builder,
C/C++/Java/Python
MatLab, STK
Storage Flash, NVMe, Disk, SSD,
NAS, Fibre Channel,
FireWire
TSN, AVB, 10BaseT1S, Switched Ethernet,
Resilient Packet Ring, RP3, WiFi 802.11,
Bluetooth, PAN, Spacewire, SpaceFibre,
IEEE802.1Q, Time-Triggered Ethernet,
AFDX, 5G
Networking
Memory
• Memory Controller, SDR, DDR
DRAM 2,3,4, 5, LPDDR 2, 3, 4,5
HBM2.0, HMC, QDR, RDRAM,
MPMC, cache, Coherent cache
FPGA Xilinx- Versal, Zynq,
Ultrascale, Kintex
Altera-Stratix, Arria,
Microsemi- Smartfusion,
Programmable logic
generator
Trade-Off
Requirements,
Thermal, Power,
Performance, Failure
Verification, Upgrade

Assemble System Model using Pre-Built System-Level IP
Scheduling/Arbitration
proportional
share
WFQ
static
dynamic
fixed priority
EDF
TDMA
FCFS
Communication Templates
Architecture # 1 Architecture # 2
Computation Templates
DSP
AI
GPU
DRAM
CPU
FPGA
m
E
DSP
TDMA
Priority
EDF
WFQ
RISC
DSP
LookUp
Cipher
AI DS
P
CPU
GP
U
mE DD
R
static
Which architecture is better suited
for our application?

Add the Task Graph to Define the Workload
I/O
DSP
CPU1
CPU2
task1 task2 task3 task4
Contention
- limited resources
- scheduling/arbitration
Interference of multiple
applications
- limited resources
- scheduling/arbitration
- anomalies
Complex behavior
- input stream
- data dependent behavior

Analyze the Results
System with faster Bus is slower in places
Unpredictable System Response

Impact of System Architecture Exploration
• System sizing and topology design
• Power consumption, cooling & management
• Device distribution across one/multi-die
• Application mapping on CPU, GPU, TPU, DSP
• SW, firmware, scheduler and network tuning
• Merges Shift-Left and Shift-Right
• System-level model integrates requirements,
creates a single model of the entire system,
trade-offs power-performance-area and
generate tests
• To optimize associated area
• To design thermal structure
• To create Chiplet IP industry
• To meet timing and power
• To meet mission requirements
• Single platform from Concept to End-of-Life
• Collaboration between design teams,
suppliers, customers

ARM Cortex A53
Benchmark FPGA VisualSim Difference Comments
ED1 5.94ms 6.425ms 7.55% Integer processing
MM 12.084ms 11.863ms 1.08% Most load operations with
random addresses
MM_st 13.984ms 14.65ms 4.5% Most store operations with
random addresses
Test System
Xilinx Ultrascale+ Zynq® UltraScale+™ XCZU9EG-2FFVB1156E MPSoC running on the ZCU102 board
Specification: 4 core ARM Cortex A53 at 1200Mhz; 32KiB i-cache; 32KiB d-cache, 1MiB L2; 2GB DDR4
DRAM 2400

Comparing Power for ARM Cortex A53
Frequency VisualSim Simulated
Power
Measured Power as
reported by Anandtech
Delta percentage
500.0 Mhz 0.037 W 0.038 W 2.63%
600.0 Mhz 0.053 W 0.051 W -3.92%
700.0 Mhz 0.073 W 0.080 W 8.75%
800.0 Mhz 0.097 W 0.090 W -7.77%
1000.0 Mhz 0.157 W 0.159 W 1.25%
1100.0 Mhz 0.193 W 0.188 W -2.65%
1200.0 Mhz 0.233 W 0.227 W -2.64%
1300.0 Mhz 0.277 W 0.269 W -2.97%
Source: Anandtech.com
Over 97% accuracy

Comparing different Cores- Dhrystone
Processor MoP Hit
Ratio
MoP Mean
Latency
I1 Hit Ratio I1 Mean
Latency
D1 Hit
Ratio
D1 Mean
Latency
L2 Hit
Ratio
L2 Mean
Latency
DSU Hit
Ratio
DSU Mean
Latency
ARM Cortex
A53
- - 99.97 1.93E-09 99.98 2.02E-09 18.75 9.33E-08 - -
ARM Cortex
A77
99.90 1.75E-09 67.22 6.25E-08 99.96 7.32E-10 14.19 1.82E-07 6.96 2.05E-09
RISC-V u74 - - 99.98 4.15E-09 99.98 1.86E-09 39.58 5.25E-08 - -
Processor Instructions Latency Max MIPS
ARM Cortex A53 ~ 56,66,000 0.0055846 ~ 1039
ARM Cortex A77 ~ 44,78,000 0.0011795 ~ 3960
RISC-V u74 ~ 60,58,000 0.007726 ~ 797

VisualSim drives Efficiency & Productivity
Model Creation (6)
Implementation (18)
Using Current Design Methodology
Project Schedule
)
Implementation (12)
Using VisualSim Design Methodology
Time savings
based on 24
month project
is 20-40%
Note: All times in months
TM
Communication and Refinement (4)
Analysis (2.5)
Model Creation (0.5)
Analysis (1.5)
Communication and Refinement (6)
Advantageous over generic modeling environment due to Shorter duration & greater applicability

VisualSim System Model using UCIe in ADAS SoC

Vary Compute, Interconnect and Traffic
Package_Type = Advanced
Max_Link_Speed_GTps = 32
Number of Modules = 4
Tx_Buffer_Size = 8192 ( No packets dropped)
Protocol = PCIe_Gen6
Flit_Size = 256 Bytes
Num_of_Flits_per_Flow_Control_Check =8
Run Simulation with Different Configurations and Topology

Power
Generation
Power
Storage
Power
Consumption
Thermal
Management
• Different charging schemes
• Impact of surge and shocks
• Battery Lifecycle
• Battery Consumption
• Statistics
• Heat and
temperature
• Impact of
cooling strategy
• Add impact of
power spikes
• State based power consumption
of electronics (controller, SOC)
and Mechanical (brakes, wheels)
• Average, instant and Cumulative
• Power per device and application
Verification and Debugging
• 4 Types of Power
Generators in VisualSim
• Constant, variable, motor,
solar charge
• Charge sent to battery
1 2 3 5
6
• Optimize and test the power management algorithms
• Sizing of power generators and battery
• Optimize the schedule, supplynet and voltage
• Estimate power consumed by the software application
Downstream Integration
• Generate UPF file with power domains and
associated voltage levels
• Generate SystemVerilog power testbench
• Generate powerState change VCD dump
7
Power
Management
• Change in power
state controlled by
time, utilization,
temperature and
expected activity
4
Add the Power and Thermal

Behavior Task Graph
Power Table
Power management Unit
SystemVerilog Output for Power System Test
VCD Waveform for Verification
create_power_domain PD_Top -include_scope
create_power_domain -name PD_1_2.0 -elements {"CLKMUX"}
create_power_domain -name PD_1_1.0 -elements {"PLL","G2","G3"}
create_power_domain -name PD_1_3.0 -elements {"PROC"}
create_supply_port -port VDD_1.0 -direction in -domain PD_Top
create_supply_port -port VSS_0.0 -direction in -domain PD_Top
create_supply_net VDD_1.0 -domain PD_Top
create_supply_net VSS_0.0 -domain PD_Top
connect_supply_net VDD_1.0 -ports VDD_1.0
connect_supply_net VSS_0.0 -ports VSS_0.0
add_power_state PD_1_2.0 -state Active
{-supply_expr (VDD_2.0 == {ON, 2.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_2.0 -state
OFF {-supply_expr (VDD_2.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_1.0 -state OFF
{-supply_expr (VDD_1.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_3.0 -state OFF
{-supply_expr (VDD_3.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})}
Power Modeling Integration

System Verification
• Validate product not just HW/SW
• Application relevant test vectors
• Generate test cases and run against RTL
• Compare simulation output against RTL
• Match architecture timing within range
• Verify functional correctness
• Task sequencing @ DSP/uP
• Resource contention
Eliminate product failure by maximizing relevant verification
Golden
Reference
Comparator
Match Tag
Architecture
model of IP
Verilog/C/
Hardware

Reference Data
Example: Infotainment

Architecting Hardware-Software for Infotainment System
Mirabilis Design Confidential
DRAM
Display
IO
A
M
B
A
A
X
I
B
u
s
CPU
GPU
Display
Ctrl
P
C
I
e
Video Camera SRAM
Packet
• System Overview
• Camera : 30fps, VGA corresponds
• CPU : Multi-core ARM Cortex-A53 1.2GHz
• GPU : 64Cores(8Warps×8PEs), 32Threads,
1GHz
• DisplayCtrl : DisplayBuffer 293,888Byte
• SRAM : SDR, 64MB, 1.0GHz
• DRAM : DDR3, 64MB, 2.4GHz
Explore at the board- and semiconductor-level to size uP/GPU, memory bandwidth and bus/switch configuration

System Model of an Infotainment System
Mirabilis Design Confidential
NXP i.MX6 /
nVIDIA Drive PX
Xilinx FPGA
Kintex 8
Discrete
DMA
ARM A53
GPU
Display Ctrl
SRAM3
DRAM3
Video IN
Parameters
Video OUT

Conducting Architecture Trade-off
• By changing the amount of video input data (packet number), observe the SRAM -> DRAM transfer
performance and examine the upper limit performance of the video input that the system can
tolerate. 210Packet/Sec
12ms
21Packet/Sec
41.4us
300Packet/Sec
• 250 Packet/Sec is the system limit
• With 300 Packet/Sec, simulation cannot be
executed due to FIFO buffer overflow.

Reference Data:
Mapping Applications onto SoC

Mapping Algorithm to Multi-Resources
Standard HW
Library
Component
Basic/Starting Configuration
Grayscale_Conversion - PS [A72 Core 1]
IIR – Logic (PL)
FFT – AI Engine Tile
Edge_Image - Logic (PL)
iFFT – AI Engine Tile
Edge_Image_Enhancement – Logic (PL)
Segmentation – PS [A72 Core 2]
Image
Processing
Algorithm

Experiments with Different Implementations
Run 3 – Using Direct Path
between Logic and AI
Run 2 – Segmentation
Mapped to AI Engine
Run 1 – Base Configuration
Mapped to Logic and ARM
Application latency increasing over time.
Latency increases due to Segmentation.
Remap segmentation task AI Tiles
Latency is deterministic
Latency requirement (App latency
< 80 msec) is met.
Utilization across NoC is acceptable
Application latency in bounded range.
NoC Utilization is high.
Changed interconnect for Segmentation
from NoC to Direct

VisualSim Chiplet
Solution
Using the Chiplet Library to Design SoC

ADAS SoC Block Diagram
UCIe
AI Engine Tiles
Warp
Scheduler
PE
PE
PE
PE
Local Mem
GPU
Memory chiplet
ADC
DDR5
Processor subsystem
Core L1
B
u
s
SLC
• Optimal
mesh size
(mxn) ?
• Best sample
size (16
bytes vs 32
bytes etc) ?
Use a single protocol
stack or multi protocol
stack?
Do we need PCIe
gen6 or still use
gen5 for meeting
application
requirements?

Statistics for Multi-Die SoC
• Note the AI Engine
latency spikes
• For multi protocol,
half bandwidth for
each protocol.
• Older gen protocols
are mixed with PCIe 6,
• Lower FLIT size
increases latency.

Comparing Different Configurations using UCIe Interface
All Die Adapters using PCIe 6.0
Die Adapters using PCIe 6.0
and Streaming Protocols (AXI)
Lower latency when using PCIe 6.0

Reference Data
Example: Deep Neural Network

Mask Region-CNN (MR-CNN) for object detection and image
segmentation
Overall representation of Mask
R-CNN model
Network Architecture of Mask R-CNN
output
CPU Preprocessing
CPU Postprocessing

Using ChatGPT to translate AI model (Mask R-CNN) in to VisualSim
Task Graph
• Each of the layers are defined as different
tasks in the task graph and the dependency
between them is modeled.
• A database is used to list the
layers/functions and the parameters
associated with them.
• These will be used to determine the
number of Multiply Accumulate (MAC)
operations corresponding to each
layer/function
Class, box
mask

VisualSim Model of DNN Hardware and Task Graph
Application sequence from
Task Graph is mapped to
HW architecture
• PE – 12x14
• 4 memory hierarchy
• Power computation
per PE, Buses and
memory

Results – Base model (168 AI Cores, 90% data availability at
SRAM)
• Peak Power
consumption at
around 10.8 Watts
• Obtained FPS = 0.414

Results – 8x8 (64) cores, 90% data availability at SRAM
• Peak Power consumption at
around 5.6 Watts as the number
of cores were reduced
• Obtained FPS = 0.29, which is
lower than the base model
results as the number of
resources for doing MAC
operations were lower

Results - 100% data availability at
SRAM, 168 cores
• The number of off chip memory
accesses were reduced. The only
accesses made were to load the
images and weights into the
SRAM
higher than the base model
results as the number of off chip
memory accesses were reduced
• Peak Power consumption (10.4
W) is lower as off chip memory
accesses were reduced

Results - 60% data availability at SRAM,
168 cores
• The number of off chip memory
accesses were increased
lower than the base model
results as the number of off chip
memory accesses were
increased

Reference Data:
Hardware-Software Partitioning SoC Architecture Design

SoC System Specification
Processor Core – RISC-V or ARM A53 core
Processor Speed – 1200 MHz
L1 cache:
I Cache : 32 KB
: 2 way set
associative
D Cache : 32 KB
: 4 way set associative
L2 Cache
Size :1 MB
Associativity :16 way
Ext DRAM
Size :4 GB
Type :DDR4
Speed :2400 MHz
HW Accelerator
Speed : 100 MHz
Software
Multimedia task
Stochastic instruction trace
Goals
Peak Power < 1.0W
Number of Matrices > 19K

VisualSim SoC Model
MPEG Application
IP or RISC-V level
• Evaluate pipeline stages
• Width, Speed
• Number of execution units, Levels of cache
SoC
• Number of RISC-V cores
• Accelerators
• Cache memory hierarchy and coherence
System level
• Development of an IoT device, ECU or an
integrated platform
Behavior
Hardware
Bus Topology

CASE 1: All SW tasks
Observations:
1. Avg power
consumption within
requirements (<1.0 W)
2. Performance
requirement not
achieved (Only a max of
9.4K frames)

Sequence diagram
Rotate Frame
task is found to
be resource
intensive

CASE 2: Run Rotate Frame Task on HW Accelerator
Observations:
1. Avg power consumption
requirement not met (>
1.3 W)
2. Performance
requirement achieved (
max of 19.9K frames)

CASE 3: Run Rotate Frame task on
HW Accelerator + Power management
Observations:
1. Avg power consumption
requirement met (<1.0
W)
2. Performance
requirement achieved (
max of 19.8K frames)

Comparing different
Processor Cores
ARM, RISC-V

Generated Statistics
Per Execution
unit stats, stall
percentages,
buffer
occupancies
are reported
• Detailed Cache, Bus
and Memory stats
are generated per
simulation.
• Stats Include – hit
ratio, throughput,
latency, number of
write backs, evictions
etc.

Use cases
Run Num Description M4 (Latency) M55 (Latency) U74 (Latency)
1 Running Dhrystone on
core. No
cache/bus/memory access
5.576700039E-4 9.47200014E-5 1.77875568E-5
2 Cache/Bus/Memory
access
8.7438000752E-4 1.6319750281E-4 5.05307708E-5
* Number of loops are different for each core

Automotive applications
Mapping tasks to RISC-V

ECU Performance Analysis under Different Use Cases
Demo environment
1. Brake ECU integrated to a CAN Network
2. Sensors write data to the memory
3. Brake Pedal or Proximity sensor triggers the braking action from the Brake ECU
ECU
Using a RISC-V processor for the Brake ECU
Analysis
1. Latency (Time taken for the signal to reach all the wheels from the Brake ECU)
2. Processor performance (MIPS)
3. Power Consumption (Breaking activity, ECU usage and Network activity)
6/28/2024 Mirabilis Design Inc. 52

System Overview
Gateway
Transfer messages between different CAN
networks
CAN Bus
CAN bus is the network that connects
sensors and ECU’s
Wheel
1
Wheel
4
Wheel
3
Wheel
2
Gateway
CAN
Bus
Engine
Proximity
Sensor
Brake
Pedal
Gyro
Sensor
Road
condition
sensor
CAN
Bus
CAN
Bus
ECU

Automotive Network System
6/28/2024 Mirabilis Design Inc.
N
CAN Wire
CAN Node
Wheel1
Wheel2
Wheel3
Wheel4
Brake
Pedal
Proximity
Sensor
Gyro
Sensor
Gateway
ECU
Road
condition
sensor
Engine
CAN
BUS
CAN
BUS
CAN
BUS
N N
N N N
N
N
N
N
N
N
N
N

VisualSim Model
RISC-V
Model
location:
VS_ARdemo
automotiveBr
ake_Model_W
ith_ECU_A53
Brake_CAN_m
odel_ECU_ne
w_RISC-V.xml

Configuration of the ECU/Processor
Processor Spec
1. Processor (ECU) RISC-V – 5 Pipeline stages
2. Number of core 1 - 2
2. Processor Speed 100 MHz - 1.2GHz
3. DRAM Type DDR3 SDRAM (Synchronous DRAM)
4. DRAM Speed Range 400 – 1066 MHz
5. Cache Speed 500Mhz
6. Cache Size 64Kbytes
7. Memory Controller DDR3, 750MHz
8. Bus CAN
ECU Data input
1. Wheels 2. Engine 3. Proximity Sensor 4. Brake Pedal
5. Gyro Sensor 6. Road Condition Sensor

Designing Brake ECU using Single Core – RISC-V

Results – single core RISC-V
Slight
improvement
in Processor
Task Latency
at few
instances

Mirabilis_Presentation_DAC_June_2024.pptx

More Related Content

Similar to Mirabilis_Presentation_DAC_June_2024.pptx

Similar to Mirabilis_Presentation_DAC_June_2024.pptx (20)

More from Deepak Shankar

More from Deepak Shankar (15)

Recently uploaded

Recently uploaded (20)

Mirabilis_Presentation_DAC_June_2024.pptx