STM32H matrix-vector-multiply throughput

Question

Ill be working with a board that has a STM32H743 on it, and I have a hard time reasoning about the f32 matrix-vector multiply performance I can expect of the m7 core. As I understand the core itself, it should be able to put out one fused-multiply-add per cycle; as long as I can keep it fed with data. Said data will consist of 128x128 float32 matrix data and a 128 float32 vector; and id like to multiply a bunch of them often and fast.

However, I cant seem to find clear information on what the expect of the integrated ram, or the integrated flash. Would either or both of these be able to sustain one fmad per cycle? That is, stream 4 bytes of matrix data into the core? The vector-data I suppose I can keep in tightly-coupled memory; which means I can move it into the proper registers without delay I think? The flash can stream in code instructions without delay; there is special hardware for that; but what about general data accesses; what throughput does that sustain?

And what about the ram that isnt tightly coupled? It is hard to find granular data as to its behavior; nor can I find high-level benchmarks as to matrix-vector performance of the system as a whole. Im pretty sure the arm core could simply put out a number of fmads equal to its frequency, but this memory model isnt easy to reason about given that my programming thus far has always been x86. The tightly coupled ram is big enough that I could easily hold two of my matrices into it; so I could perhaps use DMA copying to page a new one into TCM while the other one is being crunched? Does this sound at all sane?

(Note: I am well aware there are chips out there with more compute; but this is the one ill be working with.)

If this is for a commercial board where the vector throughput is a core function to the application, then wouldn't this have been promoted to a research issue to be resolved very early in the project? I think your question is good! Don't get me wrong. But when I read that you "be working with a board that has a STM32H743 on it", which means that decision is already taken, then I wonder about the early design process that didn't resolve this question by the time someone is moving forward with board design. Seems like the cart before the horse. What's going on? Or is this a personal project? — periblepsis, Commented May 9 at 20:12
Well... in actuality this is a research issue very early in the project; just trying to avoid dragging chip selection into this discussion, and learn more about what I can expect compute-wise from STM32 chips. — Eelco Hoogendoorn, Commented May 9 at 20:14
I think ive been pretty explicit about the vector operation in question no? 128-element f32 matrix-vector multiplies. If I can do about 10k of those a second im happy. Then there are a million other things that go into my chip selection. Might make another question about that; but I really dont want to pull that in scope here. — Eelco Hoogendoorn, Commented May 9 at 20:34
Its interesting I cannot even find figures on simpler questions like DMA ram-ram copy bandwidth. What I can find from lower-end stm32 models its looking kinda scary; on high power cpus you expect DMA to be an order of magnitude faster than the CPU could consume it, but what im seeing from these low end STM DMA controllers you are lucky to get one byte per 10 cpu cycles... which wouldnt be the answer I was hoping for. — Eelco Hoogendoorn, Commented May 9 at 20:48
The first page of the STM32H743x datasheet says it has a 32 bit, 100 MHz memory bus. DMA performance will depend on what memory you connect to that bus, but you certainly won't copy faster than the bus can run. — user1850479, Commented May 9 at 23:38

Kuba hasn't forgotten Monica · Accepted Answer · 2024-05-10 00:56:16Z

0

If you think you'll run out of this resource, you're too close to the capacity of the SoC. It sounds to me like you would want a bigger SoC, or at least one that can deal uniformly with external SDRAM, and has a cache and prefetch that will let you keep the arithmetic units always fed with data. Caches are much easier to use than manually moving data around between "fast" and "slow" RAM.

answered May 10 at 0:56

Kuba hasn't forgotten Monica

43.9k1 gold badge41 silver badges129 bronze badges

\$\begingroup\$ As a rule of thumb it'd surprise me if external ram had a higher throughout as on-die 'tightly coupled ram'. A processor like the stm32p1 seems more geared towards applications like this but I have not been any more successful in figuring out how many mfadds it's memory system would sustain. \$\endgroup\$
– Eelco Hoogendoorn
Commented May 10 at 5:15
\$\begingroup\$ I've got to say this is a common refrain; 'if you have to ask about the compute capability of an stm32 chip, you don't want to know'. I was expecting a lot of answers like this so I'll take it to heart and assume there is wisdom in it... But also I'm quite stubborn, I really do have very minimal compute requirements, and I am curious to learn what the actual possibilities on this chip are. \$\endgroup\$
– Eelco Hoogendoorn
Commented May 10 at 5:49

Add a comment |

Theoristos · Accepted Answer · 2024-05-10 04:37:10Z

0

DTCM RAM have zero wait state. But only MDMA can work with it.

AXI SRAM have 64-bit bus and low wait state, but it works on 1/2 core frequency, and, afair AXI bus have 1 wait state itself.

FLASH is a quite slow, 4 wait state at 180+ MHz (also 1/2 of core freq too)

SRAM1-3 have ~ same speed as AXI, but only 32-bit bus.

answered May 10 at 4:37

Theoristos

1607 bronze badges

\$\begingroup\$ Could you help me unpack that information a bit? So how many floats do you think I could move into the m7 core per one of its 550mhz cycle? Mdma sounds like it's designed for prefetching blocks of RAM like this... But can't find any data on it's bandwidth. \$\endgroup\$
– Eelco Hoogendoorn
Commented May 10 at 5:26
\$\begingroup\$ In the above comment I was talking about ram. But flash would have about 1/12th the bandwidth required to feed one float32 per m7 cycle into the core? I suppose that relegates it to be loaded from at boot only but that's ok. If I can get one float 32 per 4 cycles of the m7 core from general ram to TCM I'm ok with that really. \$\endgroup\$
– Eelco Hoogendoorn
Commented May 10 at 5:39
\$\begingroup\$ Assuming 'zero wait state' means what I think it does; it can be used like a register and the superscalar architecture can hide the required move instruction in it's pipeline. \$\endgroup\$
– Eelco Hoogendoorn
Commented May 10 at 5:41
\$\begingroup\$ In reality the flash might still be quite useful even if less then a tenth of the ram bandwidth; there is an access pattern to my matrices; the frequently used ones can be kept in ram from boot and the less frequent but more numerous ones could be kept in flash, and be rotated/paged into ram with dma, which should give me some 2.5x the effective memory to work with. \$\endgroup\$
– Eelco Hoogendoorn
Commented May 10 at 5:56
\$\begingroup\$ @Eelco Hoogendoorn: 1. Are you sure about 550Mhz for STM32H743 ? Afair, only H723 have 550, H747 is 480- . Also, 400+ Mhz demands some sort of "overvoltage", with a little more complex power supply. 2. Not sure about 1 AXI RAM -DTCM transfer per 4 CPU clock. It's easier to check real value by debugger. Also, you need to take in mind that BDMA is not alone on this bus. 3. If you need only 10k streamed mul-add per s - the bandwidth will be enough. Also, you can choose H755 series with additional 240 MHz M4 core. \$\endgroup\$
– Theoristos
Commented May 11 at 6:34

| Show 1 more comment

Stack Exchange Network

STM32H matrix-vector-multiply throughput

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
stm32
memory
performance
or ask your own question.

Hot Network Questions

STM32H matrix-vector-multiply throughput

2 Answers 2

Not the answer you're looking for? Browse other questions tagged stm32memoryperformance or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
stm32
memory
performance
or ask your own question.