I have two STM32L4Q5 controllers communicating over 500kBit/s CAN bus.
The test application generates messages with variable length and data. The data checksum is used as low byte of message ID, while the high bits are flipping between high and low values to make sure both sides have a chance to win arbitration. The application sends next message when there is free TX buffer and counts sent/received messages every second.
The code itself was used for years and worked just fine. However the MCU was recently changed and I had to make Cube project from scratch, at which time I have enabled auto-retransmit functionality (we used to have it disabled before).
In the first test both nodes could only send 130 to 250 messages per second, with some seconds passing without any messages sent at all. So, it looked like serious collisions were going on.
At this point I disabled auto-retransmit on one node and immediately it was able to "send" around 5800 messages per second. First I thought that this was simply the number of messages placed into TX buffers, with majority of them failing anyway. But then I checked the other node and it was actually receiving around 5500 MPS, losing on average only 300 messages. What was even stranger, the rate of successful transmissions from other node increased 3x times too, and there were no blank periods without any messages being sent anymore.
Finally I replaced the code on the second node as well and now they were working identically, sending on average 5800 messages and successfully receiving about 2800 of them.
The last test means that the bus bandwidth is sufficient for at least 5600 messages per second. Why then, automatic retransmission does not work at this rate? I feel like I am missing something obvious in the protocol (which I thought I knew pretty well), but cannot pinpoint it.
P.S. I found this post with somewhat similar symptoms, but no usable answer, unfortunately. As the controller used in there is completely different it reinforces my belief that this is something protocol-related, not a hardware or software issue.
P.P.S The actual application uses CANOpen protocol with fairly high PDO rates. The conscious design decision was made to allow some level of message loss because by the time message is re-transmitted its data may become obsolete. It is better to send fresh data in a new message than have network clogged by old messages. So, while I am quite satisfied with how it works at the moment, I still would like to know what is going wrong with auto retransmission to avoid mistakes in the future.
UPADATE: Mystery resolved!
The butler did it! In this case - myself. Turns out, random message ID generation by including data checksum into it was not so random after all. So, when two messages with same ID but different data collided the result was message corruption and error accumulation. Since we had automatic bus-off handling enabled the nodes eventually recovered, which explains those blank no-messages intervals.
With true random IDs both nodes transmit and receive on average 2900 MPS with zero lost messages.
In the end, it was a flawed test that produced erroneous results.