2
\$\begingroup\$

I have two STM32L4Q5 controllers communicating over 500kBit/s CAN bus.

The test application generates messages with variable length and data. The data checksum is used as low byte of message ID, while the high bits are flipping between high and low values to make sure both sides have a chance to win arbitration. The application sends next message when there is free TX buffer and counts sent/received messages every second.

The code itself was used for years and worked just fine. However the MCU was recently changed and I had to make Cube project from scratch, at which time I have enabled auto-retransmit functionality (we used to have it disabled before).

In the first test both nodes could only send 130 to 250 messages per second, with some seconds passing without any messages sent at all. So, it looked like serious collisions were going on.

At this point I disabled auto-retransmit on one node and immediately it was able to "send" around 5800 messages per second. First I thought that this was simply the number of messages placed into TX buffers, with majority of them failing anyway. But then I checked the other node and it was actually receiving around 5500 MPS, losing on average only 300 messages. What was even stranger, the rate of successful transmissions from other node increased 3x times too, and there were no blank periods without any messages being sent anymore.

Finally I replaced the code on the second node as well and now they were working identically, sending on average 5800 messages and successfully receiving about 2800 of them.

The last test means that the bus bandwidth is sufficient for at least 5600 messages per second. Why then, automatic retransmission does not work at this rate? I feel like I am missing something obvious in the protocol (which I thought I knew pretty well), but cannot pinpoint it.

P.S. I found this post with somewhat similar symptoms, but no usable answer, unfortunately. As the controller used in there is completely different it reinforces my belief that this is something protocol-related, not a hardware or software issue.

P.P.S The actual application uses CANOpen protocol with fairly high PDO rates. The conscious design decision was made to allow some level of message loss because by the time message is re-transmitted its data may become obsolete. It is better to send fresh data in a new message than have network clogged by old messages. So, while I am quite satisfied with how it works at the moment, I still would like to know what is going wrong with auto retransmission to avoid mistakes in the future.

UPADATE: Mystery resolved!

The butler did it! In this case - myself. Turns out, random message ID generation by including data checksum into it was not so random after all. So, when two messages with same ID but different data collided the result was message corruption and error accumulation. Since we had automatic bus-off handling enabled the nodes eventually recovered, which explains those blank no-messages intervals.

With true random IDs both nodes transmit and receive on average 2900 MPS with zero lost messages.

In the end, it was a flawed test that produced erroneous results.

\$\endgroup\$

2 Answers 2

2
\$\begingroup\$

With this ID building logic, is it possible for two messages to get the same arbitration field, but different data-field? Or the messages could be absolutely identical, but there is no node to acknowledge it.

In this case, without automatic retransmition both nodes would gen an error and change the messages. With automatic retransmition nodes would keep trying until one gets to error-passive state. But second-long silence periods looks strange.

With automatic retransmition enabled is there reason to manualy flip ID bits? One node would loose arbitration and resend it's message once bus is free. Or you need balanced data stream from two noes?

\$\endgroup\$
4
  • \$\begingroup\$ Yes, it is possible for two nodes to generate same ID with different data or even with absolutely identical messages. I estimate the chances of that are minuscule, but I might be wrong. The reason to flip high bits is exactly as you've guessed, to have symmetrical data flow for the test. In real application we have approximately even data volume in all directions. \$\endgroup\$
    – Maple
    Commented Nov 17, 2021 at 8:18
  • 1
    \$\begingroup\$ You've given me compelling idea. If two nodes happen to generate exactly same message and attempt to send them at exactly the same time then there would be nobody to ACK. Hmm... I'll try to add one silent node to the test just to do the ACK-ing. \$\endgroup\$
    – Maple
    Commented Nov 17, 2021 at 9:02
  • \$\begingroup\$ Tried with third silent node and throughput dropped even more. So, this is not an ACK. \$\endgroup\$
    – Maple
    Commented Nov 19, 2021 at 0:44
  • \$\begingroup\$ See an update in OP. Your very first guess was spot-on. We did have messages with same arbitration field but different data. \$\endgroup\$
    – Maple
    Commented Nov 19, 2021 at 1:19
1
\$\begingroup\$

The data checksum is used as low byte of message ID, while the high bits are flipping between high and low values to make sure both sides have a chance to win arbitration

Two nodes should not transmit a message with the same ID. They cannot possibly send data about the same subject. Unless it is network management. Strange arbitration effects will occur when you try.

Auto-retransmit is just that, when it continuously sends the same message, it's either not acknowledged or it has an error. Decent CAN analyzers will show you these corrupted messages so you can see what went on.

With auto-retransmit enabled, you will have to manually revoke a messagebox if transmitting takes too long.

Retransmit is essential for robust CAN bus, so either use it on Auto or do it in software.

5800 messages per second sounds like a bit too much for CAN bus in general. Are you counting failed transmits in this number? Considering j1939 says 70% is the maximum.

\$\endgroup\$
6
  • \$\begingroup\$ Regarding retransmission, please see updated question. I explained in PPS why we don't bother with it in actual application. 5800 messages includes failed attempts. The actual number of successfully received is 5600 for two nodes. Since the test varies data length the average message is 76 bit, so 425600 bits per second. I think it is what to be expected for a bus running at 500 kBit/s. \$\endgroup\$
    – Maple
    Commented Nov 17, 2021 at 8:50
  • \$\begingroup\$ "have to manually revoke a messagebox" interesting, I did not think of it. So, what happens if I don't? Will controller go into bus-off state after a while? Try to reset? We do have "auto bus-off management" function enabled too. \$\endgroup\$
    – Maple
    Commented Nov 17, 2021 at 8:57
  • \$\begingroup\$ @Maple So you're saturating the can bus... Stop doing that, things will improve, bxCAN can't do that much with only two fifo'd messageboxes on HAL. And messagebox with retransmit will repeat until eternity. \$\endgroup\$
    – Jeroen3
    Commented Nov 17, 2021 at 9:09
  • \$\begingroup\$ Well, the actual application only needs about 1000 MPS. The test was made specifically to push the limits and make sure we will not run into unexpected bottlenecks. Regarding repeating until eternity, I rather hoped you'd say it will go into bus-off. That would explain those seconds-long breaks in communication \$\endgroup\$
    – Maple
    Commented Nov 17, 2021 at 9:16
  • \$\begingroup\$ @Maple The messagebox will only reach empty state after successful transmittion or without retransmit. It will only go into bus off when actual error frames are sent, arb and ack don't count. If you get into bus-off, messageboxes need to be manually revoked. bxCAN is weird, and you just have to deal with it. The long breaks is just your application waiting for free messageboxes that never get freed because you don.t abort them. Meaning you never detect the bus is cleared again (with two nodes). \$\endgroup\$
    – Jeroen3
    Commented Nov 17, 2021 at 10:11

Not the answer you're looking for? Browse other questions tagged or ask your own question.