-1
\$\begingroup\$

I have multiple devices using ESP32-WROOM-32E and TWAI (CAN). There was an issue where they would have bus errors and then fail to recover via twai_initiate_recovery(). I solved the recovery issue by enabling all "CONFIG_TWAI_ERRATA_FIX*" fixes, however, I still have the issue that causes bus off in the first place.

The issue seems to be rare but most likely related to electrical noise. I have one device that would go bus off around every 3-4 months. One device that would go bus off after about 2 weeks and another device that generated about 44k errors and then went bus off just last week. These devices operate near generators and electrical equipment. It is a noisy environment, however, the cables are all shielded, power supplies isolated and transceivers (ISO1042) rated for high common mode voltage.

This last failure showed about 44K bus errors out of 150M frames, no tx errors and no rx errors. Almost all of the errors happened when the twai went to bus off state. After a 3 second delay, recovery was initiated and communication continued as if nothing had happened. I had set a delay before starting recovery of 3 seconds as that was what was in the esp-idf example. During this time of bus off no control was available, which is not acceptable for the application this device is used in.

I believe that the issue may be down to the default timings for 250kbps TWAI. (.brp = 16, .tseg_1 = 15, .tseg_2 = 4, .sjw = 3, .triple_sampling = false) By my calculations, this would put the sample point at 80% and the sjw would allow re-sync by up to 3 Tq. J1939 recommends sample point at 87.5% and sjw of 1.

Questions:

  1. Is it possible that the default timings could allow the controller to get out of sync and create bus errors until it went bus off?
  2. Is the 80% sample time to allow for propagation delay on the bus and in the transceiver so that it's closer to 87.5% once taken into account? (ISO1042 loop time = 152ns, 1 Tq = 200ns)
  3. Can I shorten the bus recovery time to limit how long the device is in bus off? (try instant recovery, if that fails, try again in 100ms, double the time until recovered with a max of say 5 seconds)

Thanks for taking the time to read, let me know if you have any questions.

Edit: Adding schematics for more clarity

Note: Layout is a 4 layer pcb with 2 internal ground planes as returns for top and bottom layer. CAN related signals are run as length matched differential pairs.

Power Supply Section

CAN Bus Interface and Protection

Control Section/ESP32

\$\endgroup\$
9
  • \$\begingroup\$ Overall it would be helpful if you could share a schematic and more info about the bus, since error frames are very likely caused by hardware problems. EMI is unlikely to affect the bus that much unless you have ground problems or bad cables/connectors. \$\endgroup\$
    – Lundin
    Commented May 9, 2023 at 6:31
  • \$\begingroup\$ @Lundin I added some pictures for reference and a brief about the circuit layout. The bus is less than 50ft long, terminated at both ends via 120 Ohm resistors. Each device has a drop of less than 3ft. Cable contains power and data pairs that are individually shielded. \$\endgroup\$ Commented May 9, 2023 at 12:38
  • \$\begingroup\$ "Each device has a drop of less than 3ft" That's beyond the recommended stub length of ~ 0.3m. Depending on how long the bus is overall, this may or may not be an issue. Though personally I can't say that I ever faced any problems when using all manner of more or less dirty stubs in CAN. \$\endgroup\$
    – Lundin
    Commented May 9, 2023 at 12:42
  • \$\begingroup\$ How do you terminate the bus? \$\endgroup\$
    – Lundin
    Commented May 9, 2023 at 12:43
  • \$\begingroup\$ Also, in industrial applications it's common practice to place a common mode filter on the CANH + CANL lines. \$\endgroup\$
    – Lundin
    Commented May 9, 2023 at 12:48

1 Answer 1

0
\$\begingroup\$

Is it possible that the default timings could allow the controller to get out of sync and create bus errors until it went bus off?

I would say that's not that likely in case you are only communicating with devices with similar setup and clocks. However, your system clock might be too inaccurate and that would be a big issue causing problems like those you describe. Don't pick one with >1% inaccuracy, external quartz is mandatory for 250kbps.

Is the 80% sample time to allow for propagation delay on the bus and in the transceiver so that it's closer to 87.5% once taken into account?

The first segment serves the purpose of acting as propagation delay. You have a sync bit, then a propagation delay, then the sample point happens between two equally large segments often called phase segment 1 and 2. See this.

I don't know why you picked 80% for - like you say the industry standards recommend 87.5%. SJW=3 sounds unnecessary too.

Can I shorten the bus recovery time to limit how long the device is in bus off? (try instant recovery, if that fails, try again in 100ms, double the time until recovered with a max of say 5 seconds)

I don't know your specific controller so I can't say. But generally if you hit bus off, the practice is to reset the MCU/CAN controller. Please not that error frames and even bus off are expected during power up since all nodes won't start at the same time.

\$\endgroup\$
1
  • \$\begingroup\$ Thanks for the response. The ESP32 module has a built in 80MHz oscillator with +-10ppm, I believe it should be more than enough to keep in sync. ESP-IDF (the development framework for ESP32) had their predefined timings set to 80%, I assumed it was 87.5% until I looked into the source behind their API. I was curious if there was a reason that they specifically chose 80%. In practice calling twai_initiate_recovery() successfully recovers the bus after bus off without resetting, I have the delay in there as it was in their example, but I believe it is unnecessarily long. \$\endgroup\$ Commented May 9, 2023 at 12:28

Not the answer you're looking for? Browse other questions tagged or ask your own question.