4
\$\begingroup\$

I've seen an issue on a 14-node, 250k CANBUS (with a mixture of 11-bit and 29-bit nodes) where CAN frames are often corrupted by some incorrect bus activity. The screenshot explains it more clearly.

This generally happens around 10-15 seconds after the system is powered up and the error can last anywhere between about 100 bit times and 268bit times; the screenshot shows an event that is 205 bit times. For the longer disturbances, this can obviously cause some nodes to enter the "bus off" state.

I think what is happening here is there is that a good CAN frame gets as far as the data section being transmitted when some other node begins to apply dominant bits which may go undetected initially, as the transmitter's data may contain a number of 0s. At some point, either the frame transmitter detects a dominant bit when it is trying to send a recessive and/or other nodes fail to see a stuff bit and then an error frame is signalled (the section with the largest amplitude). The bus is then left in a dominant state, presumably by just one node, but this node eventually releases the bus and allows it to operate normally again.

Initially, I thought it might be a node that is unsynchronised with the rest of the bus starting a CAN frame when it is not supposed to, but it seemingly makes no attempt to put a legitimate frame out, even assuming it was running at a very low baud rate, but I don't see why the number of dominant bits would vary.

Has anyone experienced this kind of error before/can offer any possible solutions?

I've not seen that any nodes were missing before the error then present afterwards, which would suggest a culprit and I've started to take nodes off the bus one-by-one to see if the problem goes away but any other suggestions would be welcome.

Thanks in advance for any help.

enter image description here

\$\endgroup\$
8
  • \$\begingroup\$ So taking nodes offline didn't help? Or is cumbersome to do? \$\endgroup\$
    – PMF
    Commented Jan 30 at 8:01
  • \$\begingroup\$ @PMF It is cumbersome to remove some nodes but I can probably pull some fuses to get generally the same effect. I have removed what I thought was the most likely source of the problem and the problem seems to occur for slightly less time although I'd need to repeat the test a few more times to get confidence in that statement. \$\endgroup\$ Commented Jan 30 at 8:09
  • \$\begingroup\$ The question is what that 4.3V thing is coming from. Apparently something causes the transceivers to go loco and pull the lines to a dominant state. (Some sort of failsafe/latch-up?) My guess is that this happens locally on a board on the Tx/Rx side and not on the bus side. Can you share schematics and more details of the nature of the application? \$\endgroup\$
    – Lundin
    Commented Jan 30 at 10:19
  • 2
    \$\begingroup\$ It would also be helpful to compare the Tx/Rx lines with CANH or CANL to see if the same noise is present there or not. \$\endgroup\$
    – Lundin
    Commented Jan 30 at 10:20
  • \$\begingroup\$ I have experienced similar problems on a vehicle bus (heavy machinery) where a PLC was activating valves or motors resulting in large, momentary current draw. This in turn made ground potentials dance around (which you ought to expect on heavy machinery), after which the reference levels for CAN parts in the PLC went out of whatever tolerances their transceiver had, resulting in sporadic error frames and PLC rebooting (bus off). The problems were caused both by the lack of a dedicated signal ground, instead of no ground/chassis ground, as well as internal PCB design mistakes inside the PLC. \$\endgroup\$
    – Lundin
    Commented Jan 30 at 10:25

1 Answer 1

2
\$\begingroup\$

I've actually discovered what the problem is now. A node on the bus was powering off when it should not have been plus it seems to cause this number of dominant bits to be put onto the bus just before it fully shuts down. I do not know the mechanism because I cannot access the hardware but I suspect a software or hardware issue within the node itself.

It pays to trust your instinct as to which are likely to be problematic nodes and to remove them, individually, from the CANBUS (physically or by pulling each device's fuse) until the problem is resolved.

\$\endgroup\$
1
  • \$\begingroup\$ Most CAN tranceivers have an timeout where they will enter bus-off if the controller issues a permanent dominant state for too long. The bus will recover. Eg: MCP2551 does this in 1.25 ms. \$\endgroup\$
    – Jeroen3
    Commented Jan 31 at 8:01

Not the answer you're looking for? Browse other questions tagged or ask your own question.