Disclaimer Based on multiple attempts to find the cause of the problem, I have determined that I might be facing a hardware issue rather than a programmatic failure, hence why I am posting this here and not in the Arduino channel. This is perhaps the weirdest situation I have ever faced in my professional career involving a custom Arduino board. Please bear with me. I will be as succinct as possible.
In short
I have designed a complex data acquisition and transmission system based on a barebones Arduino MEGA 2560. My code, based on a custom FSM, nearly reaches 6000 lines. My system works perfectly fine responding to a set of commands, reading and writing from/to an onboard SPI flash SD Card right after flashing the MCU. Even after power cycling the system WHILE leaving the UART-USB converters attached to the PC. All codes and commands work fine. HOWEVER, the crash occurs under the following conditions exclusively:
- Power down the system (5V/3.3V),
- Unplug and plug back in the UART-USB converters (USB side)
- Power up the system
- Issue one specific command to read from the SD card line by line and decode data --> the crash occurs
My system consists of the following blocks.
Background
After booting up, the system enters a simple FSM where it waits for a serial command over UART3
(driven by SerialEvent3
) to start acquiring data from a group of I2C sensors. When the routine starts, the values are saved in an SPI flash SD Card in ASCII format during a predetemined time. USART1
is used as a debugging port to print general events, information and values to determine the health of the system at any time.
Using a specific instruction (let's call it DATA XXX
), the system is able to dump the desired file to UART3
by reading the file line-by-line using a custom algorithm that converts ASCII values to their HEX representation. Such algorithm works flawlessly when the MCU has been recently flashed. Cycling the power doesn't seem to affect the response and the board responds well to the DATA XXX
command. Issuing this command several consecutive times works great which eliminates the chance of suspecting of RAM issues.
It is important to mention that the DATA XXX
command extensively uses the SPI Flash card opening and closing it several times during data read and conversion as the data is read and processed "on-the-fly". This is due to the lack of a readLine
method from the SD library, as discussed here.
System description, test conditions
- The system is powered by a custom-designed linear power supply rated at 3.3V, 5V with a maximum current delivery of 1.5A per channel. The power is clean as tested with an oscilloscope. The ATMEGA2560 is powered by 5V.
- The MCU uses a 16MHz crystal oscillator (automotive grade) with 22pF decoupling capacitors.
- All ATMEGA2560 are chips sourced from Mouser (same batch)
- The firmware is flashed via SPI (ICSP) using an ATMEL-ICE board and AVRDUDESS
- All power pins are properly decoupled using 10uF Tant + 100nF ceramic capacitors.
- The PCB is a 4-layer design with proper grounding and stitching. The MCU has also EMI shielding with proper guarding.
- The SPI Flash card has a level shifter on the MOSI, SCK and CS lines. It is fed with 3.3V supply and properly decoupled.
- All I2C sensors respond fine and they are NOT involved during the
DATA XXX
command. USART3
uses a level-shifter consisting of a BSS138 HI-Speed MOSFET per line Tx/Rx as it communicates with an external 3.3V logic system.- Both UARTs are connected using UART-USB converters to a PC. One end is connected using DuPont cables to the UART pins and the USB connector is plugged in using a USB-A extension. I have tested different commercial boards based on different chips such as the CH340, CH341, FTDI, and the MCP2200 all purchased from different suppliers. The behaviour is the same.
- We have tested the same code on 3 identical systems (exact same hardware).
- We use YAT as the serial terminal for all tests on different computers.
- We have tried different fuses configurations for the ATMEGA2560
- We have tried with and without bootloader (based on previous reports). We have loaded Optiboot (from MEGACORE) trying different configurations without success. The stock
stk500boot_v2_mega2560.hex
was reflashed too due to old reports of issues with UARTs in previous versions. - We have tested the system directly plugging it to an external controller via
UART3
which issues theDATA XXX
command after 10 seconds, where the board crashes too. - EDIT: as @Justme pointed out, it is not a good practice to plug pins from powered chips to unpowered chips, however, we have also used the same power supply from the system to power on the UART-USB converters (by removing the USB fuse that feeds a CH343) obtaining the same result. More tests are needed, though.
On an important note, we have observed that when connecting the UART-USB converter to UART1
(port without level-shifter) the MCU is leaking voltage all the way to the 5V rail and thus the RST line. The voltage is approximately 3V. When unplugging the converter, this voltage dissapears. Similarly, the converter on UART3
leaks a lower voltage of 1.6V. This "phantom power" is suspected of creating issues on the internal buffers, however, this is not absolutely clear since other instructions respond fine.
For instance, when issuing the DATA XXX
command after flashing the firmware, the system receives the instruction and processes the file as intended:
here, the values are shown in their HEX representation, including the transmitted command:
the last two screenshots show the same information in different representations. File number 6 is read from the SDCard line-by-line and converted to HEX. This behaviour remains even after sending the command several times.
Now, after powering the system off, unplugging both UART-USB converters, plugging either of them back to the USB port, powering the system on again the system loads correctly, however, the DATA XXX
command crashes the system, as observed here:
Please note that the crash point is random. It crashes at differet stages of the processing.
However, the board still responds (after a power-cycle) when requesting an inexistent file, for instance: DATA 070
throwing an expected ERROR
message:
The DATA XXX
command searches for the desired file and if not found, the ERROR
message is displayed, indicating that the SD Card is effectively readable.
Finally, even after removing all power, unplugging all cables and leaving the board to "discharge" for some time, the system NEVER responds to the same DATA XXX
command ever again. It crashes EVERY time. The UART port is seemingly getting corrupted after "injecting" power through the UART port before getting Vcc.
Please share your ideas, comments and questions.
- Why is my system crashing on a specific command when plugging the board in the specific sequence described before?
- Is my batch of MCUs defective?
- Do you think my code could still be the cause of the problem?
I am surely missing some relevant information. Trust me that I have tried everything I can think of and I have extensively searched for all issues pertaining similar system crashes, however, I have not found one similar to my case. I have found only one somewhat similar case where the Windows driver was found to be the probable cause of a similar issue.