4
\$\begingroup\$

Disclaimer Based on multiple attempts to find the cause of the problem, I have determined that I might be facing a hardware issue rather than a programmatic failure, hence why I am posting this here and not in the Arduino channel. This is perhaps the weirdest situation I have ever faced in my professional career involving a custom Arduino board. Please bear with me. I will be as succinct as possible.

In short

I have designed a complex data acquisition and transmission system based on a barebones Arduino MEGA 2560. My code, based on a custom FSM, nearly reaches 6000 lines. My system works perfectly fine responding to a set of commands, reading and writing from/to an onboard SPI flash SD Card right after flashing the MCU. Even after power cycling the system WHILE leaving the UART-USB converters attached to the PC. All codes and commands work fine. HOWEVER, the crash occurs under the following conditions exclusively:

  1. Power down the system (5V/3.3V),
  2. Unplug and plug back in the UART-USB converters (USB side)
  3. Power up the system
  4. Issue one specific command to read from the SD card line by line and decode data --> the crash occurs

My system consists of the following blocks. Blocks diagram

Background

After booting up, the system enters a simple FSM where it waits for a serial command over UART3 (driven by SerialEvent3) to start acquiring data from a group of I2C sensors. When the routine starts, the values are saved in an SPI flash SD Card in ASCII format during a predetemined time. USART1 is used as a debugging port to print general events, information and values to determine the health of the system at any time.

Using a specific instruction (let's call it DATA XXX), the system is able to dump the desired file to UART3 by reading the file line-by-line using a custom algorithm that converts ASCII values to their HEX representation. Such algorithm works flawlessly when the MCU has been recently flashed. Cycling the power doesn't seem to affect the response and the board responds well to the DATA XXX command. Issuing this command several consecutive times works great which eliminates the chance of suspecting of RAM issues.

It is important to mention that the DATA XXX command extensively uses the SPI Flash card opening and closing it several times during data read and conversion as the data is read and processed "on-the-fly". This is due to the lack of a readLine method from the SD library, as discussed here.

System description, test conditions

  • The system is powered by a custom-designed linear power supply rated at 3.3V, 5V with a maximum current delivery of 1.5A per channel. The power is clean as tested with an oscilloscope. The ATMEGA2560 is powered by 5V.
  • The MCU uses a 16MHz crystal oscillator (automotive grade) with 22pF decoupling capacitors.
  • All ATMEGA2560 are chips sourced from Mouser (same batch)
  • The firmware is flashed via SPI (ICSP) using an ATMEL-ICE board and AVRDUDESS
  • All power pins are properly decoupled using 10uF Tant + 100nF ceramic capacitors.
  • The PCB is a 4-layer design with proper grounding and stitching. The MCU has also EMI shielding with proper guarding.
  • The SPI Flash card has a level shifter on the MOSI, SCK and CS lines. It is fed with 3.3V supply and properly decoupled.
  • All I2C sensors respond fine and they are NOT involved during the DATA XXX command.
  • USART3 uses a level-shifter consisting of a BSS138 HI-Speed MOSFET per line Tx/Rx as it communicates with an external 3.3V logic system.
  • Both UARTs are connected using UART-USB converters to a PC. One end is connected using DuPont cables to the UART pins and the USB connector is plugged in using a USB-A extension. I have tested different commercial boards based on different chips such as the CH340, CH341, FTDI, and the MCP2200 all purchased from different suppliers. The behaviour is the same.
  • We have tested the same code on 3 identical systems (exact same hardware).
  • We use YAT as the serial terminal for all tests on different computers.
  • We have tried different fuses configurations for the ATMEGA2560
  • We have tried with and without bootloader (based on previous reports). We have loaded Optiboot (from MEGACORE) trying different configurations without success. The stock stk500boot_v2_mega2560.hex was reflashed too due to old reports of issues with UARTs in previous versions.
  • We have tested the system directly plugging it to an external controller via UART3 which issues the DATA XXX command after 10 seconds, where the board crashes too.
  • EDIT: as @Justme pointed out, it is not a good practice to plug pins from powered chips to unpowered chips, however, we have also used the same power supply from the system to power on the UART-USB converters (by removing the USB fuse that feeds a CH343) obtaining the same result. More tests are needed, though.

On an important note, we have observed that when connecting the UART-USB converter to UART1 (port without level-shifter) the MCU is leaking voltage all the way to the 5V rail and thus the RST line. The voltage is approximately 3V. When unplugging the converter, this voltage dissapears. Similarly, the converter on UART3 leaks a lower voltage of 1.6V. This "phantom power" is suspected of creating issues on the internal buffers, however, this is not absolutely clear since other instructions respond fine.

For instance, when issuing the DATA XXX command after flashing the firmware, the system receives the instruction and processes the file as intended:

Successful processing

here, the values are shown in their HEX representation, including the transmitted command: enter image description here

the last two screenshots show the same information in different representations. File number 6 is read from the SDCard line-by-line and converted to HEX. This behaviour remains even after sending the command several times.

Now, after powering the system off, unplugging both UART-USB converters, plugging either of them back to the USB port, powering the system on again the system loads correctly, however, the DATA XXX command crashes the system, as observed here:

system crashes

Please note that the crash point is random. It crashes at differet stages of the processing.

However, the board still responds (after a power-cycle) when requesting an inexistent file, for instance: DATA 070 throwing an expected ERROR message:

Expected ERROR response

The DATA XXX command searches for the desired file and if not found, the ERROR message is displayed, indicating that the SD Card is effectively readable.

Finally, even after removing all power, unplugging all cables and leaving the board to "discharge" for some time, the system NEVER responds to the same DATA XXX command ever again. It crashes EVERY time. The UART port is seemingly getting corrupted after "injecting" power through the UART port before getting Vcc.

Please share your ideas, comments and questions.

  • Why is my system crashing on a specific command when plugging the board in the specific sequence described before?
  • Is my batch of MCUs defective?
  • Do you think my code could still be the cause of the problem?

I am surely missing some relevant information. Trust me that I have tried everything I can think of and I have extensively searched for all issues pertaining similar system crashes, however, I have not found one similar to my case. I have found only one somewhat similar case where the Windows driver was found to be the probable cause of a similar issue.

\$\endgroup\$
12
  • 1
    \$\begingroup\$ The block diagram is not detailed enough to solve the problem. Please post the schematics. Is the MCU supply running on 3.3V or 5V? Also you should not be connecting data pins of powered and unpowered chips together, it will cause problems like this. \$\endgroup\$
    – Justme
    Commented May 12 at 0:20
  • \$\begingroup\$ @Justme I have improved the blocks diagram. The MCU is powered by 5V. Unfortunately, I cannot share the schematic as the design was comissioned to me by a customer. I understand your point regarding connecting pins, however, I have also tried powering both chips from the same power supply with same results. I will clarify this in the description. \$\endgroup\$ Commented May 12 at 0:26
  • \$\begingroup\$ @Justme please let me know what missing information regarding the hardware might be relevant to you. Thanks for your input. \$\endgroup\$ Commented May 12 at 0:29
  • 2
    \$\begingroup\$ This isn't really a problem we can solve. It could be anywhere in the stack: in your state machine, support routines, interrupts, buffer or stack overflows, SD card, etc. How does your software handle errors in UART communication? SD? Internals? Have you subdivided the problem, what is the minimum case that causes it to fail? Have you determined what kind of a failure it is, is your overall system simply unresponsive but still running, is the CPU absolutely crashed, can you run a debug terminal via interrupts on a spare UART, etc.? \$\endgroup\$ Commented May 12 at 0:31
  • 1
    \$\begingroup\$ @DanielMelendrez I already mentioned, the actual schematics are missing. Anyone who would like to duplicate your setup must read your wall of text that describes the schematics and then draw the schematics to see them. It would be less work for you describing the schematics as well and updating requested details. But if the problem is appearing when connecting powered UART chips to unpowered MCU, then that is the problem and don't do it. It is wrong to do that no matter what chips are involved. Fix the schematics to prevent half-powering the MCU from UART pins. \$\endgroup\$
    – Justme
    Commented May 12 at 9:27

3 Answers 3

5
\$\begingroup\$

This answer attempts to record some suggestions about failure modes, debugging and potential modifications. While it somewhat speculative, it was too long for a comment.

Likely cause of Phantom Power

The ATmega2560 datasheet shows ESD protection diodes on the I/O pins:

enter image description here

Where outputs from a powered USB UART connected to an unpowered ATmega2560 will try and power the ATmega2560 via the protection diode between the I/O pin and VCC. This is likely the cause of the Phantom Power in the question title.

How Phantom Power might be causing failure

Finally, even after removing all power, unplugging all cables and leaving the board to "discharge" for some time, the system NEVER responds to the same DATA XXX command ever again. It crashes EVERY time.

I don't use ATmega2560 microcontrollers, but after failure is it possible to use the ICSP to verify the contents of the flash still contains the expected program?

I.e. to determine if it is act of plugging UART-USB converter before main power, and powering the ATmega2560 through its I/O pins, causes the flash to become corrupted. A potential failure mode could be powering the ATmega2560 through it I/O pins leads to mis-execution of instructions which corrupts the program in flash.

Possible protection against Phantom Power

USART3 uses a level-shifter consisting of a BSS138 HI-Speed MOSFET per line Tx/Rx as it communicates with an external 3.3V logic system.

Perhaps using a level-shifter specified with support for partial power down would avoid issue. E.g. the TI Voltage Level Translation Guide has some devices described as:

The devices are fully specified for partial-power-down applications using IOFF The IOFF circuitry disables the outputs, preventing damaging current backflow through the device when it is powered down.

While UART1 currently doesn't have use a level-shifter, potentially could use the same level-shifter with support for partial-power-down as for USART3, but for UART1 with both sides operating and the same I/O voltage.

\$\endgroup\$
5
\$\begingroup\$

Connecting powered ICs to unpowered ICs will back-feed the unpowered IC through the data pin.

Your MCU has a 16 MHz clock and according to the data sheet, it needs at least 4.5V to run at 16 MHz.

If you are connecting the unpowered MCU to another powered IC, whether it has 5V IO or 3.3V IO through level shifter, and you use pull-up resistors from MCU to the data pins, you have weak leakage paths that power your MCU from the data pins of the other chip.

Therefore the AVR starts to run and might appear to work. If it does something that requires more current, supply may dip too low for it to work. The problem is that at 16 MHz, voltage needs to be at least 4.5V. It will work at much lower voltage if it runs at 8 MHz.

Maybe prevent back-feeding of supplies from data pins to solve the problem. Use a proper level shifter that does not pass current through to unpowered MCU.

Maybe enable low voltage detector to keep AVR reset until power is really applied so it won't start running until enough voltage is applied.

Maybe add reset circuitry to keep AVR externally in reset until power supply is good.

Basically, if the AVR starts to run code with power applied only through the data IO pins, and there is not enough voltage on supply pins to run the AVR itself or any external components, then any code the AVR has executed may have been executed incorrectly, or any external component the AVR wants to communicate with may also not run due to having no power or being run with undervoltage, or when communicated with, it may start consuming more current that is enough to drop the supply voltage down to a level that the AVR cannot execute any opcode properly, or stays at reset due to undervoltage, and won't recover even when main supplies are turned on.

Also when the communication interface is powering the AVR, any data transmitted will cause the voltage to drop. So depending on what command you send, send enough zero bits and the AVR receives no power on the zero bits and voltage drops.

\$\endgroup\$
1
\$\begingroup\$

The problem is in the reset circuit

It should be capable of issuing a reset to the MCU when the main power comes back on.

that reset should lead to software commanded reset of all external preriherals (so the flash card)

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.