I asked this question on AskUbuntu, but have more specific hardware questions I'm asking here now.
It seems my RAM was bad, since I found about ~6000 errors in Memtest86+, and I had 10+ freezes and hard reboots in 1 hour, but now that I've simply unplugged both RAM modules and plugged them back in, I can't get a single new error. It's under warranty, so Dell is willing to swap the entire motherboard and both RAM sticks (8 GB each), for free, this coming week, but I'm thinking I should reject the offer but I am worried my hardware may still be bad. Now that no errors are coming up I wonder if that is just more risk than I need right now to let them swap the entire motherboard, especially since they will be using refurbished parts, and my experience with refurbished hardware parts in general (not with Dell at all--but just in general) tells me to stay away unless I really have no choice.
What should I do? Was my RAM ever bad? Or, was it somehow just a mechanical pin alignment or debris issue that somehow got resolved simply by me unplugging and plugging back in the RAM?
Note my computer is 1 year old. It is a high-end Dell laptop. Recently, I wiped Windows 10 completely and installed Ubuntu 20.04.
Here's my full description I sent to the Dell support team, but they never had an engineer look at my descriptions, so I'd like to see if someone here has knowledge of what may have happened and what the solution is.
[MESSAGE I SENT TO DELL (START)]
I've done some troubleshooting and it is leaving me stumped.
Note that my operating system is Linux Ubuntu 20.04.
Over the last 2 weeks I've been experiencing occasional freezes, but rarely, and usually during boot or shutdown. Sometimes during boot it would freeze and I'd have to hold the power button to try again. I didn't think too much of it, but was confused by it still. 3 days ago I experienced repeated total freezing where no form of soft reboot would work, not even interrupting the Linux kernel with a special Ctrl + Alt + PrScr + REISUB sequence used to soft-reboot Linux computers. I had to do a total hard reboot each and every time. This occurred again and again and again--about 10+ times within a single hour. The system was completely unusable.
I booted into the Dell Diagnostics menu and ran the diagnostics twice. Each time they froze on the Memory testing screen for ~15 minutes, with something like 4 min 20 sec remaining frozen on the screen, so each time I hard rebooted to exit.
I then upgraded the BIOS from 1.9 to 1.15.1 at that time (3 days ago) and the freezing continued. I then enabled legacy boot in the BIOS/UEFI, booted into Memtest86+ v5.01 (https://www.memtest.org/), and ran a memory test. It found thousands of errors within like 6 minutes, for a total of 5632 errors within 2 hrs or so. I then called you.
Here are screenshots of those errors. This screenshot shows errors in Test 10 at address 003e295861c, for instance:
This screenshot shows the memory mapping from address to DIMM slot. As you can see, this address maps to DIMM B, which means that memory is bad:
This screenshot shows errors in Test 7 at address 0017dfdf1b8, for instance, within just 5 minutes 35 sec of beginning the test. This maps to DIMM A, which means that memory is bad. Therefore, both memories are bad:
However, I can no longer reproduce the errors (now that I have swapped the RAM sticks around during further testing). Whether I test the memories individually or together, in DIMM A or in DIMM B, they now pass. Additionally, the Dell Diagnostic test from the boot menu now runs to completion and passes. Does this make any sense!? I went from 10+ freezes per hour and 5632 errors to nothing? I wonder if it's a glitchy motherboard, but all Dell Diagnostics tests which I run from the boot menu also now pass. I need this computer to work and be reliable and not produce memory corruption. What do you think? Thanks!
[MESSAGE I SENT TO DELL (END)]
Also, I have even run a stress test with this command, for 8 hours at 100% CPU usage (all 4 cores/8 hardware threads at 100%), and at ~98% RAM usage the whole time, and it ran fine too:
stress-ng --cpu 8 --vm 8 --vm-bytes 100% --timeout 8h --metrics
And I have now run Memtest86+ for 30+ hours with both RAM sticks reinserted, and I get zero errors.
How do I go from 5632 errors to zero!?
Note: I also ran Memtest86+ v5.01 only in single-threaded mode, so none of my errors were due to its known bugs with running in multi-threaded mode.
Related:
- Related, but definitely inconclusive and not a duplicate: Can the dust cause DDR RAM errors?
- kinda-sorta related--also not a duplicate: ram errors solved by swapping slots used by ram
Future troubleshooting notes to self (Looking back: what I wish I would have done):
- I wish I would have run the Memtest86+ test 2 or 3 more times for < 1 hr each time before unplugging any RAM modules, just to see if I was consistently getting those thousands of failures.
- Then, assuming the errors were consistent, I wish the first thing I would have done to troubleshoot them would have been to just unplug both RAM modules and then plug them exactly back in as they were! Then, run the test again, and if the test passes immediately, after having failed several times in a row just before, I would know with certainty the RAM modules were just improperly seated somehow, and unplugging them and plugging them back in fixed the problem!
References:
- How I first started learning about the
stress-ng
Linux stress test command-line tool: https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
Run
&Halt
indicators would be on); the fix would be to unplug all boards and reseat them.