2

I asked this question on AskUbuntu, but have more specific hardware questions I'm asking here now.

It seems my RAM was bad, since I found about ~6000 errors in Memtest86+, and I had 10+ freezes and hard reboots in 1 hour, but now that I've simply unplugged both RAM modules and plugged them back in, I can't get a single new error. It's under warranty, so Dell is willing to swap the entire motherboard and both RAM sticks (8 GB each), for free, this coming week, but I'm thinking I should reject the offer but I am worried my hardware may still be bad. Now that no errors are coming up I wonder if that is just more risk than I need right now to let them swap the entire motherboard, especially since they will be using refurbished parts, and my experience with refurbished hardware parts in general (not with Dell at all--but just in general) tells me to stay away unless I really have no choice.

What should I do? Was my RAM ever bad? Or, was it somehow just a mechanical pin alignment or debris issue that somehow got resolved simply by me unplugging and plugging back in the RAM?

Note my computer is 1 year old. It is a high-end Dell laptop. Recently, I wiped Windows 10 completely and installed Ubuntu 20.04.

Here's my full description I sent to the Dell support team, but they never had an engineer look at my descriptions, so I'd like to see if someone here has knowledge of what may have happened and what the solution is.


[MESSAGE I SENT TO DELL (START)]

I've done some troubleshooting and it is leaving me stumped.

Note that my operating system is Linux Ubuntu 20.04. 

Over the last 2 weeks I've been experiencing occasional freezes, but rarely, and usually during boot or shutdown. Sometimes during boot it would freeze and I'd have to hold the power button to try again. I didn't think too much of it, but was confused by it still. 3 days ago I experienced repeated total freezing where no form of soft reboot would work, not even interrupting the Linux kernel with a special Ctrl + Alt + PrScr + REISUB sequence used to soft-reboot Linux computers. I had to do a total hard reboot each and every time. This occurred again and again and again--about 10+ times within a single hour. The system was completely unusable. 

I booted into the Dell Diagnostics menu and ran the diagnostics twice. Each time they froze on the Memory testing screen for ~15 minutes, with something like 4 min 20 sec remaining frozen on the screen, so each time I hard rebooted to exit.

I then upgraded the BIOS from 1.9 to 1.15.1 at that time (3 days ago) and the freezing continued. I then enabled legacy boot in the BIOS/UEFI, booted into Memtest86+ v5.01 (https://www.memtest.org/), and ran a memory test. It found thousands of errors within like 6 minutes, for a total of 5632 errors within 2 hrs or so. I then called you. 

Here are screenshots of those errors. This screenshot shows errors in Test 10 at address 003e295861c, for instance: 

enter image description here

This screenshot shows the memory mapping from address to DIMM slot. As you can see, this address maps to DIMM B, which means that memory is bad:

enter image description here

This screenshot shows errors in Test 7 at address 0017dfdf1b8, for instance, within just 5 minutes 35 sec of beginning the test. This maps to DIMM A, which means that memory is bad. Therefore, both memories are bad:

enter image description here

However, I can no longer reproduce the errors (now that I have swapped the RAM sticks around during further testing). Whether I test the memories individually or together, in DIMM A or in DIMM B, they now pass. Additionally, the Dell Diagnostic test from the boot menu now runs to completion and passes. Does this make any sense!? I went from 10+ freezes per hour and 5632 errors to nothing? I wonder if it's a glitchy motherboard, but all Dell Diagnostics tests which I run from the boot menu also now pass. I need this computer to work and be reliable and not produce memory corruption. What do you think?   Thanks!

[MESSAGE I SENT TO DELL (END)]


Also, I have even run a stress test with this command, for 8 hours at 100% CPU usage (all 4 cores/8 hardware threads at 100%), and at ~98% RAM usage the whole time, and it ran fine too:

stress-ng --cpu 8 --vm 8 --vm-bytes 100% --timeout 8h --metrics

And I have now run Memtest86+ for 30+ hours with both RAM sticks reinserted, and I get zero errors.

How do I go from 5632 errors to zero!?

Note: I also ran Memtest86+ v5.01 only in single-threaded mode, so none of my errors were due to its known bugs with running in multi-threaded mode.

Related:

  1. Related, but definitely inconclusive and not a duplicate: Can the dust cause DDR RAM errors?
  2. kinda-sorta related--also not a duplicate: ram errors solved by swapping slots used by ram

Future troubleshooting notes to self (Looking back: what I wish I would have done):

  1. I wish I would have run the Memtest86+ test 2 or 3 more times for < 1 hr each time before unplugging any RAM modules, just to see if I was consistently getting those thousands of failures.
  2. Then, assuming the errors were consistent, I wish the first thing I would have done to troubleshoot them would have been to just unplug both RAM modules and then plug them exactly back in as they were! Then, run the test again, and if the test passes immediately, after having failed several times in a row just before, I would know with certainty the RAM modules were just improperly seated somehow, and unplugging them and plugging them back in fixed the problem!

References:

  1. How I first started learning about the stress-ng Linux stress test command-line tool: https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
5
  • 2
    It might have just cleaned up the contacts - gonna be hard to tell now. I'd let Dell know & see if they're willing to hold their offer for say, a month, so you can test further.
    – Tetsujin
    Commented Aug 23, 2020 at 7:20
  • 2
    "Can unplugging your RAM stick and plugging it back in solve RAM errors/problems?" -- Quite possible. Especially if you touched the contacts of the DIMM, and got skin oils on them. Low-quality IC sockets are notorious for causing issues, and reseating the chips is standard procedure. One project I worked on used a Mil-spec computer, and every 2-3 months it would go bonkers (e.g. both Run & Halt indicators would be on); the fix would be to unplug all boards and reseat them.
    – sawdust
    Commented Aug 23, 2020 at 8:27
  • Ah, the old "blow into the cartridge" trick ;)
    – gronostaj
    Commented Aug 24, 2020 at 6:49
  • Did they test the old components after they replaced them? Did they update you?
    – Didi Kohen
    Commented Dec 30, 2020 at 20:02
  • 1
    @DidiKohen, I don't know if they tested them. No, they didn't/don't update me. Probably takes more time for them than they care to do. See my comments under the accepted answer. They ended up replacing the RAM and motherboard for me and I have had no more problems since then. I'm on that computer now--runs perfectly so far. Commented Dec 30, 2020 at 20:10

1 Answer 1

3

Taking the RAM out, put it back in can certainly fix these kind of issues.
(But the problem may come back in a couple of months.)

Basically there are 3 separate issues here:

  1. A flaky contact in the socket due to mechanical tolerances and the RAM/socket slightly shrinking/expanding due to heating/cooling down many times over a long period of usage. That may have created a bad contact and/or a very thin layer of rust building up on the contacts. Re-seating the RAM can fix this by mechanically re-aligning the contacts and/or scraping of the rust.
  2. The metal of the contacts of the RAM and of the socket is usually chemically not quite the same (different alloys). This can lead to chemical reactions between the metals, that gradually create a very thin film of reaction material on the boundary between them. This layer usually has worse electrical properties than the contacts themselves, which can lead to stability issues.
    Taking the RAM out/back in scrapes that layer off and you are good to go until it forms again. Especially computers used in a relatively humid environment can be subject to this, but it usually takes several years before this becomes an issue.
  3. Only applies if the RAM has been handled by people: Skin gives of a oily residue. If that gets on the contacts of the RAM it may react slightly with the metal, again forming a thin film on the contacts that effects the electrical properties.

The 3 effects above can appear in combinations and amplify each other. And they can start popping up after a long term using the computer without issues. Even in computers whose internals you never touched yourself since it came out of the factory it can happen.

Testing suspect RAM is tricky, especially if you don't have known good other system available.

Typical thing to do when you suspect a bad RAM is first to take out the RAM.
Visually inspect it for bend contacts: If there are any throw it away immediately. It will never be 100% reliable again.
Then clean the contacts and re-seat the RAM in the same slot. Then re-test.
If it still tests bad you can try a known good RAM in that slot. (Not always possible if the motherboard needs a specific combination of slots to be used.) If that also tests bad the slot itself is usually the culprit.
And you can test with only the suspect RAM in another slot.

In the motherboard/memory controller is the problem any RAM you test in that same slot will appear bad. But be ware when you change the memory layout/configuration (e.g. test with less or different size RAM strips) the problem can move to another slot. It is also possible it is guaranteed unstable in some memory combinations and stable in others (depending on the physical layout of the RAM present).

And always test with RAM timing in the Bios set to standard timing. Overclocked RAM can cause its own issues and make tests unreliable.

If you have another computer that is known to be good it is probably easiest to run that second computer with just 1 RAM from the problem system. Test all RAMs one by one. And then test the motherboard on the flaky computer by running it with RAM that has checked out the be good in the previous tests.

A few words on cleaning the contacts:
Don't try to clean the slots on the motherboard. Very easy to damage them.
The friction of a RAM strip being taking out/inserted is enough to scrape the contacts clean.
On the RAM strips themselves:
Gently rub them with a pencil eraser in the correct direction. (When you hold the RAM horizontally with the contacts pointing down you rub it from top to bottom. So along the contact in the direction of where the slot would be if it was inserted in a slot.)
Do both sides and try to avoid touching the contacts with your fingers.
If you did touch them (or just to be on the safe side) dab a Q-tip/cotton swab in isopropyl alcohol (available at any pharmacy) and run that over the contacts. Keep repeating until you don't see any dark smudge on the Q-tip anymore.

8
  • So what would you do about Dell offering to replace the motherboard and RAM with refurbished parts? Would you accept that offer, or just decline and keep the current parts which now seem to be working? It sounds like I should remove the RAM sticks and at least clean the contacts though. Commented Aug 24, 2020 at 17:53
  • 1
    @GabrielStaples If it works, don't fix it. If you're not planning to return the computer, then just wait - you can call Dell again if the problem returns.
    – gronostaj
    Commented Aug 24, 2020 at 20:25
  • 1
    @GabrielStaples I agree with Gronostaj. If it works just keep it as it. You can always make a new warranty claim if it starts acting up again. By the way: I would not worry about replacement parts being refurbished. As long as Dell does the replacing they should be as good as new and you will retain the original warranty on the laptop.
    – Tonny
    Commented Aug 25, 2020 at 15:21
  • 1
    So, I've had no issues since reseating the RAM...until yesterday. I picked up the laptop, on & open, to carry it downstairs. Each step on the stairs I saw the screen flicker & a bunch of rows of pixels on the screen get garbled. On my 3rd or 4th step it stopped. I checked, & the computer was frozen. It required a hard reboot to restart it. This indicates there is a mechanical problem: some sort of connections on the motherboard or something which are fractured or become disconnected only under momentary times of physical stress or strain or minor jolting. I'm going to accept the replacements. Commented Aug 30, 2020 at 20:45
  • 1
    @GabrielStaples If things shake loose just by carrying the laptop there is definitely a loose contact somewhere. Going for a replacement is the right thing to do.
    – Tonny
    Commented Aug 31, 2020 at 7:11

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .