0

Hope someone can give me an idea of how to resolve this issue short of replacing the whole machine.

Background/History

I have an ASUS P8Z68-M Pro MB / G620 CPU / 16GB DDR3 1333MHz CL 9-9-9-24 DRAM. The system is about 4 years old, and it had memory errors about 2 years ago. I bought new RAM and RMA'd the bad set to keep for spare.

Last week I noticed some weird errors in FreeNAS (which have been happening for some time), so I took the machine down and started running Memtest86+ v4.2, and found an easily reproducible error in one of the DIMMs at address 0019bd12878.

First time memory failed on Pass 1, Test 2 error bit was 00010000 - bit expected was 0, but 1 was read.

Second time error was on Pass 1, Test 1 - error bit was 00020000, again 0 expected, one read.

Problem was very easy to reproduce - Put the bad DIMM in a different slot for the two different tests - failed both times.

The problem

I replaced the bad RAM with the spare RAM from the first RMA. Brand new Patriot VIPER DDR3 1600MHz CL9-9-9-24 which I set up to run at 1333MHz in the BIOS. (G620 won't take the higher multiplier.) Did XMP in the BIOS, and then set the clock speed to 1333.

I now have a weird situation with the replacement.

This Ran fine for just over 24 hours, then I started getting a few errors at 0004d2fxxxx. (Range of addresses - program only shows a few on the screen and I don't have a printer hooked up to it, or any way to capture more details.)

Without taking down the machine I changed the Memtest86+ settings to spot test the area that was reporting the errors, and got about 4500 errors very quickly. All the errors reported with Test 8 "Random Patterns"

When I tried to reproduce and localize the problem by pulling one of the two DIMMs, and the errors stopped. So the power cycle and/or reinserting the other DIMM cleared the problem.

I went back to the original configuration and so far it has been running error free for over 37 hours. Which makes it less likely to be a simple thermal problem.

Questions

  1. Any suggestions on how I can localize this problem?
  2. Any other test programs I should run that might help?
  3. Is this more likely to be a memory problem, motherboard problem (or even CPU chip or Power supply issue)?

Any suggestions or input would be most appreciated.

Thanks.

7
  • This may be hard to clarify but it seems, based on the information you provided, that you're underclocking your RAM. DIMMs require a constant refresh cycle and if it gets too low, it will cause the data to become corrupted, which I suspect may be the case. Commented Aug 12, 2016 at 17:33
  • Yes that is correct, I am clocking 1600MHz at 1333Mhz. I was told that this was no problem to do (maybe that is not the case-If so, I'd appreciate comments.) Update: the system is now over 47.5 hours of continuous tests with no errors.
    – user73383
    Commented Aug 13, 2016 at 2:39
  • It depends on the actual memory chip's specifications. Commented Aug 13, 2016 at 3:18
  • Can you tell me what to check... I'll see if I can find chip specs. Correct me if I'm wrong, but would it not show a more consistent pattern of errors if that were the problem? It's now up to 20 complete passes and 59.5 hours of successful tests. What I don't get is why a consistent pattern of errors that goes away. Mechanical connection changes or power cycle appears to be what cleared the problem. Wonder if it's contact oxidization.
    – user73383
    Commented Aug 13, 2016 at 14:38
  • Based on what you did to solve it, it does point towards a physical connection issue. Commented Aug 13, 2016 at 15:35

1 Answer 1

0

Can't tell for sure if I've got a solution or not. Found and applied a BIOS update that is purported to improve stability and memory comparability.

So far after applying the patch the system has been running successfully for almost 48 hours. At this point I don't know if I've solved the problem, or just haven't found what causes it to fail.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .