1

My primary PC experiences a BSOD (bug check 124) roughly once a day and has been doing so for several months. These BSODs appear to be related to warnings 500 and 501 in the Windows event log. Both message types say "The Desktop Window Manager is experiencing heavy resource contention". 500 adds "The DWM responsiveness has degraded". 501 adds "Graphics subsystem resources are over-utilized. A consistent degradation in frame rate for the DW".

After checking that the graphics driver was up-to-date, I replaced the AMD graphics card with a Nvidia card from another machine. Although replacing the graphics card is expensive, I thought it was the most likely suspect, and it's easier than replacing the motherboard or the power supply. But this has made no difference to the problem. Still the same warnings 500/501 and a daily BSOD.

No hardware events in the event log. No errors or warnings in the device manager. Nothing else unusual that I could find. So I have 3 questions:

  • Any other investigative technique available (short of a voltmeter)?
  • Any alternative to replacing the motherboard and/or the power supply?
  • Any other likely causes for the BSOD?

EDIT 1: I've run the built-in Windows memory diagnostic twice, and had a clean result both times. But when I ran the Prime95 torture test (blended, lots of RAM testing) twice, it caused the same BSOD both times within 30 seconds. When I ran the Prime95 torture test (small FFTs, RAM not tested much), it ran fine for 10 minutes, although the temperature on a couple of the cores reached a nasty-looking 91C on full boost (33C at idle, ambient temperature 22C). So perhaps a memory hardware or voltage issue.

EDIT 2: I've changed the memory voltage setting so that it can go as high as 1.6 (from the default of 1.5). The Prime95 blended torture test now runs for 10 minutes without a BSOD, although 3 of the 4 cores reach the terrifying temperature of 98C! I'm going to watch for 500/501 events over the next couple of days.

EDIT 3: I'm unable to disable the core with the dodgy L2 cache as the bios doesn't allow me to disable specific cores. But changing to a profile with memory voltage raised from 1.5 to 1.6 and over-clocking boost reduced from 4.6 to 4.2 GhZ appears to have eliminated the BSODs.

System details

  • Motherboard: Asus P8Z68-V LE
  • Graphics: Nvidia GTX 770 2 Gb
  • Power:Corsair 600W
  • CPU: Intel i7 2600K 3.4 GhZ (on-demand to 4.6 Ghz)
  • Cooling: Noctua NH-D14
  • Memory: 16 Gb PC3-10666 1333MHz DDR3
  • OS: Windows 7 Pro with Aero switched-off
  • All device drivers up-to-date. OS fully-patched.
  • Machine is rarely pushed hard - maybe once a month.
11
  • reinstall windows. Commented Feb 1, 2014 at 17:52
  • Interestingly you get this issue while Aero is turned off. Have you got any .dmp file in your C:\Windows\Minidump folder?
    – and31415
    Commented Feb 1, 2014 at 18:34
  • @and31415, lots of them. Nearly all of them indicate an unexpected hardware error, but I can't isolate which hardware.
    – HTTP 410
    Commented Feb 1, 2014 at 19:53
  • @RoadWarrior Copy the entire Minidump folder to the desktop, compress it in a .zip/.7z archive, and upload it somewhere (e.g. http://ge.tt/). Then post the resulting link here.
    – and31415
    Commented Feb 1, 2014 at 20:38
  • 1
    @RoadWarrior Well, the file is empty (0 KB).
    – and31415
    Commented Feb 2, 2014 at 0:55

1 Answer 1

4

Here's the output of !analyze-v and !errrec for your dump file.

I'm not that experienced with kernel debugging, but it would be seem that GCACHEL2_ERR_ERR (Proc 0 Bank 8) is a problem with the L2 cache on one of the i7's physical cores.

Why it does that ... who knows :)

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800de4e028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000be200000, High order 32-bits of the MCi_STATUS value.
Arg4: 000000000005110a, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------


BUGCHECK_STR:  0x124_GenuineIntel
CUSTOMER_CRASH_COUNT:  1
DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT
PROCESS_NAME:  System
CURRENT_IRQL:  f
STACK_TEXT:  
nt!KeBugCheckEx


STACK_COMMAND:  kb
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: GenuineIntel
IMAGE_NAME:  GenuineIntel
DEBUG_FLR_IMAGE_TIMESTAMP:  0
FAILURE_BUCKET_ID:  X64_0x124_GenuineIntel_PROCESSOR_CACHE
BUCKET_ID:  X64_0x124_GenuineIntel_PROCESSOR_CACHE
Followup: MachineOwner

0: kd> !errrec fffffa800de4e028
===============================================================================
Common Platform Error Record @ fffffa800de4e028
-------------------------------------------------------------------------------
Record Id     : 01cf07525f60f483
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 1/2/2014 20:45:39 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800de4e0a8
Section       @ fffffa800de4e180
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Generic
Flags         : 0x00
Level         : 2
CPU Version   : 0x00000000000206a7
Processor ID  : 0x0000000000000000

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa800de4e0f0
Section       @ fffffa800de4e240
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000000
CPU Id        : a7 06 02 00 00 08 10 00 - bf e3 9a 1f ff fb eb bf
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800de4e240

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800de4e138
Section       @ fffffa800de4e2c0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : GCACHEL2_ERR_ERR (Proc 0 Bank 8)
  Status      : 0xbe2000000005110a
  Address     : 0x0000000132de9a40
  Misc.       : 0x000000d080034086
3
  • Many thanks for the Windbg analysis. I'm inclined to believe that the CPU could indeed be damaged because of that horrific 98C during Prime95. Unfortunately 3 months out of warranty, although the BSODs started to happen during the warranty period.
    – HTTP 410
    Commented Feb 2, 2014 at 11:45
  • Let's assume it's one physical core that's damaged. You could try turning Hyper Threading off or seeing if you BIOS allows you to limit the number of physical cores that are enabled. msconfig allows you to adjust the number of processors that Windows will use but to my knowledge you can't adjust which ones. Commented Feb 2, 2014 at 11:56
  • Assuming it's one physical core damaged rather than the L2 cache itself, I'll check the BIOS to see if a core can be disabled. HT is already off because of the type of work that I'm doing (coding and testing chess engines).
    – HTTP 410
    Commented Feb 2, 2014 at 14:18

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .