Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Apze
Jul 4, 2014
Problem description:

A few months ago, my computer blue screened a few times, which I had little success in fixing, so I did a windows reset from the install media, after which I then blue screened again on the clean install minutes after booting it the first time. After this, I updated my bios and the issue appeared to go away. Recently, it began blue screening again, with a variety of error codes, mostly irql_not_less_or_equal, with an occasional kmode_exception_not_handled, or whea_uncorrectable_error.

As it stands right now, on a cold boot I will generally bluescreen during load or within 30-120 seconds thereafter, until it stabilizes a bit. If I manage to load into a somewhat strenuous game, it will be completely stable during play, until I stop, but will generally bluescreen at desktop within 10 minutes at most afterwards. In addition, it is 100% stable in safe mode, but I have had it blue screen at the windows 10 recovery screen and during the safe-mode selection menu after blue screening on boot a few times. I have had it bluescreen during less strenuous gaming, but that has happened twice total out of the couple hundred bluescreens I've had over the last few weeks.

My CPU temperatures sit at 31-36C at idle, under gaming load are around 62-64C, and under max load such as AIDA get up to low 70s.

Attempted fixes:
  • Windows reset to default install.
  • Memtest86 for 4 loops as well as memtest86+ for 4 loops with no errors
  • Swapped and reseated my ram (has blue screened with each stick in individually)
  • Removed all pci-e devices aside from m.2 sata devices (video and sound)
  • Reseated all my power connections.
  • Updated all my drivers, my bios, and my SSD firmware
  • Disabled all non-microsoft services.
  • Took apart the minidumps in windbg (I have 15 or so saved as a sample) but they appear to be completely random errors generating from completely random processes or kernel space.

  • Reset bios to default settings
  • Manually set my vcore to 1.2v, under a guess that my idle vcore was going too low.
  • Disabled Speed stepping
  • Disabled c-states.... this one actually seems to have an effect, because turning c-states back on causes wildly more instability (I basically don't even get a bluescreen, just a freeze at windows loading)

One of my earlier guesses was that my vcore was dropping too low when idling, so I manually set my vcore to 1.2v and disabled C-states. It might be a placebo but it did seem to become slightly more stable. I have tried with XMP both on and off, and the CPU is running stock clockspeed, all other options are bios default, as turning off XMP seemed to make it worse (could still be placebo). I tried setting my LLC to one of the higher settings but this (placebo) seemed to make it worse. I have updated my BIOS to the newest, and reset to optimal defaults to try, which did not work. Unfortunately, there have been no 100% predictable scenarios other than full-load gaming or safe mode, for not-crashing.

I unfortunately do not have another PSU laying around to test with, although if it were a hardware issue I would expect it to crash under load or in safemode.
I think I may need to do another windows reinstall, just completely fresh instead of a reset, as it appears it doesn't necessarily change EVERYTHING back to default.

Recent changes: No recent hardware changes, no software changes that I can think of that would be relevant. My computer was more stable a month or so ago and nothing has happened since then. These issues started around the time of the Spectre/Meltdown microcode updates, which makes me super suspicious.

--

Operating system: Windows 10 Pro x64

System specs:

Self-built computer:
Intel i7 8700k
Asus Maximus X Hero Wi-fi
4x 8GB DDR4-3000 Corsair RAM
Nvidia GTX 1080 TI
Soundblaster X AE-5 Soundcard
Corsair HX850i PSU
2x Samsung 960 EVO 1TB M.2 SSDs
Razer peripherals


Location: USA

I have Googled and read the FAQ: Yes


Thanks for any help in advance.

Adbot
ADBOT LOVES YOU

Myrridinos
Jan 7, 2010
My finger usually points to the memory for these types of blue screens, but from the troubleshooting you've described it sounds like the motherboard could be the culprit.

It would be unusual but there's also a chance it could also be the m.2 drive having an issue or a really messed up peripheral.

Apze
Jul 4, 2014
It's interesting you mention the peripheral, because I have had some strangeish mouse issues so I tried a new mouse in a front usb instead of the back, and managed to get into windows with c-states enabled. It bluescreened a bit later, but typically I can't even get that far. I still had my keyboard plugged into the other usb back there, so I am going to try a new keyboard also on a front port. I guess its possible it could be the USB ports itself.


edit: That was somewhat of a red herring and didn't fix it unfortunately. However, an interesting thing to note is I had been using it heavily previous to the recent testing I was doing, and it appears as if it became less stable over time, as I turned back on c-states to see if they were fixed by the USB tests, and it actually booted into windows. After a few blue screens, it was no longer able to boot into windows, however, at which point I re-disabled them and it booted in correctly'ish.

edit2: A friend suggested that I try disabling my swap file, and it appears like this makes my system incredibly more stable as when I get into windows it has run at idle for ~1 hour of testing with no blue screens. However, I then attempted to re-enable c-states, and it bluescreened (but after getting into windows, which is somewhat better than normal).

Apze fucked around with this message at 03:57 on Apr 22, 2018

Myrridinos
Jan 7, 2010
I have seen corrupted swap files cause memory errors before. I'd think that a fresh install would have cleared that out but I don't know if a windows reset would re-use the existing swap file.

I'd see if I could run a memtest overnight without it crashing.

Let me know how things turn out.

Apze
Jul 4, 2014
Ya, definitely really weird that it increased stability so much. The real test will be tomorrow when its been cold forever, since generally it bluescreens 10+ times before it stabilizes somewhat. I'm also going to attempt to untighten my heatsink a bit, as the main m.2 is right next to the cpu, maybe my motherboard is warped a bit due to overtightening.

MF_James
May 8, 2008
I CANNOT HANDLE BEING CALLED OUT ON MY DUMBASS OPINIONS ABOUT ANTI-VIRUS AND SECURITY. I REALLY LIKE TO THINK THAT I KNOW THINGS HERE

INSTEAD I AM GOING TO WHINE ABOUT IT IN OTHER THREADS SO MY OPINION CAN FEEL VALIDATED IN AN ECHO CHAMBER I LIKE

Myrridinos posted:

I have seen corrupted swap files cause memory errors before. I'd think that a fresh install would have cleared that out but I don't know if a windows reset would re-use the existing swap file.

I'd see if I could run a memtest overnight without it crashing.

Let me know how things turn out.

If you have an SSD you should not have paging file turned on, which is, I assume what you meant by swap file

Myrridinos
Jan 7, 2010

MF_James posted:

If you have an SSD you should not have paging file turned on, which is, I assume what you meant by swap file

Swap files and Paging files are synonymous and usage of one term or the other depends on the operating system.

With the increasing endurance of SSD's and adequate memory there should be no noticeable difference in SSD wear out time having a swap file enabled. If you are hitting the swap file that much you need more RAM anyway.

Having the page file off can cause issues and application crashes if you do have a program that needs more RAM than you have.

On a typical Windows system the modern consensus is typically to keep the swap file on, solid state drive or not.

That being said I generally keep my swap/paging on my secondary mechanical hard drive, as I try to squeeze every bit of life out of my limited budget.

Myrridinos fucked around with this message at 23:48 on Apr 22, 2018

Apze
Jul 4, 2014
So I basically went through and reseated everything, as well as inspecting CPU pins and whatnot. It's so far 100% stable, unless I turn on automatic management of the swap file size, or re-enable C-states. My best guess is that I am having issues with PCI-E devices entering low power states, but whether that is due to a bad PSU, mobo or CPU I don't know. I don't actually care too much about having the C-states disabled, but I really wish I knew what the exact root cause of this is, as I'd like to just replace it. I guess I can just wait for the thing to deteriorate further, or completely explode, then just replace them one at a time.

Myrridinos
Jan 7, 2010
That's a little bit of an odd combo. C-states is processor power management, paging file is storage writes, in your case through the PCI-E bus. If I had to take a gamble I'd still point my finger at the motherboard.

If it's stable now and everything works it I would recommend letting it ride and keeping a close eye on it.

Apze
Jul 4, 2014
So after having it off for ~16 hours, it bluescreened under those settings one time, then went back to being basically as stable as it was yesterday. Since I don't really want to mess with it TOO much anymore, I'm also leaning towards the most likely culprit being motherboard, so I went ahead and got a new one that I'll get to try out tomorrow.

It is a little weird, but I'm led to believe that the C-states affect the PCI-E link power management when package c-state support is enabled. In addition, one of the more rare blue screens that I saw was a clock_watchdog_timeout, so very likely due to something like a too-long read on the SSDs. Those together suggest to me an issue directly with the pci-e bus. If this doesn't fix it, I think the SSDs are still more likely than the CPU even, but I think it would basically require both of them to be bad --if it's not the motherboard-- based on the bluescreen happening after disabling paging on the other drive. In either case, hopefully the new motherboard fixes the problem, and if it doesn't, I'll hopefully get a little more insight into the problem.

Apze fucked around with this message at 06:16 on Apr 24, 2018

Apze
Jul 4, 2014
So I replaced the motherboard, exact same issue/blue screening occurred. I decided to try to reinstall windows, and it locks up at the load screen for the windows install. To make sure it wasn't the SSDs doing something wonky, I removed all my drives, and tried to get into the windows installer, same issue. So at this point I've removed (at different times) every device in my computer except the CPU and PSU. I'm not sure that a PSU issue would be so consistent in the issues it's causing. Looks like I may have to go pick up a CPU tomorrow.

Myrridinos
Jan 7, 2010
I'm sure you've tested it but does the installation media you use boot up properly in other computers?

It's rare to have a CPU go bad but it happens.

I've seen power supplies going out result in some wacky behavior, this doesn't sound like a power supply issue but I wouldn't rule it out at this point without testing it.

This computer has been running you down a rabbit hole.

Apze
Jul 4, 2014
Welp, fingers crossed, but it appears the new CPU has fixed the problem. Boots fine with all BIOS default settings, enabled XMP and still good. Bad CPU is a rare problem, but I guess I'm just lucky like that.

Myrridinos
Jan 7, 2010
I think I've ran into a bad CPU maybe twice in the last 4 years. Try not to walk outside in any thunderstorms with that luck.

Glad to hear that you found the problem and are now stable.

Adbot
ADBOT LOVES YOU

Apze
Jul 4, 2014
I know right? I think this is my first legit bad CPU ever, outside of a couple that I know exactly why they died.

  • Locked thread