|
Problem description: A few months ago, my computer blue screened a few times, which I had little success in fixing, so I did a windows reset from the install media, after which I then blue screened again on the clean install minutes after booting it the first time. After this, I updated my bios and the issue appeared to go away. Recently, it began blue screening again, with a variety of error codes, mostly irql_not_less_or_equal, with an occasional kmode_exception_not_handled, or whea_uncorrectable_error. As it stands right now, on a cold boot I will generally bluescreen during load or within 30-120 seconds thereafter, until it stabilizes a bit. If I manage to load into a somewhat strenuous game, it will be completely stable during play, until I stop, but will generally bluescreen at desktop within 10 minutes at most afterwards. In addition, it is 100% stable in safe mode, but I have had it blue screen at the windows 10 recovery screen and during the safe-mode selection menu after blue screening on boot a few times. I have had it bluescreen during less strenuous gaming, but that has happened twice total out of the couple hundred bluescreens I've had over the last few weeks. My CPU temperatures sit at 31-36C at idle, under gaming load are around 62-64C, and under max load such as AIDA get up to low 70s. Attempted fixes:
One of my earlier guesses was that my vcore was dropping too low when idling, so I manually set my vcore to 1.2v and disabled C-states. It might be a placebo but it did seem to become slightly more stable. I have tried with XMP both on and off, and the CPU is running stock clockspeed, all other options are bios default, as turning off XMP seemed to make it worse (could still be placebo). I tried setting my LLC to one of the higher settings but this (placebo) seemed to make it worse. I have updated my BIOS to the newest, and reset to optimal defaults to try, which did not work. Unfortunately, there have been no 100% predictable scenarios other than full-load gaming or safe mode, for not-crashing. I unfortunately do not have another PSU laying around to test with, although if it were a hardware issue I would expect it to crash under load or in safemode. I think I may need to do another windows reinstall, just completely fresh instead of a reset, as it appears it doesn't necessarily change EVERYTHING back to default. Recent changes: No recent hardware changes, no software changes that I can think of that would be relevant. My computer was more stable a month or so ago and nothing has happened since then. These issues started around the time of the Spectre/Meltdown microcode updates, which makes me super suspicious. -- Operating system: Windows 10 Pro x64 System specs: Self-built computer: Intel i7 8700k Asus Maximus X Hero Wi-fi 4x 8GB DDR4-3000 Corsair RAM Nvidia GTX 1080 TI Soundblaster X AE-5 Soundcard Corsair HX850i PSU 2x Samsung 960 EVO 1TB M.2 SSDs Razer peripherals Location: USA I have Googled and read the FAQ: Yes Thanks for any help in advance.
|
# ? Apr 21, 2018 21:12 |
|
|
# ? Apr 20, 2024 03:01 |
|
My finger usually points to the memory for these types of blue screens, but from the troubleshooting you've described it sounds like the motherboard could be the culprit. It would be unusual but there's also a chance it could also be the m.2 drive having an issue or a really messed up peripheral.
|
# ? Apr 22, 2018 01:24 |
|
It's interesting you mention the peripheral, because I have had some strangeish mouse issues so I tried a new mouse in a front usb instead of the back, and managed to get into windows with c-states enabled. It bluescreened a bit later, but typically I can't even get that far. I still had my keyboard plugged into the other usb back there, so I am going to try a new keyboard also on a front port. I guess its possible it could be the USB ports itself. edit: That was somewhat of a red herring and didn't fix it unfortunately. However, an interesting thing to note is I had been using it heavily previous to the recent testing I was doing, and it appears as if it became less stable over time, as I turned back on c-states to see if they were fixed by the USB tests, and it actually booted into windows. After a few blue screens, it was no longer able to boot into windows, however, at which point I re-disabled them and it booted in correctly'ish. edit2: A friend suggested that I try disabling my swap file, and it appears like this makes my system incredibly more stable as when I get into windows it has run at idle for ~1 hour of testing with no blue screens. However, I then attempted to re-enable c-states, and it bluescreened (but after getting into windows, which is somewhat better than normal). Apze fucked around with this message at 03:57 on Apr 22, 2018 |
# ? Apr 22, 2018 01:54 |
|
I have seen corrupted swap files cause memory errors before. I'd think that a fresh install would have cleared that out but I don't know if a windows reset would re-use the existing swap file. I'd see if I could run a memtest overnight without it crashing. Let me know how things turn out.
|
# ? Apr 22, 2018 05:23 |
|
Ya, definitely really weird that it increased stability so much. The real test will be tomorrow when its been cold forever, since generally it bluescreens 10+ times before it stabilizes somewhat. I'm also going to attempt to untighten my heatsink a bit, as the main m.2 is right next to the cpu, maybe my motherboard is warped a bit due to overtightening.
|
# ? Apr 22, 2018 05:48 |
|
Myrridinos posted:I have seen corrupted swap files cause memory errors before. I'd think that a fresh install would have cleared that out but I don't know if a windows reset would re-use the existing swap file. If you have an SSD you should not have paging file turned on, which is, I assume what you meant by swap file
|
# ? Apr 22, 2018 20:42 |
|
MF_James posted:If you have an SSD you should not have paging file turned on, which is, I assume what you meant by swap file Swap files and Paging files are synonymous and usage of one term or the other depends on the operating system. With the increasing endurance of SSD's and adequate memory there should be no noticeable difference in SSD wear out time having a swap file enabled. If you are hitting the swap file that much you need more RAM anyway. Having the page file off can cause issues and application crashes if you do have a program that needs more RAM than you have. On a typical Windows system the modern consensus is typically to keep the swap file on, solid state drive or not. That being said I generally keep my swap/paging on my secondary mechanical hard drive, as I try to squeeze every bit of life out of my limited budget. Myrridinos fucked around with this message at 23:48 on Apr 22, 2018 |
# ? Apr 22, 2018 23:45 |
|
So I basically went through and reseated everything, as well as inspecting CPU pins and whatnot. It's so far 100% stable, unless I turn on automatic management of the swap file size, or re-enable C-states. My best guess is that I am having issues with PCI-E devices entering low power states, but whether that is due to a bad PSU, mobo or CPU I don't know. I don't actually care too much about having the C-states disabled, but I really wish I knew what the exact root cause of this is, as I'd like to just replace it. I guess I can just wait for the thing to deteriorate further, or completely explode, then just replace them one at a time.
|
# ? Apr 23, 2018 03:18 |
|
That's a little bit of an odd combo. C-states is processor power management, paging file is storage writes, in your case through the PCI-E bus. If I had to take a gamble I'd still point my finger at the motherboard. If it's stable now and everything works it I would recommend letting it ride and keeping a close eye on it.
|
# ? Apr 24, 2018 05:25 |
|
So after having it off for ~16 hours, it bluescreened under those settings one time, then went back to being basically as stable as it was yesterday. Since I don't really want to mess with it TOO much anymore, I'm also leaning towards the most likely culprit being motherboard, so I went ahead and got a new one that I'll get to try out tomorrow. It is a little weird, but I'm led to believe that the C-states affect the PCI-E link power management when package c-state support is enabled. In addition, one of the more rare blue screens that I saw was a clock_watchdog_timeout, so very likely due to something like a too-long read on the SSDs. Those together suggest to me an issue directly with the pci-e bus. If this doesn't fix it, I think the SSDs are still more likely than the CPU even, but I think it would basically require both of them to be bad --if it's not the motherboard-- based on the bluescreen happening after disabling paging on the other drive. In either case, hopefully the new motherboard fixes the problem, and if it doesn't, I'll hopefully get a little more insight into the problem. Apze fucked around with this message at 06:16 on Apr 24, 2018 |
# ? Apr 24, 2018 06:12 |
|
So I replaced the motherboard, exact same issue/blue screening occurred. I decided to try to reinstall windows, and it locks up at the load screen for the windows install. To make sure it wasn't the SSDs doing something wonky, I removed all my drives, and tried to get into the windows installer, same issue. So at this point I've removed (at different times) every device in my computer except the CPU and PSU. I'm not sure that a PSU issue would be so consistent in the issues it's causing. Looks like I may have to go pick up a CPU tomorrow.
|
# ? Apr 25, 2018 03:04 |
|
I'm sure you've tested it but does the installation media you use boot up properly in other computers? It's rare to have a CPU go bad but it happens. I've seen power supplies going out result in some wacky behavior, this doesn't sound like a power supply issue but I wouldn't rule it out at this point without testing it. This computer has been running you down a rabbit hole.
|
# ? Apr 25, 2018 17:11 |
|
Welp, fingers crossed, but it appears the new CPU has fixed the problem. Boots fine with all BIOS default settings, enabled XMP and still good. Bad CPU is a rare problem, but I guess I'm just lucky like that.
|
# ? Apr 25, 2018 21:08 |
|
I think I've ran into a bad CPU maybe twice in the last 4 years. Try not to walk outside in any thunderstorms with that luck. Glad to hear that you found the problem and are now stable.
|
# ? Apr 26, 2018 02:45 |
|
|
# ? Apr 20, 2024 03:01 |
|
I know right? I think this is my first legit bad CPU ever, outside of a couple that I know exactly why they died.
|
# ? Apr 26, 2018 04:40 |