Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down
Hello All, I'm trying to troubleshoot a problem with my T7810 Dell Precision Workstation that serves as my UnRaid server (NAS, VMs, Dockers, etc.) and has served me well for the better part of a year after picking it up from Shamino's fire sale. It's about 5 years old and I believe half that time it was spent in a storage container so doesn't have THAT many hours of use on it.

Specs: T7810 XL, Dual E5-2643 v3, 64GB of ram (4x16), Quadro K4200 graphics card, 825W PSU
Memory has been installed in CPU1 DIMM1, CPU1 DIMM2, etc. as appropriate.

Thursday morning I woke to find that my server had shutdown unexpectedly. I saw that it had the Power LED on so after being unresponsive I did a hard reboot and it came back up. Thinking it was a fluke I went on with my day but found it came down again about 15 minutes later. I pulled it out of my rack and into my office so I could observe what was happening and it was suddenly shutting off within 10-20 minutes of bootup. After some time, it wouldn't even make it to the full UnRaid boot before doing so.

I took out the drives, LSI card, and SSD card and put them into another case to get the server back up and running so I could troubleshoot this shutdown problem.

I've spent nearly all of my troubleshooting in the built-in hardware diagnostics tool. Initially, whenever the test ran it would power off immediately upon hitting the processor. Thinking it may be a thermal issue I ordered some fresh thermal paste and applied it. So far that hasn't had any effect.

I wanted to check the memory so I started doing the tests individually rather than letting it cycle through the list and failing at the processor. The shutoff would occur at the memory as well. Now I was thinking it was a bad DIMM, so I started swapping out and trying different configurations of DIMMs. What was strange that at some point I was able to run the whole hardware diagnostic without crashing, but then a subsequent test it would occur.

There doesn't appear to be any consistency other than the longer it's all running, the more prone to crashing, which makes me go back to a potential thermal issue? Right now the CPU Stress test has CPU1 at 72C with a high of 73C and CPU2 at 75C with a high of 75C. No errors in the BIOS log.

I'm continuing different memory configurations but am running out of things to diagnose. Currently took out the Quadro K4200 and put in a 600 to see if it's the video card. So far hasn't crashed but just getting started for the day. I was considering removing one of the CPUs to see if it's a bad proc as my next steps.

Any ideas on where to go next to troubleshoot this or what are the potential failures points?

Adbot
ADBOT LOVES YOU

TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down
Update, I have not had a crash since removing the K4200 and using the 600, starting to wonder if that video card was causing an issue somehow. I haven't been able to spend a lot of time continually running the tests (to generate heat/load). Is there an iso I can load that can run stress tests for a period of time to really get this guy tested?

Zogo
Jul 29, 2003

This is what some use for GPU testing:
https://geeks3d.com/furmark/downloads/

TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down

Zogo posted:

This is what some use for GPU testing:
https://geeks3d.com/furmark/downloads/

Thanks! I’m going to check it out. Everything has been working out just swimmingly after taking it out and replacing it.

TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down
Update on this. With the confidence above, I set everything back up and re-racked the T7810 for full-time service again. Less than a week later the problem reoccurred (better described as a restart, not a poweroff) so I took it back out of service and set it side, got depressed.

I brought this issue back up in the Packrat NAS thread as admitting defeat was/am intending to build a new server to replace it. It was brought up that this may sound like a power supply issue.

Question I have is since the T7810 has one of those fancy slot power supplies that plugs into a board that has the 20pin connector, is it possible to use a traditional PSU, raw-dogging it in the case, to test out if the PSU is the failure point? Or am I forced to buy one and hope that it resolves it?

Appreciate any insight.

Zogo
Jul 29, 2003

TraderStav posted:

Question I have is since the T7810 has one of those fancy slot power supplies that plugs into a board that has the 20pin connector, is it possible to use a traditional PSU, raw-dogging it in the case, to test out if the PSU is the failure point? Or am I forced to buy one and hope that it resolves it?

If it's a high quality PSU and you're 100% sure the pins line up correctly you could try it.

Not something I recommend though. Especially if it was a junky/older PSU.

TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down

Zogo posted:

If it's a high quality PSU and you're 100% sure the pins line up correctly you could try it.

Not something I recommend though. Especially if it was a junky/older PSU.

I found a good eBay seller with free returns so ordered up an official one to test. Should be here on Tuesday and then can run more tests.

If that's not it, I think I'm left with swapping out CPUs and RAM before the mobo. Hopefully there's enough generous eBay sellers out there to assist me in troubleshooting this.

Last time I buy anything with proprietary parts like this!

Adbot
ADBOT LOVES YOU

TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down

TraderStav posted:

I found a good eBay seller with free returns so ordered up an official one to test. Should be here on Tuesday and then can run more tests.

If that's not it, I think I'm left with swapping out CPUs and RAM before the mobo. Hopefully there's enough generous eBay sellers out there to assist me in troubleshooting this.

Last time I buy anything with proprietary parts like this!

Quick update. Received the PSU today and plugged it in. Did the diagnostic test that was repeatedly failing at the Processor test with the old one and pushed right through it. I had no idea what the end diagnostic report looked like as I never had gotten to it before! Going to try to do more battery of tests but feeling confident that this was a PSU problem!

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply