Dell Precision T3610 Computer Demons

avatar

source

I'm going to start posting my random computer questions here in the hopes that someone might be able to answer them. Now I am a software engineer by trade and have been building my own PCs for 20+ years. I'm not exactly a novice so whenever I ask a question here it's because I've reached the conclusion that it is computer demons at fault and am hoping that someone has a better answer because clearly that can't be right...can it?

This question involves a Dell Precision T3610 with an Intel Xeon 2697v2 (Ivy Bridge) 12-core CPU, 64 GB of DDR3-1600 ECC RAM and an nVidia Quadro K2000 video card.

When I first got this machine, it had a recent version of Linux Mint installed. It seemed to work fine. I installed BOINC on it (which I do with every computer I get my hands on) and added my typical projects - einstein@home, rosetta@home, milkyway@home and World Community Grid. After it downloaded work units and started crunching them, the computer started responding VERY sluggishly. While BOINC works a computer hard, it runs at low priority so typically it is only using what would otherwise be idle cycles and normally a computer running BOINC would still be very responsive. In addition, I noticed that the work units were progressing much more slowly than they should be. However, everything looked like it should otherwise...CPU was running at about 3GHz, BOINC tasks were using most of the CPU cycles, etc. Nothing looked out of place to indicate why it would be operating more slowly than normal.

Upon rebooting and paying closer attention, it looked like Linux was spitting out some error messages during boot that indicated corrected memory errors. I forget exactly what the errors were but when I looked it up it sounded like the problem MIGHT not be actual memory errors but a bad driver. Instead of trying to screw around with that I decided to install a fresh copy of Xubuntu (my normal linux of choice) to see what would happen. Perhaps unsurprisingly I got the same results.

Then I installed Windows 10. Windows 10 behaved basically the same way. It got very sluggish when BOINC started up its tasks. I did notice that the System task was using quite a bit of CPU (at least one full core) and also hitting the disk pretty hard. But this is so often the case with Windows that it's hard to say for sure if it is related. So CLEARLY this is a problem of bad memory, right? Well, maybe...but here is where it gets a little weird...

I discovered this "fix" quite by accident. When you install Windows 10, it defaults to putting your computer to sleep after 30 minutes of inactivity. Since I run BOINC I never want this to happen but I inevitably forget that setting until it happens the first time after a new install. So sure enough, I forgot to change that setting and the computer went to sleep after 30 minutes. I pressed the power button to wake it up and it woke up...but without the sluggishness it had before. Also, BOINC tasks seemed to be progressing at a more reasonable rate of speed. It seems whatever the problem was had been cured by a short nap. A fluke you say! But no...it's repeatable. If I reboot the computer, it behaves sluggishly when BOINC is running and runs much slower than it seems it should. Put it to sleep and wake it up again, and it performs normally until rebooted again. And when I say rebooted, i don't even mean power cycled...just rebooting brings the problem back.

I also ran Windows Memory Diagnostic and it found no errors.

This machine has the latest BIOS available and Windows does not show any missing drivers (and of course seemingly the same problem existed under Linux as well...not sure if a sleep and a wake-up would have fixed it there too or not). The memory is new but that doesn't eliminate the possibility of a bad stick (it has 4 16GB modules in a quad channel configuration). It just seems odd that putting the computer to sleep and waking it up solves the problem and rebooting brings it back. What could possibly cause that? Since the problem exists across multiple operating systems, surely it is a hardware issue of some sort.

Personally, I'm leaning toward a slightly more obscure solution than bad memory as I don't see how sleeping and waking could possibly fix that. Maybe a flaky memory controller on the CPU? I would think if that were the case though that I would see all kinds of stability issues but there are none, either in its "sluggish" state or in its "fixed" state. It can run for hours on end either way with no crashes or other signs of instability, all the while using nearly 100% of the CPU and GPU for BOINC tasks.

Poking around in Windows Event Viewer, I did find a whole crapload of WHEA-Logger errors that seemed to correspond to when the system was sluggish that say "A corrected hardware error has occurred. A record describing the condition is contained in the data section of this event." But the "data" section might as well be random numbers for all the use it is.

For one reason or another, I suspect that I am getting a constant stream of memory errors (that are ECC correctable so no crash) in the sluggish state and this is somehow resolved by sleeping and waking. Can correctable memory errors lead to a sluggish system? Why would sleeping and waking resolve this sort of problem?

Like I said, it's computer demons...

I guess by process of elimination I could try swapping out the memory and then the CPU, I just don't know if I have appropriate spares lying around at the moment. And it's not like this is my primary computer...it's just a toy to play with so as long as sleeping and waking it resolves the issue, then that's what I'll do. It just seems so weird.



0
0
0.000
6 comments
avatar

That is strange... Did you try a live cd?

0
0
0.000
avatar

Well, I installed Xubuntu from a live CD. The sluggishness really doesn't become noticeable until the CPU is under load though. That didn't really happen until after it was installed. When booting from the CD I did see errors similar to:

kernel: [5585143.108121] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#0_DIMM#0...

As I understand it this usually indicates a correctable memory error.

Windows also logged correctable hardware errors. It's just odd that problem goes away after putting the computer to sleep and waking it up and a reboot (soft or hard) brings the problem back. At least that works in Windows. I discovered that completely by accident so I didn't try it in Linux to see if it fixed the issue there as well.

0
0
0.000
avatar

I've never heard of anything like it... Is the computer still under warranty?

0
0
0.000
avatar

No, it is a many years old computer long out of warranty. I'm really just trying to learn something and satisfy my own curiosity.

0
0
0.000
avatar

I hope you will be able to find the answer. Things like that are a good challenge :) ...

0
0
0.000