- Joined
- Mar 7, 2008
I've kinda always half joked a system is only stable until it is unstable. On my main system, this is starting to get tiresome. (specs in sig)
To recap, I tend to run prime number finding software 24/7 that is comparable to Prime95. I aim to tune all my systems for best performance without cooking it at 100% stability. Well, as close to 100% stability as I can prove anyway. In the case of my main system, each time I think I'm 100% stable, give it some weeks, months, and I'll get a detected error. I'm fairly confident it isn't the software or other external influence, as other systems don't suffer this.
I'm wondering if in part it is a recent update which enabled multi-thread code. Before that, you would run one task per real core, since HT gives no benefit. In that scenario, for all but smallest tests, the workload will exceed the CPU cache and it will hit ram hard. This is why I keep tooting about ram bandwidth being lacking in modern systems. With the change to multi-thread, it brings a lot of work within the CPU cache sizes, thus performance is arguably hitting the CPU and cache harder, less limited by ram bandwidth if at all. Could this be why I'm starting to see a change? At least, partially.
The 1st unexplained bad unit in multi-thread era was on April 23. This task was 1440k FFT, so it would take about 11MB of data, thus not entirely fit into CPU cache of 8GB. I didn't have time to investigate, so all I did was drop the CPU clock from 4.2 fixed to 4.0 fixed. That's like stock without turbo, so it must be safe? Voltage was set to 1.275. It was previously 1.25 but I had other suspected instability so I through I'd try increasing it.
That seemed fine until yesterday when I got another bad unit. I had switched subproject since earlier, and this one was using 640k FFT, or 5MB. This would fit in cache. If that means I can rule out the ram... it must be the CPU.
I had a look at the mobo bios, and it was somewhat out of date from mid last year. Ok, I'll take it up to current, and the interim releases had the usual vague improve ram stability, system stability, compatibility, whateverbility. The update of course nuked existing settings, so I wrote down the fan settings I had to restore them, and this time I thought while I'll leave the ram at XMP, I'm going to let the CPU run stock Intel. That is, auto voltage, 4.2 single core turbo, 4.0 otherwise. No Asus all core enhancement or stuff like that.
I got back into windows, checked the voltages and... saw it was running 1.20v! Auto actually lowered it? VID is showing 1.3+. It is kinda coming back to me, I think I previously increased it from 1.20 to 1.250 then 1.275 to see if it helped, and it appears not.
Temps are nothing to worry about. There's a Noctua D14 on it and recent max temps are around 70C.
That brings me almost full circle back to the ram. While it shouldn't be implicated in the 2nd error, I can't rule out and and all ram access during that time. This system had ram stability problems early on with Ripjaws 4 3333, and the current Ripjaws 5 3200C16 2x8 seems to be ok. Is it?...
At times like these I start thinking about junking the system and getting a modern dual Xeon with ECC or something.
To recap, I tend to run prime number finding software 24/7 that is comparable to Prime95. I aim to tune all my systems for best performance without cooking it at 100% stability. Well, as close to 100% stability as I can prove anyway. In the case of my main system, each time I think I'm 100% stable, give it some weeks, months, and I'll get a detected error. I'm fairly confident it isn't the software or other external influence, as other systems don't suffer this.
I'm wondering if in part it is a recent update which enabled multi-thread code. Before that, you would run one task per real core, since HT gives no benefit. In that scenario, for all but smallest tests, the workload will exceed the CPU cache and it will hit ram hard. This is why I keep tooting about ram bandwidth being lacking in modern systems. With the change to multi-thread, it brings a lot of work within the CPU cache sizes, thus performance is arguably hitting the CPU and cache harder, less limited by ram bandwidth if at all. Could this be why I'm starting to see a change? At least, partially.
The 1st unexplained bad unit in multi-thread era was on April 23. This task was 1440k FFT, so it would take about 11MB of data, thus not entirely fit into CPU cache of 8GB. I didn't have time to investigate, so all I did was drop the CPU clock from 4.2 fixed to 4.0 fixed. That's like stock without turbo, so it must be safe? Voltage was set to 1.275. It was previously 1.25 but I had other suspected instability so I through I'd try increasing it.
That seemed fine until yesterday when I got another bad unit. I had switched subproject since earlier, and this one was using 640k FFT, or 5MB. This would fit in cache. If that means I can rule out the ram... it must be the CPU.
I had a look at the mobo bios, and it was somewhat out of date from mid last year. Ok, I'll take it up to current, and the interim releases had the usual vague improve ram stability, system stability, compatibility, whateverbility. The update of course nuked existing settings, so I wrote down the fan settings I had to restore them, and this time I thought while I'll leave the ram at XMP, I'm going to let the CPU run stock Intel. That is, auto voltage, 4.2 single core turbo, 4.0 otherwise. No Asus all core enhancement or stuff like that.
I got back into windows, checked the voltages and... saw it was running 1.20v! Auto actually lowered it? VID is showing 1.3+. It is kinda coming back to me, I think I previously increased it from 1.20 to 1.250 then 1.275 to see if it helped, and it appears not.
Temps are nothing to worry about. There's a Noctua D14 on it and recent max temps are around 70C.
That brings me almost full circle back to the ram. While it shouldn't be implicated in the 2nd error, I can't rule out and and all ram access during that time. This system had ram stability problems early on with Ripjaws 4 3333, and the current Ripjaws 5 3200C16 2x8 seems to be ok. Is it?...
At times like these I start thinking about junking the system and getting a modern dual Xeon with ECC or something.