On again, off again, stability woes

mackerel · May 13, 2017

I've kinda always half joked a system is only stable until it is unstable. On my main system, this is starting to get tiresome. (specs in sig)

To recap, I tend to run prime number finding software 24/7 that is comparable to Prime95. I aim to tune all my systems for best performance without cooking it at 100% stability. Well, as close to 100% stability as I can prove anyway. In the case of my main system, each time I think I'm 100% stable, give it some weeks, months, and I'll get a detected error. I'm fairly confident it isn't the software or other external influence, as other systems don't suffer this.

I'm wondering if in part it is a recent update which enabled multi-thread code. Before that, you would run one task per real core, since HT gives no benefit. In that scenario, for all but smallest tests, the workload will exceed the CPU cache and it will hit ram hard. This is why I keep tooting about ram bandwidth being lacking in modern systems. With the change to multi-thread, it brings a lot of work within the CPU cache sizes, thus performance is arguably hitting the CPU and cache harder, less limited by ram bandwidth if at all. Could this be why I'm starting to see a change? At least, partially.

The 1st unexplained bad unit in multi-thread era was on April 23. This task was 1440k FFT, so it would take about 11MB of data, thus not entirely fit into CPU cache of 8GB. I didn't have time to investigate, so all I did was drop the CPU clock from 4.2 fixed to 4.0 fixed. That's like stock without turbo, so it must be safe? Voltage was set to 1.275. It was previously 1.25 but I had other suspected instability so I through I'd try increasing it.

That seemed fine until yesterday when I got another bad unit. I had switched subproject since earlier, and this one was using 640k FFT, or 5MB. This would fit in cache. If that means I can rule out the ram... it must be the CPU.

I had a look at the mobo bios, and it was somewhat out of date from mid last year. Ok, I'll take it up to current, and the interim releases had the usual vague improve ram stability, system stability, compatibility, whateverbility. The update of course nuked existing settings, so I wrote down the fan settings I had to restore them, and this time I thought while I'll leave the ram at XMP, I'm going to let the CPU run stock Intel. That is, auto voltage, 4.2 single core turbo, 4.0 otherwise. No Asus all core enhancement or stuff like that.

I got back into windows, checked the voltages and... saw it was running 1.20v! Auto actually lowered it? VID is showing 1.3+. It is kinda coming back to me, I think I previously increased it from 1.20 to 1.250 then 1.275 to see if it helped, and it appears not.

Temps are nothing to worry about. There's a Noctua D14 on it and recent max temps are around 70C.

That brings me almost full circle back to the ram. While it shouldn't be implicated in the 2nd error, I can't rule out and and all ram access during that time. This system had ram stability problems early on with Ripjaws 4 3333, and the current Ripjaws 5 3200C16 2x8 seems to be ok. Is it?...

At times like these I start thinking about junking the system and getting a modern dual Xeon with ECC or something.

caddi daddi · May 13, 2017

two things that tend to leave my rig unstable at the time are, win10 updating, it's now to the point that when i run a calculation I block the internet connection to it and hook it to it's own, wireless router so i can access it from down stairs.
the other issue is I have one, unstable core, #6 that was causing mayhem, no stress test could find it, I just had to run it and turn off 1 core at a time till i found the the one causing to temp to fall after 10-12 hours and the rate of completion to increase.
you might try an increase in cache voltage.
a bump in sa voltage
increase the ram voltage a notch.
reduce the ram speed.
losten the ram timings one notch.

trents · May 14, 2017

mackerel said:
I've kinda always half joked a system is only stable until it is unstable. On my main system, this is starting to get tiresome. (specs in sig)

To recap, I tend to run prime number finding software 24/7 that is comparable to Prime95. I aim to tune all my systems for best performance without cooking it at 100% stability. Well, as close to 100% stability as I can prove anyway. In the case of my main system, each time I think I'm 100% stable, give it some weeks, months, and I'll get a detected error. I'm fairly confident it isn't the software or other external influence, as other systems don't suffer this.

I'm wondering if in part it is a recent update which enabled multi-thread code. Before that, you would run one task per real core, since HT gives no benefit. In that scenario, for all but smallest tests, the workload will exceed the CPU cache and it will hit ram hard. This is why I keep tooting about ram bandwidth being lacking in modern systems. With the change to multi-thread, it brings a lot of work within the CPU cache sizes, thus performance is arguably hitting the CPU and cache harder, less limited by ram bandwidth if at all. Could this be why I'm starting to see a change? At least, partially.

The 1st unexplained bad unit in multi-thread era was on April 23. This task was 1440k FFT, so it would take about 11MB of data, thus not entirely fit into CPU cache of 8GB. I didn't have time to investigate, so all I did was drop the CPU clock from 4.2 fixed to 4.0 fixed. That's like stock without turbo, so it must be safe? Voltage was set to 1.275. It was previously 1.25 but I had other suspected instability so I through I'd try increasing it.

That seemed fine until yesterday when I got another bad unit. I had switched subproject since earlier, and this one was using 640k FFT, or 5MB. This would fit in cache. If that means I can rule out the ram... it must be the CPU.

I had a look at the mobo bios, and it was somewhat out of date from mid last year. Ok, I'll take it up to current, and the interim releases had the usual vague improve ram stability, system stability, compatibility, whateverbility. The update of course nuked existing settings, so I wrote down the fan settings I had to restore them, and this time I thought while I'll leave the ram at XMP, I'm going to let the CPU run stock Intel. That is, auto voltage, 4.2 single core turbo, 4.0 otherwise. No Asus all core enhancement or stuff like that.

I got back into windows, checked the voltages and... saw it was running 1.20v! Auto actually lowered it? VID is showing 1.3+. It is kinda coming back to me, I think I previously increased it from 1.20 to 1.250 then 1.275 to see if it helped, and it appears not.

Temps are nothing to worry about. There's a Noctua D14 on it and recent max temps are around 70C.

That brings me almost full circle back to the ram. While it shouldn't be implicated in the 2nd error, I can't rule out and and all ram access during that time. This system had ram stability problems early on with Ripjaws 4 3333, and the current Ripjaws 5 3200C16 2x8 seems to be ok. Is it?...

At times like these I start thinking about junking the system and getting a modern dual Xeon with ECC or something.

This statement seems to contradict itself. Guess I'm missing something here.

wingman99 · May 15, 2017

mackerel said:
I've kinda always half joked a system is only stable until it is unstable. On my main system, this is starting to get tiresome. (specs in sig)

To recap, I tend to run prime number finding software 24/7 that is comparable to Prime95. I aim to tune all my systems for best performance without cooking it at 100% stability. Well, as close to 100% stability as I can prove anyway. In the case of my main system, each time I think I'm 100% stable, give it some weeks, months, and I'll get a detected error. I'm fairly confident it isn't the software or other external influence, as other systems don't suffer this.

I'm wondering if in part it is a recent update which enabled multi-thread code. Before that, you would run one task per real core, since HT gives no benefit. In that scenario, for all but smallest tests, the workload will exceed the CPU cache and it will hit ram hard. This is why I keep tooting about ram bandwidth being lacking in modern systems. With the change to multi-thread, it brings a lot of work within the CPU cache sizes, thus performance is arguably hitting the CPU and cache harder, less limited by ram bandwidth if at all. Could this be why I'm starting to see a change? At least, partially.

The 1st unexplained bad unit in multi-thread era was on April 23. This task was 1440k FFT, so it would take about 11MB of data, thus not entirely fit into CPU cache of 8GB. I didn't have time to investigate, so all I did was drop the CPU clock from 4.2 fixed to 4.0 fixed. That's like stock without turbo, so it must be safe? Voltage was set to 1.275. It was previously 1.25 but I had other suspected instability so I through I'd try increasing it.

That seemed fine until yesterday when I got another bad unit. I had switched subproject since earlier, and this one was using 640k FFT, or 5MB. This would fit in cache. If that means I can rule out the ram... it must be the CPU.

I had a look at the mobo bios, and it was somewhat out of date from mid last year. Ok, I'll take it up to current, and the interim releases had the usual vague improve ram stability, system stability, compatibility, whateverbility. The update of course nuked existing settings, so I wrote down the fan settings I had to restore them, and this time I thought while I'll leave the ram at XMP, I'm going to let the CPU run stock Intel. That is, auto voltage, 4.2 single core turbo, 4.0 otherwise. No Asus all core enhancement or stuff like that.

I got back into windows, checked the voltages and... saw it was running 1.20v! Auto actually lowered it? VID is showing 1.3+. It is kinda coming back to me, I think I previously increased it from 1.20 to 1.250 then 1.275 to see if it helped, and it appears not.

Temps are nothing to worry about. There's a Noctua D14 on it and recent max temps are around 70C.

That brings me almost full circle back to the ram. While it shouldn't be implicated in the 2nd error, I can't rule out and and all ram access during that time. This system had ram stability problems early on with Ripjaws 4 3333, and the current Ripjaws 5 3200C16 2x8 seems to be ok. Is it?...

At times like these I start thinking about junking the system and getting a modern dual Xeon with ECC or something.

Only one core at a time can access the system memory for read or write operation. Memory bandwidth is not a problem memory speed and latency is. Carbon nanotube memory data access is measured picoseconds, DDR4 data access now is measured in nanoseconds.

mackerel · May 15, 2017

trents said:
This statement seems to contradict itself. Guess I'm missing something here.

In single thread days, you run one task per real core to load up the CPU. On a quad core, you have 4 sets of data being worked on. Once the total size exceeds the CPU cache, performance drops as they all compete for limited ram bandwidth. With the move to multi-thread, all cores can work on the same data set, so the fixed amount of cache goes further before the working data exceeds it. It still isn't well understood exactly how performance is impacted but it seems like there is enough commonality between the cores and data such that ram demands are still reduced in that case. That is, 4 cores doing one thing faster is less demanding on ram bandwidth than 4 cores doing 4 separate things.

wingman99 said:
Only one core at a time can access the system memory for read or write operation. Memory bandwidth is not a problem memory speed and latency is. Carbon nanotube memory data access is measured picoseconds, DDR4 data access now is measured in nanoseconds.

Now it is my turn to miss something as I currently see this as somewhere between irrelevant and meaningless. Memory bandwidth is a very real problem. Latency masking in this type of application is pretty good so it has minimal effect on performance. Both of these are easy to observe, if a bit time consuming.

trents · May 15, 2017

What was the first version of Windows to incorporate multi threading? Seems like we have had that ability for a long time now. When were the "single thread days"?

mackerel · May 15, 2017

The OS doesn't matter, as it is the application that has to use it. The specific piece of software I'm using was updated earlier this year to support it.

wingman99 · May 15, 2017

mackerel said:
I'm wondering if in part it is a recent update which enabled multi-thread code. Before that, you would run one task per real core, since HT gives no benefit. In that scenario, for all but smallest tests, the workload will exceed the CPU cache and it will hit ram hard. This is why I keep tooting about ram bandwidth being lacking in modern systems. With the change to multi-thread, it brings a lot of work within the CPU cache sizes, thus performance is arguably hitting the CPU and cache harder, less limited by ram bandwidth if at all. Could this be why I'm starting to see a change? At least, partially.

mackerel said:
Now it is my turn to miss something as I currently see this as somewhere between irrelevant and meaningless. Memory bandwidth is a very real problem. Latency masking in this type of application is pretty good so it has minimal effect on performance. Both of these are easy to observe, if a bit time consuming.

System memory frequency and latency go hand in hand. This is a exaggerated example to make my point, if latency took a minute to transfer data that would be one minute per instruction at 1600MHz speed. So it is not the bandwidth that is the problem it is latency.

Bandwidth definition is bit-rate only.

trents · May 15, 2017

But again, in real life computing and even a lot of bench marks, latency seems to make little difference since so much of the memory dependency is handled by the large caches of today's modern processors.

wingman99 · May 15, 2017

trents said:
But again, in real life computing and even a lot of bench marks, latency seems to make little difference since so much of the memory dependency is handled by the large caches of today's modern processors.

In real life system memory speed has gone up with latency timings increase. Memory benchtests benefit with pipelining data.
And there are other latency timings also to addup. CAS Latency 14, RAS to CAS Delay 14, RAS Precharge 14, Active to Precharge Delay 34, Command Rate 2.

14-CL: CAS Latency. The time it takes between a command having been sent to the memory and when it begins to reply to it. It is the time it takes between the processor asking for some data from the memory and then returning it.
14-tRCD: RAS to CAS Delay. The time it takes between the activation of the line (RAS) and the column (CAS) where the data are stored in the matrix.
14-tRP: RAS Precharge. The time it takes between disabling the access to a line of data and the beginning of the access to another line of data.
34-tRAS: Active to Precharge Delay. How long the memory has to wait until the next access to the memory can be initiated.
2-CMD: Command Rate. The time it takes between the memory chip having been activated and when the first command may be sent to the memory. Sometimes this value is not announced. It usually is T1 (1 clock cycle) or T2 (2 clock cycles).

Read more at http://www.hardwaresecrets.com/understanding-ram-timings/#CgMzY7HRqV4DqU0F.99

Take look at the chart for CL from DDR2 ns VS DDR4 ns.

Because memory modules have multiple internal banks, and data can be output from one during access latency for another, the output pins can be kept 100% busy regardless of the CAS latency through pipelining; the maximum attainable bandwidth is determined solely by the clock speed. Unfortunately, this maximum bandwidth can only be attained if the address of the data to be read is known long enough in advance; if the address of the data being accessed is not predictable, pipeline stalls can occur, resulting in a loss of bandwidth. For a completely unknown memory access (AKA Random access), the relevant latency is the time to close any open row, plus the time to open the desired row, followed by the CAS latency to read data from it. Due to spatial locality, however, it is common to access several words in the same row. In this case, the CAS latency alone determines the elapsed time. https://en.wikipedia.org/wiki/CAS_latency

DaveB · May 15, 2017

mackerel said:
At times like these I start thinking about junking the system and getting a modern dual Xeon with ECC or something.

Yep, I built one of those with a Intel S2600CP dual LGA 2011, a pair of E5-2670s cooled by Hyper 212s, and 64GB DDR3-1333 Reg/ECC. 100% stable 100% of the time and fairly cheap.

mackerel · May 15, 2017

How did we get onto the ram discussion? I don't need a theoretical discussion. I know the following facts: bandwidth matters, dual rank is better than single rank, latency makes barely any measurable difference. If some different application might behave differently, I don't care. I'm not running it.

For dual Xeons, I was thinking of something more current since I can't live without AVX2. The only thing stopping me at the moment is I want to hold out for rumoured near future AVX-512 as that will smash future Ryzen, if I can afford it

trents · May 15, 2017

mackerel said:
How did we get onto the ram discussion? I don't need a theoretical discussion. I know the following facts: bandwidth matters, dual rank is better than single rank, latency makes barely any measurable difference. If some different application might behave differently, I don't care. I'm not running it.

For dual Xeons, I was thinking of something more current since I can't live without AVX2. The only thing stopping me at the moment is I want to hold out for rumoured near future AVX-512 as that will smash future Ryzen, if I can afford it

This is where I'm coming from. Spending much more to get lower latency RAM at the same frequency just doesn't seem to give much if any performance increase, at least in most applications.

EarthDog · May 15, 2017

Take the blue pill...the story ends.

wingman99 · May 15, 2017

mackerel said:
How did we get onto the ram discussion? I don't need a theoretical discussion. I know the following facts: bandwidth matters, dual rank is better than single rank, latency makes barely any measurable difference. If some different application might behave differently, I don't care. I'm not running it.

Sorry for the facts I posted. They are over your head.

trents said:
This is where I'm coming from. Spending much more to get lower latency RAM at the same frequency just doesn't seem to give much if any performance increase, at least in most applications.

I can't tell memory speed difference from 2133 to 3200 speed in most applications.

trents · May 15, 2017

I don't know if it has changed with DDR4 but with DDR3 I have seen many benchmark evaluations of latency vs. frequency and frequency always won out.

mackerel · May 16, 2017

EarthDog said:
Take the blue pill...the story ends.

I think I'm too far down the rabbit hole...

wingman99 said:
Sorry for the facts I posted. They are over your head.

They're irrelevant to my needs.

I can't tell memory speed difference from 2133 to 3200 speed in most applications.

You can tell the difference in latency?

wingman99 · May 16, 2017

mackerel said:
You can tell the difference in latency?

I can't tell the difference in DDR4 latency, it has been close to the same since DDR2. That is my point in post #4

On again, off again, stability woes

mackerel

Member

caddi daddi

Godzilla to ant hills

trents

Senior Member

wingman99

Member

mackerel

Member

trents

Senior Member

mackerel

Member

wingman99

Member

trents

Senior Member

wingman99

Member

DaveB

Senior Member

mackerel

Member

trents

Senior Member

EarthDog

Gulper Nozzle Co-Owner

wingman99

Member

trents

Senior Member

mackerel

Member

wingman99

Member

Similar threads