Ram bottleneck much?

mackerel · Dec 3, 2017

Above is the result of a bunch of testing... let me explain.

System
i3-8350k at either 4.0 GHz core and 4.0 GHz cache, or 5.0 GHz core, -3 AVX offset, 4.4 GHz cache
Asrock Z370 Pro4 bios 1.30
Ram is TridentZ 3000C14, 2x8GB, containing Samsung B-die
Rest probably doesn't matter.

Test software is latest Aida64, or Prime95 29.3 set to benchmark 4096k FFT in one and four worker configuration. 4 cores one worker will put all cores on the same data set, of 32MB size. 4 cores 4 workers is one worker per core, so the total load is 4x 32MB = 128MB. Because of the AVX offset, for the 5 GHz configuration, it would be running at 4.7 GHz at that time.

I've been overclocking the ram, hence the various scenarios above.
Basic performance is just by SPD 2133. Safe, slow.
The ram contains an XMP profile, which sets 3000C14.
After some tinkering, I got 3600C16 running, requiring some voltage tweaking for stability.
Here I hit a wall, and couldn't get 3700 running no matter what. Then one day I decide to try 3733, and it worked first time! I pushed on to...
3866, which required some more manual tuning of voltages.
Then my self torture started, and I went about tweaking the secondary and tertiary timings. This took a lot of trial and error, mostly error. I haven't done a final stability test, but what I have right now seems stable enough.

The Prime95 results are iter/s, where higher is better. The code is very efficient at getting work done on the core, and for data sets that don't entirely fit in CPU cache, ram performance becomes significant. The chosen test size was to ensure that would be the case. I had previously determined a rough rule of thumb that to be not significantly ram limited, a quad core Intel would need dual channel ram at a comparable rated speed compared to core clock. E.g. for a 4 GHz CPU, you'd aim for 4000 rated ram. That's tricky! This is also in part why I'm concerned about the rising core counts without a corresponding rise in memory channels.

Results here certainly do illustrate the ram limiting in action. For the overclocked CPU configuration, an 81% in ram speed (and timing optimisations also) gets up to 69% throughput increase! That suggests we're still deep in ram bandwidth limiting situation. By testing a near-stock and overclocked condition, we can also see the gains are held back significantly by the ram. In the past I don't overclock much on my prime number finding systems for this reason, the ram is limiting, so trying to make the CPU faster just ends up burning more power without getting much more throughput.

It is also interesting to compare the case of 3866 speed auto and manual timings. Depending on the scenario, we're seeing 6% to 14% increase. Remember, this is the same speed, but with highly optimised timings. In the past I had tried adjusting primary timings and they didn't make much difference. It would seem the key lies elsewhere. In the hopes of explaining this difference, I took the Aida64 measurements also. These don't show so much difference, around 4 to 6%. There is still something we're not seeing as the complete picture here. Even if you stack the latency difference that only gains up to another couple %.

Woomack · Dec 4, 2017

Not many programs will show the same so high frequency memory is not always the best option. Usually memory controller and additional latency is causing worse scalling at higher frequency. On some platforms overclocking cache/IMC/NB helps, on some it barely changes anything or it reaches stability limits.
RAM itself has fixed max bandwidth. Soft like AIDA64 is showing maximum theoretical bandwidth at given settings. Usually it's ~20% lower than maximum which you see in benchmarks.
Additional timings are helping and if you make it right then are helping even more than main timings. It's safe to leave them at auto as most programs won't be really affected but if you want max performance then it's good to set all manually. On the other hand if you set too low or too high values or wrong values for some timings ( some have to match or be higher or lower than other timings ) then memory will lose stability or performance will be worse. For example when you set CL then it has to be the same or higher than wCL. Most new motherboards will correct some mistakes but not all.

One more thing for you to test and it's the easiest way to set sub-timings:
- set memory at auto
- write down all sub timings ( motherboard will show that )
- switch to manual or XMP mode
- set higher memory frequency like 2666-3000 but also set all timings like you saw at auto settings
- check if it's working, if yes then set 3200, 3466, 3600, ...

Usually there are no big issues to pass 3200, in some cases you may need higher voltage or changes in single timings. It's actually how you can set max performance for benchmarks but you will know that timings are correct as they're in auto/jedec profile.

When you are using Z170/Z270/Z370 chipsets and Samsung B then above settings should let you pass 4000 on good motherboards and can be still adjusted. However, stability of 3600+ depends on the motherboard, traces and other things that may cause interference.

mackerel · Dec 4, 2017

With hindsight I should have taken representative power readings while I was doing all this. Be interesting to see if ram improves performance/watt (obviously aggressive CPU OC will destroy that anyway). I know my use case is somewhat niche, but it is my use case, and I do care about that performance.

Yes, I still need to try what I'll now call the "Woomack method" to setting timings. I actually never finished tweaking my manual attempt, and left it as close enough. Maybe I could gain a little more by being more aggressive with tRFC and tREFI, as I've only used safer assumptions there for now. The time it takes is not something I can do as matter of routine. This is a one off, so it will be interesting to see the diff between that and auto settings for much faster configuration.

I did have a quick go at stabilising 4000 but didn't manage to find it. I also never retried lowering primary timings with more voltage...

Woomack · Dec 4, 2017

tRFC is helping in benchmarks that like fast memory ... samsung single rank modules can work at low tRFC like 280 at 3600+ while auto is something like 500+, tREFI can be even 64k+ and it won't change much

mackerel · Dec 4, 2017

I'm currently running tRFC at 380, down from I think 526/528 of auto. I know 300 doesn't boot, but there's still that range in between untested. tREFI I increased from ~11000 auto to 16000 so far.

Woomack · Dec 4, 2017

We could have some memory bandwidth ranking or some benchmark which reacts good on memory performance with various setting but it's hard to find someone who would keep it up to date. There are many similar threads about timings and memory settings with bandwidth results.
Geekbench has various memory benchmarks and is free. Just one idea

mackerel · Dec 4, 2017

I'm not that familiar with geekbench but isn't the free version crippled in the way that it runs, as is often the case? Can't be sure, might even be time limited but I might be thinking of something else. I already have aida64 for synthetics, does geekbench do it berter?

That's why I use Prime95 built in benchmark. At large FFT sizes it is highly ram dependant. Since 29.x there is a new benchmark window making it a lot easier to run. I use 4096k both as a round number, and it is also ball park representative of actual work they're doing. I did do some further analysis of my results earlier, there isn't a linear correlation between P95 speeds and measured bandwidth or latency. It looks more like a power or exponential type curve, but I need more data points before trying to fit anything to it.

Woomack · Dec 5, 2017

AIDA64 is better for bandwidth tests but there are really often updates and it's hard to keep one ranking based on 50 versions ( literally updates are sometimes every 1-2 weeks ).
Geekbench 3 is free in 32b version. Geekbench 4 is free in 64b too but has no compute benchmark which is like additional benchmark so it doesn't matter. Also all updates in Geekbench are not affecting scores.

We don't have any ranking like that and forum users are often asking questions so I thought it would be nice to start something ... I just have no time to keep it up to date.

Ram bottleneck much?

mackerel

Member

Woomack

Benching Team Leader

mackerel

Member

Woomack

Benching Team Leader

mackerel

Member

Woomack

Benching Team Leader

mackerel

Member

Woomack

Benching Team Leader

Similar threads