Benchmark(s) sensitive to ram latency

mackerel · Sep 4, 2019

I have an idea for a test that I don't think exists, but I think it could be done indirectly if there is a benchmark that can be run on single core, and is as sensitive to ram latency as possible. What would be such a benchmark? I have a vague recollection one of the Pi benches might fit that requirement?

Basically I want to try and measure differences in latency between individual cores and ram. Especially on Intel side, there could be some variation due to ring bus to make use of.

EarthDog · Sep 4, 2019

SPi 32M is the one I also recall being memory sensitive. Not sure if it was more bandwidth or latency heavy though.

mackerel · Sep 4, 2019

That could be the one. I think that was latency, as I sucked at that bench. I could get higher speeds but couldn't get the timings down...

I did have one further thought, I wonder if the aida64 test would work if forced by affinity to a single core...

Woomack · Sep 5, 2019

Benchmarks react not to latency only but to overall performance which is something like a balance between high bandwidth and the low latency. It can also be translated in synthetic bandwidth tests as memory copy (not directly but it is something like memory performance in general). This is what AIDA64 is pointing out in the latest versions (after the test there is a question mark in a circle above read/write/copy results).
Another way to check it is to run 'winsat mem' from a command prompt. This is how memory "speed" translates into performance in the Windows environment.

High memory clock lowers latency and timings lower latency. The balance between a high clock and tight timings always give the best results. On Intel, tight timings count much more than on modern AMD.
On quad-channel platforms, there is already high bandwidth, and memory clock is usually limited, so there is a higher performance gain from tight timings than pushing memory frequency to the limits.

Most benchmarks in competitive benchmarking (at least those that give good points) don't react really well to high memory performance. You can get about 50 points more in Cinebench R11/15/20. 3DMarks react well in some configurations (physics tests). In the mentioned SuperPi32M memory is really important, and it's probably the only benchmark which is really worth to spend a lot of time on memory tweaking. x265 benchmarks get a bit better results too but not so significant.

Some more examples:
On Intel Z3xx, 4000 CL12-12-12 2N gives similar results to 4800-4900 18-18-18 2N in most benchmarks, On X299 3600 12-12-12/13-13-13 seems optimal, and it will be hard to pass 3733 at so tight timings. On AMD 3600 CL14 and below is optimal, 3800-4000 is, of course, better but usually not possible. Next step on 3000 Ryzen seems 5000+ as 4700-4800 is a bit slower or faster, depends on the test.

mackerel · Sep 5, 2019

Thanks for the info but it is not relevant to my intended application of the test I proposed. My goal is to use the results in an attempt to work out the logical structure of core placement, and possibly from that work out the physical placement. Particularly on ring/mesh bus, some cores will be closer to the IMC than others. Ideally I would also need a test to work out logical distance between cores as well, but that's outside the scope of my original post. I have seen a Linux only bit of software that claims to do that, although I forget its name for now.

Woomack · Sep 5, 2019

I guess I missed the point. You may also check compress/decompress like 7zip which has a built-in benchmark. There you can set amount of threads and used RAM, and it reacts to memory speed.

On the other hand, I somehow doubt you will see the difference between cores as it all uses CPU cache which is shared in most modern series. Most results in memory benchmarks are based on combined RAM+CPU cache+IMC speed and only RAM settings are not always giving any significant results. As you can see on Skylake-X, bumping mesh/cache clock gives much better results than overclocking RAM (of course above some point).

mackerel · Sep 5, 2019

Caches can be negated quite easily: have a data set far bigger than the cache size and do random accesses. I don't know if that applies to Pi. For example, I wouldn't think Prime95 large FFT would be a good choice here as the accesses are more predictable and hence is more bandwidth than latency sensitive.

Best I can do is try it. If it works, great. If it doesn't, it doesn't.

Benchmark(s) sensitive to ram latency

mackerel

Member

EarthDog

Gulper Nozzle Co-Owner

mackerel

Member

Woomack

Benching Team Leader

mackerel

Member

Woomack

Benching Team Leader

mackerel

Member

Similar threads