After much testing, which I touched upon in the other thread, I have some numbers to share.
I did testing with one system, an i7-6700k at 4.2 GHz, HT off, 4.1 GHz cache, G.Skill F4-3333C16-4GRRD at 3000 16-18-18-38. This is a 4x4GB kit. For each line, first value is for 4 modules fitted, 2nd value for 2 modules fitted (still in dual channel mode), and last one is first divided by 2nd to show relative performance.
Prime95 28.7 built in benchmark, 4 workers throughput in iterations/second at various FFT sizes. Run once each.
1024k 997.60 819.47 1.22
2048k 468.45 376.43 1.24
4096k 228.53 182.85 1.25
8192k 111.84 86.51 1.29
Here's the performance I was missing previously. 22% and rising with increasing workload. 1024k FFT would be 8MB data, x4 for each thread so this hits ram hard. Bigger FFT, more ram. It wasn't the motherboard. It wasn't the motherboard settings. It was the ram population. But is this effect seen in other benchmarks too? I tried a couple others as follows.
MaxxMEM2 1.99 (run 3 times, best results taken)
copy 33808 32634 1.04
read 26034 25583 1.02
write 30956 30956 1.00
latency 54.2 52.3 1.04
4% increase in copy, and 2% in read isn't much. Write unchanged. Latency increased though.
PassMark PerformanceTest 8 (ram tests only, run 3 times, best results taken)
Memory Database Operations 122.6 124.2 0.99
Memory Read Cached 31912 31922 1.00
Memory Read Uncached 20204 20244 1.00
Memory Write 16117 16397 0.98
Memory Latency 19.7 19 1.04
Memory Threaded 35007 34503 1.01
We see the same latency increase as with MaxxMem, but the rest doesn't change significantly.
So, any explanations for these observations? Is it the rank thing? Bare in mind I only care about prime finding performance, not any other benchmarks. It has long been known that as task sizes get bigger, the demands on ram performance increases. But I'm not aware of anyone trying to figure out how that works in practice. Real world testing sometimes even shows running 4 tasks will give lower throughput than running 3, although in most cases running 4 does give you more it isn't much more than 3 since it is so ram limited. From what I've seen so far, latency and processor cache size doesn't seem to play a significant role for these big units, although I haven't tested it in depth. Bandwidth seems to be king, but what is changing going from 2 modules fitted to 4?