Why not Quad memory channels in Skylake/Broadwell?

EarthDog · Jul 15, 2016

petteyg359 said:
Isn't that what we're talking about? More cores being the need for more memory channels with a side effect of more bandwidth?

Yep.

mackerel said:
It's not quite a simple count of cores as clock counts towards it also. You need enough combined CPU demand to saturate the memory interface. The problems I see only really kick in with quad cores and dual channel interface, particularly with higher clock CPU and/or lower clock ram (e.g. DDR3-1600).

For your work, correct!

magellan · Jul 15, 2016

mackerel said:
I think I had posted here in the past, I do run compute software where ram performance is as important as CPU. As a rough balance, for a Skylake quad core, ideally I'd need the ram to be numerically the same speed as the CPU clock. e.g. quad core Skylake running at 4 GHz, need 4000 MT/s dual channel ram to be "practically unlimited". This also assumes dual rank ram, and there is a stiff performance penalty for single rank, which is particularly annoying as it is rarely specified on ram kits and even 8 GB modules are moving towards single rank these days.

The side effect of the above is, I find the i3 is a great performer. You get two fast cores, and two ram channels to feed it. Nothing can be overclocked nor needs to be, so it can be done very cheaply compared to an OC i5 or i7 setup where you have to consider quality components. Note: this is for a compute scenario where you don't care the cores are in separate boxes or not, you just need a lot of them.

Edit: noticing which forum I wrote this on, I should clarify, I don't consider that the i3 can be usefully overclocked in any meaningful way for this application. The method that breaks AVX would cost you far more in performance than any clock gain, and maybe the Asrock non-Z ram OC could help a little but it would be minor enough that should only be a consideration if it doesn't impact cost to implement.

In the white papers I've read the difference between compute intensive benchmarks for dual rank vs. single rank RAM was only 2%, but this was for older Westmere-based systems running DDR3 RAM @ 1333MHz.

An IBM redbook article I read stated that the more ranks a DIMM has can limit the maximum speed the DIMM can run at because more ranks increase the electrical load on the memory buses.

mackerel · Jul 16, 2016

The difference ram makes depends on a combination of the software demands on the ram, ram performance, and CPU performance. My first encounter with this was when I did my first Skylake build last August. I bought a 6700k with DDR4-3333 ram. It tore through work, but was unstable. In diagnosis, I set the ram to standard 2133 and saw a massive performance drop, of the magnitude of tens of %. Skylake cores have the highest effective IPC so far, and it was quite highly clocked, and simply the ram wasn't keeping up. I could choose other work, which doesn't make demands on the ram since it can fit inside the L3 cache, and in those cases, pure CPU clock matters.

I have yet to decently quantify the value of dual rank over single rank, but it seems to be in the ball park of >10% in my use cases. That is, you would need faster grade single rank ram to be equivalent to dual rank, but I'm not confident enough to put a specific number to that. Trouble is, unless you're heavily ram limited, the relationship between ram speed and performance is somewhat influenced by the CPU in a non-linear way. To test this I would have to intentionally cripple the ram (perhaps go single channel and underclock) to make sure it is the dominant limit. The rank bonus doesn't seem captured in AIDA benchmarks, possibly because of real world mixed access patterns having more benefit than a synthetic measure of one part.

As for having higher rank limiting performance, I suspect this is more a server problem than consumer problem. If you load a channel with multiple modules it could start to have problems, but I don't think up to 2 modules per channel that is found on consumer kit would be particularly significant in that.

Woomack · Jul 16, 2016

AIDA64 is scalling with cores mainly on quad channel platforms. You can check my results on Z170 + i3 and i7. Bandwidth is about the same. Differences are on X99 where 8 cores have up to 20GB/s higher max bandwidth than 6 cores. It's because memory controller is totally different. Because of that you also can't compare 2 and 4 channel platforms.
In this thread are mixed results on i3 6320, i5 6600K and i7 6700K.

CPU clock has not much to do with memory performance. It's the cache clock that counts. On some platforms cache can't be higher than CPU clock so it's hard to notice that.
Next thing is that memory latency is related to cache speed and what you see in benchmarks is not pure memory performance but mix of cache, IMC and memory. 4 channel platforms have longer traces and delay between all these elements. Also cache in Haswell-E/Broadwell-E is slower than in Skylake.

Memory ranks are not affecting all applications. Also 4x single rank modules are in theory as fast as 2x dual rank ( it doesn't look exactly like that in real but that's just theory ). In both cases should address the same data in single clock cycle.

mackerel · Jul 16, 2016

If you look at ram in isolation, then CPU clock might not matter, but in reality it is the CPU demand interacting with the ram that gives the overall performance you get in any application.

In my previous testing, I agree that 4 single rank on dual channel does behave near enough the same as dual rank, but that doesn't really help in terms of buying since for example, 4x4 isn't that different in cost to 2x8.

wingman99 · Jul 16, 2016

The performance for Quad channel all depends if the data needed to be retrieved all in one cycle it Is either 256bit or 128bit or 64bit or 32bit or 16bit or 8bit of data. From what I remember on Anandtech how memory works and the channels work, the memory controller retrieves memory in blocks 128bit for Dual channel 256bit for quad channel, even if the program only needs 8bit data then it discards the rest of the data. That is why it is so hard to see a improvement with more channels because the memory does not travel faster just bigger block of data are received weather needed or not, most data needed is a small amount at one cycle time except for Graphics it highly parallel..

Why not Quad memory channels in Skylake/Broadwell?

EarthDog

Gulper Nozzle Co-Owner

magellan

Member

mackerel

Member

Woomack

Benching Team Leader

mackerel

Member

wingman99

Member

Similar threads