DDR4 Memory - Bandwidth, Latency, Quad vs Dual Discussion

EarthDog · Jul 21, 2018

Thanks!

Cache plays a role, indeed. But those are xeon and hedt cpus with mesh. Mainstream cpus do not have the mesh. From haslwell-e to skylake-e cache and it was halved and changed yet bandwidth is still huge. We also see this scaling from quad to hex w/o mesh.

Cache plays a role indeed, just not sure its doing the lions share of the work.

Alaric · Jul 21, 2018

Thanks!

Cache plays a role, indeed. But those are xeon and hedt cpus with mesh. Mainstream cpus do not have the mesh. From haslwell-e to skylake-e cache and it was halved and changed yet bandwidth is still huge. We also see this scaling from quad to hex w/o mesh.

Cache plays a role indeed, just not sure its doing the lions share of the work.

From some of the benchmarks, I now get the impression that the cores are handling the nuts and bolts of pushing bits through the pipeline while the cache, although important, appears more passive in its role.

Trying to piece together some sort of accurate blueprint from disparate bits of information is certainly a challenge. I realize they're only disparate because of the (yuge!) gaps in my knowledge, but those are the gaps I'm working on.

wingman99 · Jul 21, 2018

EarthDog said:
Thanks!

Cache plays a role, indeed. But those are xeon and hedt cpus with mesh. Mainstream cpus do not have the mesh. From haslwell-e to skylake-e cache and it was halved and changed yet bandwidth is still huge. We also see this scaling from quad to hex w/o mesh.

Cache plays a role indeed, just not sure its doing the lions share of the work.

I was thinking you were talking about your Skylake X going from less cores to more cores and seeing bandwidth improvement with AIDA64 memory test. If you have not tested with less cores to more you could see if it is the number of cores that makes the difference in AIDA64.

EarthDog · Jul 21, 2018

We already know cores make a difference in a64 bandwidth. The discussion is the role cache plays in those values.

wingman99 · Jul 21, 2018

Well you already have my opinion on increased cores is increasing cache lines in turn increasing bandwidth. On the haswell-e to skylake-e cache line, the reason reason why they reduced the LLC in half on skylake-e compared to haswell-e is the larger L2 and non inclusive cache reducing the necessary LLC storage on skylake-e, In turn it would have the better cache line bandwidth with with more cores that increase the cache lines.

Non-inclusive cache

in this context, when a data line is present in the L2, it does not immediately go into L3. If the value in L2 is modified or evicted, the data then moves into L3, storing an older copy. (The reason it is not called an exclusive cache is because the data can be re-read from L3 to L2 and still remain in the L3). This is what we usually call a victim cache, depending on if the core can prefetch data into L2 only or L2 and L3 as required. In this case, we believe the SKL-SP core cannot prefetch into L3, making the L3 a victim cache similar to what we see on Zen, or Intel’s first eDRAM parts on Broadwell. Victim caches usually have limited roles, especially when they are similar in size to the cache below it (if a line is evicted from a large L2, what are the chances you’ll need it again so soon), but some workloads that require a large reuse of recent data that spills out of L2 will see some benefit.

So why move to a victim cache on the L3? Intel’s goal here was the larger private L2. By moving from 256KB to 1MB, that’s a double double increase. A general rule of thumb is that a doubling of the cache increases the hit rate by 41% (square root of 2), which can be the equivalent to a 3-5% IPC uplift. By doing a double double (as well as doing the double double on the associativity), Intel is effectively halving the L2 miss rate with the same prefetch rules. Normally this benefits any L2 size sensitive workloads, which some enterprise environments such as databases can be L2 size sensitive (and we fully suspect that a larger L2 came at the request of the cloud providers).https://www.anandtech.com/show/1155...-core-i9-7900x-i7-7820x-and-i7-7800x-tested/4

EarthDog · Jul 21, 2018

Yep. Well aware how it works.

Good info.

mackerel · Jul 22, 2018

wingman99 said:
I believe they don't use Ganged memory channels that is comparable with Raid 0 since ~2010 multi core processors. With research I have done in to the past, since the start of multi core processors and multithreaded applications do to the lackluster performance gains in applications they switched to unganged memory channels that is comparable Non-RAID drive architectures that is just a bunch of hard drives. Unganged memory dual channels, which maintains two 64-bit memory buses but allows independent access to each channel, in support of multithreading with multicore processors.

Ok, maybe raid 0 wasn't the best analogy. To be honest I'm not up to speed on gannged or not status, but fundamentally my argument stands, the main benefit of more ram channels is more bandwidth potential. If one core is accessing a bit on one, another core on another, or if there is some group sharing thing going on doesn't really matter.

wingman99 · Jul 22, 2018

mackerel said:
Ok, maybe raid 0 wasn't the best analogy. To be honest I'm not up to speed on gannged or not status, but fundamentally my argument stands, the main benefit of more ram channels is more bandwidth potential. If one core is accessing a bit on one, another core on another, or if there is some group sharing thing going on doesn't really matter.

I was using the term bandwidth meaning the transfers from the same read or write dual channel striped bytes data source with ganged memory channels. Compared to unganged memory channels that transfer read and write separate operations.

In computing, bandwidth is the maximum rate of data transfer across a given path https://en.wikipedia.org/wiki/Bandwidth_(computing)

With unganged memory channels it is not technically on the same path for the definition of bandwidth. However, separate memory channel operations has the potential to increase the combined amount of data transferred with multiple threads and cores. With second operation of access of memory at the same time on different channels it is possible to reduce latency when the channels access the memory at the same time with the least part of the latency involved in the second operation with two channels, in turn making it faster than gannged memory channels.

Woomack · Jul 23, 2018

Too much to read ... The cache also affects memory controller max bandwidth. When you OC cache then max memory bandwidth in synthetic tests goes up. It can improve max bandwidth even up to 20% on some platforms. On Skylake-X and every X series before, it was improving general "memory" performance more than memory frequency (above some point). IMC is usually bottlenecked by low cache/mesh speed.
As we know, higher memory frequency = higher bandwidth and lower latency. However, memory bandwidth is still limited by the IMC maximum speed and in most cases is not scaling well past ~DDR4-4000. Above some point higher frequency = almost only lower latency.

In theory there is no difference if CPU has 2 or 20 cores. Maximum possible bandwidth is the same if in use is the same chipset/IMC. However, maximum real bandwidth (during work) will be different in multithreading as you can see in tests like AIDA64. More threads can utilize the bus much better.

quick example (not really correct numbers but something close):
4 channels @3600 and 8 core CPU = ~80GB/s
4 channels @3600 and 10 core CPU = ~90GB/s
4 channels @3600 and 12 core CPU = ~100GB/s
even though in all 3 cases, theoretical max is something above 120GB/s
4 channels @3600 and 10 core CPU with cache/mesh overclocked by 500MHz = ~105GB/s

You can't see that so much on dual channel platforms where cache is faster and runs at higher frequency (even at the same frequency is faster than that on X platforms).
End-users usually don't have to worry about the IMC, memory etc settings as motherboards set optimal mode. Sometimes I wonder why there are all these options when they are never set to "slower" mode. As long as memory runs in memory slots designed for multi-channel work, it always runs in optimal performance mode (or someone failed BIOS or motherboard design but it's not really happening).

wingman99 · Aug 14, 2018

I received the memory channel information from Intel and dual and quad channels are independent 64bit with for each channel.

This message was posted on behalf of Intel Corporation
Hello wingman99,
Thank you for your response and your patience.
In this case, "ganged and unganged" terms are not used in Intel's literature, but used by another companies that make reference to concepts between our technology such as Dual-channel mode: we have 2 independent 64-bit wide channels or Quad-channel mode: we have 4 independent 64-bit wide channels.
All the information available can be found here:
https://www.intel.la/content/www/xl...core/8th-gen-core-family-datasheet-vol-1.html

EarthDog · Aug 14, 2018

Ganged and unganged is AMD. Its funny you asked intel about that.

Alaric · Aug 14, 2018

Ganged and unganged is AMD.

I did not know that. I always left that at default settings and ignored it.

EarthDog · Aug 14, 2018

Ive never seen it on Intel before, but recall seeing thread after thread about amd and ganged/unganged.

Alaric · Aug 14, 2018

Yup. I got used to it with a string of AMD rigs, and up until very recently ignored most memory settings due to a bad experience with PNY and ignorance. LOL

wingman99 · Aug 14, 2018

EarthDog said:
Ganged and unganged is AMD. Its funny you asked intel about that.

The AMD terminology is the same thing Intel is doing with seperate channels 64bit with. So unganged memory channels is what intel is doing. The terminology is the same as ganged HDD or unganged HDD.

Alaric · Aug 14, 2018

The terminology is the same as ganged HDD or unganged HDD.

Aaaand I'm lost.

EarthDog · Aug 14, 2018

I cant say ganged is common vernacular for a hdd setup... im sure it's used, but, never ran across it in anything ive read that i recall.

I get it, bit its seemingly not a common use. Most call it RAID.

Alaric · Aug 14, 2018

We talking "RAID" and "not RAID"?

EarthDog · Aug 14, 2018

Yes.

Amd could choose between 1x128bits (ganged) or 2x64bits (unganged) for their memory. AFAIK it has nothing to do with dual or single channel.

Alaric · Aug 14, 2018

1x128 vs 2x64 per channel?

DDR4 Memory - Bandwidth, Latency, Quad vs Dual Discussion

Gulper Nozzle Co-Owner

New Member

Member

Gulper Nozzle Co-Owner

Member

Gulper Nozzle Co-Owner

Member

Member

Benching Team Leader

Member

Gulper Nozzle Co-Owner

New Member

Gulper Nozzle Co-Owner

New Member

Member

New Member

Gulper Nozzle Co-Owner

New Member

Gulper Nozzle Co-Owner

New Member

Similar threads