Why no CPU's w/384-bit or 512-bit memory data buses?

magellan · Aug 22, 2015

If AMD and Nvidia can design PCB's and GPU's that can access memory over 384-bit and 512-bit memory data buses why can't the same thing be done for CPU's? While it might not have much application to the desktop market, what about for servers? They already have motherboards w/9 DIMM sockets and 8 DIMM sockets so it's not like there's a lack of motherboard real estate to add DIMM sockets.

EarthDog · Aug 22, 2015

Non technical reason - same reason why a garden hose isn't 8" around...because the size of the pipe (bandwidth) isn't a limiting factor.

magellan · Aug 22, 2015

EarthDog said:
Non technical reason - same reason why a garden hose isn't 8" around...because the size of the pipe (bandwidth) isn't a limiting factor.

Maybe for desktop applications and gaming, but in the HPC environment in which I work, the datasets run from hundreds of GiB to TiB in size. The memory load on the nodes that comprise the high mem queue is 80% to 90% when running these jobs for days on end. PLINK scans through genotype/phenotype data and these datasets are all located in memory. The speed at which you can crank through this data is affected by memory bandwidth, the sample dataset I was testing against was over a TiB in size and took nearly 24 hours to complete. There are medical imaging applications that have even larger datasets than I was working with (one being a scan of the human brain).

Automata · Aug 22, 2015

I've designed a rudimentary processor for an EE class I've taken, and I'm fairly sure the answer is "space concerns". Every bit width in the bus requires a wire and connection to the processor; so your 512-bit memory bus requires 512 connections. Video cards can get away with running memory chips in parallel in a smaller package, which gives the benefit of a wider bus.

I'm going to point Dolk to this thread since he knows a hell of a lot more than I do.

Dolk · Aug 24, 2015

I need a gif for when I am summoned into a sub.

The answers here are all pretty much what I would have said. The reason why you don't see high bandwidth is due to physical limitations from all parts. The memory controller (MC) would need a bunch of more physical address, along with each core connecting to the MC. The routing around DIMMs are at their absolute physical limits with the current connector style. Individual memory modules are at their limits too when it comes to pins to area. Ever seen the back of a memory module? Its basically a pin field with no room. What room is there deals with power and signal integrity.

Plain and short, there has been no need to move up in pin count. It will not matter anymore as all RAM will be migrated to the CPU with the next generation of CPUs. Additional RAM will only be needed for extreme cases like HPC servers.

magellan · Aug 24, 2015

Dolk said:
I need a gif for when I am summoned into a sub.

The answers here are all pretty much what I would have said. The reason why you don't see high bandwidth is due to physical limitations from all parts. The memory controller (MC) would need a bunch of more physical address, along with each core connecting to the MC. The routing around DIMMs are at their absolute physical limits with the current connector style. Individual memory modules are at their limits too when it comes to pins to area. Ever seen the back of a memory module? Its basically a pin field with no room. What room is there deals with power and signal integrity.

Plain and short, there has been no need to move up in pin count. It will not matter anymore as all RAM will be migrated to the CPU with the next generation of CPUs. Additional RAM will only be needed for extreme cases like HPC servers.

Why can't Intel and/or AMD just add more channels of existing DIMM technology? The old
Westmere servers I work with have 3 memroy channels that support 3 DIMM's per channel for a total of potentially 9 DIMM's bussed to the CPU (but not all active simultaneously).

Buf if two more channels of DDR3 or DDR4 memory were added to modern CPU designs we would
have a 384-bit memory bus and 10 memory channels, but would that result in adding more than 500 pins to the CPU?

If that's true, then why not just remove stuff from the CPU that's irrelevant to the HPC/server market? Why not remove all the direct PCIe connectivity and push it onto the PCH? QPI has enough bandwidth to keep up w/10Gib ethernet.

This discussion also sheds some light on the bizarre NUMA node interleaving setup that IBM x3550 servers have, in which memory is striped across both sockets. I guess this is the way you would maximize the available physical memory.

Dolk · Aug 25, 2015

Specialized CPUs will not happen for a long long long time. Especially for the HPC market. The HPC market is a small chunk of the server industry. However, this doesn't mean that you won't see more flavors of high end server CPUs in the near future. Memory issues in servers will be a small problem come next refresh.

Now with this current generation of CPUs, Skylake, you will have options of having up to 2 DIMMs in a quad channel solution. You get your 8 DIMMs of total memory that way, and decrease in latency since its quad.

Why not go to 384-bit memory? See my last post. Memory is LITERALLY at the physical limits of its size and dimensions. You would have to change the standard of memory in order to go up in pin count, and thats just dumb when we are one generation away from HBM on all CPUs.

EDIT

Aslo, QPI now UPI is way above 10GB/E.

NUMA is not an Intel/AMD based server, it uses IBM Power7 CPUs. These types of CPUs are specially designed for large number of socket, high memory density server designs. Their utilization of cross memory is very complicated and unique. Not to many have implemented this type of architecture due to its immense complexity and high latency.

magellan · Aug 25, 2015

Dolk said:
Specialized CPUs will not happen for a long long long time. Especially for the HPC market. The HPC market is a small chunk of the server industry. However, this doesn't mean that you won't see more flavors of high end server CPUs in the near future. Memory issues in servers will be a small problem come next refresh.

Now with this current generation of CPUs, Skylake, you will have options of having up to 2 DIMMs in a quad channel solution. You get your 8 DIMMs of total memory that way, and decrease in latency since its quad.

Why not go to 384-bit memory? See my last post. Memory is LITERALLY at the physical limits of its size and dimensions. You would have to change the standard of memory in order to go up in pin count, and thats just dumb when we are one generation away from HBM on all CPUs.

EDIT

Aslo, QPI now UPI is way above 10GB/E.

NUMA is not an Intel/AMD based server, it uses IBM Power7 CPUs. These types of CPUs are specially designed for large number of socket, high memory density server designs. Their utilization of cross memory is very complicated and unique. Not to many have implemented this type of architecture due to its immense complexity and high latency.

I wouldn't figure adding a channel or two of memory would require changing the design of DIMM's one iota.

The Westmere based IBM X3550-M3 servers we're running right now have the capability of having 3 DIMM's per channel for a total of 9 DIMM's per socket.

All the systems we have in our cluster are NUMA (Non Uniform Memory Access) capable x86 servers, none of the them feature the IBM Power architecture. All the systems that have both sockets populated are in NUMA mode.

cyberfish · Aug 27, 2015

More and more HPCs are adding GPUs, since GPUs are much faster for some workloads. Those applications require low latency high bandwidth connection between GPU and CPU, so trading PCI-E for wider memory bus wouldn't be a good idea.

As for why they don't have higher bandwidths in general, like others said, it's probably just because it's not the bottleneck.

1TB dataset isn't really that big. A modern Xeon can stream close to 100GB/s already. So unless the task takes less than a few minutes, memory access time is negligible.

Also, there's only so much bandwidth a core/thread can use, even if the IMC has infinite bandwidth. The charts here summarize it nicely - http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/10

At 8 cores, bandwidth is limited by how much each thread can use, which is why hyperthreading increases total bandwidth. At 14c or 18c, IMC becomes the bottleneck.

However, no real life workload will be totally memory-bound like that.

There are many programming tricks you can use to reduce memory access. Almost all algorithms can be written in a way that mostly works with stuff in L3.

Then there's also the question of whether DRAM is actually limited by the bus, vs the speed of the DRAM chips.

magellan · Aug 28, 2015

cyberfish said:
More and more HPCs are adding GPUs, since GPUs are much faster for some workloads. Those applications require low latency high bandwidth connection between GPU and CPU, so trading PCI-E for wider memory bus wouldn't be a good idea.

As for why they don't have higher bandwidths in general, like others said, it's probably just because it's not the bottleneck.

1TB dataset isn't really that big. A modern Xeon can stream close to 100GB/s already. So unless the task takes less than a few minutes, memory access time is negligible.

Also, there's only so much bandwidth a core/thread can use, even if the IMC has infinite bandwidth. The charts here summarize it nicely - http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/10

At 8 cores, bandwidth is limited by how much each thread can use, which is why hyperthreading increases total bandwidth. At 14c or 18c, IMC becomes the bottleneck.

However, no real life workload will be totally memory-bound like that.

There are many programming tricks you can use to reduce memory access. Almost all algorithms can be written in a way that mostly works with stuff in L3.

Then there's also the question of whether DRAM is actually limited by the bus, vs the speed of the DRAM chips.

A 1 TiB dataset isn't all that big? Really? When you have analysed a dataset of more than 1 TiB in size and to what end?

From page 223 of the Tuning IBM System X Servers For Performance:

"SPECfp_rate is used as an indicator for high-performance computing (HPC)
workloads. It tends to be memory bandwidth-intensive and should reveal
significant improvements for this workload as memory frequency increases.
As expected, a number of sub-benchmarks demonstrate improvements as
high as the difference in memory bandwidth. As shown in Figure 10-22 on
page 224, there is a 13% gain going from 800MHz to 1066MHz and another
6% improvement with 1333MHz. SPECfp_rate captures almost 50% of the
memory bandwidth improvement."

Show me an algorithm that will sort a 150 MiB file out of the L3 cache exclusively.

cyberfish · Aug 28, 2015

magellan said:
A 1 TiB dataset isn't all that big? Really? When you have analysed a dataset of more than 1 TiB in size and to what end?

I haven't. My last project was doing machine learning on a bunch of brain MRI scans, and that was only a few hundred GBs. Memory bandwidth wasn't a problem at all.

Show me an algorithm that will sort a 150 MiB file out of the L3 cache exclusively.

That's a pretty common problem. First divide it into chunks that fit in L3. Sort each chunk individually. Then merge them.

Obviously it's not exclusively out of L3, but it only makes 2 O

passes through the dataset. Memory bandwidth won't be the bottleneck.

magellan · Sep 12, 2015

cyberfish said:
I haven't. My last project was doing machine learning on a bunch of brain MRI scans, and that was only a few hundred GBs. Memory bandwidth wasn't a problem at all.

That's a pretty common problem. First divide it into chunks that fit in L3. Sort each chunk individually. Then merge them.

Obviously it's not exclusively out of L3, but it only makes 2 O passes through the dataset. Memory bandwidth won't be the bottleneck.

Your algorithm isn't much of an algorithm at all and can't be translated into code as is.

So it can't be done exclusively in the L3 cache, gee who would've thunk it?

WRT your assertion that memory bandwidth isn't an issue in HPC applications:

http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/2356

In particular:

"If you look at the performance and speedup, you see that the Nehalem keeps getting faster up to 8 cores per node, but in the Westmere case, when we go beyond 8 cores per node, we see diminishing returns of 12% performance improvement when increasing cores by 25% from 8 to 10. The drop off is more significant between 10 and 12 cores as the code only gets a 2.7% increase in performance with 20% more cores! Note that for reference, the speedup when using 12 cores of Nehalem on two nodes are shown to show what the performance could have been if the Westmere cores had not encountered a bottleneck getting data from memory. This simply means that for this code, the Nehalem's computing power (for WRF) didn't exceed the bandwidth available, but the Westmere has so much more computing power that you can create a traffic jam if you run WRF on all cores. "

cyberfish · Sep 12, 2015

magellan said:
Your algorithm isn't much of an algorithm at all and can't be translated into code as is.

So it can't be done exclusively in the L3 cache, gee who would've thunk it?

It's a trivial problem and a trivial algorithm. Yes, I can translate it into code easily in about 20 minutes. No I am not going to do that because I have better use for my 20 minutes, like writing actually useful code.

If you read my posts above, I have never claimed that everything can be done exclusively in L3. You are the one putting that in my mouth all along. I claimed that almost everything can be done mostly in L3, to the point that memory bandwidth isn't the bottleneck.

WRT your assertion that memory bandwidth isn't an issue in HPC applications:

http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/2356

In particular:

"If you look at the performance and speedup, you see that the Nehalem keeps getting faster up to 8 cores per node, but in the Westmere case, when we go beyond 8 cores per node, we see diminishing returns of 12% performance improvement when increasing cores by 25% from 8 to 10. The drop off is more significant between 10 and 12 cores as the code only gets a 2.7% increase in performance with 20% more cores! Note that for reference, the speedup when using 12 cores of Nehalem on two nodes are shown to show what the performance could have been if the Westmere cores had not encountered a bottleneck getting data from memory. This simply means that for this code, the Nehalem's computing power (for WRF) didn't exceed the bandwidth available, but the Westmere has so much more computing power that you can create a traffic jam if you run WRF on all cores. "

Yes, WRF is one of the very few memory intensive applications. That's why I said "most". Even then you can see Nehalem already scales much better, and memory bandwidth has been growing faster than CPU power for the past few generations.

magellan · Sep 15, 2015

cyberfish said:
It's a trivial problem and a trivial algorithm. Yes, I can translate it into code easily in about 20 minutes. No I am not going to do that because I have better use for my 20 minutes, like writing actually useful code.

If you read my posts above, I have never claimed that everything can be done exclusively in L3. You are the one putting that in my mouth all along. I claimed that almost everything can be done mostly in L3, to the point that memory bandwidth isn't the bottleneck.

Yes, WRF is one of the very few memory intensive applications. That's why I said "most". Even then you can see Nehalem already scales much better, and memory bandwidth has been growing faster than CPU power for the past few generations.

What HPC clustering applications have you worked with? What DRM did you use? What clustering software did you use?

It would seem to me that if you have 6 or 8 processes (i.e. with independent memory address spaces) running on 6 or 8 cores with a dataset >> L3 cache, there would be significant L3 cache thrashing.

It would be interesting to see how the execution times would differ for my PLINK data set (which can take from 8 to 24 hours depending on which compute nodes and datasets I use) if I were to slow down the memory bus speed. Maybe I'll ask if I can do so on one of our R&D systems.

Why no CPU's w/384-bit or 512-bit memory data buses?

magellan

Member

EarthDog

Gulper Nozzle Co-Owner

magellan

Member

Automata

Destroyer of Empires and Use

Dolk

I once overclocked an Intel

magellan

Member

Dolk

I once overclocked an Intel

magellan

Member

cyberfish

Member

magellan

Member

cyberfish

Member

magellan

Member

cyberfish

Member

magellan

Member

Similar threads