Skylake and ram scaling in Prime95

mackerel · Feb 20, 2016

I tested. I tested some more. Had a mug of tea, then went back for more testing. Then I made some charts and here are the results!

Test system:
CPU: i7-6700k at 4.2 GHz, ring at 4.1 GHz, HT off
Mobo: MSI Gaming Pro, bios 1.7
GPU: 9500 GT (just to make sure no ram bandwidth is stolen by integrated graphics)
RAM: for the results presented I use two types
G.Skill F4-3333C16-4GRRD Ripjaws 4, 4x4GB kit
G.Skill F4-3200C16-8GVK Ripjaws V, 2x8GB kit

Testing was performed using Prime95 28.7 built in benchmark in Windows 7 64-bit. Each setting was run once, after the PC had been given time to settle down after rebooting. All test configurations had the ram in dual channel mode. Timing values listed are ordered CAS-RCD-RP-RAS as commonly shown in most software.

Most testing was with all 4 modules of the Ripjaws 4 kit fitted, for reasons discussed later. This ram is known from previous experience not to boot in this mobo at 3333 with 4 modules fitted, so I tested at common ram speeds from 2133 to 3200. To not complicate matters with timings, these were fixed at 16-18-18-38 for scaling tests, which may disadvantage the slower speeds since the values would typically be lower in practice. Latency will be considered separately later.

As the clock increases we see no significant difference in performance. This is not ram limited.

There is a slight increase in performance here as ram clocks go up, but not much.

Now we are starting to see something happen.

And here we see a clear relation with speed and performance.

Here we alter the display a bit so we can compare ram settings. 3 speeds are tested. Actually, two of these are not exciting. At 3200 the results for 15-16-16-36 and 16-18-18-38 are practically identical. At 2800, 14-16-16-36 and 16-18-18-38 gave a 1% average advantage to C14, but this is so small it is hard to say if this could just be measurement variation. It gets a little more interesting at 2133, where 3 speeds were tested: 14-14-14-35, 15-15-15-35, and 16-18-18-38. The last one is on average 4% slower than the other two, which were the same. This may be an area for future research, although it seems ram speed is more important to performance. Timings might get you a little more as a secondary optimisation.

Putting the ram to one side, how does CPU speed affect performance? These 4 lines show the combinations of CPU at 3.5 and 4.2 GHz, with 1 or 4 workers active.

With one worker, the scaling is near perfect with the faster CPU 19% faster, compared to 20% for ideal clock scaling.

With 4 workers, it would seem the ram is the limit. We only see 4% increase for the 20% clock increase. This may present opportunities for power saving as the higher clock doesn't help here. It may be interesting to see how scaling applies over a wider range of CPU speeds.

And finally, this is the cause of some unexpected behaviour I saw. I had two comparable systems, but I saw a massive performance difference between them which I struggled to explain. I tried various things and even wrongly blamed the mobo for being rubbish, but it would seem module rank has a major influence. This isn't so commonly discussed or even specified. I found Thaiphoon Burner as free software that can read this. The Ripjaws 4 modules are single rank, and the Ripjaws 5 module is dual rank (caution: other parts in the series may vary!). General consensus seems to be that having higher ranks can slightly increase bandwidth, at the cost of slightly higher latency.

This chart is going to take some explaining. The chart again shows the 4 worker throughput. The grey line is the Ripjaws V kit, and light blue line is Ripjaws 4 kit with 4 modules fitted, both at 3200. So on each memory channel is a total of 2 rank, and performance is so identical you can't see the light blue line under the grey line! So far so good? Let's take two of the Ripjaws 4 modules out, leaving it running in dual channel mode. Logically, this shouldn't make a difference. It is still 2 channels, running at the same clock and timings. Nope. We see a 19% drop in performance (orange line). This is massive! How massive? The yellow and blue lines are 4 modules running at 2666 and 2400 respectively, and they go neatly either side of the orange line. That is quite a performance drop!

The tentative conclusion from this is that, it seems it is worth having the higher rank modules, or running more modules to do so, otherwise you will reduce your potential significantly. Unfortunately it doesn't seem that easy to find out what rank a module is before buying it.

Ideally more testing could be done to make sure it is the rank, and not something else. I'd need for example 8GB modules with single rank to make sure the module capacity isn't in some way influencing it. Or alternatively, 4GB modules with dual rank.

I have quite a lot of data from this testing, so if there are different ways the data could be cut, I could have a go at showing it.

PolRoger · Feb 20, 2016

The last data set/chart is a little hard hard to follow... So your saying that your 2x8GB is double-sided and your 4x4GB kit is single-sided and they both perform the same at 3200C15 when installed in your setup but when running with only two sticks 2x4GB single-sided from your 4x4GB kit at 3200C15 it shows ~19% performance drop due to having less memory available for the Prime benchmark?

I have a 2x8GB single-sided kit and i5-6600K that I could test with the benchmark...

mackerel · Feb 20, 2016

Yes, although the 2x8GB kit is actually C16 but it doesn't seem to make a difference. I would be interested in the results. Past testing not shown here suggests the cache difference between i5 and i7 isn't really significant. I should still follow that up under more controlled tests.

PolRoger · Feb 20, 2016

I'll run it at 4.2Ghz 4100Mhz cache @3200C15 and perhaps at a few other combinations too. I'll be testing on an ASUS Maximus XVIII Impact with G.Skill F4-3600C16D-GTZ. If running more memory helps on the benchmark then maybe one of the new 2x16GB double-sided kits might be the best/ideal way to go?

PolRoger · Feb 20, 2016

3200C15:

3733C17:

mackerel · Feb 20, 2016

Just plotted that, along the 2 and 4 module results from mine earlier. Your 3200 results are still ~4% average faster than my 3200 2 module results so there might be something else going on too. I didn't note other timings so possibly some of those might be chosen differently by the mobo and have some impact too.

PolRoger · Feb 22, 2016

I ran some more of this Prime95 benchmark on some of my setups over the weekend. I was curious about scaling over different Intel generations as well as increasing dram quantity and different ic chip density.

Speed was set at 4.4GHz 4400MHz cache @4 cores no HT for all platforms. I also ran two runs on Ivy-E @6 cores no HT. Quad channel seems to be helping offset the somewhat slower ipc from Ivy Bridge and the two extra cores actually allow the Ivy-E to have the fastest results.

Its a stressfull benchmark with AVX so if you don't have everything with your overclock dialed in right it will freeze/hang or even BSOD.

4.4GHz 4400MHz 4 core No HT:

Prime95 v28.7 Benchmark @4.4GHz 4400Mhz cache 4core with no HT.PNG

4.4GHz 4400MHz 6 core No HT:

Prime95 v28.7 Benchmark @4.4GHz 4400Mhz cache 6 core with no HT.PNG

mackerel · Feb 22, 2016

Interesting... I never had IB, but comparing SB to Haswell clock for clock in totally non-ram limited situations, I found Haswell about 50% faster than SB! Presumably IB isn't going to be that different from SB, both lacking FMA which gives Haswell and later CPUs quite a boost. Similarly Skylake is about 14% faster than Haswell for reasons I can't prove, but it is documented they reduced FMA instruction length from 5 cycles in Haswell to 4 cycles in Skylake which might be a contributing factor, depending on how well it is pipelined. Or it could be something else entirely.

Side note: I did some playing around on Broadwell too. It looks like that also benefits from having 2 rank per channel compared to 1 rank.

EarthDog · Feb 22, 2016

I never had IB, but comparing SB to Haswell clock for clock in totally non-ram limited situations, I found Haswell about 50% faster than SB!

I'll bite... What application showed a 50% increase? Is this your one application that showed significant motherboard performance differences?

mackerel · Feb 22, 2016

Specifically it was LLR running tasks using 128k FFTs. I'm using Prime95 as a easier to use substitute, although the benchmark there starts at 1024k FFT which at best a single thread will fit in the L3 cache of an i7 before you become ram dependant. These share a common math library so are roughly comparable as far as the heavy lifting is concerned. Where supported, they will use FMA which provides a decent boost over AVX, which in turn provides a massive boost from not having AVX. Pretty much anything before Sandy Bridge isn't worth running this on. For reasons I don't claim to understand, AMD's implementation of similar doesn't give any real word benefit so they're totally out of the running.

PolRoger · Feb 22, 2016

mackerel said:
Side note: I did some playing around on Broadwell too. It looks like that also benefits from having 2 rank per channel compared to 1 rank.

Here is what I got when comparing DDR3 kits in double-sided vs. single-sided using my 4770K @4.4GHz 4400MHz cache 8GB (2x4GB) 2400 10-12-11-31. Kits have either Hynix CFR ic (double-sided) or Hynix MFR ic (single-sided):

Prime95 v28.7 Benchmark @4.4GHz 4400Mhz cache 4core with no HT running Hynix CFR & MFR.PNG

I would think that an 8-core Haswell-E running quad channel and 32/64GB DDR4 (double-sided) would get you the best performance on this bench.

What is the relationship between LLR and PrimeGrid?

mackerel · Feb 23, 2016

PrimeGrid run a variety of prime number finding projects you can pick and choose from. Of the CPU apps, LLR is the most commonly used to test if a number is prime or not.

As for optimisation, for PrimeGrid there is a little twist. The BOINC server will send out two tasks to double check the results as it goes along. If a number is found to be Prime, the first computer to report gets the discovery. For small units computer speed is less important, as the delay between sending tasks can often decide who can return first. For longer tasks, that can take hours or days on a fast system, this delay isn't significant any more and you can power your way to be first to return. Fewer faster cores is a better optimisation in this scenario.

In what I've seen so far, it looks like one high speed ram channel can supply just over one high speed core. So I think a 6 core system with quad channel ram would be preferable over 8 cores which would suffer memory bandwidth limitations and thus run slower per-core. The Skylake i3 CPUs are also interesting at one core per channel, and you can get away without buying overclocking kit.

If I get the time and motivation I want to try estimating the required bandwidth per core clock that is optimal. It will be a bit fuzzy since I wont be able to consider every variable like timings too.

mackerel · Feb 24, 2016

I just tried to find a way to re-express the previous data in a way to give a better insight to ram requirements. I think I found it.

I took the 3200 speed ram result as the reference, as it was the fastest I ran at, and thus should be least limited. Strictly speaking, it would be nice to have even more bandwidth but that's not happening unless I get a quad channel ram system.

Anyway, I took the 1, 2, 3, 4 worker results for the tested ram speeds (2133 to 3200) and divided that by the reference results to give a scaling indication. That was then divided again by the number workers. I then divided the resulting value into the calculated ram bandwidth at each speed. For indication, 3200 ram in dual channel mode should offer 50GB/s. I tested with 4 single rank modules, so this is the higher performing state with 2 rank per channel. CPU was the i7-6700k at 4.2 GHz and cache at 4.1 GHz.

As there was a lot of data, I tried to simplify it by only showing 1, 2, 4, 8M FFT sizes. They follow a similar trend, with minor variations throughout. I will leave that for another day, but the overall trend is clear enough. More bandwidth = faster up to a point of diminishing returns.

I should add this only applies to the Skylake at 4.2 GHz. Presumably a slower clocked CPU will have more ram bandwidth relative to the CPU speed, and shift the charts up a bit. I need to test this and will have to work in some lower clocked CPU results later.

I need to do more checking to see how well this fits in with past real data on scaling.

mackerel · Feb 25, 2016

I had some data at 3.5 GHz as well as the 4.2 earlier. I had to simplify at this point, and since there wasn't that much variation between the FFT sizes I just used 1024k as the first number that falls out. When I normalised the horizontal axis to include CPU clock, the two sets overlaid on top of each other nicely. I probably should do more to be extra sure, but already I need to handle more dimensions in Excel than I know how to! So there's a lot of manual work and it is getting messy...

I also tried splitting out the number of workers to look at the discontinuities. I think there are two things going on there. As mentioned in the original testing, when I changed ram speed, I kept the same numerical timings. This puts slower ram at a disadvantage since the timings should tighten as speed slows. Secondly, it looks like where you have fast ram + more cores vs. slower ram and fewer cores for the same bandwidth ratio, fewer cores takes the slight edge. So a combo of these effects are contributing.

I got a rough indication from that of how to scale ram and CPU clocks. As it turned out, if the nominal ram speed e.g. 3200 matches the CPU clock (for quad core), then each worker is about 90% the speed compared to the relatively unlimited case of running 1 worker. For 95%, knock the CPU clock down another 12% or so. Note it may be more efficient, but you will still be getting less work done due to lower clock, assuming ram is fixed. Put it another way, if you keep increasing core clock, you will still do more, but the efficiency drop will reduce the gain you see. At some point, they may practically cancel each other out and you are purely ram limited.

Again this assumes you're running dual channel ram, with modules that are either 2 rank or have 2x single rank per channel. Otherwise downscale your performance expectations.

magellan · Mar 24, 2016

PolRoger said:
The last data set/chart is a little hard hard to follow... So your saying that your 2x8GB is double sided (D/R) and your 4x4GB kit is single sided (S/R) and they both perform the same at 3200C15 when installed in your setup but when running with only two sticks 2x4GB (S/R) from your 4x4GB kit at 3200C15 it shows ~19% performance drop due to having less memory available for the Prime benchmark?

I have a 2x8GB single sided (S/R) kit and i5-6600K that I could test with the benchmark...

"Sides" of RAM have nothing to do w/ranks. You could have 4 ranks on a single-sided stick of RAM.

PolRoger · Mar 25, 2016

magellan said:
"Sides" of RAM have nothing to do w/ranks. You could have 4 ranks on a single-sided stick of RAM.

Yes this can be true and seems to apply more to server ram but for common unbuffered desktop type ram like those that were tested above I believe that the single-sided kits were single rank and the double-sided kits were dual rank.

Woomack · Mar 25, 2016

In desktop RAM in most cases single sided = single rank, double sided = dual rank. There are no quad ranks in desktop RAM as it requires ECC.

In DDR4 it looks like:
4GB modules = single rank
8GB modules = single and dual rank ( dual only in new higher capacity IC like new Samsungs )
16GB modules = dual rank

Since all is going to higher capacity then I don't expect any new dual rank 8GB modules and no new 4GB.

magellan · Mar 25, 2016

PolRoger said:
Yes this can be true and seems to apply more to server ram but for common unbuffered desktop type ram like those that were tested above I believe that the single-sided kits were single rank and the double-sided kits were dual rank.

How many ranks a DIMM has depends on the density of the individual memory IC's installed and has nothing to do w/how many sides are populated:

http://www.simmtester.com/page/news/showpubnews.asp?num=128

PolRoger · Mar 27, 2016

magellan said:
How many ranks a DIMM has depends on the density of the individual memory IC's installed and has nothing to do w/how many sides are populated:

http://www.simmtester.com/page/news/showpubnews.asp?num=128

Ok... You are right.

So on that note... I went back and edited out some of the references to S/R and D/R from my posts. :thup:

Kits used for testing:

[email protected] F3-17000CL8D-4GBXMD 2x2GB sticks (8 chips per side of each dimm)
G.Skill F3-2666C11D-8GTXD 2x4GB sticks (8 chips per side of each dimm)
Team DDR3-2666C11 TXD34G2666HC11CBK 2x4GB sticks (8 chips on just one side of each dimm)
G.Skill F3-19200CL9Q-16GBZMD 4x4GB sticks (8 chips per side of each dimm)
G.Skill F3-19200CL11Q-16GBZHD 4x4GB sticks (8 chips per side of each dimm)
G.Skill F4-3600C16D-16GTZ 2x8GB sticks (8 chips on just one side of each dimm)
Crucial Ballistix DDR4-2400 BLS2K8G4D240FSA 2x8GB sticks (8 chips per side of each dimm)

Woomack · Mar 29, 2016

magellan said:
How many ranks a DIMM has depends on the density of the individual memory IC's installed and has nothing to do w/how many sides are populated:

http://www.simmtester.com/page/news/showpubnews.asp?num=128

In modern DDR ( that article has 10 years ) can't be less than 8 memory chips because of architecture. You are right, ranks don't mean how many PCB sides are used but in current non-ECC generation of DDR4 we have only 2 options:
8 chips - single side - single rank ( 4GB and 8GB modules )
16 chips - double side - dual rank ( 8GB and 16GB modules )

Skylake and ram scaling in Prime95

Member

Member

Member

Member

Member

Member

Member

Member

Gulper Nozzle Co-Owner

Member

Member

Member

Member

Member

Member

Member

Benching Team Leader

Member

Member

Benching Team Leader

Similar threads