• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

i9-7960X, 16 cores, HT: Why Prime95 produces more heat with 16 workers than 32?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

kkm

New Member
Joined
May 24, 2018
I am building a workhorse computer for computing needs at work. The board is AsRock Taichi XE, and the CPU is the i9-7960X. I am not shooting for the best GHz, rather for a longer-term stability and a 3 year working life of the machine. My life is crunching numbers, not tuning hardware (although I have BSEE and MS in physics, so I have some ideas what I am tuning). At the very least, as a former physicist, I am taught very very well not to trust any measurement instrument. And now I see something that truly baffles me.

I am using prime95 to get the most heat out of the CPU. Currently I am charting the thermals (package power/hottest core) vs the number of workers on the same 8K FFT with AVX instructions disabled in prime95. What strikes me as super odd is that when I run 16 threads, I am getting the most power of the package, not when I run 32. All my runs are 2 minute ramp-up, then 10 minutes of measurement. I am using HWINFO, latest release, I believe, and reading the highest core Tj and package power reported by it. Sampling period is set to 500ms.

In either 16 or 32 tests, the core multipliers sit at x39. Core voltage does vary in units of mV, and is same within the margin of error in both 16 and 32 worker runs. Power min/max differ by a few watt over the 10 minute run. But. Here are the power dissipation results of an earlier run, before I tuned Vcore down:

16 workers cause the power dissipation of 265±4 W
32 workers 188±6W

That's a huge difference!

I expected the hyperthreading switching to produce more heat. I read (but never tried) that disabling HT lowers the dissipation somewhat. Now, with 32 workers and 16 physical and 32 logical HT cores, there should be really a lot of HT context switches. With 16, much fewer, only caused by OS background processes. This is where my thinking went wrong, and my understanding is obviously incorrect.

If that's important, Turbo Boost is enabled (I am hand-tuning the table to get the load chart more even), Turbo Boost 3.0 is enabled also (but makes no difference), and I have disabled both EIST and SST [EDIT typo: was "RST"] in the BIOS (and HWINFO reflects all 4 flags set as I configured, if I can trust it).

Does anyone have a similar experience? What is going on?
 
Last edited:
In my experience, across dozens of cpus, disabling HT uses less power and yields lower temps. Granted, im not setting specifics as you are on p95, i simply use the default small fft for cpu testing and blend to test more memory. Perhaps its in the one-off settings used? Does it happen when using a 'canned' test from p95 or hold true with other stress testing applications? It may well be p95 and your lengths run better when the cache is split over 16 total threads than when HT is enabled? No idea.

Ive got a 7960x at 4.5 ghz all cores with HT disabled (i dont need the cores, chip was a free replacement for a borked 7900x).
 
You want max heat then you disable AVX?

My experience is with AVX on only, so I'm not sure if it behaves differently with it off. I can't think of an obvious reason for the described effect unless your system is going into some kind of limiting with 32 that isn't in 16. With AVX+HT on the CPU does produce more heat but not more throughput. I suggest you repeat the testing but instead of stress mode, use the benchmark mode (easiest to use in 29.x). Is there a reduction in throughput with 32 workers than 16 which would explain the difference in power usage.

What codepath is it using when you disable AVX? I think but not 100% sure it'll probably use SSE of some form which I don't think behaves much different. If it falls back to x87 then HT on should be faster and therefore use more power.
 
That does sound odd, almost as if it's throttling or not hitting the higher clocks in turbo when all threads are active. Are you monitoring the core speed? You can check for throttling using Intel XTU
 
im not setting specifics as you are on p95, i simply use the default small fft for cpu testing

Very close, the small FFT test uses 8K to 16K by default. I just wanted it to be very consistent, to exclude any randomness from the test, so I set both MinTortureFFT and MaxTortureFFT to 8. But should not be very different, I guess.

It may well be p95 and your lengths run better when the cache is split over 16 total threads than when HT is enabled?

That's an idea worth pondering, thanks. I just ttrusted the prime's test description ("data fits into L2 cache"), but with HT and a worker per code the cache usage doubles. On the other hand, 1MB is a lot of cache, and I am using the smallest FFT size.

I did not try disabling the HT, but I'll run that test too.
 
I did not try disabling the HT, but I'll run that test too
ding ding ding!!!

So you cut back to 8c/16t from 16c/32t. HT was never disabled. Why those results are as such though, not sure. But actually disabling HT will lower temps and save power...at least in all tests ive run over the years.
 
With AVX on, running one worker per core (with good affinity) is same performance/behaviour as same number of workers and HT off. There isn't a throughput advantage for AVX when you exceed this, although if you have other things going on at the same time, it can dilute the reduction in performance. If you don't have good affinity, you lose about 10% throughput performance as Windows occasionally collides threads on the same core, which do not benefit from this.

Memory is a bit rusty, on probably a 6700k doing small FFT tasks (fits in CPU cache) I observed no throughput different between 4c4t and 4c8t running all threads, but the latter case was about 10C hotter. I don't recall the cooler I was using.

There is also a guide to ram requirements for FFTs, take FFT size and multiply by 8, you get your ram. e.g. 8k FFT = 64k ram per worker. There is also some extra static data but I'm not sure how big it is. For most Intel CPUs, you get high performance as long as total ram requirement fits within total L3 cache, beyond that you are often ram bandwidth limited. For CPU tests (not ram dependant) I like to pick a bigger FFT size that fills up the cache some more. 128k FFT per real core (1MB each) works well, or 64k if you want to run two per core with HT on.

Skylake-X is an exception to the above, with the new cache structure implemented on it. I don't understand it well yet, but many people have seen gains overclocking the cache clock. L3 usage is not the same as with previous CPUs and it may be safer to assume you want to sit in L2 for maximum performance.
 
You want max heat then you disable AVX?

I'm going to tune the AVX offsets separately. Some of our computations are AVX-heavy (LAPACK), even use AVX-512 through MKL; other yield much less to SIMD. So I am thinking of tuning it good for both types of workload.

My experience is with AVX on only, so I'm not sure if it behaves differently with it off. I can't think of an obvious reason for the described effect unless your system is going into some kind of limiting with 32 that isn't in 16. With AVX+HT on the CPU does produce more heat but not more throughput. I suggest you repeat the testing but instead of stress mode, use the benchmark mode (easiest to use in 29.x). Is there a reduction in throughput with 32 workers than 16 which would explain the difference in power usage.

What codepath is it using when you disable AVX?

The gwnum library uses certain code paths based on the presence of SSE2 and SSE4.1, I believe: https://github.com/rudimeier/mprime/blob/master/gwnum/

With AVX on, running one worker per core (with good affinity) is same performance/behaviour as same number of workers and HT off. There isn't a throughput advantage for AVX when you exceed this, although if you have other things going on at the same time, it can dilute the reduction in performance. If you don't have good affinity, you lose about 10% throughput performance as Windows occasionally collides threads on the same core, which do not benefit from this.

Memory is a bit rusty, on probably a 6700k doing small FFT tasks (fits in CPU cache) I observed no throughput different between 4c4t and 4c8t running all threads, but the latter case was about 10C hotter. I don't recall the cooler I was using.

Interesting, thanks. It did not occur to me to disable HT, but then our loads are quite uniform at any given time (a large computation is sharded, one linux process per shard). Maybe I'll get better overall throughput from disabling HT and then using the freed up thermal headroom for a couple hundred MHz.

I am doing the tuning in Windows, because tools, but I've disabled even EIST, not speaking of SpeedShift, with the support for it far from stellar in Linux.

Skylake-X is an exception to the above, with the new cache structure implemented on it. I don't understand it well yet, but many people have seen gains overclocking the cache clock. L3 usage is not the same as with previous CPUs and it may be safer to assume you want to sit in L2 for maximum performance.

Thanks for this heads-up too. I need to do more reading on this. Half of the computation time is in the MKL code, supposed to be optimized to a crisp, but then there is the other half.

Are you monitoring the core speed? You can check for throttling using Intel XTU

I am using HWINFO. Are there any known issues with it? The bar graphs are all even at the max load, and also I am dumping the CSV report a row a second and churning it afterwards (I'm kinda habitually crunching numbers :))
 
Last edited by a moderator:
You and mackrel are two peas in a pod. :) :thup:

If anyone knows Prime95's minutia, you have the right guy here!!
 
Last edited:
I'm going to tune the AVX offsets separately. Some of our computations are AVX-heavy (LAPACK), even use AVX-512 through MKL; other yield much less to SIMD. So I am thinking of tuning it good for both types of workload.
This is sounding more like HPC type workloads... don't have any experience in that area other than I know my hardware requirements for running Prime95 and similar software share more with HPC than consumer decisions (ram bandwidth!!!).

Interesting, thanks. It did not occur to me to disable HT, but then our loads are quite uniform at any given time (a large computation is sharded, one linux process per shard). Maybe I'll get better overall throughput from disabling HT and then using the freed up thermal headroom for a couple hundred MHz.

I am doing the tuning in Windows, because tools, but I've disabled even EIST, not speaking of SpeedShift, with the support for it far from stellar in Linux.

If HT is worth it will depend on the task. HT friendly tasks can see up to a 50% throughput improvement, but some like Prime95 (in FMA3 mode) see practically zero improvement. At least, nothing significant beyond measurement tolerances. Best thing to do is test it. Run all threads with HT on and off, and look at the performance of the workload. Do you get better throughput with HT on? If you have a power meter, you can also look at that while it is running. These are inexpensive devices so worth considering getting one regardless.

I am using HWINFO. Are there any known issues with it? The bar graphs are all even at the max load, and also I am dumping the CSV report a row a second and churning it afterwards (I'm kinda habitually crunching numbers :))

It picks up the numbers from somewhere and displays them. If it looks wrong, it probably is. Sometimes you might see nonsense temperatures on some sensors, or the CPU reports a very low power which I've heard but not verified can be related to power settings when overclocking.
 
Back