BD and PD understanding the shared FPU

Zerix01 · Nov 11, 2012

I just upgraded from a 1055T to an FX 8320. I run F@H most of the time on this system so that is my basis for benchmarking.

I was wondering how the shared FPU handles work between the two integer threads. If I understand correctly the FPU can handle 4 64bit, 2 128bit, or 1 256bit operation per clock. Now if only one thread is running on the a particular module, shouldn't it have access to lets say all four 64bit or the two 128bit parts of the FPU?

I have been running F@H with this chip clocked at 4.4Ghz. With eight threads I get 27,100 PPD. With four threads I get about 15,000 PPD. That is quite a drop in points. So would this just be the way F@H is programmed or am I confused about how the shared FPU works in this architecture?

RGone · Nov 11, 2012

Understanding Bulldozer architecture
through Linpack benchmark
By [email protected]

Zerix01 · Nov 11, 2012

Interesting. If I am following certain parts correctly then that means that one "core" has a hard time feeding the FPU unless it is a 256bit instruction. So two threads running on a module actually helps FPU utilization as opposed to competing for it like commonly portrayed online. Also F@H uses single precision SSE instructions, so I think a lot of this design is not being utilized to its fullest.

RGone · Nov 11, 2012

You are likely accurate in some of the FX is not used to the extent it might be. It is hard to get a 'true' picture of what has occurred with FX BD or PD. Most of it tends to be deep if it tells much and then when you get to some understanding, it seems winders just does not use a BD or PD as it could be used and may never be.

RGone · Nov 11, 2012

Like I said it is hard to get to someone who can really test accurately so that their result seems viable all around. Here is another.

A quick look at Bulldozer thread scheduling
Is it really best to share?

]-[itman · Nov 12, 2012

You are correct in that the fpu in each module can be used as a single 256 bit fpu or two 128 bit fpu's. To keep it simple, the reason you don't see as strong performance as you'd expect from 8x128 bit fpu's is because the "front end" of the core that assigns instructions to the particular fpu is shared between both 128 bits fpu's when they are acting as such. Unfortunately, this front end can only assign one instruction per cycle, so if both cores have instructions to do, one core get's its instructions first, then the next core. In a try dual core scenario, both cores could receive their instructions at the same time.

This doesn't mean that you get 1/2 the performance as it isn't all too common for both fpu's to need to be assigned at the same time, but it does happen. AMD estimates that this, and a few other factors, account for a roughly 20% loss by using modules, this is shown in RGone's last link. It's actually a lot more complicated than this (as you can imagine) and there are other factors to consider (cache sharing, branch prediction, etc) but that's a basic explanation.

What this means to the end user is that in almost every case, it is better to assign one thread per module until more threads are needed. If all the cores can be utilized, it should always be better to have all cores active than restrict the os scheduler to one thread per module.

BD and PD understanding the shared FPU

Zerix01

Member

RGone

Senior DFI Staff

Zerix01

Member

RGone

Senior DFI Staff

RGone

Senior DFI Staff

]-[itman

Similar threads