• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

FRONTPAGE Bulldozer Architecture Explained

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Overclockers.com

Member
Joined
Nov 1, 1998
George Harris is one of our most active Senior Members at the Overclockers.com Forums, as well as a frequent contributor to articles on the site. As a fifth year undergraduate at Missouri University of Science and Technology studying Computer Engineering (with an emphasis on Computer Architecture), he has both the knowledge and insight to bring us a more technical view on the Bulldozer architecture. This article, while at first glance quite wordy, will allow us all to gain deeper insight into what makes the (admittedly rather underwhelming) Bulldozer architecture 'tick'.

fx-tech-deck-02-wm-640x360.jpg

... Return to article to continue reading.
 
Dolk; good read. I agree about the FP and I was actually trying to bring that up the other night (face to face discussion with edumicated ppl) but decided to drop it so as not to cause confusion. Sometimes I feel like OCF, TR and some other forums are the only places I am understood; most of the time anyway.

I made this post somewhere else this morning:

The issue here is the architecture and software do not mesh, according to AMD. To me it is not better than HT in some cases and worse than others. In cases where it is better than the iX it is most likley not software that takes full advantage of the FPU.

In essence I am saying that in heavy float ops work it is no better than a quad but in straight int ops it acts like a octo with a limited FSB.
 
Last edited:
We're hitting the wall harder than a drunk Sailor on a Friday night, man.

Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
Or not existing?

Come on, graphene...
 
We're hitting the wall harder than a drunk Sailor on a Friday night, man.

Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
Or not existing?

Come on, graphene...

And what do you mean by visible improvements? Compute power goes up, software demands go up, programs get sloppy because of the immense amounts of code and memory requirements go up, programs get larger to take advantage of the memory size and speed as well as the compute power so storage speed and size go up. I would not trade the PC I have today (the slowest) for the best I had 10 years ago.

If you DC or encode quite a bit and a few other things you can see and feel the powar:) I still want moar powar. I want it!!! Give it to me now; I demand it!!!:rain: OOPS:( Be careful what you ask for lightning does strike:(

Dolk: Did you get any insight into the cache issues? I just feel that AMD cache scheme is ineffective. It seems that the L3 offers little gain to the average user and that a large L2 is where it is at for them. (speaking about the move from winzer to brizzy to PhI to PhII to AthII (winzer 2.0 on the duals) x4 and so on. You can see that the windsor had a great run but the brizzy was lacking.
 
Last edited:
We're hitting the wall harder than a drunk Sailor on a Friday night, man.

Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
Or not existing?

Come on, graphene...

I feel the same way. My E6750 felt the same as my i7 920. Synthetic benchmarks is the only place I saw improvements. I don't really do anything CPU intensive.

But the jump from p4 -> e6750 was massive, or so it felt.

SSD's were the huge jump in performance that I was looking for. I wonder when something else will come out that will make as much of a noticeable improvement as SSD's.
 
@archer, sorry for the late reply. I have become very busy as of lately. I do not believe there is a cache problem, or at least there is a problem with the architecture plan for the cache. There may have been something wrong with the execution, but I cannot full determine that without a full disclosure from AMD.

The L3 cache has greatly improved, and if AMD had kept their original plan of having a single L3 cache, than we would have seen a far less gain from the Phenom II to BD architecture. You have to think of it this way for the cache system in BD. The L1 is only for the cores to store their data. The L2 is for the cores to talk to one another inside their respected module. The L3 is for the modules to talk to each other inside their respected CPU. Now each cache gets all the information within that hierarchical set. That means the L2 will always contain the information from all L1 caches in the module, and the L3 will always contain the information from all L2 caches in the CPU.

Again how this idea was executed is unknown to me. It is more guess work without actually knowing the full details of each segment of the actual produced architecture. My only guess is that the cache system is being held back by the CMOS layout setup that AMD decided to go with (the Auto layout).
 
@Archer, sorry for the late reply. I have become very busy as of lately. I do not believe there is a cache problem, or at least there is a problem with the architecture plan for the cache. There may have been something wrong with the execution, but I cannot full determine that without a full disclosure from AMD.

The L3 cache has greatly improved, and if AMD had kept their original plan of having a single L3 cache, than we would have seen a far less gain from the Phenom II to BD architecture. You have to think of it this way for the cache system in BD. The L1 is only for the cores to store their data. The L2 is for the cores to talk to one another inside their respected module. The L3 is for the modules to talk to each other inside their respected CPU. Now each cache gets all the information within that hierarchical set. That means the L2 will always contain the information from all L1 caches in the module, and the L3 will always contain the information from all L2 caches in the CPU.

Again how this idea was executed is unknown to me. It is more guess work without actually knowing the full details of each segment of the actual produced architecture. My only guess is that the cache system is being held back by the CMOS layout setup that AMD decided to go with (the Auto layout).

You know I think they would let you in. In a way by explaining some of these thing in the manner you have you have done them a service. I personally want to see the white papers:)

Well a module level cache is good but I just have to speculate on the actual efficency of the entire setup. They need another float unit in there and IMHO it can cause unnessary WS and flushes of data that has waited around for too long. If you are running a math intensive (float and int mix) program I can see that there could be a substantial gain but I can also see that if the instructions were not written (compiled) in such a way that they would specifically take advantage of the design that things could be worse than a traditional quad.

I am beginning to think that the Intel fake cores (at least the way they do it) could be inferior to this (and should be); but only if they can get things hammered out for PD. This design is good and it is an advancment but I would just feel like an early adopter of win 95 waiting for a patch if I had juped on this thing.
 
In the time that BD was first proposed, the current trend for programmers was pushing FP onto the GPU rather than the CPU. The CPU was not strong enough. In the 5 years of the development, we have seen that the CPU can do some good FP calculations with enough resources.

If you look at Fusion and BD, you may see something that for the coming future. Having on board GPU will allow for FP calculations to be pushed permanently onto the GPU, and not the CPU. That way the CPU can go back to doing its task of multi-threaded integer and memory.

Furthermore, if we move on over to the server side, most of the calculations on server side do not involve the use of a FP. If they do, those servers usually have CUDA clusters or Vector Processors to help with the efficiency.

Yeah BD will not be good at breaking benchmarks, but maybe its time that we changed how we benched our CPUs? Just like the time when we changed our benchmarks from single threaded to multi-threaded, or 32bit to 64bit. Maybe its time we go to only multi-threaded integer and memory.

Hmm I should have stated that in my article, that would have been a good paragraph.
 
In the time that BD was first proposed, the current trend for programmers was pushing FP onto the GPU rather than the CPU. The CPU was not strong enough. In the 5 years of the development, we have seen that the CPU can do some good FP calculations with enough resources.

If you look at Fusion and BD, you may see something that for the coming future. Having on board GPU will allow for FP calculations to be pushed permanently onto the GPU, and not the CPU. That way the CPU can go back to doing its task of multi-threaded integer and memory.

Furthermore, if we move on over to the server side, most of the calculations on server side do not involve the use of a FP. If they do, those servers usually have CUDA clusters or Vector Processors to help with the efficiency.

Yeah BD will not be good at breaking benchmarks, but maybe its time that we changed how we benched our CPUs? Just like the time when we changed our benchmarks from single threaded to multi-threaded, or 32bit to 64bit. Maybe its time we go to only multi-threaded integer and memory.

Hmm I should have stated that in my article, that would have been a good paragraph.

Well I look more at how much work a cpu can do so I am with you on changing some standards to represent what the processor actually does as far as work unit type. Unfortuantely you fall into that entire class of PPL who get a bad rap of being partial when you point out facts that they refuse to consider.

In the end I don't think eliminating benches and saying we dont like your tests is the solution. I think expanding the benches and every year evaluating what direction the total package is going in, by looking at usage models, software sales, software in development and task environment might be a great solution. My i5 can crunch and fold better than my PhII but when running I really can not tell. Honestly these days we need to stop looking so hard for stellar performance and focus on the deficiencies. Any desktop mainstream CPU manufactured today will play games as well as any other depending on the system setup (I should say any quad) as long as you have the supporting hardware. People can say all they want but if I felt my i processors were any faster in day to day use in my 24/7 rig (daily driver) I would throw out every AMD system in the house. It is just not the case.
 
The problem is that we are still figuring out our limits with the CPU. As of right now for gaming and general usage of the CPU, we have not seen progression since we started working in multi-threading.

CPU Architects are still trying to find the best combination with the resources they have. In the past year or two, we have seen that increasing cores is done, we have exploited that technique to its max. The next step is exploiting each core to its max, and that is what we are currently doing. Until we find that, than e will move onto other tasks maybe different types of doping of the silicon, or different kinds of transistors, or new memory architectures, or finally going 128bit (which will be sooner than some of you can imagine).
 
The problem is that we are still figuring out our limits with the CPU. As of right now for gaming and general usage of the CPU, we have not seen progression since we started working in multi-threading.

The way I understand what you you are saying I can not agree. Run todays highest end games on the first batch of duals. We have come a long way.

The raw compute power has increased dramatically as well.

What needs to be considered, as I have said before, is the entire package. Just think of some of the arch. changes to the CPU that have not involved compute power. The APU that you have mentioned already, new SIMD instructions, moving many of the NB components to the CPU to lessen the bottlenecking and a few other things.

I see the progression that you have said has not been realized. Killing the FSB is a huge jump for heavy background multi tasking for the GP user who might want to DC. The memory controllers ability to operate faster memory...............

CPU Architects are still trying to find the best combination with the resources they have. In the past year or two, we have seen that increasing cores is done, we have exploited that technique to its max. The next step is exploiting each core to its max, and that is what we are currently doing. Until we find that, than e will move onto other tasks maybe different types of doping of the silicon, or different kinds of transistors, or new memory architectures, or finally going 128bit (which will be sooner than some of you can imagine).

Your argument seems to contractdict itself. You said we are not reaching potential yet you almost sound happy about 128bit.

Not that 128 is bad but to me it just makes for bigger slopier programs with sloppy, generic, coding.

I personally do agree with the theme of your post and it is the software that has to take advantage of the hardware and do it efficently.

Also I was making a case that benches could be revamped with usage models that were more represanative of todays average user. With FB and all of this social garbage and streaming and blah, blah we can not readily say that a CPU is crap because it is just as fast at FB as the rest of the pack.

Yeah gaming benches show this and that but unless I am playing a game where frame rates give a specific advantage then 80FPS or 200FPS makes no diffrence. I want to see tha parts that stress the CPU revealed. I mean who cares if you max at 200FPS and the other guy maxes at 180 when he also bottoms at 60FPS and bottom out at 40FPS and it is consitent across video cards. How well is the CPU is handling heavy BOT AI.

I would also like to see timed tests and real workloads used.
 
Last edited:
I hope that this was some type of typo on your part because it (obviously) couldn't be further from the truth:

This is the first time that AMD has implemented branch prediction.
 
Nice write up! :thup:

One comment - in your article you state:
"As a side note, it was actually the multi-cores that came before the paralleled software itself. It was a last ditch effort of the computer architects to keep an aging technology growing and relevant to newer, more advanced computers."

I think you should qualify this as "for Windows PC's". Because parallel software has been around for decades. It is true that writing parallel code is newer to Windows PC's. This is mostly because Microsoft never really took it seriously.

I used parallel software techniques in the codes I wrote on UNIX systems back in the 90's. And it goes back further than that. Software running on supercomputers is parallelized. Windows PCs got stuck in the one CPU does it all because that's what the first IBM Personal Computers did (<-- fore fathers of modern PC's). They could just as easily had dual or more CPU's. It's to bad that we didn't switch over to parallized software sooner when the first multi-core consumer PC's came out.

All the above said, I do hope that we get better software that can actually use all these wonderfull new cores/nodes!
 
@Gautam, This is true in a sense. AMD has used a different type of branch prediction before, but the one they have implemented for BD is much different than usual. Let me look through my notes to see where I got this quote.

@Owenator, again also true. I should have stated this for the PC side. Server and workstation has been using parallel coding for some time.
 
@Gautam, This is true in a sense.

No, it's not true in any sense. Branch prediction has been around for decades. (longer than you or I have) AMD has simply moved away from a more traditional hierarchical BP to one that's decoupled from fetch and uses branch fusion. There's a world of a difference between improving their prediction and using it for the first time (which was probably some time in the early 90's with whatever it was they were pitting against the Pentium)
 
Yeah I am going to take back what I said, I really have no idea why I put that statement in. I went through all my notes and couldn't find anything to backup that statement. Let me re-write that. Thanks for the catch, I'll update it to be accurate.
 
Back