Bulldozer Architecture Explained

October 20, 2011 George Harris Editorials, Processors 26

Table of Contents

George Harris is one of our most active Senior Members at the Overclockers.com Forums, as well as a frequent contributor to articles on the site. As a fifth year undergraduate at Missouri University of Science and Technology studying Computer Engineering (with an emphasis on Computer Architecture), he has both the knowledge and insight to bring us a more technical view on the Bulldozer architecture. This article, while at first glance quite wordy, will allow us all to gain deeper insight into what makes the (admittedly rather underwhelming) Bulldozer architecture ‘tick’. — David

Most likely you have just finished looking at all the reviews on the Bulldozer Architecture CPU, the “Zambezi” FX-8150. Some of you are probably wondering if they are all correct, or if each and every reviewer was an idiot when working with the chip. The honest truth is that it is true, the Bulldozer architecture has the reported level of performance. People have asked about microcode updates and BIOS changes, but those will not significantly impact the performance Bulldozer offers. Without going too in-depth, this article aims to achieve a brief and simplified explanation of the approach Bulldozer has taken and why the performance is what it is. Some of my examples will use complicated terms that you may not understand right away. In the future, I will be posting a paper on looking at the CPU architecture in a simple way, but I will try to explain everything in the easiest form here, and not go too in depth.

The CPU

To start off, we have to understand what a CPU is in its general form. Going back to the single core CPUs we see the essence of what CPU actually is: a dumb machine that takes inputs and creates outputs based on a certain set of algorithms laid out by intricate logic networks. To break that down, you can think of the CPU as a city, and the roads are the logic networks. The roads guide the cars to their destinations. Probably not the best explanation, but it gives you a sense of what I am talking about.

The CPU has become more and more complicated each year as more features get integrated onto a single die. Let’s consider the original CPU: we have a CPU that has a single core, a memory subsystem, and an I/O subsystem. The core takes care of the actual execution of the information passed to it. The memory subsystem takes care of controlling the RAM and L3 cache memory. The I/O handles information being sent to the Northbridge and any other directly connected ICs.

The core has 5 basic stages: Fetch, Decode, Execute, Memory, and Write Back. The Fetch portion takes in an instruction and places it into a buffer. For this example we will use the Reorder Buffer (RB). The RB looks at the information passed in by the Fetch and constructs an ordered table of which instructions are to be executed from first to last. A perfect example of this is an instruction set that is contained in a “for” loop. The RB can look at the loop and determine how to unravel it so that the execution process will be the quickest and safest for the CPU. Once an instruction is ready in the RB, the Decoder unit will take that instruction and dissect it. Before the execution of the instruction, the CPU needs to know what type of instruction it has. Is the instruction FP, Integer, SSE2, etc? With the decoder knowing which instruction it has, it can reserve an execution unit for the instruction.

Once an execution unit is free, the Decoder will send the instruction along, and the execution unit will execute the instruction. After this is done, the CPU needs to know the result of the instruction just in case another instruction relies on that instruction to produce an answer. For example, something like y=x+2 will require two instructions, one to solve x, and another to solve y. Now, y cannot be solved until x is solved, thus the CPU needs to know x after it is executed to solve for y. There are several techniques to get this task done – one of the oldest is to store the information into the L1 cache and have it on hand for later. Other ways involve small buffers near the front end that can quickly look up recently executed information. At the very end of all this, the executed instruction is stored throughout the memory subsystem. The L1, L2, L3, and RAM will each have a copy for a certain amount of time. The memory subsystem controls this; for now we will not go to0 in depth with this.

This represents the absolute basics of a CPU and how it works. I could easily write a book on the rest of the information, but you are here to learn more about Bulldozer and the results it has produced. Again, before we get to that, we have to look at one more piece of history. Yeah, hate me now, but knowing the past creates an understandable present and future.

Evolution to the Multi-Core CPU

As of right now, we are hitting the limits of Moore’s Law, which pretty much states that as the size of the transistor decreases so must the amount of transistors used increase. This relation allows for more and more complicated CPUs. In most forms, these complicated CPUs are just CPUs with crazy amounts of cores. In the world we live in, we are finding it easier to parallel our programs so that they can be executed faster. As a side note, it was actually the multi-cores that came before the paralleled software itself. It was a last ditch effort of the computer architects to keep an aging technology growing and relevant to newer, more advanced computers.

Now, since we are stuck with this technology, computer architects are trying to figure out the best way to implement it. How many cores do we really need, how do we exploit these cores to always have them fully utilized? Do we add in threads on top of our cores, or add more cores to handle threads? These are the questions that computer architects are exploring right now. As we already have seen, Intel and AMD have chosen separate ways to answer this question. Intel has chosen the approach of logically filling up their cores with enough instructions so that the execution units are always being used. AMD has decided to throw more cores at the problem and somewhat brute force their way into executing as many instructions as possible.

Definition of a Core

Now before I continue, I have to mention the definition of a core, because Intel and AMD are starting to skew the term to distract the customers. A core contains the 5 basic stages of a CPU that I have already gone over: Fetch, Decode, Execute, Memory, and Write Back. Intel and AMD both say they have 2, 4, 6, 8, etc. cores. In some ways this is true, especially on the server side. On the desktop side we do not see this: Intel currently has the 2600K which has 4 cores and 8 threads. This is the proper way of stating the CPUs identity. AMD’s FX-8150 has 4 modules, 8 cores. This is the wrong way to state the CPU’s identity. Well kind of…

AMD has approached the idea of manipulating each core by adding in more execution units and a completely separate thread, based on hardware, to each core. Intel has approached the idea of manipulating each core by adding in a beefier front end and allowing a separate thread to run in a core. Alright, so what does this exactly mean? AMD has taken the hardware approach, and Intel the virtual route. So when AMD says that the FX-8150 is an 8-core CPU, it really is not. Even though it has two separate execution units and L1 cache, the module is still sharing the front end and the back end. That means that a Module is really a single core with two threads.

The Module system is exactly like Intel’s Hyperthreading system, but completely not. Both systems implement a second thread in a single core. How each company handles that thread is what makes it the system different.

Intel’s Thoughts

If we go back to our basic CPU, we see that the execution units are not always being utilized. Intel saw this and decided that they could branch their front end so that it can bring in more instructions, but keep it separate from the core itself. These instructions are executed in the core, and even written back inside the core. In a lot of ways this is a great idea, but there is one huge problem: what happens when the core is using an execution unit, and a thread is told that the execution unit can be used? A huge slowdown is created which can cause an entire instruction set to be stalled for some period, or reproduced and executed again. Another downfall is that the order of instructions has to be perfect so that both the thread and the core can equally complete its set of instruction in a specific time. You do not want to have one of these waiting for units to be free: it defeats the purpose of having a thread on top of a core.

AMD’s Thoughts

For the longest time, AMD has always thought that adding more cores will create a faster CPU. The thought of this is pretty simple, take the negatives of threading, and apply them to their CPUs with the negatives being possible fill up and drastic penalties. The increase in the number of physical cores allows for these negatives to disappear altogether. If we go back to our original CPU: we do not utilize the full amount of the execution units as possible. The question now is: how do we create each core so that it does not have an excess of execution units, but enough so that any large unraveling of loops can be executed in a timely matter? AMD has done a pretty good job in this market. The Thuban is an excellent example of the potential of “cores only” CPUs.

AMD faced a new problem after the Thuban showed that increasing the number of cores would not produce a huge gain in performance on the desktop side. The next logical step was to go the way of Intel, but still keep their roots. This is where the idea of the modules came into play. The idea is to create a beefier middle end, and have a very intelligent front end and memory subsystem to control the instructions. Right now, AMD has to play catch up with Intel and the architecture implemented in the front end. Since there are so many execution units to fill up, the front end has to take in enough instructions and the decoder has to assign enough instructions so that the units are fully utilized, but all the same the RB has to be able to make sure that the instructions are executed in a manner that is best for the CPU. This is very complicated, and requires many stages in the CPU for all of this to work out.Furthermore, since the Module has to take in two separate threads, and assign them as a normal process, the front end has to make sure no conflict arises in the distribution of registers and resources.

AMD Bulldozer Architecture

Now we can jump into the architecture of Bulldozer. To begin we look at the front end of the module, the Fetch. Bulldozer’s front end includes branch prediction, instruction fetching, instruction decoding and macro-op dispatch. These stages are effectively multi-threaded with single cycle switching between threads. The arbitration between the two cores is determined by a number of factors including fairness, pipeline occupancy, and stalling events.[1] In a lot of ways, this helps resolve the conflict issues. The problem with this is that it may take a thread much longer time to get through the front end. There may also be a situation where a thread could be stalled for a long time while another thread is being fixed. Despite all these disadvantages, the front end works well.

We now move down to the Decode stage of the CPU. This area has gotten a bit of a beefing up in terms of the amount of instructions that get processed. Up to four instructions can be decoded and be ready for execution. Most of the instructions that will go through the decode are simple and will only require 1 of 4 of the instruction spots to be processed. Other instructions like a 256-bit FP must take 2 of the 4 instruction spots to be processed.

Next up is the Execution process. Each thread is assigned a separate “core” for execution. This is where AMD talks about the number of “cores” in Bulldozer. Since there are effectively 8 execution units, at least for integers, Bulldozer can execute 8 threads. These units are utilized to their maximum potential based on the demand of the module itself. By assumption, the threads for each module could either be identical, in that they share resources, or they are different, in that they do not share resources. Since most of the resources are shared between each core, it could be assumed that each thread in a module could be similar in resources.

Since the execution process and the thread choosing process has not been fully disclosed, there have to be a lot of assumptions about how the execution process is fully utilized. To clarify, can we have both integer cores processing while the FP unit is working? Can we have one of the integer cores running, and have the FP working? Does it have to be FP or both integer cores? From what I can tell, it is FP or integer cores, not a combination of both. It would seem easier for the module to focus on dedicated threads instead of two different threads at the same time. If a FP thread was working and an integer thread was working at the same time, then it may cause complications. But if it is just the FP thread working or both the integer cores working, then less complications could take place.

The Memory subsystem has been completely redone. A lot of the cache techniques have been completely reworked to support faster ways for retrieving data or writing data. I am not going to go into specifics because most of it is technical details that would require more explanations. The one part I will focus on is the L3 cache since that is completely different than what most people would expect. Each module has its own L3 cache instead of having one giant L3 cache. The reasoning behind this is to allow modules to look up information faster. If there was one big L3 cache for the whole CPU, then all 4 modules would have to wait in line to access it. So instead, each module can easily access their own L3 cache. The L3 cache of a module has to keep the same information that all other L3 caches have. This way a module does not have to ask another module for any information. The only downfall to this technique is the updating of all L3 caches. Each L3 must have the same information at all times. A cache miss could cause a stall until the issue is resolved.

Reasons for Unexpected FX Results

Now it’s time to answer the question: why are we seeing these results from the FX-8150? Before I go further, I want to say that this section is completely my opinion but I will do my best to support all of my claims with facts.

To start off this discussion, I want to attack the FP units. The FPU process may be completely thrown away due to the GPUs ability to process FPU instructions at a much greater rate. With this fact, AMD decided to not build an aggressive FP unit for each module. Furthermore, since FP units are so big, and require so much power, if each module had an FP unit it would waste even more energy and may require a larger die. In the end, AMD did the right thing. Since GPUs will be integrated more and more with the CPU or with the coding itself, FP units could disappear altogether.

I’m going to switch gears and attack the front end. Branch prediction pretty much allows the front end to predict which thread will get certain instructions. It also increases the throughput for multi-threading. Considering this is the first time AMD has ever created a very up-to-date front end, they did not do a bad job. Overall, it works and assigns each module with enough threads to keep up with tasks. The problem is how the the front end is assigning those tasks, and how the execution units are working with those threads.

I talked about how the integer cores and the FP unit may not work at the same time. I would like to know if this is true or not, because there could be a huge loss of performance from Bulldozer. Lets say a single thread requiring the FP is sent down the module, with an integer thread. Both threads could be on, and executing the threads at the same time. Instead, the integer thread has to wait until the FP thread is complete. This is truly a waste of resources.

In the time I was writing this, a lot of new information has been coming out. I just read this article over at xbitlabs, and immediately realized what the engineer was talking about. Next to the architecture, the silicon process is the next most important feature of the CPU. If what he says is true, and it could most definitely be true from some of the results I have seen, AMD made a huge mistake with Bulldozer. Working with the transistors and making sure each one is optimized for this task can play a huge role in performance, power, and heat. The good news is that this could probably be resolved with a new stepping, but I would not hold my breath for it being completely fixed.

Regarding power issues, this one came as a surprise to me. From what I have read about the power management, AMD implemented four rings inside each module and one ring around the module itself. The front end, both integer cores, the FPU, and the L2 cache all can be turned off to save power. Each module can also be turned off to save more power, and to also allow for scaling during Turbo Core. The design process for the power gating came from Llano. If any of you are unfamiliar with that APU, the power gating for the Llano is one of the best. It uses very little power for being the king of low end processors. Knowing this makes me even more upset with Bulldozer and its hunger for power, although this may depend on the previous point I discussed.

Closing Remarks

Why talk about this? Why try to defend the AMD Bulldozer? The purpose is to hopefully extinguish some of your torches, and put down your pitchforks. You have to realize that AMD is in a new league. Instead of doing what they normally have done, they are exploiting what they have learned to hopefully make a faster CPU, and you know what: they did. Compare the results you get with the FX-8150 and the Deneb 965. What you should be doing is not comparing this CPU as if it’s an 8-core CPU, because it is not; compare it with the four core CPUs, because that is what it is. The FX-8150 is just another four core CPU with some major tweaks.

This may not have been the savior we have been looking for in AMD against Intel but give it more time and we could really see this architecture improve. We have had the same architecture for many years, and its final run, Thuban, showed us what could happen in the late game. The same needs to happen with each new architecture: time and patience.

– Dolk

References:

[1] AMD’s Bulldozer Microarchitecture by David Kanter (www.realworldtech.com)

Software

26 replies

Join the discussion →

Loading new replies...

Archer0915

"The Expert"

5,063 messages 221 likes

#1 Oct 20, 2011

Dolk; good read. I agree about the FP and I was actually trying to bring that up the other night (face to face discussion with edumicated ppl) but decided to drop it so as not to cause confusion. Sometimes I feel like OCF, TR and some other forums are the only places I am understood; most of the time anyway.

I made this post somewhere else this morning:

The issue here is the architecture and software do not mesh, according to AMD. To me it is not better than HT in some cases and worse than others. In cases where it is better than the iX it is most likley not software that takes full advantage of the FPU.

In essence I am saying that in heavy float ops work it is no better than a quad but in straight int ops it acts like a octo with a limited FSB.

Reply Like

click to expand...

muddocktor

Retired

12,975 messages 0 likes

#2 Oct 20, 2011

Very good read, Dolk.:thup:

Reply Like

Theocnoob

Member

12,062 messages 0 likes

#3 Oct 20, 2011

We're hitting the wall harder than a drunk Sailor on a Friday night, man.

Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
Or not existing?

Come on, graphene...

Reply Like

Archer0915

"The Expert"

5,063 messages 221 likes

#4 Oct 20, 2011

We're hitting the wall harder than a drunk Sailor on a Friday night, man.
Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
Or not existing?
Come on, graphene...

And what do you mean by visible improvements? Compute power goes up, software demands go up, programs get sloppy because of the immense amounts of code and memory requirements go up, programs get larger to take advantage of the memory size and speed as well as the compute power so storage speed and size go up. I would not trade the PC I have today (the slowest) for the best I had 10 years ago.

If you DC or encode quite a bit and a few other things you can see and feel the powar:) I still want moar powar. I want it!!! Give it to me now; I demand it!!!:rain: OOPS:( Be careful what you ask for lightning does strike:(

Dolk: Did you get any insight into the cache issues? I just feel that AMD cache scheme is ineffective. It seems that the L3 offers little gain to the average user and that a large L2 is where it is at for them. (speaking about the move from winzer to brizzy to PhI to PhII to AthII (winzer 2.0 on the duals) x4 and so on. You can see that the windsor had a great run but the brizzy was lacking.

Reply Like

click to expand...

kskwerl

Member

535 messages 0 likes

#5 Oct 20, 2011

Very well written and informative article, thank you.

Reply Like

Metallica

Member

1,058 messages 0 likes

#6 Oct 20, 2011

We're hitting the wall harder than a drunk Sailor on a Friday night, man.
Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
Or not existing?
Come on, graphene...

I feel the same way. My E6750 felt the same as my i7 920. Synthetic benchmarks is the only place I saw improvements. I don't really do anything CPU intensive.

But the jump from p4 -> e6750 was massive, or so it felt.

SSD's were the huge jump in performance that I was looking for. I wonder when something else will come out that will make as much of a noticeable improvement as SSD's.

Reply Like

click to expand...

bmwbaxter

Member

4,135 messages 7 likes

#7 Oct 20, 2011

Nice article! it was an informative read :thup:

Reply Like

Lord_of_Decay

Member

459 messages 0 likes

#8 Oct 20, 2011

Excellent article Dolk.

Reply Like

Dolk

I once overclocked an Intel

6,877 messages 14 likes

#9 Oct 21, 2011

@archer, sorry for the late reply. I have become very busy as of lately. I do not believe there is a cache problem, or at least there is a problem with the architecture plan for the cache. There may have been something wrong with the execution, but I cannot full determine that without a full disclosure from AMD.

The L3 cache has greatly improved, and if AMD had kept their original plan of having a single L3 cache, than we would have seen a far less gain from the Phenom II to BD architecture. You have to think of it this way for the cache system in BD. The L1 is only for the cores to store their data. The L2 is for the cores to talk to one another inside their respected module. The L3 is for the modules to talk to each other inside their respected CPU. Now each cache gets all the information within that hierarchical set. That means the L2 will always contain the information from all L1 caches in the module, and the L3 will always contain the information from all L2 caches in the CPU.

Again how this idea was executed is unknown to me. It is more guess work without actually knowing the full details of each segment of the actual produced architecture. My only guess is that the cache system is being held back by the CMOS layout setup that AMD decided to go with (the Auto layout).

Reply Like

click to expand...

Archer0915

"The Expert"

5,063 messages 221 likes

#10 Oct 21, 2011

@Archer, sorry for the late reply. I have become very busy as of lately. I do not believe there is a cache problem, or at least there is a problem with the architecture plan for the cache. There may have been something wrong with the execution, but I cannot full determine that without a full disclosure from AMD.
The L3 cache has greatly improved, and if AMD had kept their original plan of having a single L3 cache, than we would have seen a far less gain from the Phenom II to BD architecture. You have to think of it this way for the cache system in BD. The L1 is only for the cores to store their data. The L2 is for the cores to talk to one another inside their respected module. The L3 is for the modules to talk to each other inside their respected CPU. Now each cache gets all the information within that hierarchical set. That means the L2 will always contain the information from all L1 caches in the module, and the L3 will always contain the information from all L2 caches in the CPU.
Again how this idea was executed is unknown to me. It is more guess work without actually knowing the full details of each segment of the actual produced architecture. My only guess is that the cache system is being held back by the CMOS layout setup that AMD decided to go with (the Auto layout).

You know I think they would let you in. In a way by explaining some of these thing in the manner you have you have done them a service. I personally want to see the white papers:)

Well a module level cache is good but I just have to speculate on the actual efficency of the entire setup. They need another float unit in there and IMHO it can cause unnessary WS and flushes of data that has waited around for too long. If you are running a math intensive (float and int mix) program I can see that there could be a substantial gain but I can also see that if the instructions were not written (compiled) in such a way that they would specifically take advantage of the design that things could be worse than a traditional quad.

I am beginning to think that the Intel fake cores (at least the way they do it) could be inferior to this (and should be); but only if they can get things hammered out for PD. This design is good and it is an advancment but I would just feel like an early adopter of win 95 waiting for a patch if I had juped on this thing.

Reply Like

click to expand...

Join the full discussion at the Overclockers Forums →