Bulldozer Architecture Explained

Add Your Comments

George Harris is one of our most active Senior Members at the Overclockers.com Forums, as well as a frequent contributor to articles on the site. As a fifth year undergraduate at Missouri University of Science and Technology studying Computer Engineering (with an emphasis on Computer Architecture), he has both the knowledge and insight to bring us a more technical view on the Bulldozer architecture. This article, while at first glance quite wordy, will allow us all to gain deeper insight into what makes the (admittedly rather underwhelming) Bulldozer architecture ‘tick’. — David

 

AMD FX CPU Die

AMD FX CPU Die

Most likely you have just finished looking at all the reviews on the Bulldozer Architecture CPU, the “Zambezi” FX-8150. Some of you are probably wondering if they are all correct, or if each and every reviewer was an idiot when working with the chip. The honest truth is that it is true, the Bulldozer architecture has the reported level of performance. People have asked about microcode updates and BIOS changes, but those will not significantly impact the performance Bulldozer offers. Without going too in-depth, this article aims to achieve a brief and simplified explanation of the approach Bulldozer has taken and why the performance is what it is. Some of my examples will use complicated terms that you may not understand right away. In the future, I will be posting a paper on looking at the CPU architecture in a simple way, but I will try to explain everything in the easiest form here, and not go too in depth.

The CPU

To start off, we have to understand what a CPU is in its general form. Going back to the single core CPUs we see the essence of what CPU actually is:  a dumb machine that takes inputs and creates outputs based on a certain set of algorithms laid out by intricate logic networks. To break that down, you can think of the CPU as a city, and the roads are the logic networks. The roads guide the cars to their destinations. Probably not the best explanation, but it gives you a sense of what I am talking about.

The CPU has become more and more complicated each year as more features get integrated onto a single die. Let’s consider the original CPU: we have a CPU that has a single core, a memory subsystem, and an I/O subsystem. The core takes care of the actual execution of the information passed to it. The memory subsystem takes care of controlling the RAM and L3 cache memory. The I/O handles information being sent to the Northbridge and any other directly connected ICs.

The core has 5 basic stages: Fetch, Decode, Execute, Memory, and Write Back. The Fetch portion takes in an instruction and places it into a buffer. For this example we will use the Reorder Buffer (RB). The RB looks at the information passed in by the Fetch and constructs an ordered table of which instructions are to be executed from first to last. A perfect example of this is an instruction set that is contained in a “for” loop. The RB can look at the loop and determine how to unravel it so that the execution process will be the quickest and safest for the CPU. Once an instruction is ready in the RB, the Decoder unit will take that instruction and dissect it. Before the execution of the instruction, the CPU needs to know what type of instruction it has. Is the instruction FP, Integer, SSE2, etc? With the decoder knowing which instruction it has, it can reserve an execution unit for the instruction.

Once an execution unit is free, the Decoder will send the instruction along, and the execution unit will execute the instruction. After this is done, the CPU needs to know the result of the instruction just in case another instruction relies on that instruction to produce an answer. For example, something like y=x+2 will require two instructions, one to solve x, and another to solve y. Now, y cannot be solved until x is solved, thus the CPU needs to know x after it is executed to solve for y. There are several techniques to get this task done – one of the oldest is to store the information into the L1 cache and have it on hand for later. Other ways involve small buffers near the front end that can quickly look up recently executed information. At the very end of all this, the executed instruction is stored throughout the memory subsystem. The L1, L2, L3, and RAM will each have a copy for a certain amount of time. The memory subsystem controls this; for now we will not go to0 in depth with this.

This represents the absolute basics of a CPU and how it works. I could easily write a book on the rest of the information, but you are here to learn more about Bulldozer and the results it has produced. Again, before we get to that, we have to look at one more piece of history. Yeah, hate me now, but knowing the past creates an understandable present and future.

Evolution to the Multi-Core CPU

As of right now, we are hitting the limits of Moore’s Law, which pretty much states that as the size of the transistor decreases so must the amount of transistors used increase. This relation allows for more and more complicated CPUs. In most forms, these complicated CPUs are just CPUs with crazy amounts of cores. In the world we live in, we are finding it easier to parallel our programs so that they can be executed faster.  As a side note, it was actually the multi-cores that came before the paralleled software itself. It was a last ditch effort of the computer architects to keep an aging technology growing and relevant to newer, more advanced computers.

Now, since we are stuck with this technology, computer architects are trying to figure out the best way to implement it.  How many cores do we really need, how do we exploit these cores to always have them fully utilized? Do we add in threads on top of our cores, or add more cores to handle threads? These are the questions that computer architects are exploring right now. As we already have seen, Intel and AMD have chosen separate ways to answer this question. Intel has chosen the approach of logically filling up their cores with enough instructions so that the execution units are always being used. AMD has decided to throw more cores at the problem and somewhat brute force their way into executing as many instructions as possible.

Definition of a Core

Now before I continue, I have to mention the definition of a core, because Intel and AMD are starting to skew the term to distract the customers. A core contains the 5 basic stages of a CPU that I have already gone over: Fetch, Decode, Execute, Memory, and Write Back. Intel and AMD both say they have 2, 4, 6, 8, etc. cores. In some ways this is true, especially on the server side. On the desktop side we do not see this: Intel currently has the 2600K which has 4 cores and 8 threads. This is the proper way of stating the CPUs identity. AMD’s FX-8150 has 4 modules, 8 cores. This is the wrong way to state the CPU’s identity. Well kind of…

AMD has approached the idea of manipulating each core by adding in more execution units and a completely separate thread, based on hardware, to each core. Intel has approached the idea of manipulating each core by adding in a beefier front end and allowing a separate thread to run in a core. Alright, so what does this exactly mean? AMD has taken the hardware approach, and Intel the virtual route. So when AMD says that the FX-8150 is an 8-core CPU, it really is not. Even though it has two separate execution units and L1 cache, the module is still sharing the front end and the back end. That means that a Module is really a single core with two threads.

The Module system is exactly like Intel’s Hyperthreading system, but completely not. Both systems implement a second thread in a single core. How each company handles that thread is what makes it the system different.

Intel’s Thoughts

If we go back to our basic CPU, we see that the execution units are not always being utilized. Intel saw this and decided that they could branch their front end so that it can bring in more instructions, but keep it separate from the core itself. These instructions are executed in the core, and even written back inside the core. In a lot of ways this is a great idea, but there is one huge problem: what happens when the core is using an execution unit, and a thread is told that the execution unit can be used? A huge slowdown is created which can cause an entire instruction set to be stalled for some period, or reproduced and executed again. Another downfall is that the order of instructions has to be perfect so that both the thread and the core can equally complete its set of instruction in a specific time. You do not want to have one of these waiting for units to be free: it defeats the purpose of having a thread on top of a core.

AMD’s Thoughts

For the longest time, AMD has always thought that adding more cores will create a faster CPU. The thought of this is pretty simple, take the negatives of threading, and apply them to their CPUs with the negatives being possible fill up and drastic penalties. The increase in the number of physical cores allows for these negatives to disappear altogether. If we go back to our original CPU: we do not utilize the full amount of the execution units as possible. The question now is: how do we create each core so that it does not have an excess of execution units, but enough so that any large unraveling of loops can be executed in a timely matter? AMD has done a pretty good job in this market. The Thuban is an excellent example of the potential of “cores only” CPUs.

AMD faced a new problem after the Thuban showed that increasing the number of cores would not produce a huge gain in performance on the desktop side. The next logical step was to go the way of Intel, but still keep their roots. This is where the idea of the modules came into play. The idea is to create a beefier middle end, and have a very intelligent front end and memory subsystem to control the instructions. Right now, AMD has to play catch up with Intel and the architecture implemented in the front end. Since there are so many execution units to fill up, the front end has to take in enough instructions and the decoder has to assign enough instructions so that the units are fully utilized, but all the same the RB has to be able to make sure that the instructions are executed in a manner that is best for the CPU. This is very complicated, and requires many stages in the CPU for all of this to work out.Furthermore, since the Module has to take in two separate threads, and assign them as a normal process, the front end has to make sure no conflict arises in the distribution of registers and resources.

AMD Bulldozer Architecture

Bulldozer Realized

Bulldozer Realized

Now we can jump into the architecture of Bulldozer. To begin we look at the front end of the module, the Fetch. Bulldozer’s front end includes branch prediction, instruction fetching, instruction decoding and macro-op dispatch. These stages are effectively multi-threaded with single cycle switching between threads. The arbitration between the two cores is determined by a number of factors including fairness, pipeline occupancy, and stalling events.[1] In a lot of ways, this helps resolve the conflict issues. The problem with this is that it may take a thread much longer time to get through the front end. There may also be a situation where a thread could be stalled for a long time while another thread is being fixed. Despite all these disadvantages, the front end works well.

We now move down to the Decode stage of the CPU. This area has gotten a bit of a beefing up in terms of the amount of instructions that get processed. Up to four instructions can be decoded and be ready for execution. Most of the instructions that will go through the decode are simple and will only require 1 of 4 of the instruction spots to be processed. Other instructions like a 256-bit FP must take 2 of the 4 instruction spots to be processed.

Shared Frontend

Shared Frontend

Next up is the Execution process. Each thread is assigned a separate “core” for execution. This is where AMD talks about the number of “cores” in Bulldozer. Since there are effectively 8 execution units, at least for integers, Bulldozer can execute 8 threads. These units are utilized to their maximum potential based on the demand of the module itself. By assumption, the threads for each module could either be identical, in that they share resources, or they are different, in that they do not share resources. Since most of the resources are shared between each core, it could be assumed that each thread in a module could be similar in resources.

Since the execution process and the thread choosing process has not been fully disclosed, there have to be a lot of assumptions about how the execution process is fully utilized. To clarify, can we have both integer cores processing while the FP unit is working? Can we have one of the integer cores running, and have the FP working? Does it have to be FP or both integer cores? From what I can tell, it is FP or integer cores, not a combination of both. It would seem easier for the module to focus on dedicated threads instead of two different threads at the same time. If a FP thread was working and an integer thread was working at the same time, then it may cause complications. But if it is just the FP thread working or both the integer cores working, then less complications could take place.

The Memory subsystem has been completely redone. A lot of the cache techniques have been completely reworked to support faster ways for retrieving data or writing data. I am not going to go into specifics because most of it is technical details that would require more explanations. The one part I will focus on is the L3 cache since that is completely different than what most people would expect. Each module has its own L3 cache instead of having one giant L3 cache. The reasoning behind this is to allow modules to look up information faster. If there was one big L3 cache for the whole CPU, then all 4 modules would have to wait in line to access it. So instead, each module can easily access their own L3 cache. The L3 cache of a module has to keep the same information that all other L3 caches have. This way a module does not have to ask another module for any information. The only downfall to this technique is the updating of all L3 caches. Each L3 must have the same information at all times. A cache miss could cause a stall until the issue is resolved.

Reasons for Unexpected FX Results

Now it’s time to answer the question:  why are we seeing these results from the FX-8150? Before I go further, I want to say that this section is completely my opinion but I will do my best to support all of my claims with facts.

To start off this discussion, I want to attack the FP units. The FPU process may be completely thrown away due to the GPUs ability to process FPU instructions at a much greater rate. With this fact, AMD decided to not build an aggressive FP unit for each module. Furthermore, since FP units are so big, and require so much power, if each module had an FP unit it would waste even more energy and may require a larger die. In the end, AMD did the right thing. Since GPUs will be integrated more and more with the CPU or with the coding itself, FP units could disappear altogether.

I’m going to switch gears and attack the front end. Branch prediction pretty much allows the front end to predict which thread will get certain instructions. It also increases the throughput for multi-threading. Considering this is the first time AMD has ever created a very up-to-date front end, they did not do a bad job. Overall, it works and assigns each module with enough threads to keep up with tasks. The problem is how the the front end is assigning those tasks, and how the execution units are working with those threads.

I talked about how the integer cores and the FP unit may not work at the same time. I would like to know if this is true or not, because there could be a huge loss of performance from Bulldozer. Lets say a single thread requiring the FP is sent down the module, with an integer thread. Both threads could be on, and executing the threads at the same time. Instead, the integer thread has to wait until the FP thread is complete. This is truly a waste of resources.

In the time I was writing this, a lot of new information has been coming out. I just read this article over at xbitlabs, and immediately realized what the engineer was talking about. Next to the architecture, the silicon process is the next most important feature of the CPU. If what he says is true, and it could most definitely be true from some of the results I have seen, AMD made a huge mistake with Bulldozer. Working with the transistors and making sure each one is optimized for this task can play a huge role in performance, power, and heat. The good news is that this could probably be resolved with a new stepping, but I would not hold my breath for it being completely fixed.

Regarding power issues, this one came as a surprise to me. From what I have read about the power management, AMD implemented four rings inside each module and one ring around the module itself. The front end, both integer cores, the FPU, and the L2 cache all can be turned off to save power. Each module can also be turned off to save more power, and to also allow for scaling during Turbo Core. The design process for the power gating came from Llano. If any of you are unfamiliar with that APU, the power gating for the Llano is one of the best. It uses very little power for being the king of low end processors. Knowing this makes me even more upset with Bulldozer and its hunger for power, although this may depend on the previous point I discussed.

Closing Remarks

Why talk about this? Why try to defend the AMD Bulldozer? The purpose is to hopefully extinguish some of your torches, and put down your pitchforks. You have to realize that AMD is in a new league. Instead of doing what they normally have done, they are exploiting what they have learned to hopefully make a faster CPU, and you know what: they did. Compare the results you get with the FX-8150 and the Deneb 965. What you should be doing is not comparing this CPU as if it’s an 8-core CPU, because it is not; compare it with the four core CPUs, because that is what it is. The FX-8150 is just another four core CPU with some major tweaks.

Core Roadmap - Big Machinery!

Core Roadmap - Big Machinery!

This may not have been the savior we have been looking for in AMD against Intel but give it more time and we could really see this architecture improve. We have had the same architecture for many years, and its final run, Thuban, showed us what could happen in the late game. The same needs to happen with each new architecture: time and patience.

– Dolk

 

References:

[1] AMD’s Bulldozer Microarchitecture by David Kanter (www.realworldtech.com)

Leave a Reply

Your email address will not be published. Required fields are marked *

Discussion
  1. Dolk; good read. I agree about the FP and I was actually trying to bring that up the other night (face to face discussion with edumicated ppl) but decided to drop it so as not to cause confusion. Sometimes I feel like OCF, TR and some other forums are the only places I am understood; most of the time anyway.
    I made this post somewhere else this morning:
    The issue here is the architecture and software do not mesh, according to AMD. To me it is not better than HT in some cases and worse than others. In cases where it is better than the iX it is most likley not software that takes full advantage of the FPU.

    In essence I am saying that in heavy float ops work it is no better than a quad but in straight int ops it acts like a octo with a limited FSB.
    We're hitting the wall harder than a drunk Sailor on a Friday night, man.
    Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
    Or not existing?
    Come on, graphene...
    Theocnoob
    We're hitting the wall harder than a drunk Sailor on a Friday night, man.
    Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
    Or not existing?
    Come on, graphene...

    And what do you mean by visible improvements? Compute power goes up, software demands go up, programs get sloppy because of the immense amounts of code and memory requirements go up, programs get larger to take advantage of the memory size and speed as well as the compute power so storage speed and size go up. I would not trade the PC I have today (the slowest) for the best I had 10 years ago.
    If you DC or encode quite a bit and a few other things you can see and feel the powar:) I still want moar powar. I want it!!! Give it to me now; I demand it!!!:rain: OOPS:( Be careful what you ask for lightning does strike:(
    Dolk: Did you get any insight into the cache issues? I just feel that AMD cache scheme is ineffective. It seems that the L3 offers little gain to the average user and that a large L2 is where it is at for them. (speaking about the move from winzer to brizzy to PhI to PhII to AthII (winzer 2.0 on the duals) x4 and so on. You can see that the windsor had a great run but the brizzy was lacking.
    Theocnoob
    We're hitting the wall harder than a drunk Sailor on a Friday night, man.
    Anybody else notice the actual visible improvements in CPU performance getting smaller with each gen?
    Or not existing?
    Come on, graphene...

    I feel the same way. My E6750 felt the same as my i7 920. Synthetic benchmarks is the only place I saw improvements. I don't really do anything CPU intensive.
    But the jump from p4 -> e6750 was massive, or so it felt.
    SSD's were the huge jump in performance that I was looking for. I wonder when something else will come out that will make as much of a noticeable improvement as SSD's.
    @Archer, sorry for the late reply. I have become very busy as of lately. I do not believe there is a cache problem, or at least there is a problem with the architecture plan for the cache. There may have been something wrong with the execution, but I cannot full determine that without a full disclosure from AMD.
    The L3 cache has greatly improved, and if AMD had kept their original plan of having a single L3 cache, than we would have seen a far less gain from the Phenom II to BD architecture. You have to think of it this way for the cache system in BD. The L1 is only for the cores to store their data. The L2 is for the cores to talk to one another inside their respected module. The L3 is for the modules to talk to each other inside their respected CPU. Now each cache gets all the information within that hierarchical set. That means the L2 will always contain the information from all L1 caches in the module, and the L3 will always contain the information from all L2 caches in the CPU.
    Again how this idea was executed is unknown to me. It is more guess work without actually knowing the full details of each segment of the actual produced architecture. My only guess is that the cache system is being held back by the CMOS layout setup that AMD decided to go with (the Auto layout).
    Dolk
    @Archer, sorry for the late reply. I have become very busy as of lately. I do not believe there is a cache problem, or at least there is a problem with the architecture plan for the cache. There may have been something wrong with the execution, but I cannot full determine that without a full disclosure from AMD.
    The L3 cache has greatly improved, and if AMD had kept their original plan of having a single L3 cache, than we would have seen a far less gain from the Phenom II to BD architecture. You have to think of it this way for the cache system in BD. The L1 is only for the cores to store their data. The L2 is for the cores to talk to one another inside their respected module. The L3 is for the modules to talk to each other inside their respected CPU. Now each cache gets all the information within that hierarchical set. That means the L2 will always contain the information from all L1 caches in the module, and the L3 will always contain the information from all L2 caches in the CPU.
    Again how this idea was executed is unknown to me. It is more guess work without actually knowing the full details of each segment of the actual produced architecture. My only guess is that the cache system is being held back by the CMOS layout setup that AMD decided to go with (the Auto layout).

    You know I think they would let you in. In a way by explaining some of these thing in the manner you have you have done them a service. I personally want to see the white papers:)
    Well a module level cache is good but I just have to speculate on the actual efficency of the entire setup. They need another float unit in there and IMHO it can cause unnessary WS and flushes of data that has waited around for too long. If you are running a math intensive (float and int mix) program I can see that there could be a substantial gain but I can also see that if the instructions were not written (compiled) in such a way that they would specifically take advantage of the design that things could be worse than a traditional quad.
    I am beginning to think that the Intel fake cores (at least the way they do it) could be inferior to this (and should be); but only if they can get things hammered out for PD. This design is good and it is an advancment but I would just feel like an early adopter of win 95 waiting for a patch if I had juped on this thing.
    In the time that BD was first proposed, the current trend for programmers was pushing FP onto the GPU rather than the CPU. The CPU was not strong enough. In the 5 years of the development, we have seen that the CPU can do some good FP calculations with enough resources.
    If you look at Fusion and BD, you may see something that for the coming future. Having on board GPU will allow for FP calculations to be pushed permanently onto the GPU, and not the CPU. That way the CPU can go back to doing its task of multi-threaded integer and memory.
    Furthermore, if we move on over to the server side, most of the calculations on server side do not involve the use of a FP. If they do, those servers usually have CUDA clusters or Vector Processors to help with the efficiency.
    Yeah BD will not be good at breaking benchmarks, but maybe its time that we changed how we benched our CPUs? Just like the time when we changed our benchmarks from single threaded to multi-threaded, or 32bit to 64bit. Maybe its time we go to only multi-threaded integer and memory.
    Hmm I should have stated that in my article, that would have been a good paragraph.
    Dolk
    In the time that BD was first proposed, the current trend for programmers was pushing FP onto the GPU rather than the CPU. The CPU was not strong enough. In the 5 years of the development, we have seen that the CPU can do some good FP calculations with enough resources.
    If you look at Fusion and BD, you may see something that for the coming future. Having on board GPU will allow for FP calculations to be pushed permanently onto the GPU, and not the CPU. That way the CPU can go back to doing its task of multi-threaded integer and memory.
    Furthermore, if we move on over to the server side, most of the calculations on server side do not involve the use of a FP. If they do, those servers usually have CUDA clusters or Vector Processors to help with the efficiency.
    Yeah BD will not be good at breaking benchmarks, but maybe its time that we changed how we benched our CPUs? Just like the time when we changed our benchmarks from single threaded to multi-threaded, or 32bit to 64bit. Maybe its time we go to only multi-threaded integer and memory.
    Hmm I should have stated that in my article, that would have been a good paragraph.

    Well I look more at how much work a cpu can do so I am with you on changing some standards to represent what the processor actually does as far as work unit type. Unfortuantely you fall into that entire class of PPL who get a bad rap of being partial when you point out facts that they refuse to consider.
    In the end I don't think eliminating benches and saying we dont like your tests is the solution. I think expanding the benches and every year evaluating what direction the total package is going in, by looking at usage models, software sales, software in development and task environment might be a great solution. My i5 can crunch and fold better than my PhII but when running I really can not tell. Honestly these days we need to stop looking so hard for stellar performance and focus on the deficiencies. Any desktop mainstream CPU manufactured today will play games as well as any other depending on the system setup (I should say any quad) as long as you have the supporting hardware. People can say all they want but if I felt my i processors were any faster in day to day use in my 24/7 rig (daily driver) I would throw out every AMD system in the house. It is just not the case.
    The problem is that we are still figuring out our limits with the CPU. As of right now for gaming and general usage of the CPU, we have not seen progression since we started working in multi-threading.
    CPU Architects are still trying to find the best combination with the resources they have. In the past year or two, we have seen that increasing cores is done, we have exploited that technique to its max. The next step is exploiting each core to its max, and that is what we are currently doing. Until we find that, than e will move onto other tasks maybe different types of doping of the silicon, or different kinds of transistors, or new memory architectures, or finally going 128bit (which will be sooner than some of you can imagine).
    Dolk
    The problem is that we are still figuring out our limits with the CPU. As of right now for gaming and general usage of the CPU, we have not seen progression since we started working in multi-threading.

    The way I understand what you you are saying I can not agree. Run todays highest end games on the first batch of duals. We have come a long way.
    The raw compute power has increased dramatically as well.
    What needs to be considered, as I have said before, is the entire package. Just think of some of the arch. changes to the CPU that have not involved compute power. The APU that you have mentioned already, new SIMD instructions, moving many of the NB components to the CPU to lessen the bottlenecking and a few other things.
    I see the progression that you have said has not been realized. Killing the FSB is a huge jump for heavy background multi tasking for the GP user who might want to DC. The memory controllers ability to operate faster memory...............
    CPU Architects are still trying to find the best combination with the resources they have. In the past year or two, we have seen that increasing cores is done, we have exploited that technique to its max. The next step is exploiting each core to its max, and that is what we are currently doing. Until we find that, than e will move onto other tasks maybe different types of doping of the silicon, or different kinds of transistors, or new memory architectures, or finally going 128bit (which will be sooner than some of you can imagine).

    Your argument seems to contractdict itself. You said we are not reaching potential yet you almost sound happy about 128bit.
    Not that 128 is bad but to me it just makes for bigger slopier programs with sloppy, generic, coding.
    I personally do agree with the theme of your post and it is the software that has to take advantage of the hardware and do it efficently.
    Also I was making a case that benches could be revamped with usage models that were more represanative of todays average user. With FB and all of this social garbage and streaming and blah, blah we can not readily say that a CPU is crap because it is just as fast at FB as the rest of the pack.
    Yeah gaming benches show this and that but unless I am playing a game where frame rates give a specific advantage then 80FPS or 200FPS makes no diffrence. I want to see tha parts that stress the CPU revealed. I mean who cares if you max at 200FPS and the other guy maxes at 180 when he also bottoms at 60FPS and bottom out at 40FPS and it is consitent across video cards. How well is the CPU is handling heavy BOT AI.
    I would also like to see timed tests and real workloads used.
    I hope that this was some type of typo on your part because it (obviously) couldn't be further from the truth:
    This is the first time that AMD has implemented branch prediction.
    Nice write up! :thup:
    One comment - in your article you state:
    "As a side note, it was actually the multi-cores that came before the paralleled software itself. It was a last ditch effort of the computer architects to keep an aging technology growing and relevant to newer, more advanced computers."
    I think you should qualify this as "for Windows PC's". Because parallel software has been around for decades. It is true that writing parallel code is newer to Windows PC's. This is mostly because Microsoft never really took it seriously.
    I used parallel software techniques in the codes I wrote on UNIX systems back in the 90's. And it goes back further than that. Software running on supercomputers is parallelized. Windows PCs got stuck in the one CPU does it all because that's what the first IBM Personal Computers did (<-- fore fathers of modern PC's). They could just as easily had dual or more CPU's. It's to bad that we didn't switch over to parallized software sooner when the first multi-core consumer PC's came out.
    All the above said, I do hope that we get better software that can actually use all these wonderfull new cores/nodes!
    @Gautam, This is true in a sense. AMD has used a different type of branch prediction before, but the one they have implemented for BD is much different than usual. Let me look through my notes to see where I got this quote.
    @Owenator, again also true. I should have stated this for the PC side. Server and workstation has been using parallel coding for some time.
    Dolk
    @Gautam, This is true in a sense.

    No, it's not true in any sense. Branch prediction has been around for decades. (longer than you or I have) AMD has simply moved away from a more traditional hierarchical BP to one that's decoupled from fetch and uses branch fusion. There's a world of a difference between improving their prediction and using it for the first time (which was probably some time in the early 90's with whatever it was they were pitting against the Pentium)
    Yeah I am going to take back what I said, I really have no idea why I put that statement in. I went through all my notes and couldn't find anything to backup that statement. Let me re-write that. Thanks for the catch, I'll update it to be accurate.