• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Conroe technical documents

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

proth

Member
Joined
Dec 28, 2005
Location
In your mirror
For those that really care about what's "Intel Inside". Below are a few articles explaining Conroe's internals. Contributions by Ross,OChungry,Mikeguava and Strat



A very good article comparing Conroe to other architectures (thanks to Ross)
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2748


Light Reading:
http://www.extremetech.com/article2/0,1697,1935727,00.asp
http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144
http://arstechnica.com/articles/paedia/cpu/core.ars
ftp://download.intel.com/technology/architecture/new_architecture_06.pdf



Heavy Duty stuff:
http://www.intel.com/design/pentium4/manuals/253665.htm
http://www.intel.com/design/pentium4/manuals/253666.htm
 
Last edited:
Dang, my secret info sources are out, LOL. The realworldtech one is good if you don't like reading through the Intel papers...
 
Since there's been many questions, in several threads, regarding micro/macro-op fusion, SSE4, 128b SIMD registers, etc., I thought I'd post the Intel white-paper, which seems to address nearly all these issues. I found it a pretty good read, &, IMHO, the white-papers seem to provide a better overview of new technologies vs the datasheets, which tend to be "nuts & bolts" specific (as they should be).

Though, I suppose to fully appreciate such things as the 4 instruction wide execution units, et al, one needs to already know the current P4 Netburst technology is 3 wide, etc., etc.

Well, anyway, FWIW:

[56K WARNIING - .pdf]

Core Technology Micro-Architecture Whitepaper

BTW - JMO, but when you see all the various aggregate "improvements" over the current Netburst technology, it starts to become obvious why these things have the potential to really kick some serious arse on the earlier brethren, at least in my mind.

Strat
 
Thanks for posting the link Strat.

There are actually some websites linked above that give a somewhat more in-depth explanation in addition to comparing to P4/Yonah. I posted the realworldtech one in another thread a while ago and it's probably the best one IMHO, but it looks like the link is hosed here.

proth, you might want to edit and correct the OP...here's the correct link: http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144

When you really look at it compared to the other cores, Conroe is all that and a bag of chips if the benches didn't alert you to that fact alredy, LOL.

In a nutshell, the macro fusion is it's ability to combine 2 instructions coming in (4+1 per clock if they can be combined) and still produce only 4 instructions going out. That not only increases actual instructions per clock when possible, but also reduces the # of micro ops needed to complete the instructions. Good stuff ;)
 
Ross/Electron... Thanks, I read that Anandtech article yesterday and looked to update this post, but couldn't find it...Will add it into the top post
 
Well, a guy at amdzone.com (i know i know, but it doesn't mean all info is biased and this is a user comment, not an editorial). Anyway, I thought I had a source for some of this, but I couldn't find it and I can't avoid studying for my finals any more tonight ;) So, take this as you will, an opinion and that is all, I have no revliable source at this time to back it up.

I wish to point out a few "problems" with this article:



The "Core" has 4-way execution vs. Hammer's 3-way (and thus 96 vs. 72-entry reorder buffer - which is not a coincident since 72*4/3=96). However, it's shown in some texts that IPC of single-threaded programs rarely exceed 3, and thus AMD's 3-way is really an economic and sufficient choice, while Intel's is the "rich men's approach". One notable exception is floating-point programs, which could have IPC > 3. That probably is the rationale of K8L having wider FP units.

The micro-ops and macro-ops comparison is moot, because AFAIK there is no "macro-ops" in Hammer. Its description of AMD "K8" macro-ops is thus wrong: most simple x86 instructions are decoded to one or two *micro-ops* by K8 fastpath decoders, while the complicated ones are decoded by the microcode engine (to probably a variable number of micro-ops). In Hammer, even SSE instructions follow the fastpath.

The "smart" decoding also exists in Hammer, and can hardly be called an advantage of the Core.

The comparison of load/store is also wrong. AMD Opteron CAN return load results out-of-order to the execution units; this is relatively easy, anyway. Opteron also can forward store results to dependent loads (store buffer forwarding). Intel Core's prediction logic is probably more sophisticated on memory address disambiguation and prediction, though.
 
Anyway, still not real major stuff, Conroe definitely still has its share of advantages, but you know how I am about getting down to the truth ;)
 
Well, let me help you. Most of the "problems" he points out are either not problems at all or something he likely misread.

The "Core" has 4-way execution vs. Hammer's 3-way (and thus 96 vs. 72-entry reorder buffer - which is not a coincident since 72*4/3=96). However, it's shown in some texts that IPC of single-threaded programs rarely exceed 3, and thus AMD's 3-way is really an economic and sufficient choice, while Intel's is the "rich men's approach". One notable exception is floating-point programs, which could have IPC > 3. That probably is the rationale of K8L having wider FP units.
This is not a problem or even a point. He simply says that "3 units are enough most of the time on single-threaded apps". Actually, just a dumb thing to say as it has no bearing on anything. If 3 are efficient and sufficient enough, K8L wouldn't have 4 :)

The micro-ops and macro-ops comparison is moot, because AFAIK there is no "macro-ops" in Hammer. Its description of AMD "K8" macro-ops is thus wrong: most simple x86 instructions are decoded to one or two *micro-ops* by K8 fastpath decoders, while the complicated ones are decoded by the microcode engine (to probably a variable number of micro-ops). In Hammer, even SSE instructions follow the fastpath.
That is not moot or wrong just because he doesn't like what it's called. The article was careful to point out what AMD themselves define as and call macro-ops. Maybe he doesn't like the phrase or it doesn't really suit it, but they are in fact called macro-ops. Maybe he needs to brush up: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22054.pdf

The "smart" decoding also exists in Hammer, and can hardly be called an advantage of the Core.
Absolutely no where on that article does it say that smart decoding in and of itself is an advantage for Core. Quite the contrary, the very first line of the very first paragraph in that section says: "Similar to the K8 architecture, Core pre-decodes instructions that are fetched." The only thing remotely close to talking about advantage Core in that whole section is:
anand said:
So how do Intel's Core and AMD's Hammer compare when it comes to decoding? It is hard to say at the moment without access to Intel's optimization manuals. However, we can get a pretty good idea. In almost every situation, the Core architecture has the advantage. It can decode 4 x86 instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer can do only 3.

The comparison of load/store is also wrong. AMD Opteron CAN return load results out-of-order to the execution units; this is relatively easy, anyway. Opteron also can forward store results to dependent loads (store buffer forwarding). Intel Core's prediction logic is probably more sophisticated on memory address disambiguation and prediction, though.
Again, no where does it say that the AMD can't do OOO. The load/store I am not sure, but one thing I noticed quickly was that all of a sudden he's talking about Opteron, which is mentioned (again) no where in the entire article. The article pretty specifically is talking about Athlon64 consistently. I don't know how he made that jump or if the Opteron/Athlon use the same or similar arch, but he needs to be careful interchanging products like that if he wants to use them as a base to contradict what someone else is talking about.

Suffice it to say, as much as he knows or sounds like he knows, there's nothing there that remotely proves anything in the article is "incorrect info", particularly about Core, and most of it, he his flat out wrong likely from misunderstanding/misinterpreting what was written.
 
Ross said:
Well, let me help you. Most of the "problems" he points out are either not problems at all or something he likely misread.

This is not a problem or even a point. He simply says that "3 units are enough most of the time on single-threaded apps". Actually, just a dumb thing to say as it has no bearing on anything. If 3 are efficient and sufficient enough, K8L wouldn't have 4 :)

I hadn't read anywhere that K8L had 4, are you sure? I think his point was just that many people/websites seem to stress this part of the architecture and in many cases, it doesn't offer any real advantage with current software.


That is not moot or wrong just because he doesn't like what it's called. The article was careful to point out what AMD themselves define as and call macro-ops. Maybe he doesn't like the phrase or it doesn't really suit it, but they are in fact called macro-ops. Maybe he needs to brush up: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22054.pdf

Part of the problem here is labeling. Anyway, I think his point is that AMD "macro-ops" are not what intel (and the Anandtech article) consider macro-ops and therefore the macro-ops comparison is kind of pointless. I kind of agree with him that the differences in architecture make a comparison much more difficult and I'm sure we'd hear a very different story if the article was based off a conversation from an AMD engineer, but either way, it's more of a labeling issue.

Absolutely no where on that article does it say that smart decoding in and of itself is an advantage for Core. Quite the contrary, the very first line of the very first paragraph in that section says: "Similar to the K8 architecture, Core pre-decodes instructions that are fetched." The only thing remotely close to talking about advantage Core in that whole section is:

I assume he's talking about micro-ops fusion, not really sure, in which case he is right and wrong. AMD can do this, but anandtech did mention it as well, maybe he just misssed it in the article.

Again, no where does it say that the AMD can't do OOO. The load/store I am not sure, but one thing I noticed quickly was that all of a sudden he's talking about Opteron, which is mentioned (again) no where in the entire article. The article pretty specifically is talking about Athlon64 consistently. I don't know how he made that jump or if the Opteron/Athlon use the same or similar arch, but he needs to be careful interchanging products like that if he wants to use them as a base to contradict what someone else is talking about.

Suffice it to say, as much as he knows or sounds like he knows, there's nothing there that remotely proves anything in the article is "incorrect info", particularly about Core, and most of it, he his flat out wrong likely from misunderstanding/misinterpreting what was written.

This is actually where I think he has the biggest point (if he is indeed correct), he's not saying anything about OOO in itself, just the loads in OOO. The A64/Opteron thing isn't important IMO, they are the same architecture, only difference is the # of HT connects activated to allow multiple cpu systems, I mix and match them myself from time to time. If he was writing an actual article, he needs to get it right, but in the situation at hand, not a big deal. Like I said earlier, what he brings up isn't a big deal and it really depends on if he's correct in some of his own information.
 
I hadn't read anywhere that K8L had 4, are you sure? I think his point was just that many people/websites seem to stress this part of the architecture and in many cases, it doesn't offer any real advantage with current software.
No. I thought he specifically said 4, but he did say they had wider units, so I assume they will. I know nothing about AMDs :D As for the advantage, it will really depend how often it makes use of the extra units and rest assured, software will catch up with CPUs. With multi-cores so obviously the trend, it won't be long software is even better optimised to make use of them. A patch here and there is all it takes ;)

Part of the problem here is labeling. Anyway, I think his point is that AMD "macro-ops" are not what intel (and the Anandtech article) consider macro-ops and therefore the macro-ops comparison is kind of pointless. I kind of agree with him that the differences in architecture make a comparison much more difficult and I'm sure we'd hear a very different story if the article was based off a conversation from an AMD engineer, but either way, it's more of a labeling issue.
Yeah, if look at what Anand said and what he said, it's basically describing the exact same thing. Point being though, it's not Anand's labelling, it's AMD's and he's saying Anand was wrong. macro-ops obviously mean 2 different things between Intel and AMD in the technical sense.

This is actually where I think he has the biggest point (if he is indeed correct), he's not saying anything about OOO in itself, just the loads in OOO. The A64/Opteron thing isn't important IMO, they are the same architecture, only difference is the # of HT connects activated to allow multiple cpu systems, I mix and match them myself from time to time. If he was writing an actual article, he needs to get it right, but in the situation at hand, not a big deal. Like I said earlier, what he brings up isn't a big deal and it really depends on if he's correct in some of his own information.
Yeah, like I said, I don't really know anything about the AMDs. Now at least I know A64s/Opties have the same architectures :) If he's right that would be a mistake on Anand's part, but I would think they have done their homework on it before writing ;)
 
Back