Well, let me help you. Most of the "problems" he points out are either not problems at all or something he likely misread.
The "Core" has 4-way execution vs. Hammer's 3-way (and thus 96 vs. 72-entry reorder buffer - which is not a coincident since 72*4/3=96). However, it's shown in some texts that IPC of single-threaded programs rarely exceed 3, and thus AMD's 3-way is really an economic and sufficient choice, while Intel's is the "rich men's approach". One notable exception is floating-point programs, which could have IPC > 3. That probably is the rationale of K8L having wider FP units.
This is not a problem or even a point. He simply says that "3 units are enough most of the time on single-threaded apps". Actually, just a dumb thing to say as it has no bearing on anything. If 3 are efficient and sufficient enough, K8L wouldn't have 4
The micro-ops and macro-ops comparison is moot, because AFAIK there is no "macro-ops" in Hammer. Its description of AMD "K8" macro-ops is thus wrong: most simple x86 instructions are decoded to one or two *micro-ops* by K8 fastpath decoders, while the complicated ones are decoded by the microcode engine (to probably a variable number of micro-ops). In Hammer, even SSE instructions follow the fastpath.
That is not moot or wrong just because he doesn't like what it's called. The article was careful to point out what AMD themselves define as and call macro-ops. Maybe he doesn't like the phrase or it doesn't really suit it, but they are in fact called macro-ops. Maybe he needs to brush up:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22054.pdf
The "smart" decoding also exists in Hammer, and can hardly be called an advantage of the Core.
Absolutely no where on that article does it say that smart decoding in and of itself is an advantage for Core. Quite the contrary, the very first line of the very first paragraph in that section says: "Similar to the K8 architecture, Core pre-decodes instructions that are fetched." The only thing remotely close to talking about advantage Core in that whole section is:
anand said:
So how do Intel's Core and AMD's Hammer compare when it comes to decoding? It is hard to say at the moment without access to Intel's optimization manuals. However, we can get a pretty good idea. In almost every situation, the Core architecture has the advantage. It can decode 4 x86 instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer can do only 3.
The comparison of load/store is also wrong. AMD Opteron CAN return load results out-of-order to the execution units; this is relatively easy, anyway. Opteron also can forward store results to dependent loads (store buffer forwarding). Intel Core's prediction logic is probably more sophisticated on memory address disambiguation and prediction, though.
Again, no where does it say that the AMD can't do OOO. The load/store I am not sure, but one thing I noticed quickly was that all of a sudden he's talking about Opteron, which is mentioned (again) no where in the entire article. The article pretty specifically is talking about Athlon64 consistently. I don't know how he made that jump or if the Opteron/Athlon use the same or similar arch, but he needs to be careful interchanging products like that if he wants to use them as a base to contradict what someone else is talking about.
Suffice it to say, as much as he knows or sounds like he knows, there's nothing there that remotely proves anything in the article is "incorrect info", particularly about Core, and most of it, he his flat out wrong likely from misunderstanding/misinterpreting what was written.