Theories About Modern CPU Cache

Today there are two major CPU manufactures, Intel and AMD. Both have good products on the market. But when overclocking, there are many factors to consider when buying a CPU – like the constant battle between the Athlon Vs
Coppermine.

The Athlon, with its fast 200Mhz Bus and a huge 128 Kb Level 1
cache memory, gave AMD a huge advantage. The Athlon was clearly the
champion. But soon Intel came out with the Coppermines with its 256Kb on-die Level 2 cache. Then things started to go down for AMD. I think I have a good explanation why.

Please remember I am using information obtained off the Internet. I have tried to get the most accurate information there is, but if I have any mistakes please bring them to my attention.

I will start with what I believe is the source of the problem. There is a little thing inside the CPU’s L1 and L2 cache called the cache association. It states the number of different sections of memory that can be mirrored into the L1 and L2 cache. For example, the K5 CPU has a 16Kb data cache with 4-way association so the CPU can have 4-4Kb blocks of the main memory in
the L1 cache.

Now the Pentiums (75-200 non-MMX) have an 8Kb data cache with
2 way association. So if everything but the caches were equal, the K5 would be superior. The early Pentiums and the K5 are different internally, but the cache association can be of help with superscalar CPU’s. Think about it: For example, in a superscalar Pentium (at 166 MHz) data is needed in 3 places to complete a set of instructions.

Let’s say that 2 of these instructions can be paired and completed in one clock cycle, so it will take 2 clock cycles to complete the first 3 instructions. But because there is a L1 cache miss, an area of the L1 cache will need to be flushed.

Then the CPU will wait for the slower (66Mhz) system bus to load the needed data. So instead of taking 1 clock cycle to finish the last instruction, it will now take a minimum of 4 clock cycles to complete the last instruction (The access to the L2 cache will take at least 3 cycles, plus the cycle to complete the instruction).

The 166 MHz K5 has an obvious advantage in this sort of situation. The K5 can hold all the data in the L1 cache and continue to run at full speed. The Pentium would have to swap the data in and out of the L2 cache, or possibly main memory.

Note: I did not take into the account the L1 access, L2 latency, instruction fetch, instruction decode and control cycles that are necessary in a CPU and L2 cache. I have tried to keep this example simple, but accurate.

Now in the new CPU’s, like the Coppermine and Athlon, if the CPU has to go to the main memory, it can mean that the CPU slows down to the speed of the main memory. So the more times the data is in the L1 and L2 cache, the faster the CPU is overall. The cache association can affect this greatly.

The new Coppermines and Athlons have different cache layouts. The
Coppermine has 16 KB data and instruction caches with a 4-way association. The L2 cache is 256KB and has an 8 way cache association. So this means that the Coppermine can hold data and instructions from a total of 12 locations in the main memory.

The Athlon was praised for its 7th generation architecture and its huge 128Kb L1 cache. The caches may be big, but the data and code L1 caches and even the L2 cache are only 2 way associative. So the Athlon, even though it can execute about 6 instructions per cycle, may need data from 3 or more memory locations; this means that the Athlon will have to slow down and go to its L2 cache or even main memory to get the necessary data.

This would also explain the things I have heard about the poor performance of the new Celerons with the Coppermine core. They only have 128KB L2 cache and a 4-way association, so they are inferior to the true Coppermine in terms of the cache performance.

This is a very possible cause of the performance differences between the new CPU’s today.

Benjamin Whetham


Be the first to comment

Leave a Reply