A Case-Study On The Impact Of L2 Cache Size On CPU Performance

Short Summary

In this article, we examine the impact of the L2 cache size on CPU performance by comparing benchmarks of a PIII 1200 MHz with a PIII-S 1266 MHz. We find that a larger L2 cache gives real performance gains and attempt to quantify those gains.

Introduction

In modern processors, we see that there are varying sizes of L2 cache on the processors. For example, Celerons have 128KB of L2 cache, the newer Celeron II’s (Tualatin) have 256KB, PIII (Coppermines and Tualatins) and Athlons have 256KB, PIII-S chips have 512 MB, Xeons can have 1MB or more, and P4’s have either 256KB or 512KB.

When comparing processors, it is important to note that the size of the L2 cache can have a great impact on that comparison. It is generally regarded that a larger L2 cache will give greater performance, but direct benchmark comparisons which attempt to isolate the effect of the L2 cache are hard to come by.

Recently, I replaced my PIII 1200 MHz (256KB L2) with a PIII-S 1266 (512KB) cache, giving me the opportunity to run these benchmarks on identical hardware with identical software and driver settings. This allowed me to isolate as well as possible the effects of the L2 cache size on CPU performance.

History: Just what is the L2 Cache?

It is natural at this point to wonder just what the L2 cache is and what it does. (It’s also natural to wonder if there is an L1 cache.)

Modern processors have two levels of cache: L1 and L2.

The L1 cache is a small amount of memory, generally 32KB to 128KB, that runs at the same speed as the processor and stores instructions and data to feed to the CPU core. It sits directly on the CPU die itself. It is the task of the L1 cache to be able to send new instructions to the core as quickly as the core can take them, for otherwise, the core will sit idle and under utilized.

To assist the L1 cache, most modern processors have another, larger layer of cache: L2.

This layer of memory will generally run faster than system memory and can be accessed directly by the CPU / core. The L2 cache will not only act as a buffer for incoming instructions and data, but will also store recent instructions and data and try to anticipate what will be done next. When successful, this anticipation can feed instructions to the CPU even faster and thereby increase its utilization.

On a Pentium III, this predictive cache is called ATC: Advanced Transfer Cache. Clearly, when the L2 cache works at its best, the CPU can be more effectively used. And even when it isn’t at its best, having more L2 cache allows more instructions and data to be retained and increases the probability that the cache’s anticipation will be correct.

Traditionally, the L2 cache was located on the motherboard. At that time, the L2 cache was still much faster than the system memory, but it was hamstrung by the fact that it was so far from the CPU core and L1 cache. With the early PIII’s, Intel moved the L2 cache (512KB at that point) from the motherboard to the SECC module (processor packaging) and ran it at half the processor speed. This was a great improvement for CPU performance, but as clock speeds increased, it was once again a bottleneck.

More recently, the L2 cache was moved from the processor packaging to the CPU die itself. While the cache size had to be reduced for the then large CPU cores, it reaped a large benefit: It ran at full-speed, rather than half-speed. This more than made up for the reduced L2 cache size for the processors of the day. (AMD’s evolution of the L2 cache was largely identical.)

At this point, the reader may wonder just why the cache is smaller as it gets closer to the CPU core, just where it is needed the most. In short, manufacturing memory on the CPU die itself is very expensive, and space is limited. This puts a constraint (in terms of price and available real estate) on the L1 and L2 cache sizes.

One last note: With the recent die shrinks from the 0.18 micron process to the 0.13 micron process, Intel has gained room on the CPU die for a larger cache. This is why we’ve begun to see 512KB L2 caches on Northwood P4’s and PIII-S chips. AMD also has larger L2 cache sizes on its roadmaps.

Okay, with a brief description of the L1 and L2 caches and what they do for performance, we can turn to our first benchmark: Matlab. (Edit: I couldn’t afford to keep an idle PIII 1200 any longer, and so I was unable to run any additional benchmarking.)

Test #1: Matlab Benchmarks

Introduction: Matlab is a numerical linear matrix algebra package which is used extensively in advanced mathematics, engineering, and scientific computation. For example, I use it in my research to numerically solve the system of partial differential equations arising from a tumor growth model. (Applied computational fluid mechanics.)

This software is about as real-world as you can get; many of the computations you see in this benchmark are similar to those used in audio, graphics, and gaming applications.

Part of Matlab 6.0.0.42a, Release 12 is a benchmark function which uses 6 different tests:

  1. LU (from LINPACK, n=1000): Does a floating point LU-decomposition of a 1000×1000 matrix using regular memory
  2. FFT: does fast Fourier transform using floating points and irregular memory access.
  3. ODE: solves an ordinary differential equation using various Matlab-specific data structures
  4. Sparse: solves a sparse linear system of mixed integer and floating point values
  5. 2-D: tests 2-D drawing graphics
  6. 3-D: test OpenGL graphics

Additional information on Matlab and bench can be found at Mathworks.

Software Setup:

  • My OS is Windows XP Pro, with SP1 installed, on a 40GB WD SE drive in NTFS.
  • As this is an Intel machine, I have the latest version of Intel’s chipset drivers and application accelerator installed.
  • My GF2 MX-400 has driver version 40.72 (WHQL)
    Hardware Setup:

  • I have 512 MB of Kingmax PC150, which I ran at 2-2-2-5/7 for all tests

Testing Procedure:

I used the following methodology for the testing for each test:

  1. Restart computer
  2. Allow all startup procedures to stop running
  3. Open matlab. Disable warning messages and enable OpenGL.
  4. Run bench once to load all textures, etc. into memory.
  5. Run bench 10 times and average the results.

Results: 1270 MHz (varied FSB, equal clock speed)

For this test, I ran the PIII 1200 at 141 FSB x 9 = 1269 MHz, and
the PIII-S 1266 at 134 FSB x 9.5 = 1272 MHz. (Discrepancies are due to FSB variations.) This allows us to test the effects of the L2 cache size independently of the CPU clock speed, although the memory bandwidth (via the FSB) will vary. Notice that the PIII-S (and thus the memory) is running at a lower FSB, so the results here most likely slightly understate the performance differences.

1270

Notice that the PIII-S chip with its increased L2 cache performs better in most benchmarks.

1270rel

The relative change gives the percentage decrease in execution times and thus the percentage increase in performance. Notice that the PIII-S chip with its larger L2 cache performs better in most benchmarks, on average 5.66% better.

Results: 1341 MHz (varied FSB, equal clock speed)

For this test, I ran the PIII 1200 at 149 FSB x 9 = 1341 MHz, and
the PIII-S 1266 at 141 FSB x 9.5 = 1341 MHz. (Discrepancies are due to FSB variations.) Once again, this is to test the effects of the L2 cache independently of the clock speed. Notice that the PIII-S (and thus the memory) is running at a lower FSB, so the results here most likely slightly understate the performance differences.

1341

Once again, the PIII-S chip with its increased L2 cache outperforms the PIII.

1341rel

The relative change gives the percentage decrease in execution times and thus the percentage increase in performance. Notice that the PIII-S chip with its increase L2 cache performs better in most benchmarks, on average 4.46% better in this case.

Results: 149 MHz FSB (equal FSB, varied clock speed)

For this test, I ran the PIII 1200 at 149 FSB x 9 = 1341 MHz, and the PIII-S 1266 at 149 FSB x 9.5 = 1415 MHz. (Discrepancies are due to FSB variations.) This allows us to test independently of the memory bandwidth and any other FSB effects. If the L2 cache is not important to the performance, then the execution times should scale linearly with the multipliers. That is, the PIII-S times should be

100*(1-(9/9.5))% = 5.26%

lower than those of the PIII if the L2 cache is not an important factor. If the results differ by more than this amount, then we have again shown that the increased L2 cache improves performance on a level that is discernible from any improvement due to raw clock speed.

149fsb

Once again, the PIII-S chip outperforms the PIII. The question is, can this increase in performance be attributed to the increased multiplier alone? The next graph addresses that.

149fsbrel

The relative change gives the percentage decrease in execution times and thus the percentage increase in performance. Notice that the PIII-S chip with its increase L2 cache performs better in most benchmarks, on average 8.78% better in this case. As this result differs from the linear result by 67%, we can safely conclude that the L2 cache is responsible for a substantial portion of the faster execution times.

Conclusions

At equivalent CPU clock speeds (and even with reduced memory bandwidth), a CPU demonstrates appreciably better performance in computationally-intensive tasks.

In the case of the PIII architecture, doubling the L2 cache improves computational performance on the order of 4-5%.

*CORRECTION:* A reader pointed out a typo / error in regarding the
introduction of the on-module L2 cache. The L2 cache was actually moved to
the SECC module with the Pentium II. Furthermore, the Pentium Pro also had
an on-module L2 cache. (Which was actually my first chip. 😉

An
interesting “historical” article turned up by google is here.

Paul Macklin

Be the first to comment

Leave a Reply