• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

L2 Cache Performance Benchmarks: 512KB vs. 256KB

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Short Summary
In this article, we examine the impact of the L2 cache size on CPU performance by comparing benchmarks of a PIII 1200 MHz with a PIII-S 1266 MHz. We find that a larger L2 cache gives real performance gains.
Introduction
In modern processors, we see that there are varying sizes of L2 cache on the processors. For example, Celerons have 128KB of L2 cache, the newer Celeron II's (Tualatin) have 256KB, PIII coppermines and Athlons have 256KB, PIII-S chips have 512 MB, Xeons can have 1MB or more, and P4's have either 256KB or 512KB. When comparing processors, it is important to note that the size of the L2 cache can have a great impact on that comparison. It is generally regarded that a larger L2 cache will give greater performance, but direct benchmark comparisons which attempt to isolate the effect of the L2 cache are hard to come by.

Recently, I replaced my PIII 1200 MHz (256KB L2) with a PIII-S 1266 (512KB) cache, giving me the opportunity to run these benchmarks on identical hardware with identical software and driver settings. This allowed me to isolate as well as possible the effects of the L2 cache size on CPU performance.

History: Just what is the L2 Cache?
It is natural at this point to wonder just what the L2 cache is and what it does. (It's also natural to wonder if there is an L1 cache.)

Modern processors have two levels of cache: L1 and L2. The L1 cache is a small amount of memory, generally 64KB or 128KB, that runs at the same speed as the processor and stores instructions and data to feed to the CPU core. It sits directly on the CPU die itself. It is the task of the L1 to be able to send new instructions to the core as quickly as the core can take them, for otherwise, the core will sit idle and under-utilized.

To assist the L1 cache, most modern processors have another, larger layer of cache: L2. This layer of memory wil generally run faster than system memory and can be accessed directly by the CPU / core. The L2 cache will not only act as a buffer for incoming instructions and data, but will also store recent instructions and data and try to anticipate what will be done next. When successful, this anticipation can feed instructions to the CPU even faster and thereby increase its utilization. On a Pentium III, this predictive cache is called ATC: advanced transfer cache. Clearly, when the L2 cache works at its best, the CPU can be more effectivley used. And even when it isn't at its best, having more L2 cache allows more instructions and data to be retained and increases the probability that the cache's anticipation will be correct.

Traditionally, the L2 cache was located on the motherboard. At that point in time, the L2 cache was still much faster than the system memory, but it was hamstrung by the fact that it was so far from the CPU core and L1 cache. With the early PIII's, Intel moved the L2 cache (512KB at that point) from the motherboard to the SECC module (processor packaging) and ran it at half the processor speed. This was a great improvement for the CPU performance, but as clock speeds increased, it was once again a bottleneck.

More recently, the L2 cache was moved from the processor packaging to the CPU die itself. While the cache size had to be reduced for the then large CPU cores, it reaped a large benefit: It ran at full-speed, rather than half-speed. This more than made up for the reduced L2 cache size for the processors of the day.

At this point, the reader may wonder just why the cache is smaller as it gets closer to the CPU core, just where it is needed the most. In short, manufacturing memory on the CPU die itself is very expensive, and space is limited. This puts a constraint (in terms of price and available real estate) on the L1 and L2 cache sizes.

One last note: With the recent die shrinks from the .18micron process to the .13micron process, Intel has gained room on the CPU die for a larger cache. This is why we've begun to see 512KB L2 caches on Northwood P4's and PIII-S chips.

Okay, with a brief description of the L1 and L2 caches and what they do for performance, we can turn to our first benchmark: Matlab.
Test #1: Matlab Benchmarks
Introduction: Matlab is a numerical linear matrix algebra package which is used extensively in advanced mathematics, engineering, and scientific computation. For example, I use it my my research to numerically solve the system of partial differential equations arising from a tumor growth model. (Applied computational fluid mechanics.) This software is about as real-world as you can get.

Part of Matlab 6.0.0.42a, Release 12 is a benchmark function which uses 6 different tests:
1) LU (from LINPACK, n=1000): Does a floating point LU-decomposition of a 1000x1000 matrix using regular memory
2) FFT: does fast Fourier transform using floating points and irregular memory access.
3) ODE: solves an ordinary differential equation using various Matlab-specific data structures
4) Sparse: solves a sparse linear system of mixed integer and floating point values
5) 2-D: tests 2-D drawing graphics
6) 3-D: test OpenGL graphics

Additional information on Matlab and bench can be found at Mathworks.

Software Setup:
+My OS is Windows XP Pro, with SP1 installed, on a 40GB WD SE drive in NTFS.
+As this is an Intel machine, I have the latest version of Intel's chipset drivers and application accelerator installed.
+My GF2 MX-400 has driver version 40.72 (WHQL)
Hardware Setup:
+I have 512 MB of Kingmax PC150, which I ran at 2-2-2-5/7 for all tests
Testing procedure:
I used the following methodology for the testing for each test:
1) Restart computer
2) Allow all startup procedures to stop running
3) Open matlab. Disable warning messages and enable OpenGL.
4) Run bench once to load all textures, etc. into memory.
5) Run bench 10 times and average the results.
Results: 1270 MHz (varied FSB, equal clock speed)
For this test, I ran the PIII 1200 at 141 FSB x 9 = 1269 MHz, and
the PIII-S 1266 at 134 FSB x 9.5 = 1272 MHz. (Discrepancies are due to FSB variations.) This allows us to test the effects of the L2 cache size independently of the CPU clock speed, although the memory bandwidth (via the FSB) will vary. Notice that the PIII-S (and thus the memory) is running at a lower FSB, so the results here most likely slightly understate the performance differences.

1270.bmp

Notice that the PIII-S chip with its increase L2 cache performs better in most benchmarks.

1270rel.bmp

The relative change gives the percentage decrease in execution times and thus the percentage increase in performance. Notice that the PIII-S chip with its larger L2 cache performs better in most benchmarks, on average 5.66% better.

Results: 1341 MHz (varied FSB, equal clock speed)
For this test, I ran the PIII 1200 at 149 FSB x 9 = 1341 MHz, and
the PIII-S 1266 at 141 FSB x 9.5 = 1341 MHz. (Discrepancies are due to FSB variations.) Once again, this is to test the effects of the L2 cache independently of the clock speed. Notice that the PIII-S (and thus the memory) is running at a lower FSB, so the results here most likely slightly understate the performance differences.

1341.bmp

Once again, the PIII-S chip with its increased L2 cache outprforms the PIII.

1341rel.bmp

The relative change gives the percentage decrease in execution times and thus the percentage increase in performance. Notice that the PIII-S chip with its increase L2 cache performs better in most benchmarks, on average 4.46% better in this case.
Results: 149 MHz FSB (equal FSB, varied clock speed)
For this test, I ran the PIII 1200 at 149 FSB x 9 = 1341 MHz, and
the PIII-S 1266 at 149 FSB x 9.5 = 1415 MHz. (Discrepancies are due to FSB variations.) This allows us to test independently of the memory bandwidth and any other FSB effects. If the L2 cache is not important to the performance, then the execution times should scale linearly with the multipliers. That is, the PIII-S times should be

100*(1-9/9.5)% = 5.26%

lower than those of the PIII if the L2 cache is not an important factor. If the results differ by more than this amount, then we have again shown that the increased L2 cache improves performance on a level that is discernable from any improvement due to raw clock speed.
149fsb.bmp

Once again, the PIII-S chip outperforms the PIII. The question is, can this increase in performance be attributed to the increased multiplier alone? The next graph addresses that.

149fsbrel.bmp

The relative change gives the percentage decrease in execution times and thus the percentage increase in performance. Notice that the PIII-S chip with its increase L2 cache performs better in most benchmarks, on average 8.78% better in this case. As this result differs from the linear result by 67%, we can safely conclude that the L2 cache is responsible for a substantial portion of the faster execution times.

Conclusions:
At equivalent CPU clock speeds (and even with reduced memory bandwidth), a CPU demonstrates appreciably better performance in computationally-intensive tasks. In the case of the PIII architecture, doubling the L2 cache improves computational performance on the order of 4-5%.

Coming Soon:
Additional benchmarks at the same FSB speeds in the SiSoft Sandra CPU Arithmetic and Multimedia benchmarks. Also, a clearer writeup of the results thus far.

Feedback is appreciated!
 
Last edited:

deez

Member
Joined
Jul 9, 2001
Location
Louisville, KY
I'd definitely like to see both cpu's at 149 FSB for the next test. I'll also be redoing my seti bench with the 1.4 tonight

For another comparison check out the benchmark results at www.ocsetiteam.com and you'll see that the P3-S smokes the celeron T. And for those of you out there with these chips we need more benchmarks
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Deez, thanks for the suggestions. I decided to run the tests at those speeds so that the clock speeds would be roughly equal. So, differences observed in performance are related to the difference in L2 cache and FSB speed.

I like your idea. I'll do the tests with equal FSB so that the results are related to the differeing multipliers and L2 caches. If the results scale linearly with the multiplier so that the PIII-S gives times 94.74% of the PIII at equal FSB but unequal multiplier / clock speed, then the L2 cache plays little role in the performance. Conversely, if the results don't scale linearly, then the L2 cache will again be demonstrated to have a measurable impact on the performance.

I'm not sure when I'll get to those additional benchmarks, but I will soon. Thanks again! -- Paul
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
The original write-up was edited to include a comparison with a constant FSB. -- Paul
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Thank you! I hope to be updating it with further benchmarks to see if the 5% finding is consistent. Hopefully, it will be interesting enough for others to take note of and refer to ...

I felt there was a need for some direct testing of L2 cache size, since we often see the bigger caches touted as increasing performance, but rarely see direct comparison.

Thanks again! -- Paul
 

james.miller

Member
Joined
Jun 8, 2002
Location
Dunstable, uk
also o/c the two systems to see what difference it makes to the performance.

my bet is: the pIII-s will pull further and further away as the speed increases
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Hi, and thanks for the idea and comment.

Well, here's how the tests are thus far:

test: #1
PIII FSB: 141 (overclocked)
PIII-S FSB: 134 (stock)
matched speed approx 1270 MHz
Performance difference: 5.66%, all of which is due to L2 cache difference, and is understated due to lower memory speed of PIII-S in the test.

test: #2
PIII FSB: 149 (overclocked)
PIII-S FSB: 141 (overclocked)
matched speed approx 1341 MHz
Performance differnece: 4.64%, all of which is due to L2 cache difference, and is understated due to lower memory speed of PIII-S in the test.

test: #3
PIII FSB: 149 (overclocked)
PIII-S FSB: 149 (overclocked)
unmatched speed, 1341 MHz vs. 1415 MHz, each memory speeds.
Performance difference: 8.78%, some of which is due to L2 cache difference, and some of which is
due to the actual clock speed difference. However, if it were only due to clock speed difference, the expected performance difference would be 5.26%, so indeed some of the jump was due to the cache.

So, thus far it appears that the difference between the two was less at higher clock speeds, but then again, it's just a few data points. Next, I'll run the PIII-S at 1200 = 126 * 9.5 to get another equal clock speed comparison.

I also plan to run the benchmarks in SiSoft sandra to get another means of comparison. (I've run all the tests on the PIII 1.2 already prior to uninstalling that chip.)

Thanks again! -- Paul
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Okay, I updated the section on the L2 cache. Feedback and improvements are welcome! :) -- Paul
 

toastedzergling

Member
Joined
May 25, 2002
Nice work Paul, maybe you should do as much benchies as you can and submit it to the front page, As a tualatin owner I am searching for this kind of info for a long long time.
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Hey, glad you like it!

I think I will eventually submit it. I'd like to finish my sandra benches first, though, and do a little better analysis.

Who knows, maybe this will be sticky material ... (nah!)

Did my L2 explanation sound okay?

Thanks! -- Paul
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
Hey, Zak!

What do you mean by general performance increase? Windows feels snappier, there's no doubt about that. The temps are also quite nice -- feels about the same as the last one for temps, maybe even slightly lower. I attribute that primarily to doinga better job at the ASIII application, but maybe the voltage really is running on the low side ...

But actually, my daily use is matlab, folding, and fluid mech. ;) (Okay, and too much outlook express.:)) Between my new hdd and CPU, my computer feels like it can literally fly! :D I can't thank you enough!

BTW, is my L2 explanation okay? Also, were the 149 FSB results what you expected?

Thanks -- Paul
 

JCLW

Member
Joined
Apr 1, 2002
Nicely done. :)

Try running clibench (mk III smp 0.7.15 (win32 i386)). It's a really small (160kb) simple benchmark (which is probably why it's so often overlooked) that has been used in the SMP community for years. It does the follwing tests:
Dhrystone V 2.1: Standard benchmark for the integer performance. The program code is about 16 kB, that means, depending on your CPU architecture, that it runs fully in level one cache. The faster the clock speed and the more sophisticated your CPU's architecture the better the results.

Whetstone: A standard for the floating point performance. Runs also fully in level 1 cache. Measures your FPU.

Eight queens problem: Runs mostly in level 1 cache. It's a test that measures how your CPU handles recursive functions. Depends mostly on the CPU's architecture.

Matrix operations: Runs mostly in level 2 cache. If you've got no cache you'll get very bad results.

Number crunch performance: Shows how fast your CPU does integer calculations. It does some time consuming calculations.

Floating point performance: This test uses nothing than FPU functions. It does calculations, conversions and other stuff.
Run one thread for every logical processor. You don't have to fill out all the info it asks when you run the tests. The memory benchmarks are known to be a bit odd - but still sometimes insteresting.

- JW
 
Last edited:
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
My pleasure. I'm glad you liked it!

Oh, here's a fun point.

Some processors (read high-level Xeon's) acutally have an L3 cache, too. That must give some interesting results! :) -- Paul
 
OP
macklin01

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
They can certainly hold their own ... :)

In my benches thus far, I see a bigger difference on 2D than 3D, but that makes sense, as the 2D is most likely controlled by the CPU / MMX / SSE, and the 3D by the gfx card. But even the 3D graphics improved somewhat.

But a Duron does have exceptional peformance / price. ;)

-- Paul
 

Karl04

Member
Joined
Aug 1, 2002
Location
Suffolk, VA, USA
yeah they do, but if you wanted to get any of those that are beyond 1.1gig, then i think we all be overpaying. cause i've seen xp1600 for cheaper than durons

Karl