This is what I want from a benchmark program:
1) The ability to let me test a wide-range of programs in standalone mode.
2) The ability to let me test programs running concurrently, and, besides an overall number, give me a performance breakout for each application being run.
3) The ability to either choose a number of concurrent scenarios, or let me choose my own, and again, give me an overall number, and a breakout for each application.
Is this so much to ask? Apparently, it is from what the benchmarkers are doing.
Why Standalones Are Important
Programs in a set “category” often don’t act alike. A few examples (I’m using this as the basis for discussion):
Willies eat Athlons alive in Windows Media Encoder. Athlons win most other categories, sometimes by a little, sometimes by a lot. PIIIs almost always trail both, but PIIIs beat both in Photoshop.
But an average doesn’t tell me that. If my major application is Windows Media Encoder, I should definitely get a Willy, but if I just look at an average, I won’t know that.
On the other hand, if I never use Windows Media Encoder, I’m going to think the difference between a Willy and Athlon is closer than what it really is. If enough weight is put on Windows Media Encoder, I might think a Willy is better for my kind of work when it may actually be awful.
Let’s say I’m a typical office worker. Let’s say I use Word all the time. There’s not a substantial performance difference between a Willy and Athlon using Word. There is a substantial difference between a Willy and Athlon using Excel, or PowerPoint.
Let’s say I’m a creative type. Let’s say I use Premiere a lot. You might assume from Willy’s stellar performance in Windows Media Encoder that it would be great in Premiere, too (all that good bandwidth and all). And you would blunder bigtime. Willy is terrible at Premiere; Athlons rule there.
Or let’s say I use Photoshop all the time. Even a 2Ghz Willy gets smacked by a 1Ghz PIII. Athlons get wiped. If Photoshop is my life, I should be looking at a PIII or maybe a Tualatin.
But People Do More Than One Thing At A Time
Of course they do. Did you see me say “get rid of concurrent testing?” No. I’m saying, “Let’s see just how well these apps play with each other.”
The benchmark program has to compile that to come up with an average, anyway. Why can’t we see it?
Who knows, maybe some programs do very badly while IE is running, and similiar programs don’t. Maybe they do badly on Platform A, but not Platform B. Wouldn’t you like to know that? Wouldn’t you like to test that?
More importantly, if I don’t even have half the programs running concurrently, how useful is any result that includes them?
As I’ve discussed before, to compare anything, you have to make choices. So long as they are reasonable choices, and what those choices are along with their weight, that gives people a basis upon which to judge the comparison.
We don’t find that out in current benchmarks, and that’s a fatal flaw. It’s a fatal flaw for two reasons:
I’ve already given examples of why an average can be useless or even detrimental to a typical user. This wouldn’t be so bad if you could at least isolate out the individual factors and know what weight is given to each application, but that’s precisely what the benchmarkers are not telling you.
Per the second, if I don’t have to reveal how I weight testing, I can make anything win. If I want Willys to win, no problem. I just put more weight on programs like Windows Media Encoder or Naturally Speaking, and less on programs where Willy does poorly.
Conversely, I can stretch out the general Athlon advantage quite a bit if I feel like it. If I really have no shame, I could even make the PIII win.
That’s the what. Never mind the why. If Benchmark X suddenly shows Willy in a much more favorable light; there may be absolutely legitimate reasons for this. Or it may be Intel blackmail. Without more detail, we can’t check. The Athlon adherents/conspiracy theorists will automatically assume the latter. To stop that, the benchmarkers have to prove the former, and “Because I said so” doesn’t cut it.
Pandering To Pea Brains?
The benchmarkers are moving away from those kinds of measurements that actually are useful to real people using a mix of real applications. ZDNet’s High-End Winstone at least used to give you a breakdown of how each of the apps did. No more. All you deserve is a single Content Creation number.
Sysmark 2000 provided application breakdowns. Sysmark 2001 doesn’t, and won’t. Tough if you’d like to know how well the app you actually use does. It’s our scenario or no scenario.
You’re a general of an army division. Your troops need boots. If your quartermaster didn’t want to be bothered finding out boot sizes, found out that the average mens shoe size is a size 10, and ordered 20,000 size 10s, would you
congratulate him? If he pulled out the book with that statistic during his court-martial, would you stop the trial and give him a medal instead? Like hell you would.
If you base a purchase decision based solely on one of those numbers, you are just like that quartermaster. You might get stuck with ordering just one size, but at least find out which size fits most of your troops.
Do you know what the benchmarkers are essentially doing by giving you just one number under these circumstances?
At best, they are saying, “If only a single number will fit into your heads, here it is, since reality is too hard for you. So we’ll be Oz and give you a number to CYA, and we’ll take out anything that might let somebody with a half-a-brain show you up.”
At worst, they’re saying, “BWAHAHAHAHA!!! You are puppets on our string, and you’re too stupid to even know it.”
The truth lies somewhere between these two extremes.
Not very complimentary, is it?
Don’t Get Mad At Dorothy, Get Mad At Oz
I suppose some of you will get insulted by this. Good, you should be. Now let’s work on whom you should be mad at.
Essentially, I’ve said one number doesn’t reflect reality. I’ve shown why, and given some reasons why this is bad for you.
If what I said doesn’t apply, it wasn’t meant for you.
If it does apply, all I’ve tried to do is open your eyes in the most anonymous way possible. How can I personally insult you when I don’t even know who “you” are?
There are two kinds of dimwittery: that due to ignorance, and that due to stupidity. We are all ignorant and stupid in at least some things; we’re human. Our best course is to fix the ignorance, and avoid applying the stupidity.
What is worse, somebody pointing out a problem, or somebody possibly taking advantage of it? What’s worse, an uncomfortable truth, or a comfy lie?
Think about it.
Tomorrow, why we have problems with the ZDNet, Sysmark, Benchmark Studio and many other popular benchmarks.
Something to keep in mind: There are two basic uses for benchmarks: use as a purchase tool and use as a diagnostic tool. For us, we have to view
benchmarks as something that informs you before you buy something. So would someone responsible for buying computers for a business.
The average member of this audience, on the other hand, already has decided. He or she uses benchmarks to see if the system is up to snuff compared to other similiar systems, or how much better it is than his or her old system.
So a benchmark can stink for our purposes, but be fine for yours.
No Benchmarks Just Because We Ran Them
What we find all too often is essentially, “We’re going to run five (or ten or ninety-seven benchmarks), and by God, we’ll put up a chart for every single one of them whether it means anything or not.”
We’re not going to do that. We’re not here to provide you with information; we’re here to provide you with useful information.
If something is clearly useless; we’re not going to present it like it is. You get enough chaff in your life; we won’t add to it.
If we run a benchmark and find there’s no real difference, we’re just going to say, “There’s no real difference,” and then say why. In areas of doubt, we’ll err on the side of providing more rather than less, but we’ll point that out, too.
Just to give an example: if we run something, and everything comes with 1% of each other, we’re not going to give a big chart showing everything coming within 1% of each other. We’re just going to say, “Everything came within 1% of each other, no real difference there.”
No doubt there will be some who’ll say, “Give me everything, let me decide.” Our answer to that is pretty simple.
Reading a review should be educational. It should teach you what is important, and what isn’t, not just hand you Your Favorite Number.
Numbers always need to be interpreted, if for no other reason that to state whether or not it’s important or not, and why. This is meant to give you a deeper understanding of the number.
If you don’t want that, then I guess you really didn’t want everything, after all.
As I’ve said before, a number tells you what at some point in time. It does not tell you how it came about, or why it is what it is. That’s our job, to say what we think and why we think it. Your job is to decide whether or not we make sense, and if you disagree, to tell us why.
We like these kinds of benchmarks very much for two reasons:
Unfortunately, none of the current versions of the major app-based benchmarks give us precisely what we want, for a number of reasons.
ZDNet Benchmarks: Winstone and Content Creation
We’ve used these in the past, and have not much liked them. They have generally not given us a per-app breakdown on performance (High End Winstone used to, but ZDNet doesn’t want to support that anymore).
We’ve also found them
to be rather erratic and susceptible to being influenced. You can read about some of those past travails here.
BapCo Benchmarks: Sysmark2000/2001
Sysmark2000 does provide for individual scoring of applications. We liked that idea very much, and figured we’d pick up SysMark2001 for our future testing.
Unfortunately, SysMark then proceded to take out precisely what we found most valuable about it, and essentially said, “Tough.”
Here’s what they had to say about it:
Q8. Can you run individual applications versus the entire suite? And if not why?
A8. No, you cannot run individual applications. You can run scenarios (Office Productivity and Internet Content Creation) that contain applications that make up the scenario. Due to the nature of the workload it is difficult to pinpoint on one application, as there are a number of activities going on in a multi-tasking environment. However, the performance of the individual application will be reflected in the performance score of the scenarios.
Q11. Why can’t I see individual application scores on SYSmark 2001 like I could on SYSmark 2000?
A11. SYSmark 2001 operates in a multitasking environment where many applications execute concurrently. In this scenario the performance of individual applications cannot be captured as they operate concurrently with other applications. Hence the individual application scores are not shown.
This is a lot of nonsense. Per running individual applications, Sysmark2000 did precisely this. If BapCo wanted to provide concurrent scenarios, that’s fine and dandy, but it’s not an either/or. They’re charging double the price of SysMark2000; there should be some money for a little updated scripting.
No, Sysmark decided you’re not entitled to know how individual applications run.
As per the claim that the “the performance of individual applications cannot be captured as they operate concurrently with other applications,” this is complete nonsense. They can’t note when each application starts and stops?
Now what would be likely is that performance of each application would be lower than it would be in standalone mode. This is something I know I would like to know very much if I were a corporate buyer. If everybody in my office uses A, and performance gets killed if I’m doing B, too, maybe I tell my employees not to do A and B at the same time. Or maybe I find out that I need to put more RAM in each machine, and the problem goes away.
But BapCo doesn’t want you to know that. Don’t tell me what you told me was critical in your year 2000 program is now completely unimportant in 2001.
You might say, “Why don’t you just buy SysMark2000?” Well, we would have liked a concurrent measurement, and more importantly, the programs used in SysMark2000 are being rapidly outdated. We may still do it, but it’s far from ideal.
CSA Benchmarks: Benchmark Studio
This is a new one that ready to be released. Some have rather liked the betas. It’s certainly far more flexible than the others, and for that reason will probably be best for serious corporate testing.
However, for us, it has two huge problems:
Gaming FPS: This can be often useful. However, not always. We think this is often useless in video card comparisons and can be almost completely useless in some other comparisons. We’ll put up FPS benchmarks when we think they’re useful, we won’t put them up when they aren’t.
We’re going to apply what I call the Superman test to these comparisons. If only Superman could tell the difference between the two, it doesn’t matter, and you shouldn’t base your decision solely on that.
Nor is there any point in putting up a chart that essentially tells you nothing important. That’s detrimental, because it implies the information in the chart is important when it’s not.
Here’s some of the principles we’ll follow in determining what we’ll put up and won’t:
Low-frame rate situations: If you are comparing video cards in a particular game, and everything is coming in at say, less than 60fps, any significant difference between cards probably makes a difference in actually playing the game.
However, if Video Card A gets 194fps, and Video Card B gets 184fps, you’d have to be Superman to tell the difference. That should not be the big deciding factor.
No significant difference: If I use Motherboard A and get 46.34 fps, and Motherboard B gives me just 45.51 fps, again, you have to be Superman to tell the difference.
On the other hand, if Video Card A gets 180fps, and Video Card B gets 60fps, while you might have to be close to Superman to tell the difference, there’s a good chance that such a difference might make a real difference somewhere down the road.
Differences at low resolution: This is one that needs to be handled carefully. If video card hit a wall at higher resolutions due to CPU limitations, then testing at very low resolutions can be helpful. If the wall is hit due to the limitations of the card itself, then it isn’t.
As a check on something else: Sometimes you can use fps to see how much improvement you get from something else. DDR is a very good example of this. Some games are sensitive to memory speed; others aren’t. It may well end up being a useless improvement for that particular game (see above), but at least it’s a harbinger of the future.
CPU Benchmarks: We see few uses for this. For a new family of processors, or as a comparison between different families of processors, certainly. For an overclocking effort, maybe. To see how other equipment compares, no.
I’m not saying I’m going to rip them out of any user article; it’s just that they’re often used when there’s no point to them.
Memory Benchmarks: We think these have been grossly overused as of late, and, outside of a fast preview, should never be used all by themselves.
Some have been almost completely useless. I’ve seen a few LinPack graphs that essentially just told me that CPUs have L1 and L2 cache. I kind of knew that.
Others have been displayed like they were the only thing that mattered. Uh-huh. Memory speed has at best some, at worst almost nothing to do with increased performance.
What we plan to do with these benchmarks is to try to correlate them to performance in other areas. You really shouldn’t get all hot and bothered about a 10% memory speed difference if it only has a 1% effect on the programs you run. If that 10% translates to a 3-4% improvement, then that’s a different story.
Hard Drive Benchmarks: We’re trying to find a good one, here, too, for what we know we’ll be doing for a while, essentially testing mobos. We won’t be testing hard drives per se, but rather the hard drive controllers and drivers for them. We’re not too big on Winbench from past experience. HDTach is more useful for hard drive testing. We’re taking a close
look at IOMeter, will have to see about that.
What We’ll Probably Do
We’ll put up a varying set of benchmarks; exactly which ones will depend upon what we are testing and what we think the critical factors or questions are surrounding the piece of equipment.
As of the moment, our next major testing project over the next couple months will be the various flavors of DDR mobos. We’ll look to see what degree of difference there is between them, and between DDR and the KT133A option.
We’ll probably usually include an app-based benchmark, a memory benchmark, at least one game benchmark which looks to be memory sensitive, and at least one that doesn’t, and at least initially a hard drive benchmark
We’ll also probably test using a variety of tweaks, one at a time, at least until we get a good idea of what helps and what doesn’t.
Consistency Is Not Its Own Virtue
We may be running exactly the same tests two months from now; we probably won’t. The purpose of benchmarking is to find answers, not to run benchmarks. If questions and issues get answered, we won’t run those tests anymore. If new questions come up, we’ll use new benchmarks.
We’ll take usefulness and importance over consistency every single time.
Here’s some people who agree with us:
The only completely consistent people are dead.
Consistency is the last refuge of the unimaginative.
Consistency requires you to be as ignorant today as you were a year ago.
A foolish consistency is the hobgoblin of little minds . . . .
Ralph Waldo Emerson
They said it, I didn’t, so if you disagree, go argue with them first. If you get them to change their minds, have them email me. 🙂