There’s Two Tests A Benchmark Must Meet
What should a benchmark be?
Some think it should be an idealized even playing field where no side has any sort of “unfair” advantage. They even create their own benchmarks to do just that; benchmarks meant essentially to test the hardware.
Others feel that since they use real applications to do real work, the most valuable benchmarks are those which test real applications doing real work.
I belong to the second camp.
My general opinion of the first camp can be summarized by an exchange I had a couple years ago with someone who said that a K6-2 did Mandelbrot fractals better than a PII. I essentially said,
“Who sits around all day doing Mandelbrot fractals?”
Now if doing Mandelbrot fractals or something similiar all day long floats your boat and is the mark of relevancy for you, that’s fine by me. Insisting that your measurement of relevancy has to be mine is not so fine.
If I use Word or Excel or Photoshop or Quake all day, though, I don’t care how well Brand X CPU does against Brand Y CPU in Mandelbrot fractals.
I want to know how well Brand X CPU does against Brand Y CPU running Word or Excel or Photoshop or Quake.
Don’t tell me that Mandelbrot fractals or whatever are the only true measure. Not for me it isn’t. Don’t tell me I’m wrong, stupid or evil for wanting to know how it does with my real work, because I’m the one who’s going to have to live with consequences, not you.
If “your” processor does “your” benchmark better, but does “my” work slower than the other guy’s product, “your” ideal benchmark didn’t help, but hurt me.
As I said before, “fairness” hardly matters when it comes to finding a relevant benchmark. Sure, it’s “unfair” that Adobe didn’t put in 3DNow optimizations, but if I use Photoshop all the time, that’s the world I have to live in.
If every application in the world were SSE-optimized up the ying-yang, and none were 3DNow-optimized, would that mean we should never compare an Athlon against an Intel product? Of course not. If that’s the real world, and you’re interested in the real world, that’s the environment you’re going to have to live in, whether you like it or not.
Who Is Better, Michael Jordan or Me?
No, you can’t ask, “Doing what?”
Pretty stupid question, isn’t it? But that’s what the advocates of an all-purpose number, no matter who comes up with it.
The best benchmark for you is the one that most closely approximates what you do. Relevancy is, well, relative. There is and cannot be a “one shoe fits all” benchmark, and anybody who insists otherwise is trying to sell you a bridge.
I first talked about this here. The principles still hold true.
Not only does a benchmark need to be relevant, it also has to be reasonably representative of what it is measuring.
It’s ironic, but I’ve been arguing for a benchmark I don’t much like (Sysmark2000), and am using solely because the other alternatives are even worse for my purposes.
A few people seem to think that I’m against open-source benchmarks, even though I’ve said the opposite earlier on. Let me repeat, I’m not, but just where is the open-source benchmark that covers the same ground as a SysMark or ZDNet benchmark?
What I’ve been primarily arguing about lately is the relevancy of using a popular program like Photoshop in a benchmark meant to cover a wide spectrum of uses.
That kind of benchmark isn’t for everyone. It would be a stupid one to use if you played games all the time. But it is potentially valuable to some.
I have not argued that a conglomeration of such benchmarks is a good idea, indeed, as the link above shows, I’ve been against that for a long time.
Nor have I argued that the particular benchmark is necessarily accurate in providing a good representative job of benchmarking that particular application. All I’ve said is there has been insufficient research and testing done to come to a definitive conclusion that it isn’t.
I’m working on that now. I’m comparing how the AthlonMP does as opposed to a TBird, in the Photoshop section of SysMark 2000, and also compared the two using alternative Photoshop measurements.
I’m getting some pretty erratic measurements so far, so erratic that I’m going to have to come up with additional tests. This is going to take a while.
I’ll tell you one thing though, innocent or guilty, there’s going to be a lot more basis for any decision than what we’ve seen so far.
It’s Been Surreal
To put it mildly, it’s been rather odd arguing about a benchmark in which the Athlon easily won overall against the PIV. It’s like winning 80-90% of the time isn’t good enough; one side feels it has to win all the time.
It’s been even more surreal arguing about what is arguably an outdated benchmark. It seems to me that if you want to argue about a SysMark benchmark, it would be the current one.
Of course, testing its accuracy will be a bit more difficult, but it’s doable. We’ll likely do that once a .13 micron Willamette becomes a viable option for our audience, no doubt with SysMark2002.