William Campbell says in this article that most benchmarking of computer products isn’t worth the electrons it’s printed on. He’s right.
He looks at one specific test, and points out that one particular test has a pretty high level of variability given the high level of control over the variables. He’s right (and it’s probably true for other tests, too).
He says that in order to say that Product A is better than Product B, and for that statement to have statistical validity, you need to run a particular test a few dozen times each on each machine. He’s right.
He says no one tests to that degree, much less performs statistical tests on the results. He’s right.
He says that even if all this testing had been done, reviews often present small degrees of differences as being significant when in fact they are statistically insignificant. He’s right.
In fact, Mr. Campbell is absolutely, totally right in everything he says except for one unspoken assumption: Being that since he’s absolutely right and current review practice is statistically absurd, anything is going to change.
It won’t for a very simple reason: on the whole, the audience wants simple, clear winners and losers, and they want them fast, period. They don’t want to hear anything else.
Well, some realize that this is mostly nonsense, and want what would be statistically valid testing. This brings up a problem Mr. Campbell didn’t address.
After you’ve run your 25-30 tests, you may be able to say that the particular piece of equipment you tested might be better than another particular piece of equipment you tested, but you can’t say that Brand A, Model B is statistically better than Brand C, Model D. For that, you’d need about thirty of each.
This is not going to happen.
The reality is that the typical current review of a product can only indisputably measure gross differences between products and catch inherent faults within the product.
But most people don’t want to hear that, and I don’t think there’s a damn thing that can be done about it.