On graphs not starting from zero: such a presentation tends to magnify small differences. If series A is 1000 and series B is 1001, then they differ (relatively) by 0.1%, which may be within measurement error. But plotting (for example) on a scale from 995 to 1005 (about 1% of the data range) will make any difference look much more significant, regardless of the statistical significance.
Looking at the the 3Dmark03 scores, we see a spread from about 106k to 111k, or a relative difference of 4.7%. The standard deviations for the measurements look pretty small (but I can't calculate without the raw numbers), so the differences appear to be statistically significant, even for the small sample sizes. But they differences aren't as huge as they appear on the graph, so it's worth noting.
Similar story on 3dmark05. About a 3.6% relative spread.
On 3Dmark06, the relative difference from high to low is a paltry 1.5%, with larger measurement variation than the other benchmarks. While there is an observable trend, it's statistically not much greater than the measurement variation. This is a case where the graph may not be worthwhile. Vista32 vs 7-32 are within 0.1% of one another, well within measurement error from what I can tell.
Calling XP "decisively in last place" in this case, with a clipped graph range (focused upon 3% of the data range), is scientifically misleading in my opinion. If I were reviewing this paper for a journal, I would absolutely require the authors to revise these strong claims for this graph. The data simply don't bear them out.
Aquamark3: about 4%.
SuperPi 1.5M is similar: all tests within 1%.
wPrime 32M without the outlier: 3.5%
I find the overall conclusion should be different: for most of these operating systems, the performance is within 5% of one another in benchmarks. The differences may or may not be statistically significant. Given the small difference in the benchmarks and the unknown statistical significance of those differences, other metrics should be used for the basis of choosing the operating system. It is interesting, however, to note that Vista and Win7
can be trimmed to perform on par with XP.
I'm not trying to be harsh or unduly critical; this is just my view as a scientist looking in. The benchmarking techniques are great, but the data analysis and presentation could use improvement. The relative differences give a clear picture and should be included, rather than looking at clipped graphs. I know it's common in benchmarking articles and reviews, but well, there's not a high standard of rigor out there. We can be the best.