• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

[O/C]Windows Showdown: 8 Operating Systems in 6 Benchmarks

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.
For those like myself, who aren't as up to speed on benchmarking as some, John the owner of www.madshrimps.com lends some insight into the results:

http://www.madshrimps.be/vbulletin/...-i7-ati-tested-8-windows-os-68535/#post250569

Mainly, his comments are a clarification of what the hardware Gautam chose means to the bencmark results. It's good insight, along with most things you'll find John posting at madshrimps.

You can find their articles here:
http://madshrimps.com/?action=articles

As well as their news here:
http://www.madshrimps.be/?action=news
 
Well, another thing which have a B I G impact on these benchmarks is which servicepacks did he install (if any) ??????

SP3 for XP32 is a performance killer de luxe- likewise I found Vista without SP1 to perform noticeable worse than with SP1 (never tried SP2). Microsoft did put in a lot of work in SP1 (and probably the same with SP2) to increase Vistas performance because Vista lagged behind XP in performance at that time.

Can Gautam clarify the question regarding which service packs ????

*****I got flashbacks to Nvidias frame jump cheating after M$ released SP1 for Vista ;) *****
 
Last edited:
I benched XP SP2 vs. SP3 when it came out (screenshots are long gone) and SP3 was either the same or better with my setup at the time (E4400, IP35-E, 8800GTX). Do you have documentation or reference to it being a benchmark killer?
 
On graphs not starting from zero: such a presentation tends to magnify small differences. If series A is 1000 and series B is 1001, then they differ (relatively) by 0.1%, which may be within measurement error. But plotting (for example) on a scale from 995 to 1005 (about 1% of the data range) will make any difference look much more significant, regardless of the statistical significance.

Looking at the the 3Dmark03 scores, we see a spread from about 106k to 111k, or a relative difference of 4.7%. The standard deviations for the measurements look pretty small (but I can't calculate without the raw numbers), so the differences appear to be statistically significant, even for the small sample sizes. But they differences aren't as huge as they appear on the graph, so it's worth noting.

Similar story on 3dmark05. About a 3.6% relative spread.

On 3Dmark06, the relative difference from high to low is a paltry 1.5%, with larger measurement variation than the other benchmarks. While there is an observable trend, it's statistically not much greater than the measurement variation. This is a case where the graph may not be worthwhile. Vista32 vs 7-32 are within 0.1% of one another, well within measurement error from what I can tell.

Calling XP "decisively in last place" in this case, with a clipped graph range (focused upon 3% of the data range), is scientifically misleading in my opinion. If I were reviewing this paper for a journal, I would absolutely require the authors to revise these strong claims for this graph. The data simply don't bear them out.

Aquamark3: about 4%.

SuperPi 1.5M is similar: all tests within 1%.

wPrime 32M without the outlier: 3.5%

I find the overall conclusion should be different: for most of these operating systems, the performance is within 5% of one another in benchmarks. The differences may or may not be statistically significant. Given the small difference in the benchmarks and the unknown statistical significance of those differences, other metrics should be used for the basis of choosing the operating system. It is interesting, however, to note that Vista and Win7 can be trimmed to perform on par with XP.

I'm not trying to be harsh or unduly critical; this is just my view as a scientist looking in. The benchmarking techniques are great, but the data analysis and presentation could use improvement. The relative differences give a clear picture and should be included, rather than looking at clipped graphs. I know it's common in benchmarking articles and reviews, but well, there's not a high standard of rigor out there. We can be the best. :)
 
Agree with this - average + std deviation bars presents the idea in less space (even with three samples). More samples would always be nice, though. :cool:

Anyway, the theoretical implications behind this are that the optimizations in each OS are weighted differently, leading to these results, correct? And thus Win7 + Vista are statistically indistinguishable performance-wise given these benchmarks.

Finally, I don't really have a solution for this given your dataset, but I dislike graphs not starting from zero, since that skews the scale of the real differences. Dunno what really to do here, though.

You are absolutely right. They all perform within 5% of one another, often within one or two percent. The differences are barely significant. The graphs tell the wrong story.
 
Paul, what about evaluating the claims in the frame of reference of hwbot.org performance?

I initially took similar concerns with the presentation of data. However, most specifically buried in this growing thread above is this rationalization from Gautam:

Probably plenty don't like the graphs not starting from zero, but it's conducive to what I'm trying to present. This is not supposed to be an academic paper, it's supposed to tell people clearly what OS scores the best. The real difference is tiny, yes. For example even between the averages of 7 64 and Vista 32, it's just 2.35%, which in real terms is tiny. However, on the 3DMark05 hall of fame, it's actually larger than the difference between 1st place and 5th place, which is a very big deal for anyone competitive.

The graphs are meant to illustrate there's a consistent and reliable performance difference. These differences may not be statistically significant to the average user, but then the article, as well as Gautam's comments also make it clear that's not the intended audience. As a benchmarking piece, talking about the OS differences which mean the difference between 1st in the world and 5th in the world... If I were at the top of the benchmarking foodchain, these small differences begin to look significant.

From the sites I've seen that have picked up the article, the intended audience mostly interpreted it as it was intended. In contrast, I think I saw some posts on the macrumors forum where they may have been taken incorrectly.
 
Interesting point. Of course, it's great work and among the best out there. Sure, it's better than they are, but that doesn't make the conclusion correct.

To validly claim 2.3% difference as significant, you need more samples. And there's no getting around the fact that the graphs, without some statement on the small actual relative differences, are misleading, particularly for those younger audience members who haven't had much experience in science, mathematics, statistics, etc.

The conclusion should be "little-to-no significant difference in benchmarks", not "XP definitively worst" etc. I'd hate to see people shell out $200 or $300 to upgrade their OS based upon magnified views of performance differences that may not even be significant.

Regarding 3Dmarkland, perhaps the real lesson is that the top 5 places are statistically equivalent.
 
Last edited:
I benched XP SP2 vs. SP3 when it came out (screenshots are long gone) and SP3 was either the same or better with my setup at the time (E4400, IP35-E, 8800GTX). Do you have documentation or reference to it being a benchmark killer?

Approx. as much documentation you have....... I can dig up ton's of articles that claims SP2 to be faster than SP3 - likewise you will be able to dig up a ton of articles that says the opposite ;)
In other words - I am interested for my own sake in the information about which SP's used here...
So for me this test will be "just one of those test's" - I know that I will have to install XP,Vista,2008 and 7 and test all those in my system to find which OS performs best since a lot of the performance is driverdependant. One of the most noticeable performanceimpacts in my system comes from the Areca 1680ix - 4GB raid controller (6x SSD's in raid0) - and this controller seems to favour XP.


I can also dig up articles that says that XP SP3 is (was?) a lot faster (claimed up to 10% faster) than Vista SP1 was in benchmarks...
 
That's a definite problem: there's a lot of anecdotal information out there.

The thing I like about Guatam's article is the precision he attempted. Just not thrilled about the data presentation and the conclusions drawn from them.
 
I think presenting it any other way would have missed the target audience it was directed at, so I do see the value in how it's presented.

I could also see value in it being more complete looking at it from your perspective Paul - the same set of graphs based on a 0 scale would ensure it's more clear to any lay person. Perhaps that would make sense to present it after the conclusion - but I'm not motivated to work up the graphs myself. Perhaps Gautam is. For an astute reader however, all the information is sufficiently presented for one to draw the proper conclusion.

Maybe we set the bar high with expectations for readers to look at the results critically, but really I thought it was sufficiently clear. I can also see danger in setting the bar too low - putting the kid gloves on it is going to make a lot of prominent overclockers yawn when all they are really interested in is the hard data. It's the people at the forefront of the hobby which drive interest, and those are likely the people we should rightly cater to. As a site, we generally do very well at getting the people starting at a base level up to speed - I'd say that's our strong suit.

That wasn't very concise. Essentially, I think there's a balance and I think we're in the right place.
 
I think the data presentation is an issue, but not huge if mentioned.

However, I still think the conclusions are incorrect.

"The only thing XP remains good for are 2D benchmarks, falling far behind the pack in all things 3D. Once again, this article only sets out to show which the fastest operating systems are by the numbers. "

Incorrect conclusion.

All the operating systems perform within 5% of one another on all the benchmarks, and most often within 2%. This is a near statistical dead-heat. The real message is that (1) if properly tuned, Vista isn't much of a hit, and (2) upgrading to Win-7 is not much of a gain. The numbers tell me to stick with what you have, which is a completely different conclusion. A couple of percent is not "far behind." The numbers don't lie per se, but do need to be understood in proper context. In quantitative work, that context is called "statistical significance" or at least "relative differences."

Saying that article told which OS is fastest "by the numbers" implies that the numbers significantly supported that conclusion. They did not. At least not such a strong conclusion.

Ultimately, this is why I think the graphs are a problem: they tend to not just mislead the readers, but the writers. I've found that the workflow usually goes like this: (1) do the work. (2) plot the results (3) analyze the graphs to make sense of the data (4) write the article accordingly. A misleading plot in (2) throws the whole thing off. In fact, I find that clear, well-selected graphs are more important to the writers than the readers, as it makes or breaks the science.

If it takes skewed plot to show a result, then there is probably no result. (Which in itself is a result, just not the one presented here. ;))
 
The skewed plots' results show a very significant difference relative to the target point of view. As he said, when it comes to world records, the difference in top scores is much less than the differences in the benchmarks he graphed. Therefore, even if not statistically significant for anyone but the target audience, the differences are eons apart at that level of competition.

I think he did a superb job and accounted for margin of error very well with the number of times he ran the benches. This isn't a dissertation or peer-reviewed journal article and doesn't pretend to be. It's an article written by an authority on benchmarking, directed at like-minded individuals and attempting to show the differences between operating systems. It accomplished that well.

The largest point that might be important to note regarding his results is the one I.M.O.G. linked to at Mad Shrimps. That may be worth exploring at some point.

For every day use, the layperson can take their pick of operating system. They may see lots of differences, but the largest of those will not be speed. If the article was aimed at that audience, graphing from zero would have made more sense. Indeed, the conclusion you have proposed (the difference is < 5% for any tested OS) would be valid. However, it's not the audience this was written for and I think even the layperson would be able to tell that. Actually, the lay person would probably be sitting there thinking "WTF is Wprime?"
 
Okay, but remember that overclockers.com audience != benchmarking audience. We're a more diverse group than that.

And if the difference between scores is statistically insignificant, then maybe that's all there is to it: the top 5 places are all statistically equivalent, so the exact ordering really doesn't matter (except as a matter of pride, which I understand).

The skewed plots' results show a very significant difference relative to the target point of view.

No, they don't. Magnifying insignificant differences does not make them significant. Significance has a very real and non-fuzzy meaning, and it has not been achieved here.

I'm not a layperson, and even I was "confused" that the article was aimed at me as an overclockers.com reader, even though I'm not a benchmarker. ;)

As I said, we have a diverse audience. Hardcore benchmarkers, hardcore coolers, people who are looking for the best bang for the buck, people who just want to find efficient cooling methods, people who want reliable hardware based upon people who have truly stressed it, etc. So, teh different conclusions are valid, precisely because we aren't a monolithic group. -- Paul

*edit*
As I think about it more, there are really two different benchmarking communities out there.

Group 1: Uses benchmarking to compare product A to product B. Do they differ significantly?

Group 2: Uses benchmarking to compete. Who gets the highest number?

What we have here is a member of group 2 writing an article for the audiences of both groups.
*/edit*
 
Last edited:
ok so im gonna go put my vista install disk back now ..hehehehe

glad to be apart of these forums .. you guys are awesome !!
 
Interesting point. Of course, it's great work and among the best out there. Sure, it's better than they are, but that doesn't make the conclusion correct.

To validly claim 2.3% difference as significant, you need more samples. And there's no getting around the fact that the graphs, without some statement on the small actual relative differences, are misleading, particularly for those younger audience members who haven't had much experience in science, mathematics, statistics, etc.

The conclusion should be "little-to-no significant difference in benchmarks", not "XP definitively worst" etc. I'd hate to see people shell out $200 or $300 to upgrade their OS based upon magnified views of performance differences that may not even be significant.

Regarding 3Dmarkland, perhaps the real lesson is that the top 5 places are statistically equivalent.
The conclusion is: "Overall, the two most solid performers are Server 2008 32 and Vista 32.[...]in benchmarks, Vista 32 performs very well." I stress over and over again that this only applies to how they do in benchmarks, not real world usage or anything else.

I don't know exactly how to reconcile the other issues. Yes, anyone that's been through a mathematics or engineering background (as I have myself) has been repeatedly taught that it's wrong to present data as I have. Nevertheless, I'm not budging on that point. First of all, the graphs look bad if they're zeroed. Second, ask yourself, how does statistical significance translate into the real world? It's possible to force a statistically significant outcome that's insignificant in the real world. Conversely, some things that are statistically insignificant can be very significant in the real world. Sure, in terms of the GDP, your salary is statistically insignificant, but it would certainly be significant to you if that statistically insignificant part of the GDP were to disappear. ;)

And besides that, even though the sample space is very tiny, the standard deviations are quite tiny in every case, far smaller than the difference between the best and worst OS. A percentage difference doesn't really mean much if you don't consider the standard deviation.

Oh and I should have noted this in the article but everything was as new as possible. SP3 for XP, SP2 for XP64, SP2 for Vista/08.
 
When G first began to post his result in the lounge the "target audience" were the benchers. And it did exactly what is was suppose to do. When I setup my rig with a LN2 pot on top of it my first question is what I'm gonna be benching today? when that's decided I go to G post and figure out what os I need to try to get the best of that benching session. So I end up setting 2 or 3 hard drives and that way I can expect to get the most out of the 100$ worth of nitro I just bought for that benching session. For me that OS comparaison G did is worth gold. I use it for the purpose it was written and I don't really care if the graph don't start from zero... Just my $0.02
 
Guatam, thanks for the nice response.

You make some good points, and I want to reserve this space to write something back.

More later. Thanks -- Paul
 
Oh and I should have noted this in the article but everything was as new as possible. SP3 for XP, SP2 for XP64, SP2 for Vista/08.

Thanks Gautam

I think I stick with XP SP2 for a while longer - I use my "high-end" system mostly for other things than gaming and benching :)

But I certainly gonna give Server 2008 x64 a go - x86 os'es which only support 4 GB memory (w. PAE) isn't interesting anymore imo. so at least you as others are pushing me in that direction so to say ;)
 
All too often people sacrifice the stability of a 32-Bit OS in cases when 3.5 GB of RAM is more than enough to cover their needs.

I considered 64-bit only as part of a multi boot so that I would have the option of booting into 64-bit when I really need it... but so far I could not find justification for over 3.5 GB of RAM for my personal use which includes use of older programs incompatible with 64-Bit OS.


On my triple Windows 7 / XP / Vista boot [all 32-Bit], I have certainly found Vista to be slower in a way that I can feel in comparison to Windows 7 and Windows XP.


I suppose benchmarks measure things once they get going and real life also includes getting them to go, which is what I mean by feeling faster vs. slower.


I stress over and over again that this only applies to how they do in benchmarks, not real world usage or anything else.
 
Back