Windows Showdown: 8 Operating Systems in 6 Benchmarks

Since its debut, Windows Vista has taken nothing but flak from almost every demographic one could think of. Everyone from the casual user looking to browse the web and type up a few reports to the benchmark fanatic obsessed with squeezing all the speed he or she could is likely to complain about Vista being bloated and slow. Windows 7 on the other hand has been hailed as being noticeably better performing, and supposedly as light as XP. And what about XP? How do they really stack up to one another? The examination of these questions follows.

Personally, I’m an avid benchmark junkie, so I could only look from this perspective. I’m unconcerned with how things “feel”, but rather how they score. Hard numbers are what matter to me. They might not matter to many, but they measure speed in its true essence, devoid of any subjectivity. Bearing this in mind, I selected six of the most popular benchmarks used by overclocking enthusiasts, each tending to have unique biases with regards to what part of the system they emphasize.

Editor’s Note: While the author is being modest, Gautam is a world renowned benchmarker, and is an authority on the subject of Windows benchmarks.

The Benchmarks

3DMark03 – Predominantly measures GPU performance
3DMark05 – Predominantly measures CPU and memory performance
3DMark06 – Measures both GPU and CPU/memory, and additionally tests multi-threaded performance
Aquamark3 – Almost exclusively measures CPU and memory performance, with an emphasis to the latter
SuperPi 1M – Measures single-threaded CPU performance and is slightly influenced by memory
wPrime 32M – Measures multi-threaded CPU performance with no influence from memory
Some might be wondering why 3DMark Vantage was omitted. The main reason is that it would be a bit boring. Each operating system appears to score nearly identical in 3DMark Vantage, and any variations are within the margin of error.

System Configuration

I used a setup that I would consider fairly typical for an overclocking enthusiast:

Intel Core i7 965 Extreme at 4 GHz
ASUS P6T Deluxe OC Palm Edition
6GB Corsair Dominator GT 2000C7 at 960MHz CAS 7
2x ATi Radeon HD 4890’s at stock frequency of 850/975

To be perfectly honest, the system configuration will likely have an impact on how the various operating systems compare with each other. Therefore, using one that is modern and high-performing is, in my eyes, the fairest way to compare them.

The Operating Systems

Windows Server 2008 x64
Windows Server 2008 x32
Windows 7 x64
Windows 7 x32
Windows Vista x64
Windows Vista x32
Windows XP x64
Windows XP x32

The operating systems are the usual suspects, all with the latest service packs installed. I added Windows Server 2008, as some people have supposed that it is faster than Vista, which it is based on, and I wished to put that theory to the test. Additionally, I tested both 32-bit and 64-bit variants of each operating system. How they handle the memory subsystem is important when it comes to benchmark performance, as we will see. Lastly, in order to make the tests fairest for the operating systems, I trimmed all eight of them using nLite and vLite. Consequently, I made the running services constant between all of them to rule any out as a factor. My vLite profile is as follows:

vlite profile

vlite profile

Only the important stuff remains, with all the fluff removed.
And nLite (for XP 32 and 64):

nlite profile

nlite profile

The Results

3DMark03 Results

3DMark03 Results

Only one thing predominantly sticks out when viewing the results for 3DMark03—good ol’ XP doesn’t fare too well, while all the others are very close, with Windows 7 being slightly in the lead. Since 3DMark03 is heavily GPU-centric, this dead heat is not too much of a surprise. The benchmark depends mostly on GPU performance and is not heavily influenced by much on the system side, OS included. Still, it certainly shows XP’s obsolescence.

3DMark05 Results

3DMark05 Results

Now is when things start getting interesting. 3DMark05 emphasizes CPU and memory performance, and consequently we can see the operating system having a very noticeable impact on performance. In fact, that the only two operating systems that even perform similarly are Server 2008 and Vista. This does not come as much of a surprise, considering that the two are mostly the same under the hood, and are even more similar after I ensured that the running services and installed components were as close between them as possible. XP once again lags far behind the rest of the pack, but interestingly enough both 7 32 and 7 64 also score considerably lower than Vista and Server 08. 7 and XP being the worst performers certainly flies in the face of conventional beliefs. Another very interesting thing to note is that the 64-bit variants for 7, Vista and Server 08 all perform worse than their 32-bit counterparts. We must bear in mind that this benchmark uses under 1GB of memory, but for this quantity, the 64-bit OSes handle the memory sub-system a bit slower.

3DMark06 Results

3DMark06 Results

The results in 3DMark06 are somewhat similar to those in 05, however, this time around Windows 7 pulls up far ahead, scoring almost evenly with Vista. Also, the hit going from 32-bit to 64-bit in Windows 7 is much smaller than it is going from 32-bit to 64 in Vista and Server 2008. XP is still decisively in last place, but the margin is a bit smaller this time around, thanks to XP scoring better in the CPU test portion of 3DMark06 than the newer OSes.

Aquamark3 Results

Aquamark3 Results

The results from Aquamark3 are quite similar to 06. Windows 7 once again makes a strong showing, and once again, 64-bit does not seem to hurt 7 very much, but takes a slight toll on Vista. Both versions of XP are far behind, but curiously enough XP 64 is considerably better than XP 32. Server 2008 is similar to Vista, however it’s only fair to point out that run #3 for Server 08 32 was a bit of an outlier, what one would call an unlucky run. The first two runs had it performing on par with Vista.

One important thing to note about Aquamark3 in particular is that there is a heavy dependence on graphics drivers. These results only look this way on ATi GPU’s, like those used in this test. On nVidia GPU’s, XP is actually slightly ahead of the others. You’ll have to take my word on that since nVidia results aren’t included in this roundup, but curiously enough, running ATi in Windows 7 scores about equal to a comparable nVidia setup in XP.

SuperPi 1M Results

SuperPi 1M Results

These results are very different indeed from those obtained in the 3D benchmarks, and are almost completely the opposite. XP 64 has a noticeable lead over all the others, and is also the most consistent. Interestingly enough, this is the only benchmark where Server 2008 appears to be considerably faster than Vista. However, just like in the 3D benchmarks, the 64-bit variants of Vista and Server 2008 are slower. In 7 it’s the complete opposite, with 7 64 noticeably outperforming 7 32, further supporting that the 64-bit version of 7 does indeed seem to be optimized in some way that 64-bit Vista is not.

wPrimeResults

wPrimeResults

I’ll start out by saying that I tried to work out exactly why XP 64 scored so poorly, but I’m afraid I can’t offer any explanation, so it has to be taken at face value. Otherwise, XP 32 is still ahead of the newer OSes, but by a smaller amount than it is in SuperPi. All OSes in fact are very close to each other, barring XP64. Windows 7 though once again shows some weakness on the 2D side of things, but 32-bit and 64-bit are in a dead heat.

Conclusion

So, who’s the winner? Well, if you’ve scrolled to skip past the graphs, every single benchmark has a unique operating system that does best. Overall, the two most solid performers are Server 2008 32 and Vista 32. Both of these are at the top for the 3D benchmarks, and fare okay in the 2D benchmarks as well. Deserving of flak in every day usage or not, in benchmarks, Vista 32 performs very well. Contrary to popular belief, XP and 7’s supposed “lightness” does not really translate in benchmarks. In fact, the more CPU-centric a benchmark is, the worse 7 tends to do. The only thing XP remains good for are 2D benchmarks, falling far behind the pack in all things 3D. Once again, this article only sets out to show which the fastest operating systems are by the numbers. The fastest choice might not necessarily be the best one for you.

Questions and discussion of this article are on Overclockers Forums, join in!

- Gautam

  1. Windows XP vs Windows 7 – Benchmarks
  2. Benchmarks: Windows 7 RTM versus Vista, XP
  3. Deploying Windows 7 in the Office Environment
  4. XP-Style Buttons and Clickable Search for Windows 7
  5. K|ngp|n Cooling Summer Sandy Showdown

Tags: , , , , ,

92 Comments:

hokiealumnus's Avatar
This is a superb article. He did a ton of work on it. Well worth a read for any bencher worth their salt. Thanks Gautam!
I.M.O.G.'s Avatar
Great article Gautam! Best benchmark article I've seen on windows - right to the guts!

If you like it, and have a Digg account, hit that Digg button at this link!
http://digg.com/microsoft/Windows_Sh...n_6_Benchmarks
AmbientFiction's Avatar
Very nice add to the front page.
stereo555's Avatar
Very good article ! Interesting outcome on some of the findings . Thx for the heads up .
EarthDog's Avatar
This is a GREAT article. Especially for overclockers who trim their OS's. Awesome work!

I would love to have seen a stock vs. stock OS comparison as that is what most people (non serious benchers) run.
johnz's Avatar
Great article. I'm not entirely surprised, but it's nice to see the findings on "paper". Very well done :^)
johan851's Avatar
Very interesting. Nice job!
David's Avatar
I have a few questions.

Firstly: how do they perform if you strip them all down to the bare minimum. I understand why you chose to keep them all on an even keel for this article but I think there would be some value in doing the tests again with each OS stripped down as much as possible.

Secondly: How do Vista and Vista SP1 compare?

Thirdly: Would it be worth seeing how XP/XPSP1/XPSP2/XPSP3 compare?

A very interesting article, which also raises some questions :-)
||Console||'s Avatar
Very nice paper .

Glad some one as trusting as you took the time to do all of these .
johan851's Avatar
He did this, didn't he?
c627627's Avatar
Excellent work Gautam. World's definitive benchmark article.
EarthDog's Avatar
I thought the same thing...

However I believe he wanted a COMPLETELY stripped down OS as opposed to the version he has. I will hope he clarifies soon.

Personally, I would have rather seen this run at all stock since thats how 99% of people here run their OS's.
nikhsub1's Avatar
Very Nice Guatam! I can't help but wonder how 2008 R2 would fair...??? Isn't R2 based on a different kernel?
I.M.O.G.'s Avatar
Go to google and search for Windows benchmarks - you'll see plenty of places doing stock windows comparisons.

Taking out all the extra garbage that runs on the OS, and actually evaluating each release on an even scale, performed by someone who knows what they are doing - that hasn't been done anywhere else. (Except maybe on xtremesystems, assuming he might have released his stuff over there too)
EarthDog's Avatar
I have thanks. But that content is not here, which is was my point. Regardless, its a stellar article for its target demographic.
I.M.O.G.'s Avatar
I was just messing with you bud, and I agree it wouldn't hurt to have someone here doing the stock tests that we can trust.
EarthDog's Avatar
I know, all good, never thought otherwise .

And that was my other thought (stock tests here we can TRUST!).

These guys have A LOT to offer and its wonderful to see them even more involved!!! W00t!
Gautam's Avatar
Thanks guys.
If I went any lower it would be at the point where they start acting goofy. Let me put it this way, I can already fit all my OSes on CD's (not DVD's) and installation takes about 8 minutes tops.

Not only that IMHO "stripping" is overrated anyways. I don't think disabling services or removing components changes how an OS scores much anyways. I made them all even though just for the sake of being thorough. Vista without SP1 is wrought with issues especially with a Crossfire setup like I used, so it wouldn't quite be right to try without it. And that along with your 3rd question can also be answered like this. 3DMark06 takes almost 9 minutes to run all by itself. Then multiply that by 3 and you're easily surpassing 30 minutes per operating system per benchmark. I think you know where I'm going with this.

2008 R2 is based on 7. I'm thinking of giving it a whirl as well, but we can probably expect the difference between 2008 R2 and 7 to be similar to that between 2008 and Vista (in other words, negligible)
jhanby's Avatar
Great article !
Joeteck's Avatar
Very nice. I also have a few questions...

How does it do with 1156 Intel CPUs? Some benchmarks have proven dual channel to be better...

I would love to see these same tests on a core i7 860 @ 4GHz.... Then those compared to the 965 as well...
mdcomp's Avatar
Great article dude!


Matt
sacha35's Avatar
Outstanding work Gautam, very intensive.
Deanzo's Avatar
Great article and work G
David's Avatar
Yeah, I mean stripped down all the way. The way Gautam did it he stripped them down to about the same point, was just wondering if some OSes lend themselves more to being stripped right the way down.

Admittedly, it's a lot of work for relatively little info.
Deanzo's Avatar
Hey G,

Just looking at the XP64 wPrime score, did you try with and with out graphics drivers, to see if it changed ?
It looks like the kind of drop in score by not running drivers, maybe an issue with that OS and it's drivers ?
Omega Destroyer's Avatar
This is a great atricle.

I just had one comment. Instead of putting the results of all three runs and an average, I think it might have been more useful to only show the average with some error bars.
I.M.O.G.'s Avatar
Good input. I considered that when I first looked at the graphs also. Given more thought, I like the transparency of the way G presented it - with the shades of grey and the average in Red, I think it made it very clear.

We both took the same considerations in mind however, which is interesting.
shadowdr's Avatar
Great article, I have been thinking of installing Vista as my primary OS for a while but honestly mine still has bugs. I don't strip them down at all, and XP has a few issues of it's own mostly the .net framework but I have never been able to install SP2 for Vista. Then again there is so much software to install as well, I think I will wait untill XP is unrecoverable.
Gautam's Avatar
Hmm...it's a nice sounding theory, but that same XP64 is the one that I used for all the other benches, every single one, and with most of them being 3D benches. And it's only in wPrime that it goes weird.

I'll take another look. All the OSes are still in their same state.
Omsion's Avatar
Agree with this - average + std deviation bars presents the idea in less space (even with three samples). More samples would always be nice, though.

Anyway, the theoretical implications behind this are that the optimizations in each OS are weighted differently, leading to these results, correct? And thus Win7 + Vista are statistically indistinguishable performance-wise given these benchmarks.

Finally, I don't really have a solution for this given your dataset, but I dislike graphs not starting from zero, since that skews the scale of the real differences. Dunno what really to do here, though.
icebob's Avatar
Great work G! I know you like Vista but the result just talk for themselves... guees I have to hit the egg and get a couple more hdd!!! One question tho do you think they (OS) will behave the same when they get the nose dripping?
Gautam's Avatar
Actually, I too was thinking about std deviation bars, but I didn't feel like trying to figure out how at first, put them like that to begin with and then just liked the way it looked. I also specifically recorded the order in which I did each run, mostly for myself. Since I did, I figured I might as well present it to the reader too. As for more data points, once again, to add even one more data point per bench means around 5-6 hours, not to be whiny about it, but that's the truth. And if you look at the results it's pretty clear that there's not even much variation between runs in each given OS. Whether I can show it statistically or not, 7 always scores lower than Vista in 05 for example. Even one run can tell you that. I could do more runs to make it more statistically sound, but pragmatically it wouldn't help a thing.

Probably plenty don't like the graphs not starting from zero, but it's conducive to what I'm trying to present. This is not supposed to be an academic paper, it's supposed to tell people clearly what OS scores the best. The real difference is tiny, yes. For example even between the averages of 7 64 and Vista 32, it's just 2.35%, which in real terms is tiny. However, on the 3DMark05 hall of fame, it's actually larger than the difference between 1st place and 5th place, which is a very big deal for anyone competitive.
icebob's Avatar
You know what I mean will they behave the same @ lets say 5.5ghz versus 4.2 ghz Let me rephrase, do you think the os will affect your max oc?
Deanzo's Avatar
I'm not even sure why having 3D drivers loaded matters so much on a 2D (cpu only) bench
But it could be a one off issue with that bench/OS, or not
Gautam's Avatar
Very possible. Almost no one ever uses XP64 for benching anything, so I have nothing to go off of.
freeagent's Avatar
Nice article hombre

A little O.T...

I like the new look of the main page, looks good
I.M.O.G.'s Avatar
Well stated. I also didn't care for the graphs not starting from zero, but after I digested the article, it seemed clear to me you made the right choice in presenting the data.

Thank you. iNet and our mods played a large part in making it what it is. Dogsoldier also deserves an immense amount of credit - he took part and won the logo design contest we held on the forums, and the entire design was centered on his logo. I owe him an article still, highlighting his artistic portfolio.
SeanBest's Avatar
Nice read ... could've used more pictures!
I.M.O.G.'s Avatar
For those like myself, who aren't as up to speed on benchmarking as some, John the owner of www.madshrimps.com lends some insight into the results:

http://www.madshrimps.be/vbulletin/f...35/#post250569

Mainly, his comments are a clarification of what the hardware Gautam chose means to the bencmark results. It's good insight, along with most things you'll find John posting at madshrimps.

You can find their articles here:
http://madshrimps.com/?action=articles

As well as their news here:
http://www.madshrimps.be/?action=news
xtreeme's Avatar
Well, another thing which have a B I G impact on these benchmarks is which servicepacks did he install (if any) ??????

SP3 for XP32 is a performance killer de luxe- likewise I found Vista without SP1 to perform noticeable worse than with SP1 (never tried SP2). Microsoft did put in a lot of work in SP1 (and probably the same with SP2) to increase Vistas performance because Vista lagged behind XP in performance at that time.

Can Gautam clarify the question regarding which service packs ????

*****I got flashbacks to Nvidias frame jump cheating after M$ released SP1 for Vista *****
hokiealumnus's Avatar
I benched XP SP2 vs. SP3 when it came out (screenshots are long gone) and SP3 was either the same or better with my setup at the time (E4400, IP35-E, 8800GTX). Do you have documentation or reference to it being a benchmark killer?
macklin01's Avatar
On graphs not starting from zero: such a presentation tends to magnify small differences. If series A is 1000 and series B is 1001, then they differ (relatively) by 0.1%, which may be within measurement error. But plotting (for example) on a scale from 995 to 1005 (about 1% of the data range) will make any difference look much more significant, regardless of the statistical significance.

Looking at the the 3Dmark03 scores, we see a spread from about 106k to 111k, or a relative difference of 4.7%. The standard deviations for the measurements look pretty small (but I can't calculate without the raw numbers), so the differences appear to be statistically significant, even for the small sample sizes. But they differences aren't as huge as they appear on the graph, so it's worth noting.

Similar story on 3dmark05. About a 3.6% relative spread.

On 3Dmark06, the relative difference from high to low is a paltry 1.5%, with larger measurement variation than the other benchmarks. While there is an observable trend, it's statistically not much greater than the measurement variation. This is a case where the graph may not be worthwhile. Vista32 vs 7-32 are within 0.1% of one another, well within measurement error from what I can tell.

Calling XP "decisively in last place" in this case, with a clipped graph range (focused upon 3% of the data range), is scientifically misleading in my opinion. If I were reviewing this paper for a journal, I would absolutely require the authors to revise these strong claims for this graph. The data simply don't bear them out.

Aquamark3: about 4%.

SuperPi 1.5M is similar: all tests within 1%.

wPrime 32M without the outlier: 3.5%

I find the overall conclusion should be different: for most of these operating systems, the performance is within 5% of one another in benchmarks. The differences may or may not be statistically significant. Given the small difference in the benchmarks and the unknown statistical significance of those differences, other metrics should be used for the basis of choosing the operating system. It is interesting, however, to note that Vista and Win7 can be trimmed to perform on par with XP.

I'm not trying to be harsh or unduly critical; this is just my view as a scientist looking in. The benchmarking techniques are great, but the data analysis and presentation could use improvement. The relative differences give a clear picture and should be included, rather than looking at clipped graphs. I know it's common in benchmarking articles and reviews, but well, there's not a high standard of rigor out there. We can be the best.
macklin01's Avatar
You are absolutely right. They all perform within 5% of one another, often within one or two percent. The differences are barely significant. The graphs tell the wrong story.
I.M.O.G.'s Avatar
Paul, what about evaluating the claims in the frame of reference of hwbot.org performance?

I initially took similar concerns with the presentation of data. However, most specifically buried in this growing thread above is this rationalization from Gautam:

The graphs are meant to illustrate there's a consistent and reliable performance difference. These differences may not be statistically significant to the average user, but then the article, as well as Gautam's comments also make it clear that's not the intended audience. As a benchmarking piece, talking about the OS differences which mean the difference between 1st in the world and 5th in the world... If I were at the top of the benchmarking foodchain, these small differences begin to look significant.

From the sites I've seen that have picked up the article, the intended audience mostly interpreted it as it was intended. In contrast, I think I saw some posts on the macrumors forum where they may have been taken incorrectly.
macklin01's Avatar
Interesting point. Of course, it's great work and among the best out there. Sure, it's better than they are, but that doesn't make the conclusion correct.

To validly claim 2.3% difference as significant, you need more samples. And there's no getting around the fact that the graphs, without some statement on the small actual relative differences, are misleading, particularly for those younger audience members who haven't had much experience in science, mathematics, statistics, etc.

The conclusion should be "little-to-no significant difference in benchmarks", not "XP definitively worst" etc. I'd hate to see people shell out $200 or $300 to upgrade their OS based upon magnified views of performance differences that may not even be significant.

Regarding 3Dmarkland, perhaps the real lesson is that the top 5 places are statistically equivalent.
xtreeme's Avatar
Approx. as much documentation you have....... I can dig up ton's of articles that claims SP2 to be faster than SP3 - likewise you will be able to dig up a ton of articles that says the opposite
In other words - I am interested for my own sake in the information about which SP's used here...
So for me this test will be "just one of those test's" - I know that I will have to install XP,Vista,2008 and 7 and test all those in my system to find which OS performs best since a lot of the performance is driverdependant. One of the most noticeable performanceimpacts in my system comes from the Areca 1680ix - 4GB raid controller (6x SSD's in raid0) - and this controller seems to favour XP.


I can also dig up articles that says that XP SP3 is (was?) a lot faster (claimed up to 10% faster) than Vista SP1 was in benchmarks...
macklin01's Avatar
That's a definite problem: there's a lot of anecdotal information out there.

The thing I like about Guatam's article is the precision he attempted. Just not thrilled about the data presentation and the conclusions drawn from them.
I.M.O.G.'s Avatar
I think presenting it any other way would have missed the target audience it was directed at, so I do see the value in how it's presented.

I could also see value in it being more complete looking at it from your perspective Paul - the same set of graphs based on a 0 scale would ensure it's more clear to any lay person. Perhaps that would make sense to present it after the conclusion - but I'm not motivated to work up the graphs myself. Perhaps Gautam is. For an astute reader however, all the information is sufficiently presented for one to draw the proper conclusion.

Maybe we set the bar high with expectations for readers to look at the results critically, but really I thought it was sufficiently clear. I can also see danger in setting the bar too low - putting the kid gloves on it is going to make a lot of prominent overclockers yawn when all they are really interested in is the hard data. It's the people at the forefront of the hobby which drive interest, and those are likely the people we should rightly cater to. As a site, we generally do very well at getting the people starting at a base level up to speed - I'd say that's our strong suit.

That wasn't very concise. Essentially, I think there's a balance and I think we're in the right place.
macklin01's Avatar
I think the data presentation is an issue, but not huge if mentioned.

However, I still think the conclusions are incorrect.

"The only thing XP remains good for are 2D benchmarks, falling far behind the pack in all things 3D. Once again, this article only sets out to show which the fastest operating systems are by the numbers. "

Incorrect conclusion.

All the operating systems perform within 5% of one another on all the benchmarks, and most often within 2%. This is a near statistical dead-heat. The real message is that (1) if properly tuned, Vista isn't much of a hit, and (2) upgrading to Win-7 is not much of a gain. The numbers tell me to stick with what you have, which is a completely different conclusion. A couple of percent is not "far behind." The numbers don't lie per se, but do need to be understood in proper context. In quantitative work, that context is called "statistical significance" or at least "relative differences."

Saying that article told which OS is fastest "by the numbers" implies that the numbers significantly supported that conclusion. They did not. At least not such a strong conclusion.

Ultimately, this is why I think the graphs are a problem: they tend to not just mislead the readers, but the writers. I've found that the workflow usually goes like this: (1) do the work. (2) plot the results (3) analyze the graphs to make sense of the data (4) write the article accordingly. A misleading plot in (2) throws the whole thing off. In fact, I find that clear, well-selected graphs are more important to the writers than the readers, as it makes or breaks the science.

If it takes skewed plot to show a result, then there is probably no result. (Which in itself is a result, just not the one presented here. )
hokiealumnus's Avatar
The skewed plots' results show a very significant difference relative to the target point of view. As he said, when it comes to world records, the difference in top scores is much less than the differences in the benchmarks he graphed. Therefore, even if not statistically significant for anyone but the target audience, the differences are eons apart at that level of competition.

I think he did a superb job and accounted for margin of error very well with the number of times he ran the benches. This isn't a dissertation or peer-reviewed journal article and doesn't pretend to be. It's an article written by an authority on benchmarking, directed at like-minded individuals and attempting to show the differences between operating systems. It accomplished that well.

The largest point that might be important to note regarding his results is the one I.M.O.G. linked to at Mad Shrimps. That may be worth exploring at some point.

For every day use, the layperson can take their pick of operating system. They may see lots of differences, but the largest of those will not be speed. If the article was aimed at that audience, graphing from zero would have made more sense. Indeed, the conclusion you have proposed (the difference is < 5% for any tested OS) would be valid. However, it's not the audience this was written for and I think even the layperson would be able to tell that. Actually, the lay person would probably be sitting there thinking "WTF is Wprime?"
macklin01's Avatar
Okay, but remember that overclockers.com audience != benchmarking audience. We're a more diverse group than that.

And if the difference between scores is statistically insignificant, then maybe that's all there is to it: the top 5 places are all statistically equivalent, so the exact ordering really doesn't matter (except as a matter of pride, which I understand).

No, they don't. Magnifying insignificant differences does not make them significant. Significance has a very real and non-fuzzy meaning, and it has not been achieved here.

I'm not a layperson, and even I was "confused" that the article was aimed at me as an overclockers.com reader, even though I'm not a benchmarker.

As I said, we have a diverse audience. Hardcore benchmarkers, hardcore coolers, people who are looking for the best bang for the buck, people who just want to find efficient cooling methods, people who want reliable hardware based upon people who have truly stressed it, etc. So, teh different conclusions are valid, precisely because we aren't a monolithic group. -- Paul

*edit*
As I think about it more, there are really two different benchmarking communities out there.

Group 1: Uses benchmarking to compare product A to product B. Do they differ significantly?

Group 2: Uses benchmarking to compete. Who gets the highest number?

What we have here is a member of group 2 writing an article for the audiences of both groups.
*/edit*
Xtreme Barton's Avatar
ok so im gonna go put my vista install disk back now ..hehehehe

glad to be apart of these forums .. you guys are awesome !!
Gautam's Avatar
The conclusion is: "Overall, the two most solid performers are Server 2008 32 and Vista 32.[...]in benchmarks, Vista 32 performs very well." I stress over and over again that this only applies to how they do in benchmarks, not real world usage or anything else.

I don't know exactly how to reconcile the other issues. Yes, anyone that's been through a mathematics or engineering background (as I have myself) has been repeatedly taught that it's wrong to present data as I have. Nevertheless, I'm not budging on that point. First of all, the graphs look bad if they're zeroed. Second, ask yourself, how does statistical significance translate into the real world? It's possible to force a statistically significant outcome that's insignificant in the real world. Conversely, some things that are statistically insignificant can be very significant in the real world. Sure, in terms of the GDP, your salary is statistically insignificant, but it would certainly be significant to you if that statistically insignificant part of the GDP were to disappear.

And besides that, even though the sample space is very tiny, the standard deviations are quite tiny in every case, far smaller than the difference between the best and worst OS. A percentage difference doesn't really mean much if you don't consider the standard deviation.

Oh and I should have noted this in the article but everything was as new as possible. SP3 for XP, SP2 for XP64, SP2 for Vista/08.
icebob's Avatar
When G first began to post his result in the lounge the "target audience" were the benchers. And it did exactly what is was suppose to do. When I setup my rig with a LN2 pot on top of it my first question is what I'm gonna be benching today? when that's decided I go to G post and figure out what os I need to try to get the best of that benching session. So I end up setting 2 or 3 hard drives and that way I can expect to get the most out of the 100$ worth of nitro I just bought for that benching session. For me that OS comparaison G did is worth gold. I use it for the purpose it was written and I don't really care if the graph don't start from zero... Just my $0.02
I.M.O.G.'s Avatar
I slipped this into the original article under the OS section, thank you.
macklin01's Avatar
Guatam, thanks for the nice response.

You make some good points, and I want to reserve this space to write something back.

More later. Thanks -- Paul
xtreeme's Avatar
Thanks Gautam

I think I stick with XP SP2 for a while longer - I use my "high-end" system mostly for other things than gaming and benching

But I certainly gonna give Server 2008 x64 a go - x86 os'es which only support 4 GB memory (w. PAE) isn't interesting anymore imo. so at least you as others are pushing me in that direction so to say
c627627's Avatar
All too often people sacrifice the stability of a 32-Bit OS in cases when 3.5 GB of RAM is more than enough to cover their needs.

I considered 64-bit only as part of a multi boot so that I would have the option of booting into 64-bit when I really need it... but so far I could not find justification for over 3.5 GB of RAM for my personal use which includes use of older programs incompatible with 64-Bit OS.


On my triple Windows 7 / XP / Vista boot [all 32-Bit], I have certainly found Vista to be slower in a way that I can feel in comparison to Windows 7 and Windows XP.


I suppose benchmarks measure things once they get going and real life also includes getting them to go, which is what I mean by feeling faster vs. slower.


Jo3f1sh's Avatar
Wow. Great read. I have a copy of XP x64 that i never got around to installing and always wondered how it fared. Not well it seems.
I.M.O.G.'s Avatar
Are you a benchmarker? If not, XP64 does fine. The difference is miniscule.
xtreeme's Avatar
Ever tried to run windows x64 with min. 8gb ramm and no page file ? I promise you that it flies vs a x86 os when the x86 system starts seriously swapping
I do use photoshop a lot and especially with some heavy filters I always end up with a system that starts swapping madly to the disks (Raid0 on Areca 1680ix w. 4GB ).

The performance difference is then suddenly very noticeable in a x64 system with 8GB (or more) vs. x86 with it's memory limitations.

If I were a avid gamer or bencher I would have gone much longer than Gautam did in his stripping - I would start off with MicroXP and stripped it to the bone. And of course ; no AV running in the system - actually not a single start up program at all.

If it all is about performance - I think MicroXP is a good place to start

From my experience with both Vista and 7 - I see (saw with Vista - didn't try it after SP1) that XP always loaded programs faster, search was faster (without indexing on), rendering was faster +++

I guess (hope?) that this is things that Microsoft will sort out in 7 - remember that XP had a lot of problems in the start too, XP got good after SP2..

EDIT: I guess a lot of the oldtimers that doesn't get impressed of the eyecandy that 7 offers will keep on running XP till the bitter end
macklin01's Avatar
Color me one of those old-timers. I have 64-bit Vista Ultimate sitting in a box after a year of using it. Just installed XP-64 on my new 1.5 TB drives. The performance difference is tangible, again in day-to-day use, searching, task-switching, etc. And not in any way optimized; I'll have to check out these suggestions.

The ID3 tag bug in 64-bit XP may be what pushes me over to linux, though. :-)
JigPu's Avatar
Very interesting article Gautam! Thanks for all your hard work!

I understand your motivation in magnifying differences both graphically and in word choice, and I have to say that it hits home for an audience of competitive benchmarkers. Such an audience IMHO rely on statistical deviations as much as hardware in their quest to beat the next guy (since, after all, glory goes to the man with the best single datapoint, not the best statistical average ).

However, as somebody who doesn't competitively benchmark I agree with macklin -- the magnification of tiny differences just misleads me. I'm not a statistician so I can't comment on just how statistically significant things are (or aren't), but the OSes aren't as differentiated as the language and graphs suggest.

If I may, I have a few suggestions:
  • Try to make your intended audience more clear. Your second paragraph could be read as targeting competitive benchmarkers, but that's not how I read it.
  • If the site supports it, rollover graphs would be wonderful. You could keep the magnified views by default, but show the zero-based ones if the reader rolls their mouse over.
  • Tone down the language a little and/or couple it with language that emphasizes these differences are very small but possibly quite significant to a competitive benchmarker.
  • Make your data available. I've seen very few sites do this (there could be a reason why, but I don't know), and I think it would be interesting to provide readers with the raw data should they wish to do a more in-depth analysis.

JigPu
macklin01's Avatar
Those are great points, JigPu.

Indeed, you could use the data as a follow-up article for non-benchmarkers, because your results are also significant to us, but with different conclusions (as I mentioned above). It's interesting that the same data tell different stories depending upon your target. I'd be happy to help write a very short follow-up note / article-ette.

It's actually funny, because we end up using the same (software) tools for very different purposes.

Since we have both target audiences here, it might be a nice way to get further mileage from your great, hard work. Also, it might be nice to have our "cultures" intermingle.
Jo3f1sh's Avatar
Well...not really. But my main reason for installing it would have been to upgrade from XP 32-bit, probably expecting a big performance increase. I suppose what i meant to say was that it doesn't look like it would have been as much of a performance upgrade as i thought.

I guess it helps to clarify.
senorbum's Avatar
I didn't read through all the posts, so maybe its been covered. But I think what really comes out of this article is that OS does not have a huge impact on performance. For somebody trying to be in the tops for benchmarking stats its fine, but even that varies depending on exactly what you are doing. Most of these graphs are actually quite poorly displayed (in a statistical sense). The superPi graph is a superb example of this. Windows 7 64 looks 70% slower than windows xp 64 in this graph. In reality it is 0.56% slower.

That being said, it is certainly interesting to see the effect that the operating system has on a given system for various applications. I wish more of us had the resources and time to do similar benchmarks in order to be able to compare with different configurations.
I.M.O.G.'s Avatar
These forum comments are now integrated with the article. Styling work remains to be done, but the basics of the system are done. Go here to see what this looks like:
http://www.overclockers.com/windows-...-6-benchmarks/
Gautam's Avatar
Yes and perhaps unsurprisingly no one on the hwbot forum raised the issue of the graphs not starting at zero. Their issues were mainly that they wanted more hardware configurations tested. I suppose each set of audiences behaved somewhat as expected. The benchmark junkies determined to know nothing other than what they should be using to get every last point out without much thought, while plenty of you guys considering things a bit deeper.

About the tone and all of that...I guess what I probably should have stated up front is that this began for the benching team. In fact, it was in the private team lounge in a much less refined state for months, but I was asked to make it public. So the nature of the testing and the conclusions was from the getgo intended for them. (It's also why it remained private...using Vista over XP was somewhat of a "trade secret" that's been used successfully to grab some records)

One other example that might hit home to a lot of people here is that if you were to take 3% off of 4000MHz, it'd put you 3880. However, I can ensure that many members of this forum have gone through great lengths to get that extra 3%.
I.M.O.G.'s Avatar
I noticed the same thing reading the hwbot thread - they had the data points and that's all they were concerned with. The difference between audiences and frame of reference is certainly interesting.

Out of all the sites that picked up your article (about half a dozen highly relevant community sites), www.hwbot.org and www.madshrimps.be had the most on point evaluation and commentary. Props to them.
Omsion's Avatar
That's certainly true - the question for some members of your audience, then, is how consistent that 3%, if not random noise, carries over to real world usage.

icebob's Avatar
This exactly what I'm talking about, G posted that in the lounge right before the last Forum Warz. I can tell you that this was our Bible to setup our rig. Should I remind you how we did on the last Warz
Gautam's Avatar
It's not "random noise" and it is consistent. Even from a statistical viewpoint, if a data point is 5 deviations from the mean then the error is certainly statistically significant. In fact I'm not clear what reasoning you guys are using to dismiss a certain percentage as being "insignificant."
icebob's Avatar
You see G that thing should have stayed in the lounge.....
rdrash's Avatar
Thanks for the hard work Gautam.... I know it must have taken hours and hours to accomplish and is very much appreciated! Not many people would have bothered with such an exhaustive effort, kudos.

....sorry to see some people giving you headaches.

..... lol Bob, you might catch grief for saying that, but +1 brother I'm with you.
macklin01's Avatar
I disagree. This kind of discussion is healthy and enlightening for all of us. We all learn something and are forced to reassess and strengthen our arguments. Sometimes we find we were wrong (and can be thankful for new knowledge, saving money, or whatever), and sometimes we find we were right but now have a deeper understanding of why (and have more effective arguments for the next time).

There's a great risk when a group keeps itself isolated because it doesn't want to hear contrary opinions or analyses. The group loses out because it develops a monoculture that's susceptible to unchallenged dogma. The broader community loses out because they don't get the group's in-depth expertise. When both work together, both are enriched. They just have to learn one anothers' vocabularies and motivations.
macklin01's Avatar
Again, a well-done work comes out stronger after tackling constructive criticism. It's part of how we learn and evolve.

I believe that Guatam's work is in this category: well-done work that will emerge all the stronger.

Firewalling ourselves from differing points of view isn't healthy or conducive to understanding. If our analyses can only convince people who agree with us, then they probably aren't very good analyses. Fortunately, that's not the case here.

I think there's a good opportunity here to intermingle and strengthen the bonds within our diverse community. Again, I'd like to extend my offer to G to do something together as a follow-up. I'm learning a lot as I read through here.
icebob's Avatar
Ok let me put it in a different perspective, if you want a new car and want advice on what is more cost/performance effective you will probably look in Car and Drivers, but if you already have the car and want to get the most out of it you will probably look in Muscle Car. You see my point, this comparo was done for a Muscle car audience not for the Car and Drivers reader. You seem to don't understand how much work it involve to get let's say 1 seconds less in wprime, and G reference guide help us accomplish that. I can assure you that switching from Vista 32 to Win 7 won't let you get your email faster
macklin01's Avatar
Thanks. I appreciate the difference.

I wouldn't say that. In fact, I greatly appreciate and admire how difficult it is. I myself would never have the time, patience, or budget to do that. But I admire seeing what's possible, and I appreciate that pushing the envelope of the hardware helps advance the state of hardware for the rest of us. At the absolute very least, what you do (1) helps us figure out what hardware has enough quality to survive 24-7 heavy-duty use in less extreme settings (e.g., a 5% overclock applied to a cancer simulation), and (2) pushes the hardware manufacturers to improve their top-end products, which in turn improves the mid- and lower-end products as well. It's a win for everyone. I don't think anybody denies that. And nobody denies that there are benefits to the broader community far beyond this.

What we have here is an interesting discussion. You're presenting work that started in a niche but is interesting to everyone. You're finding different points of view on the same data. That's enlightening for all of us. It's not that somebody or other "doesn't get it." It's that they have a different frame of reference.

The data may or may not be statistically significant. Some plots are, some may not be. I believe most individually are. Nonetheless, a near-NULL result is extremely interesting for the general readership, and the individual results are interesting to the benchers. We all win here. And I think taking care to remember that we are a broader audience is valuable. We gain data that we didn't have before, even if for different conclusions. It's a beautiful case of getting twice as much out of the same data than previously thought. That's a benefit of opening up to a broader group--you find things you would not have otherwise expected.

That's been the case for me. I've been exposed to the thoughts and methods of a completely new group. Aside from reading a few "world record LN2 overclock" articles here and there, this is new to me. And I gained for it. So thanks for opening up. Don't let constructive critiques scare anyone away--it means that we're genuinely interested and want to learn more. You might just get some new recruits for it.

Opening yourself up and presenting your work to a broader, often skeptical audience is challenging and scary. I know exactly how this feels, because I do it every day as a mathematician working on cancer and molecular/cellular biology. The discussions can be heated and draining, but you learn so much and advance your knowledge and your presentation skills so much, that you always come out the stronger for it.

I've also found that the more I learn, the more education I acquire, the more I find myself able and willing to say "I was wrong. I hadn't thought of it that way. That's interesting. That has so much more meaning than I had appreciated. That's deep, and I think I can use it."
hokiealumnus's Avatar
Spoken like a true PhD. Thanks for your input. Even those of us that didn't write the article are getting some good advice for the future.
macklin01's Avatar
I hope so, hokie. I hope I'm not just stirring up trouble.

I hope that Guatam realizes that I wouldn't even be commenting if I didn't think it was a great article worth discussing.
Gautam's Avatar
Yeah, I also think that the concerns are valid.

But perhaps I should do one benchmark with something like 20-50 trials which will also exhibit that the error between results is very small, and when you have even a couple of percent worth of difference, it is significant.
Omsion's Avatar
Yes, that's definitely appears true for some of the comparisons (generally speaking, XP vs the rest). But for quite a few, I don't think we could conclude a statistically significant difference of mean values with the current sample size (just by looking at the graphs, no actual hypothesis testing).
For example, Vista vs Server 08 vs Win7 in 3DMark03, 3DMark05, minus Vista64 in 3DMark06, etc. Thus my original conclusion(s). I definitely agree with you that there are non-random differences in the mix.

macklin01 got to this first, so I'll let his word stand.

But for myself, I learned alot here from this back and forth, and especially about what benchmarkers look for. I understand now that this is an especially great guide for choosing which OS to run when targeting different benchmakrs. This is something I wouldn't have gotten out of this without this discussion.
macklin01's Avatar
Beautiful. Thanks, Omision.
Gautam's Avatar
In 03, 06 and Aquamark, yes Vista and 7 are basically equal. In 05 they are certainly not. (And in 06 the difference between 32-bit and 64-bit for Vista is significant)

How about I focus on just 05, just Vista 32 and 7 32 for example, and give them each a much larger amount of trials?
Omsion's Avatar
Yeah, I probably mis-wrote some of those. I plead too many graphs and quick glances

Anyway, do you think a fair conclusion given these numbers, for an average user interested in upgrading to Win7 from Vista only for performance reasons is "Don't bother - many insignificant results, couple significant ones but only resulting in small differences in both directions depending on benchmark"?
I.M.O.G.'s Avatar

The response has been overwhelmingly positive, and even gautam would agree it was time to release his work. 6 major community outlets picked up his article, as well as many other smaller ones.

The negativity in response to open discussion is the only thing out of place here.
macklin01's Avatar
I think that's fair; see above. I think that's what makes this so interesting--depending upon the purpose, there are two radically different sets of conclusions to draw. Also, the fact that the highest scorer varied among all the tests seems to show that overall it's a draw, when not considering specific benchmarks as the goal. I also think that another interesting result shown here is that Vista can be trimmed to perform essentially as well as XP and Win-7 (at least in benchmarks; task switching, etc. is another matter). These are interesting results outside the benching community.

It might be good to use the term "small" rather than "insignificant." The differences may well turn out to be statistically significant but not large enough to justify the time spent in an OS reinstallation. Again, depending upon the purpose of the system.

Another funny thought: for some of the benches, there may not be a statistically significant "winner." In those cases, a bencher would be better served by running the benchmark multiple times and waiting for a random event to push them higher than reinstalling their OS. That's actually kind of cool.
xtreeme's Avatar
You are absolutely right there, I actually did hold the worldrecord in 3dMark 2001 - the result came after hours of benching. I never expected the result

Futuremark (former Mad Onion) have created a hype - I did early understand their goal ; earn money on others work.... so I just jumped off
g0dM@n's Avatar
I'm sure glad I'm on Windows 7 and ditched XP.
Hopefully I can afford to implement Win 7 on all of my PCs and laptops.

GREAT ARTICLE!!
xhrist's Avatar
That's a lot of good info and a lotta hard work. Thumbs up!
Neuromancer's Avatar
Was just reading some of the comments and I can not believe some of the statements made "XP is just fine by the data" and it is "tangibly faster than Vista" etc.

First of all "Vista is tangibly slower" is based on a subjective conclusion, and is contrary to your claims about the article not being "scientific" and has been known since the OS was in BETA that it was a UI effect to make the OS seem more appealing.

To deal with the Vista comment. It is faster than XP, the difference is in the UI. The "aero theme" has a 1000ms delay that you can adjust. This will make Vista "tangibly appear faster" than XP. But it gets rid of the nice effects. Way back in the day XP had the same issue. They added a delay to the start menu and tweakers hacked the hell out of that OS to make it a benchable system over 2000.

(subjective)For me Vista boots faster, loads programs faster and runs a lot more solidly than XP does. I DREAD having to work on peoples PCs that still use XP. Sad but true. I am even starting to appreciate 7 a bit now that I have forced myself to use it for more than a month. Its still no Vista64 but, it might be. (7-64 would not let me run a ton of software I like so that choice was not an option )


As for the basis of the article. It is quite clear and would be too hard to read if it started at 0%. I like seeing them start at 0, and oft times when I see a review that does not start there.. I anticipate a biased report. I can see why they chose to work it how they did. Yes 5% is small in terms of "desktop readiness" it is HUGE when talking about benchmarking though. 5% boost in performance could lead to 50-2000% improvement in boints. (not a typo).

Just saying the article is great. Thanks guatum for your diligence. I find myself linking to or referring to this article quite a bit
Leave a Comment