• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Culprit found on v1.15 core! You need a new PSU!!

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

LandShark

Super Shark Moderator
Joined
Aug 13, 2001
Location
Deep Blue Sea (Maryland)
Engineers at NVIDIA (notably Scott LeGrand) have come up with a theory for the EUE's seen in core 1.15 (and a few others in the 1.15 to 1.18 range) on certain hardware. They found that this core had code optimizations that drove the GPU so hard that it would draw a lot more electricity (one sign of this was running hotter). In some boxes, this was too much electricity and this lead to numerical instabilities. When the same machine was given a beefier power supply, the problem went away.

We've been told that 8800's require 600W power supplies, but we're finding that even a little bigger (eg at least 650W) is important to leave some room for error. We are working to see if there is some way to detect this issue in software, but for now, if you're getting EUE's on the NV GPU client, this is something to consider.

By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.

We're still looking into this. For now, if you're seeing issues with your card, please consider trying out a bigger power supply. We will continue to look to see if this is indeed the problem and what we can do to help the situation such that the code runs stably on all machines.

PSU?? well, if someone is using some cheap PSU for GPU folding, yes, I can accept the blame on the PSU. however, when quite a lot of members are using some known good brand/quality PSU and still experiencing the EUE/UM problem with the optimized core version (1.15~1.18), I must not agree (in other word, I would call it B....S.......)!

you can read more @ the folding forum.
 
Word... I think that's bogus. I'm using a Corsair 650TX on my 8800GTS and still had errors.

However, I'm also using an Antec TruePower II 430W to power two 8800GT 256MB cards. :D Not necessarily a "cheap" PSU, but not the quality or power rating of my other Corsair and PCP&C units... I said that to say this... the failure rate on that system with the Antec wasn't any better or worse than my system using the 650TX.

I, for one, would like some sort of switch on the client if they do indeed go back to the optimizations, so I can turn them off manually... and to be honest, what's the freakin' point of the optimizations anyway... a little faster production? Isn't it pretty damn fast already? It also doesn't sound like they're necessary for the science. Why take the chance of donors producing more errant work and possibly :temper: because the core doesn't work correctly for their more than capable system. I know that's how I feel.

EDIT: What I just posted on FF.org

harlam357 said:
VijayPande said:
Xilikon said:
I believe this is bogus. A quality PSU like Antec, Seasonic or Corsair can run it well. I ran 2x8800GT with a Corsair VX550W without issues and right now, 2 of my GPU computers have just a 500w PSU without tossing a EUE. If it eue even with a 1000W psu, something else is causing issues, not a insufficient PSU.

Perhaps the issue is in terms of the non-quality PSU's. Anyway, this is an easy enough hypothesis to test and has been verified in the NV labs at least for their boxes which show the problems. For now, we have a software fix (dialing back the optimizations), but this is an issue for us since we'd like to dial them back in.

Agreed... bogus. Sounds like another excuse... if donors are expected to fork over $$$ for a 1000W psu, then you're going to lose a lot of donors. Like others have said, I've had problems with quality Corsair psus... single 12v rail running a single GPU. You can't tell me that one 8800GTS is too much for a 650W Corsair.

What's the point in the optimizations in the first place? A little more speed? Is it necessary for the science? Isn't GPU2 fast as heck already?

I do have a suggestion, that might make everyone happy... if you must put the optimizations back in, I ask that a command line flag be added to turn them on. That way the average Joe Gamer who stumbles onto FAH and decides to run it on his GPU can install it, run it, and be generally unaware of what's going on and still produce useful results. Then the FAH faithful, who know about the optimization flag and wish to run it, can turn it on if they find they have a system that has a 10,000W psu that can handle the load.

I just ask this - Give Us The Choice!!!
 
Last edited:
yup, I too have a Corsair HX620 running just ONE 8800GT and stilling getting plenty of EUE.

to be honestly, I really do think the GPU2 client is another battle ground for NV & ATI. they all try to one upped each other just like in gaming/benchmark, and remember how nasty they could get, each blaming each other for cheating due to certain "special" driver for just certain benchmark or producing artifact due to pushing too hard. I really do think the optimization is the same case, being pushed too much thus less stable. why even optimized? well, it's just like in gaming, NV is just treating GPU2 client as another game, which it will always try to improve it's score! (sometimes it works, sometimes it produce artifact or caught "cheating")

and yes, one can always optimize a certain app for certain thing. but as any hardware/software engineer should know, it's broad stability/compatible that count!! people complain 'bout a new OS that's less stable, less compatible yet the OS vendor discard or ignore and blame onto yourself (your hardware is just too old for my OS). now, Stanford somehow makes me feel the same......

and yes, it said it's "a possiblity" of the cause of the problem. but then it's saying you might want to goes trash your PSU and buy a new one AND SEE IF IT'LL WORK while my old trusty PSUs (Corsair/Antec/Silverstone/you_name_it) are PERFECTLY FINE with the job!!!
 
I don't have a single card out of 12 that hasn't had a problem with v1.13 and 1.15 of the core. I have one rig running a 9800GTX on an OCZ GameStream 700 W PS that ran a 10 % UM/EUE rate on v 1.15 with the shaders clocked @ 1728. On v 1.19 it has a 1 % EUE/UM rate with shaders at 2052. Needless to say it produces more ppd on v 1.19 than on v1.15.

Most of my rigs are running on single rail Corsair PSUs that have about the highest 12v ampacity on the market for a given wattage rating. I don't buy the PS theory either. However, it could be related to the PS/motherboard combo. Perhaps a PS that puts out 11.8v on the 12v rail combined with a mobo that really needs 12.2 v to supply proper PCIe voltage. It just might not have anything to do with available amps, it could be PCIe vdroop or something.
 
9800GT here with 2 DVD drives, 6 hard drives, Couple extra 120mm fans, PCP&C 510 Express. Have not gotten an EUE.
 
I saw that thread on the F@H forum and e-mailed another user here about it. I'm running Seasonic power supplies in all my systems. I can look back at the stats and see for the weeks of 09.21.08 and 09.28.08 my two machines running three 8800GTs finished 434 GPU work units without a single error. There's nothing wrong with my power supplies, the only reason I'm seeing errors now is because they put out a GPU core that isn't stable, and replaced it with another core that isn't stable.

At a certain point I'm just going to start using this machine for what it was built for, playing games. I get blamed for enough things that aren't my fault as it is, I'm not going to be told the $150 power supply I have in this computer is a cheap piece of garbage and if I'd just purchased a quality power supply everything would be fine.

I can see from the new units they are getting ready to put out that they are planning to lower the points for nVidia owners again. I got those on both systems, my two EVGA 8800GT cards finished them, though obviously at a much lower ppd. My XFX 8800GT 256 couldn't handle the new units, and went into an EUE loop. I guess when they switch over I'll be retiring that computer from folding and I'll have to give some thought to whether I'll keep this one running 24/7.
 
I like the opimization flag sugeestion H ! :thup:
That's an awesome idea and hope they implement it if they decide to use it again later.
All I care about really is a much stringent QA they seem to be applying now,
unlike what they released in the past.

But yeah, bah on PSU theory !
I have a couple of HW that have a higher rate of UM's & EUEs than others.
I tried another DC project that utilized CUDA capable cards on 'em.
No probs runnin' 'em whatsoever, aside from initial set-up on my part. Their WU's are much longer too - 8hrs - 12hrs depending on HW




I saw that thread on the F@H forum and e-mailed another user here about it. I'm running Seasonic power supplies in all my systems. I can look back at the stats and see for the weeks of 09.21.08 and 09.28.08 my two machines running three 8800GTs finished 434 GPU work units without a single error. There's nothing wrong with my power supplies, the only reason I'm seeing errors now is because they put out a GPU core that isn't stable, and replaced it with another core that isn't stable.

At a certain point I'm just going to start using this machine for what it was built for, playing games. I get blamed for enough things that aren't my fault as it is, I'm not going to be told the $150 power supply I have in this computer is a cheap piece of garbage and if I'd just purchased a quality power supply everything would be fine.

I can see from the new units they are getting ready to put out that they are planning to lower the points for nVidia owners again. I got those on both systems, my two EVGA 8800GT cards finished them, though obviously at a much lower ppd. My XFX 8800GT 256 couldn't handle the new units, and went into an EUE loop. I guess when they switch over I'll be retiring that computer from folding and I'll have to give some thought to whether I'll keep this one running 24/7.
I don't think it's really about lowering the nVidia PPD intentionally, more like a transition to larger WUs, which sad to say, afffects PPD.
Wha?? Did I actually just defend Stanford? Please someone hit me... hard. :eek:

Even before jumping into the GPU2 bandwagon, I anticipated huge PPD adjustments favoring higher end cards as the project goes forward. That's why I went the cheaper route and got mostly 96 shader cards, less power consumption too. It's just to supplement my SMP folding.

The way I see it, even if PPD were cut in half on 8800GS/9600GSO, or any GPU for that matter, it's more or less equal to running a single instance of quad WinSMP - for less than a third of the price of a quad. Oh, and it has lifetime warranty, so F@H can bork it at any time.

I believe RL issues or use of rigs for their primary purposes should always take precedent over F@H, if that is the case.
 
I like the opimization flag sugeestion H ! :thup:
That's an awesome idea and hope they implement it if they decide to use it again later.

Thanks! :) I have a good one every once in a while. :)

The idea has gained the support of Xilikon over @ FF.org... so maybe Pande Group will listen. :D If core v1.19 is generally error free for most, I think it's best to stop with this core for v6.20 (and v6.20r1) of the client. Adding that flag will require a new client and core... so folks who continue to run current client versions will stay stable and producing useful results, then others who want the extra optimizations can upgrade their client, get the new core, and turn on the flag. I think that's the best approach at this juncture.
 
Was just thinking that if this we're true (ie: the PSU is the culprit) then for those of us also utlizing the CPU core for other tasks (ie: SMP client) the EUE's should be noticably less by reducing the load on the CPU (ie: shutting down the SMP client). At 3.4GHz and 1.35V Q6600G0 uses in the neighborhood of 155 Watts full load (from the 12V rail, which would be same rail for the extra plug if the PSU has only one 12V rail). So by removing the load from the CPU would it not make a huge difference on the EUE's of the GPU client according to this theory? Perhaps the PSU's do have something to do with the EUE's but I would hazard to say that very few of the EUE's were PSU related.
 
Last edited:
I call BS as well on the PSU theory, for the reasons others have mentioned. I can see it happening on your $20 no-name PSUs, which I think most folders are smart enough to avoid.

Assuming the PSU was the culprit, I would have some choice words with Corsair, as my rigs all use their units...
 
Even though I don't fold any more I thought this was extremely interesting and read most of that thread that LS linked. And I pretty much feel that blaming the psu for the majority of the problems is pure BS myself. But I did read of a few hypothesizing that it might be the actual circuitry on the vid cards themselves or perhaps the silicon, which sounds much more likely to me. They also pointed out that this generally doesn't seem to be happening much with the 260 and 280 class vid cards too, which seems to point to card design or silicon problems with the lower classes of Nvidia cards. Most of the DC'ers that do DC for any length of time know the importance of good power and good cooling, so they don't run Powmax psu's and stock cooling solutions in their machines. Sounds like typical "point the finger at anything else" Stanford responses. :rolleyes:
 
Now this has me interested... because I have a 9800 GTX, a GTX 260, and a GTX 280. The only card that's ever had EUE issues is the 9800, and it's run at stock (which granted is a factory OC). The plot thickens...
 
yes, I'm w/ Mudd. after more thinking, I'm leaning towards the circuity/board design/and the silicon itself. I think they are pushing too much too hard (remember, it draw more power w/ the new optimized core) for the circuity/silicon itself that was not meant to be running this way. I think from now on, they should validate/test their silicon/chip by running the GPU2 client instead to make sure their chip could handle it properly!!

btw, just tried to see if I'm too basis/blind/confused/mad or NV/Stanford really full of it, I tried to run a single 8800GT + a stock clock E6600 + 1 HDD + 2 sticks of memory off an Antec 850 Quattro. guess what, it still producing EUE/UM at about the same rate!!! may be some of you could point me to where I can find a 2000W PSU so I can run my Tri & Quad folder reliably??!!

If NV admit and rewrite on their web site stating in order to run our GPU reliably on everything, you must have a 800W+ PSU even for a mid-low range 8800GT card! I'll believe it's the PSU then......
 
Its a stepping stone, NV/AMD will never admit to it being there chip/silicon. It is almost like there is no "cure" to be found anymore, but more like pande is pushing this as the new "benchmark" However, it now appears we need a 10K watt power supply to even offer a GPU contribution :p
 
Sounds like typical "point the finger at anything else" Stanford responses. :rolleyes:

Sorry you feel that way. New information from Scott LeGrand, today, for those that may have missed it;

slegrand said:
Here are the current facts:

1. Something very odd is up with some and I do mean *some* G8x/G9x chips.
2. This problem wasn't evident until recently or the NVIDIA client would never have made it out the door, but sometime recently,
like a harmonic convergence so to speak, a subset of G8x/G9x chips started having random failures. It may be a hardware
issue, but it seems to be caused by some sort of software change. I'd guess something is messing up some sort of timing on
the chip, but that's just a guess.
3. Some chips stop exhibiting this problem with a beefy enough power supply.
4. Some don't, but they all do it less often.
5. For whatever reason, it doesn't happen on GTX260/280 - I've had a GTX260 running F@H for the past 2 months straight without a single instance of this.
6. Reproing this bug takes anywhere from 40 minutes to 8 hours of computation so fixing it is going awfully slow where 40 minutes was the norm for an underpowered system, and 4-8 hours the current norm now that I've addressed that.

Keep in mind that GPUs currently do not have ECC memory. But, in graphics, if a memory error occurs, the write target is defined by the hardware itself as a specific pixel in the framebuffer or a render target, and all inputs are done in terms of texture coordinates. This constrains the reads and the writes to stay in reasonable areas of memory and limits the worst-case scenario to a corrupted pixel.

In contrast, in Folding@Home, naked memory pointers are used both for reads and writes. When a memory error occurs, this can lead to an invalid read or write of random memory. When this happens, a kernel for the GPU fails. This is what is happening here. Memory errors are almost guaranteed to occur if there is insufficient power for the GPU. But, as I just said, when it's in graphics, the worst you're likely to see is a corrupt pixel for a single frame (obviously one can come up with more bizarre failure scenarios, but this is the lion's share of them).

Alternatively, if an atom coordinate is misread from memory, it can cause the forces to shoot off to the moon, and that leads to a cascade of NaNs, which is the other EUE failure scenario here.

I'm now seeing it repro with an 800W power supply and a 9800GTX. But the frequency of reproduction is much lower than with the 460W power supply with which I initially did so.

I can force a fix of this in the same way that I once fixed a bug on the Atari Jaguar in reading memory twice and then comparing the values, but that's a kludge that merely reduces the frequency of memory errors by a factor of 1e9 or so, and since the 9800s were all working just fine a month or two ago, there's a root cause, and it really points to being a software failure.

So I'm going to end on a bright note - we can repro this, that means we can fix it. Getting to this stage was the hardest part.
 
I had a 780i, two HDs in RAID, and (3) 8800GTs running off of the same PCP&C 510 Express that I'm using now.

Good to see that they're admitting it's not JUST the PSU issue, though.
 
Back