Strange Issue

Wicked_Pixie · Sep 18, 2009

On one of my rigs, CUDA WU would start, pause 'waiting to run', and move on to a new WU.
I have now about a dozen WU's 'waiting to run' at various stages of work completed. When it resumes on WUs that are 99% complete, I get an error - display driver stopped working, and then picks up a new WU.

Not all WUs that resume from '99% done', end up as 'computational error', but they are increasing in frequency.

Never did that before until last night...

Using 190.62 drivers w/ CUDA 2.3 drop-in files and optimised unified installer.
GPU is GTX295, if that makes any difference.

Anyone had this problem?
Should I re-install driver or go back to 190.38??
Anyone using BOINC 6.10.4? Would that be better?

muddocktor · Sep 18, 2009

I just updated my main system to the drivers and optimized clients that you are running Pixie, so I will keep an eye out for this myself (computational error part) with my 260GTX. The other part about it quitting running a wu in the middle of processing and going to another is part of the way the BOINC client prioritizes the wu's. If it sees a wu that it classifies as needing immediate processing it will shut down on a wu with a longer due date and work on the more immediate high priority wu. Persoanlly I think the BOINC client does that too often. This usually happens when you receive a fresh load of work units and some have short due dates on them.

Wicked_Pixie · Sep 18, 2009

Thanks for the reply MD!
Well, I thought it was prioritisng due dates too. But, when I looked at the dates, it was totally random.
According to these threads here and here, It seems to be a BOINC issue - EDF specifically ( I don't know what that means).

I'm now running 6.10.4. Same issue as the last version, but, at least, I hardly get the display driver quitting on me.

However, I found out accidentally that if I disable SLI in NV control panel. It completes the WU very quickly (under 4Min.) Boinc only sees 1 GPU core though. Is it possible that 2 both GPU cores were running the WU at the same time? I dunno.
But if I can confirm later that the results are good, I may just disable SLI entirely.
Anyone with SLI cards care to try this for testing?

Sir-Epix · Sep 18, 2009

Wicked_Pixie said:
Thanks for the reply MD!
Well, I thought it was prioritisng due dates too. But, when I looked at the dates, it was totally random.
According to these threads here and here, It seems to be a BOINC issue - EDF specifically ( I don't know what that means).

I'm now running 6.10.4. Same issue as the last version, but, at least, I hardly get the display driver quitting on me.

However, I found out accidentally that if I disable SLI in NV control panel. It completes the WU very quickly (under 4Min.) Boinc only sees 1 GPU core though. Is it possible that 2 both GPU cores were running the WU at the same time? I dunno.
But if I can confirm later that the results are good, I may just disable SLI entirely.
Anyone with SLI cards care to try this for testing?

I don't run SLI, but I am pretty sure you can get both cards to crunch different work units with SLI disabled...don't know what you have to do...curse of only have 1 PCI-E slot on mobo.

muddocktor · Sep 18, 2009

That version you are running, that must be a beta build. I just d/l'ed the latest official build this morning and it's 6.6.36. Maybe that might be part of your problem with the errors. I've run into that in the past with the beta clients so I tend to stick with the official builds now.

QuietIce · Sep 19, 2009

If I remember right (and I don't run SLI so this is something dredged from memory) you should set your cards to run independently, not in SLI. Can't remember how to get BOINC to see the other card(s) though. I'll try to find the thread where that was discussed - it's been awhile.

Did you reboot after you disabled SLI ...?

Edit:
Success! Know anything about "dummy plugs"?

Codeman05 said:
yes, you will need the dummy plug on the second GPU in order for windows to allow you to enable the ghost display so BOINC will see it.

Though one "trick" i noticed was IF you enable PhysX (no sli), then you can enable the second monitor with no dummy plug...not sure why that works, but I've been unable to do that under Vista and W7

http://www.ocforums.com/showpost.php?p=6092361&postcount=6
http://www.ocforums.com/showthread.php?t=605690

Duner · Sep 19, 2009

Wicked_Pixie said:
On one of my rigs, CUDA WU would start, pause 'waiting to run', and move on to a new WU.
I have now about a dozen WU's 'waiting to run' at various stages of work completed. When it resumes on WUs that are 99% complete, I get an error - display driver stopped working, and then picks up a new WU.

Not all WUs that resume from '99% done', end up as 'computational error', but they are increasing in frequency.

Never did that before until last night...

Using 190.62 drivers w/ CUDA 2.3 drop-in files and optimised unified installer.
GPU is GTX295, if that makes any difference.

Anyone had this problem?
Should I re-install driver or go back to 190.38??
Anyone using BOINC 6.10.4? Would that be better?

This is a very common issue actually. There was a thread on it at the SETI forums, but I can't find it. I did find the solution threat though.

FIX

The problem is SETI doesn't account for the speed boost of the optimized apps and therefore the predicted end times are not valid, so they jump around all the time. By manually adding the flop count for the CPU and GPU, you somewhat mitigate the problem. I've applied the flops correction myself and haven't had units paused in the middle for a long time now.

The only part I wound change is:

8. For each of the apps multiply the p_fpops value by the factor below and put this into the appropiate flops entry in the app_info given below. For multibeam 608 you need the estimated Gflops. The app_info given below has the values for a GTS250.
Application Calculate
Astropulse 503 = p_fpops x 2.6
Astropulse 505 = P_fpops x 2.6
Multibeam 603 = p_fpops x 1.75
Multibeam 608 = Est.Gflops x 0.2

If you are using the latest unified installer, change the Est. Gflops to x 0.3 and if you are using the 2.3 dll drop in files, then use a x 0.4 or even 0.5 to account for the improved performance.

Wicked_Pixie · Sep 19, 2009

Duner said:
This is a very common issue actually. There was a thread on it at the SETI forums, but I can't find it. I did find the solution threat though.

FIX

The problem is SETI doesn't account for the speed boost of the optimized apps and therefore the predicted end times are not valid, so they jump around all the time. By manually adding the flop count for the CPU and GPU, you somewhat mitigate the problem. I've applied the flops correction myself and haven't had units paused in the middle for a long time now.

The only part I wound change is:

8. For each of the apps multiply the p_fpops value by the factor below and put this into the appropiate flops entry in the app_info given below. For multibeam 608 you need the estimated Gflops. The app_info given below has the values for a GTS250.
Application Calculate
Astropulse 503 = p_fpops x 2.6
Astropulse 505 = P_fpops x 2.6
Multibeam 603 = p_fpops x 1.75
Multibeam 608 = Est.Gflops x 0.2

If you are using the latest unified installer, change the Est. Gflops to x 0.3 and if you are using the 2.3 dll drop in files, then use a x 0.4 or even 0.5 to account for the improved performance.

Thanks for the tip and link, duner !!!

So, I looked at my app_info and this is what it looks like.

Code:

- <app_version>
  <app_name>astropulse_v505</app_name> 
  <version_num>505</version_num> 
- <file_ref>
  <file_name>ap_5.05r168_SSE3.exe</file_name> 
  <main_program /> 
  </file_ref>
  </app_version>
- <app>
  <name>setiathome_enhanced</name> 
  </app>
- <file_info>
  <name>MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe</name> 
  <executable /> 
  </file_info>
- <file_info>
  <name>cudart.dll</name> 
  <executable /> 
  </file_info>
- <file_info>
  <name>cufft.dll</name> 
  <executable /> 
  </file_info>
- <file_info>
  <name>libfftw3f-3-1-1a_upx.dll</name> 
  <executable /> 
  </file_info>
- <app_version>
  <app_name>setiathome_enhanced</app_name> 
  <version_num>608</version_num> 
  <plan_class>cuda</plan_class> 
  <avg_ncpus>0.040000</avg_ncpus> 
  <max_ncpus>0.040000</max_ncpus> 
- <coproc>
  <type>CUDA</type> 
  <count>1</count> 
  </coproc>
- <file_ref>
  <file_name>MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe</file_name> 
  <main_program /> 
  </file_ref>
- <file_ref>
  <file_name>cudart.dll</file_name> 
  </file_ref>
- <file_ref>
  <file_name>cufft.dll</file_name> 
  </file_ref>
- <file_ref>
  <file_name>libfftw3f-3-1-1a_upx.dll</file_name> 
  </file_ref>
  </app_version>
  </app_info>

Now, I searched for the client_state.xml file (to search for P_fpops figures) and cannot find it anywhere.
I looked at the Boinc files in Program files and the AppData files.

The p_fpops for the GTX295 should be easy to figure out . But since math is not my strong suit, could someone see if this is correct...
106GFLOPS = 106000000000 flops x 0.5 = 53000000000 flops

So, I need to figure out what my CPU flops are.
According to this post, you can estimate it by using MIPS and add six zeroes behind it.

For 3593 floating point MIPS (Whetstone) per CPU, <flops>3593e+6</flops> is the simplest. 3.593e+9 or 3593000000 are other forms you could use.
Joe

Based on that example, is that computation correct?

Duner, if it is not too much trouble, would it be alright for you to post or PM me your app_info? I just want to compare excatly where I insert them. I am afraid of borking that file. I'm in no hurry though, since I am finishing up all my queued WUs before editing the app_info file.

Thanks again for your help. :beer:

Wicked_Pixie · Sep 19, 2009

QuietIce said:
If I remember right (and I don't run SLI so this is something dredged from memory) you should set your cards to run independently, not in SLI. Can't remember how to get BOINC to see the other card(s) though. I'll try to find the thread where that was discussed - it's been awhile.

Did you reboot after you disabled SLI ...?

Edit:
Success! Know anything about "dummy plugs"?
http://www.ocforums.com/showpost.php?p=6092361&postcount=6
http://www.ocforums.com/showthread.php?t=605690

I was only running SLI cos I thought the current drivers solved that issue for DC projects. I know other CUDA projects that it is OK, SETI might be the exception. But it did work flawlessly in SLI for a few days...

They are now running independently now.
Oh, and I forgot to reboot after disabling SLI.
My bad.

QuietIce · Sep 19, 2009

I don't know any of this first-hand, just trying to help where I can while trying not to make things worse. Seems you have more of a handle on it than me - I didn't know the newer drivers were supposed to correct that issue ...

QuietIce · Sep 19, 2009

Wicked_Pixie said:
Thanks for the tip and link, duner !!!

So, I looked at my app_info and this is what it looks like.

Code:

- <app_version> <app_name>astropulse_v505</app_name> <version_num>505</version_num> - <file_ref> <file_name>ap_5.05r168_SSE3.exe</file_name> <main_program /> </file_ref> </app_version> - <app> <name>setiathome_enhanced</name> </app> - <file_info> <name>MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe</name> <executable /> </file_info> - <file_info> <name>cudart.dll</name> <executable /> </file_info> - <file_info> <name>cufft.dll</name> <executable /> </file_info> - <file_info> <name>libfftw3f-3-1-1a_upx.dll</name> <executable /> </file_info> - <app_version> <app_name>setiathome_enhanced</app_name> <version_num>608</version_num> <plan_class>cuda</plan_class> <avg_ncpus>0.040000</avg_ncpus> <max_ncpus>0.040000</max_ncpus> [COLOR="Yellow"] <flops>xxxxxxxxxxx.x</flops>[/COLOR] - <coproc> <type>CUDA</type> <count>1</count> </coproc> - <file_ref> <file_name>MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe</file_name> <main_program /> </file_ref> - <file_ref> <file_name>cudart.dll</file_name> </file_ref> - <file_ref> <file_name>cufft.dll</file_name> </file_ref> - <file_ref> <file_name>libfftw3f-3-1-1a_upx.dll</file_name> </file_ref> </app_version> </app_info>

Now, I searched for the client_state.xml file (to search for P_fpops figures) and cannot find it anywhere.
I looked at the Boinc files in Program files and the AppData files.

The p_fpops for the GTX295 should be easy to figure out . But since math is not my strong suit, could someone see if this is correct...
106GFLOPS = 106000000000 flops x 0.5 = 53000000000 flops

So, I need to figure out what my CPU flops are.
According to this post, you can estimate it by using MIPS and add six zeroes behind it.

Based on that example, is that computation correct?

Duner, if it is not too much trouble, would it be alright for you to post or PM me your app_info? I just want to compare excatly where I insert them. I am afraid of borking that file. I'm in no hurry though, since I am finishing up all my queued WUs before editing the app_info file.

Thanks again for your help.

I highlighted the line in your app_info file that needs to be inserted. You'll need to use data provided by BOINC to calulate the number to be used:

SETI@Home Forums said:
4. Browse the BOINC log file to get the estimated speed of your GPU (or before you shut BOINC down, click on the messages tab). This is usually given at the top and is in Gflops. Some estimates from my testing are:
a) 9800GT = 60Gflops
b) GTS250 = 84Gflops
c) GTX260 (216 sp) = 96Gflops

The line in my BOINC Messages tab looks like
"CUDA Device: 8400GS ... (... est. 4GFlops)"

and is about 10-12 lines down from the top.

NOTE: The calculation given did not work very well for my cheapie card - I had to get there by trial and error. If I were you I'd double-check the estimated completion time shown in BOINC after doing this ...

Duner · Sep 19, 2009

I put the flop count right under the

<version_num>608</version_num>
<flops>xxxxxxxxxx</flops>

Also, the client_state.xml file is hidden by default.
Set your preferences to show hidden files, then search for the file by name. That way you can get the proper p_fpops value.

nzaneb · Sep 20, 2009

Before I took my main rig down, I was running it with SLI on, and still crunching on all 3 GPU's. I was using a dummy plug on the two that were not in use, that may have been avoidable as well, but I didn't check. So the newer drivers seem to account for SLI.

Sir-Epix · Sep 20, 2009

Here is another strange issue with CUDA crunching that I came upon. For the Race I really OC my 8600GTS from the 702MHz core in my sig (normal oc) to ~775MHz core with a step of ~740Mhz in between. Well the 740Mhz produced a nice boost in RAC; however, when I pushed the card to 775MHz I started getting worse rac and WU completion times than the 702Mhz core. So it looks like GPUs have a sweet spot for number crunching.

LandShark · Sep 20, 2009

while we at it, does boinc/seti benefit more from GPU core/memory speed or (like folding) from shader speed??

Sir-Epix · Sep 20, 2009

LandShark said:
while we at it, does boinc/seti benefit more from GPU core/memory speed or (like folding) from shader speed??

No idea...I don't think anyone has really tested it...CUDA crunching is still in the early phase as it was just release a couple months ago.

Wicked_Pixie · Sep 20, 2009

Success!!

Thanks everyone!! :beer:

After the scavenger hunt for xml files and some editing, no more pre-empted CUDA WUs or whatever it is they call it.

Oh, and.... Duner rawks !! :attn:

Duner · Sep 20, 2009

I'm glad things are running better.

Strange Issue

Wicked_Pixie

Member

muddocktor

Retired

Wicked_Pixie

Member

Sir-Epix

Member

muddocktor

Retired

QuietIce

Disabled

Duner

Member

Wicked_Pixie

Member

Wicked_Pixie

Member

QuietIce

Disabled

QuietIce

Disabled

Duner

Member

nzaneb

Senior Member

Sir-Epix

Member

LandShark

Super Shark Moderator

Sir-Epix

Member

Wicked_Pixie

Member

Duner

Member

Similar threads