View Full Version : F@H Stability error
Zerix01
05-15-07, 07:27 PM
I was looking through my F@H history and came across this.
[05:29:36] Completed 2240000 out of 4000000 steps (56%)
[06:00:53] Writing local files
[06:00:53] Completed 2280000 out of 4000000 steps (57%)
[06:32:06] Writing local files
[06:32:06] Completed 2320000 out of 4000000 steps (58%)
[06:40:31] Quit 101 - Fatal error: NaN detected: (ener[13])
[06:40:31]
[06:40:31] Simulation instability has been encountered. The run has entered a
[06:40:31] state from which no further progress can be made.
[06:40:31] This may be the correct result of the simulation, however if you
[06:40:31] often see other project units terminating early like this
[06:40:31] too, you may wish to check the stability of your computer (issues
[06:40:31] such as high temperature, overclocking, etc.).
[06:40:31] Going to send back what have done.
[06:40:31] logfile size: 55538
[06:40:31] - Writing 56101 bytes of core data to disk...
[06:40:31] ... Done.
[06:40:31]
[06:40:31] Folding@home Core Shutdown: EARLY_UNIT_END
[06:40:31] CoreStatus = 72 (114)
[06:40:31] Sending work to server
[06:40:31] + Attempting to send results
[06:40:32] + Results successfully sent
[06:40:32] Thank you for your contribution to Folding@Home.
[06:40:32] - Preparing to get new work unit...
[06:40:32] + Attempting to get work packet
[06:40:32] - Connecting to assignment server
[06:40:33] - Successful: assigned to (171.65.103.160).
[06:40:33] + News From Folding@Home: Welcome to Folding@Home
[06:40:33] Loaded queue successfully.
[06:40:35] + Closed connections
After that it fetched a new WU and has been running fine since (now at 33%). I am wondering if this is WU related or a system stability issue, I have only had this system running for a few months now. Also is there anyway to monitor for this issue on all wu's, like is there a log file that stores errors like this?
I run two single core linux clients on my Athlon X2 3600 (with AMD retail cooling, no OC). My CPU temps reported from LM-sensor and the K8 integrated sensor was 42 C on core0 and 36 C on Core1, the temperature in my apartment was up to 80 F today so I guess that would be about ambient temp inside case. I think these are the highest temps I've seen on this CPU.
ihrsetrdr
05-15-07, 08:07 PM
I was looking through my F@H history and came across this.
After that it fetched a new WU and has been running fine since (now at 33%). I am wondering if this is WU related or a system stability issue, I have only had this system running for a few months now. Also is there anyway to monitor for this issue on all wu's, like is there a log file that stores errors like this?
I run two single core linux clients on my Athlon X2 3600 (with AMD retail cooling, no OC). My CPU temps reported from LM-sensor and the K8 integrated sensor was 42 C on core0 and 36 C on Core1, the temperature in my apartment was up to 80 F today so I guess that would be about ambient temp inside case. I think these are the highest temps I've seen on this CPU.
Those temps don't look bad at all to me.
You run 2 regular units? You could muuuch better ppd by running SMP for Linux-64 Beta (http://forum.folding-community.org/forum57,smp-for-linux-64-beta.html)
;)
AFAIK if it only happens once in a blue moon its probably just that particular WU r/c/g but if your getting them everyother wu the its your machine.
Zerix01
05-15-07, 10:17 PM
Those temps don't look bad at all to me.
You run 2 regular units? You could muuuch better ppd by running SMP for Linux-64 Beta (http://forum.folding-community.org/forum57,smp-for-linux-64-beta.html)
;)
I'm still running a 32 bit version of Kubuntu. Also the SMP client is made for 4 cores and is inefficient on 2 cores. The WU's should finish faster using this method, as for PPD, eh what ever I really only look at how many units have been done (habbits from my SETI@Home days). If this gets more units out faster then it should be more science worthy.
Thanks for the feedback.
Zerix, you're being given good FAH advice here. This is not Seti. Exactly why the Linux 64 bit SMP client is better for FAH, I can't tell you, but if it wasn't in the best interests of FAH, Vijay would not be giving it the bonus points that he's giving it, for sure.
All work units are definitely NOT created equal or of equal value to the science. Further, what IS of greatest value to the science, will change as research goals, etc., change.
Your error message is a classic of a WU that, given it's starting position and energies, can not continue. These happen once in a while, and you don't need to fret them. If they happen twice in a week, with different project numbered WU's, then it's likely a stability issue with the rig.
ihrsetrdr
05-16-07, 02:23 AM
I'm still running a 32 bit version of Kubuntu. Also the SMP client is made for 4 cores and is inefficient on 2 cores. The WU's should finish faster using this method, as for PPD, eh what ever I really only look at how many units have been done (habbits from my SETI@Home days). If this gets more units out faster then it should be more science worthy.
Thanks for the feedback.
I run the 64 bit Kubuntu 6.10 and 6.06 LTS and fold SMP work units which return alot quicker than the "regular" WU's; the science gets done quicker and the points(which teams thrive on) are so much greater.
One caveat: deadlines for SMP work units are way shorter- 2-4 days max which means you would need to keep your machine(s) running 24/7. For those unable to do so, the regular wu's would be better suited.
An odd feature of SMP folding is: even though 64 bit hardware and o/s(Linux) are necessary, the client is 32 bit, which means that you have to apt-get install ia32-libs in order to fold....:shrug:
There has been some talk(very little) about releasing an SMP client for 32 bit Linux; not enough tho' to save my old Dell dual Xeon from the auction block. :eh?:
Zerix01
05-16-07, 02:28 AM
Your error message is a classic of a WU that, given it's starting position and energies, can not continue. These happen once in a while, and you don't need to fret them. If they happen twice in a week, with different project numbered WU's, then it's likely a stability issue with the rig.
Well like I mentioned before this system is new-ish and this is the hottest it's been around here since the system was built so it just all seemed to tie together. But I don't normaly check my WU status so this could very well have happened before and I have not noticed. I guess I'll be checking the logs daily for a while.
Zerix, you're being given good FAH advice here. This is not Seti. Exactly why the Linux 64 bit SMP client is better for FAH, I can't tell you, but if it wasn't in the best interests of FAH, Vijay would not be giving it the bonus points that he's giving it, for sure.
I'm not sure why the bonus points are awarded but I know the single core and SMP clients can process the same type of data (just packaged differently) and from what I have read all over their forums the SMP client was made for four processors and will run inefficeintly on dual or dual quad systems, I'm still looking into why. I know people are running the SMP client on dual core system similar to mine and making the deadline, but it still sounds more worth clock for clock running two standard clients side by side.
For the C2Ds (conroe) , you can hardly consider production approaching 75% of a C2Q at the same GHz to be inefficient. However, your AMD X2, with a 256 KB cache per core isn't going to be a good SMP folder and will likely make more points on bonus uniprocessor WUs. The uniprocessor client is pretty much trouble free as well. But, when the bonus WUs run out, points fall 60 to 85%.
WU count isn't an indication of the work being done in FAH becasue the WUs are so different. Some WUs will complete in hours on your rig while others will take more than a week.
Zerix01
05-16-07, 08:49 AM
For the C2Ds (conroe) , you can hardly consider production approaching 75% of a C2Q at the same GHz to be inefficient. However, your AMD X2, with a 256 KB cache per core isn't going to be a good SMP folder and will likely make more points on bonus uniprocessor WUs. The uniprocessor client is pretty much trouble free as well. But, when the bonus WUs run out, points fall 60 to 85%.
X2 3600 Brisbane has 512KB of cache per core. Any idea of why the SMP client needs so much more cache than the regular client?
WU count isn't an indication of the work being done in FAH becasue the WUs are so different. Some WUs will complete in hours on your rig while others will take more than a week.
My last system (and only folder) was an Athlon XP 1600, I'm just glad to see my wu's churning out every day or so, times two :clap: . How do I see how much ppd has been done per wu? On my stats page all I see is the total points.
No way to know what an X2 has by the numbers. But with 512 KB/core it still isn't going to be a very efficient SMP folder, but will probably make the deadlines running 24/7.
THe SMP client is running 4 threads so it stands to reason it would need at least 4 times the cache as the uniprocessor client.
Most of us use a third party monitoring program, like FAHmon (http://fahmon.silent-blade.org/index.php?n=Main.Download) and keep track of standings at folding.extremeoverclocking.com (http://folding.extremeoverclocking.com/user_summary.php?s=&u=238077) (link is to my cpu contest user name for sig rig #4).
the SMP client was made for four processors and will run inefficiently on dual or quad systems,
Quad cores ARE four processors! :D
They may share level 2 cache, etc., sure.
Zerix01
05-16-07, 08:41 PM
Quad cores ARE four processors! :D
LOL yeah I was typing that before a meeting and my boss walked in so I wasn't able to look over what I typed. I meant a dual Quad core system (eight processors). But then why not run two smp clients? I'm feeling like I have been reading some bad information.
thideras
05-16-07, 08:46 PM
OP:
The EARLY_UNIT_END error is returned whenever a WU dies with a known error that can provide useful results. All this error code signifies is that the results can be returned even though the MD calculations failed. The points received are proportional to the work done. The causes can be machine related (see link below). It is also possible that several machines EUE on it at the same point, but that it is completed (100%) by yet another.
Zerix01
05-17-07, 02:25 AM
I keep forgetting everything has a wiki now :D . Thanks for the info, I also dug this up which more or less confirms as long as I don't see them to much then it should be project related not computer related.
As noted above, there are several causes for WU to EUE. All true EUEs result in the core exiting with a Core status of 72 (114). Other abnormal exits characterised by different core statuses are not EUEs, but rather the symptom of another problem.
vBulletin® v3.8.7, Copyright ©2000-2013, vBulletin Solutions, Inc.