View Full Version : Farm production down.....File-IO-Errors
Papsomax
08-21-04, 11:20 AM
Well, am not sure what the heck is going on with my plexi farm.....I am still using OcixLTSP vr that uses FAH4.exe with -forceesse flags..and for the last 3 days my clients are getting check sum errors and/or File-IO-Errors predominatley on Core78 proteins.....Core65 seems to be ok.......AND the clients that are getting these errors are (a) not oc'd and/or (b) have been stable for months also has Core79 changed? as I have and AXP 2400 that is now working on a core79? I thought Core79 was assigned to SSE2 processors only? I've taken the farm down.....restarted the server....downloaded new cores and FAH4 and I'm still getting these blasted errors! Since I am booting off the CD, I can't change the flags or go to FAH5 unless I do an hdd install and when I do the hdd install, restart the terminal server...for some reason the clients won't boot...I get an error message that says knoppix boot cd not found???
Figures I get into the Top 10 Team 32 folders and this crap happens.............grrrrrrrrrrrrrrrrrrrrrrrrrrrrr rrr :bang head :mad:
Any suggestions. My boxen farms have all been changed over to FAH5 and I am not having any problems with them.
Paps
overdoze
08-21-04, 11:32 AM
Try download the latest overclockix_LTSP and try it out.
overdoze
08-21-04, 11:50 AM
It could also be that the home directory where it store the FAH folding is full. You can check it with the command
df /home/knoppix
Papsomax
08-21-04, 11:58 AM
The version I am using is dated 06/22/04
df /home/knoppix = 66% used
paps
Also whats with Core79? An AXP 2400 folding with Core79? Interesting!
overdoze
08-21-04, 12:13 PM
hm sounds like you have the latest version and home is not full. On the latest version you could control the folding flags by putting the flag(s) inside a text file called options and place it in the first client folding directory of each layer such as server, 240, 239 ..etc.. If you install to hardrive one thing I notice is I have to configure LTSP evertime I boot up the server. Hopefully I can get that fix on the next release.
Core79 is Dgromac which help the productions if you have SSE2 enable cpu.
Papsomax
08-21-04, 01:08 PM
An example of the Options text would be?
Ya I know Core79 is a dgrome, but the processor that it has been assigned to is client 234 which is an AXP 2400! Go figure!
overdoze
08-21-04, 02:27 PM
-forceSSE
The above line is the contents of file options
this is exactly running as default however. I might use the following flags just for kicks
-forceasm -forceSSE -advmethods
I'm geting Dgromacs on AMD processors once in a while. I'm still getting productions boost even it is only running on a SSE only processor such as amd athlon
PS: how many clients do you have right now. You could also increase the writing time to the 10 -15 minute rather than 6 to decrease the freequency of writing out from all clients. Since I suspect the server might get overloaded. The options to change is inside your client.cfg file for ex:
checkpoint=12
Arkaine23
08-22-04, 03:44 PM
Hmmm, something like this could signal a runaway core on the boxes exhibiting the issue. I.E. there are 2 clients on a box, but 3 cores are running. The extra core is overwriting the real core for one of the clients and thus you get an I/O error when folding checks itself.
This can happen if you kill folding, but the command/script that kills it is not enough to kill the core (often the cores come back to life).
So the solution is to reboot the whole farm, or to SSH around and count the number of cores- manually kill folding clients and cores on machines that have too many cores running, then restart the foldclient init script.
These are examples and I might have it a little bit off since overdoze setup the lTSP version. Feel free to correct me....
as root-
ssh IP_address top (tells you what's running on a client machine)
ssh IP_address killall -9 FahCore_78.exe (to manually kill all cores of a certain type with the strongest kill command available)
ssh IP_address killall -15 FAH4Console-Linux.exe
(to kill all folding clients and cores on a node. Note: not sure if overdoze renamed the folding client to something else. Also note, the cores might come back to life a few seconds after issuing this command.)
ssh IP_address /etc/init.d/foldclient (to restart the foldclient script. Note, only do this on a client that is not already running folding)
overdoze
08-22-04, 08:08 PM
I have noticed there are some large WU has been release out. What I mean large is its footprint in memory is large enough to take over the available memory. Since we are folding 2 instances on each client. I'm afraid that it cause the error. The only one way to figure this out is to add more memory on the client and try it out again. All the tinker is still taking only 10M ram where as some gromac cores are taking a whopping 40M ram each instant.
I'm working on the next overclockix_LTSP right now. List of the changes will be
Base Knoppix 3.4 release
Only run on hardrive installed
Folding only one instant per client by default (to conserve ram)
Folding using wine to emulate windows (better points productions on tinker core). This require X to run on each client
User has the ability to upgrade the FAHConsole as well as controlling the flags
Will keep you guys posted.
Arkaine23
08-22-04, 09:06 PM
Overdoze, you can configure wine to use the tty driver instead of X and run wine folding in text mode. You can find the info in discussion of the finstall folding installer on folding-community forums..... That might help conserve some ram.
overdoze
08-23-04, 01:29 PM
Thanks I found it. This should help quite a bit.
I tested on tinker P639 and I'm getting a 25% boost in production using wine on this WU. Really strange, but on most gromacs I get no production boost. Most tinker I get some boost. The larger the tinker the faster it is using wine.
Papsomax
08-23-04, 04:01 PM
It is only involving gromes..tinkers are fine and it's not every client every day...there might be 1 to 3 different clients a day that have either the check sum error or the file-io error and eg client 240 may have a grome with no problems....240_2 has a grome and is having this problem....I have restrated the farm.. matter of fact...i completely started over b deleting all client data...so when each client booted they would get new cores.....still has the same problem....what hacks me off is it may have completed 80% of this grome..this error occurs and it deletes the work unit so i've lost that 80%.......looking in the home directories there are one core of each 65, or 78 and one has 65,78 and 79 i didnt notice any extra cores....
I did top on all of the clients and there are only two cores running on each client i only have 8 clients at this time plus the server
paps
Arkaine23
08-23-04, 04:07 PM
You'd need to check the running processes with the top command. The extra core would be running in ram- IE one folding client has launched its core twice. This is probably not the issue, more likely its what overdoze said about the ramdisk on a client being full because of a particularly big wu.
As root-
ssh IP_address /cat/proc/meminfo (should give you info about a node's memory useage)
vBulletin® v3.8.7, Copyright ©2000-2012, vBulletin Solutions, Inc.