View Full Version : I've lost a lot of work in the last couple of weeks like this.
NedClocker
08-22-08, 03:20 PM
Units are not getting sent and when I restart a client, it starts over with a new unit.
Any suggestions?
jeff@jeff-desktop:~$ cd ~/folding/FAH
jeff@jeff-desktop:~/folding/FAH$ ./fah6 -smp -verbosity 9
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.
2 cores detected
--- Opening Log file [August 22 20:05:05]
# SMP Client ################################################## ################
################################################## #############################
Folding@Home Client Version 6.02
http://folding.stanford.edu
################################################## #############################
################################################## #############################
Launch directory: /home/jeff/folding/FAH
Executable: ./fah6
Arguments: -smp -verbosity 9
[20:05:05] - Ask before connecting: No
[20:05:05] - User name: Ned_Clocker (Team 32)
[20:05:05] - User ID: 38A22004165DDE5D
[20:05:05] - Machine ID: 1
[20:05:05]
[20:05:05] Loaded queue successfully.
[20:05:05]
[20:05:05] + Processing work unit
[20:05:05] Core required: FahCore_a2.exe
[20:05:05] Core found.
[20:05:05] - Autosending finished units...
[20:05:05] Trying to send all finished work units
[20:05:05] + Attempting to send results
[20:05:05] - Reading file work/wuresults_03.dat from core
[20:05:06] Working on Unit 05 [August 22 20:05:06]
[20:05:06] + Working ...
[20:05:06] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -checkpoint 30 -verbose -lifeline 4767 -version 602'
[20:05:06] (Read 26792167 bytes from disk)
[20:05:06] Connecting to http://171.64.65.56:8080/
[20:05:07]
[20:05:07] *------------------------------*
[20:05:07] Folding@Home Gromacs SMP Core
[20:05:07] Version 1.91 (2007)
[20:05:07]
[20:05:07] Preparing to commence simulation
[20:05:07] - Ensuring status. Please wait.
[20:05:13] - Couldn't send HTTP request to server
[20:05:13] + Could not connect to Work Server (results)
[20:05:13] (171.64.65.56:8080)
[20:05:13] - Error: Could not transmit unit 03 (completed August 21) to work server.
[20:05:13] - 12 failed uploads of this unit.
[20:05:13] + Attempting to send results
[20:05:13] - Reading file work/wuresults_03.dat from core
[20:05:13] (Read 26792167 bytes from disk)
[20:05:13] Connecting to http://171.64.122.86:8080/
[20:05:24] - Looking at optimizations...
[20:05:24] - Working with standard loops on this execution.
[20:05:24] - Previous termination of core was improper.
[20:05:24] - Going to use standard loops.
[20:05:24] - Files status OK
[20:05:24] Error: Work unit read from disk is invalid
[20:05:24] Finalizing output
[20:05:26] - Expanded 4922555 -> 24360573 (decompressed 494.8 percent)
[20:05:26]
[20:05:26] Project: 2662 (Run 2, Clone 133, Gen 16)
[20:05:26]
[20:05:26] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=1, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=3, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=2, HOSTNAME=jeff-desktop
NODEID=3 argc=14
NODEID=1 argc=14
NODEID=0 argc=14
NODEID=2 argc=14
[20:05:32] Will resume from checkpoint file
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_20070720 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) mdrun (-:
Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
starting mdrun 'HGG in water'
250000 steps, 500.0 ps.
old size= 35159 old_crc=2088768
-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_20070720
Source code file: md.c, line: 831
Fatal error:
Checkpoint error on step 4043760
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4
gcq#0: Thanx for Using GROMACS - Have a Nice Day
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[20:05:35] Completed 43760 out of 25File work/wudata_05.log has changed since last checkpoint
[cli_1]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
[0]0:Return code = 255
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[20:05:39] CoreStatus = FF (255)
[20:05:39] Client-core communications error: ERROR 0xff
[20:05:39] Deleting current work unit & continuing...
[20:08:12] Posted data.
[20:08:13] Initial: 0000; - Uploaded at ~145 kB/s
[20:08:13] - Averaged speed for that direction ~56 kB/s
[20:08:13] - Server does not have record of this unit. Will try again later.
[20:08:13] Could not transmit unit 03 to Collection server; keeping in queue.
[20:08:13] + Sent 0 of 1 completed units to the server
[20:08:13] - Autosend completed
After shutdown
[0]0:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[20:10:08] - Warning: Could not delete all work unit files (5): Core file absent
[20:10:08] Trying to send all finished work units
[20:10:08] + Attempting to send results
[20:10:08] - Reading file work/wuresults_03.dat from core
[20:10:08] (Read 26792167 bytes from disk)
[20:10:08] Connecting to http://171.64.65.56:8080/
[20:10:15] - Couldn't send HTTP request to server
[20:10:15] + Could not connect to Work Server (results)
[20:10:15] (171.64.65.56:8080)
[20:10:15] - Error: Could not transmit unit 03 (completed August 21) to work server.
[20:10:15] - 13 failed uploads of this unit.
[20:10:15] + Attempting to send results
[20:10:15] - Reading file work/wuresults_03.dat from core
[20:10:15] (Read 26792167 bytes from disk)
[20:10:15] Connecting to http://171.64.122.86:8080/
[20:10:15] - Couldn't send HTTP request to server
[20:10:15] + Could not connect to Work Server (results)
[20:10:15] (171.64.122.86:8080)
[20:10:15] Could not transmit unit 03 to Collection server; keeping in queue.
[20:10:15] + Sent 0 of 1 completed units to the server
[20:10:15] - Preparing to get new work unit...
[20:10:15] + Attempting to get work packet
[20:10:15] - Will indicate memory of 1002 MB
[20:10:15] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 6
[20:10:15] - Connecting to assignment server
[20:10:15] Connecting to http://assign.stanford.edu:8080/
[20:10:15] Posted data.
[20:10:15] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[20:10:15] + News From Folding@Home: Welcome to Folding@Home
[20:10:16] Loaded queue successfully.
[20:10:16] Connecting to http://171.64.65.56:8080/
[20:10:21] Posted data.
[20:10:21] Initial: 0000; - Receiving payload (expected size: 4920469)
[20:10:31] - Downloaded at ~480 kB/s
[20:10:31] - Averaged speed for that direction ~195 kB/s
[20:10:31] + Received work.
[20:10:31] + Closed connections
[20:10:36]
[20:10:36] + Processing work unit
[20:10:36] Core required: FahCore_a2.exe
[20:10:36] Core found.
[20:10:37] Working on Unit 06 [August 22 20:10:37]
[20:10:37] + Working ...
[20:10:37] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 30 -verbose -lifeline 4767 -version 602'
[20:10:37]
[20:10:37] *------------------------------*
[20:10:37] Folding@Home Gromacs SMP Core
[20:10:37] Version 1.91 (2007)
[20:10:37]
[20:10:37] Preparing to commence simulation
[20:10:37] - Ensuring status. Please wait.
[20:10:54] - Looking at optimizations...
[20:10:54] - Working with standard loops on this execution.
[20:10:54] - Previous termination of core was improper.
[20:10:54] - Going to use standard loops.
[20:10:54] - Files status OK
[20:10:54] Error: Work unit read from disk is invalid
[20:10:54] Finalizing output
[20:10:55] - Expanded 4919957 -> 24360573 (decompressed 495.1 percent)
[20:10:56]
[20:10:56] Project: 2662 (Run 2, Clone 393, Gen 12)
[20:10:56]
[20:10:56] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=3, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=1, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=0, HOSTNAME=jeff-desktop
NODEID=2 argc=14
NODEID=0 argc=14
NODEID=3 argc=14
NODEID=1 argc=14
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_20070720 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) mdrun (-:
Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
starting mdrun 'HGG in water'
250000 steps, 500.0 ps.
[20:11:04] (0%)
[20:11:04] ed 0 out of 250000 steps (0%)
the garynator
08-22-08, 03:27 PM
I know you've probably checked this already, but firewall maybe?
what have you all tried so far?
This is on only one machine?
WarriorII
08-22-08, 03:44 PM
I'm gonna take a Stab instead of asking.
By your Flags this is an Ubuntu machine correct?
cd ~/folding/FAH
:~/folding/FAH$ ./fah6 -smp -verbosity 9
Have you tried Deleting the info inside the Work Folder and Queue.dat file & starting over from scratch that way?
I know- these are SMP WU's and have a time limit on them.
If you can't get them uploaded in fast enough order, you still don't get points for doing them.
Was it just this one WU or was there more from other machines?
There is Conectivity Test in the Stickies you may want to try too.
Didn't Standford go down for a bit too this past week?
Good to see you Ned.
Gary too !
Mark620
08-22-08, 07:32 PM
It would appear that Stanford has serious problems
this type of problem is proliferating...
Mr.Guvernment
08-22-08, 07:58 PM
ya could be tyhe aftershock of the power outages they had
In my experience, there are three things that cause this problem:
[20:10:08] + Attempting to send results
[20:10:08] - Reading file work/wuresults_03.dat from core
[20:10:08] (Read 26792167 bytes from disk)
[20:10:08] Connecting to http://171.64.65.56:8080/
[20:10:15] - Couldn't send HTTP request to server
[20:10:15] + Could not connect to Work Server (results)
[20:10:15] (171.64.65.56:8080)
[20:10:15] - Error: Could not transmit unit 03 (completed August 21) to work server.
[20:10:15] - 13 failed uploads of this unit.
1) Blocked by firewall, router, or ISP has shut down that port to save bandwidth.
If you can connect with the router in question and get an "OK" or ping, then it's not #1, and obviously, your internet connection is working.
2) Either your comm protocol or Stanford's protocol, is bad. With the old clients (5.02, 5.04, etc.), it could be things like:
a) Using the IE Software settings. Does not apply to newer clients.
or
b) Stanford's server is bonkers - even if it says the server's status is great, that doesn't always mean that the server will communicate with boxes *outside* Stanford.
That just means the server *will* communicate with *one* other server, inside Stanford's FAH lab.
And this looks like a 2b problem, to me.
Mark620
08-23-08, 06:19 AM
I hope its "2b"
None of my computers will communicate with stanford.
[07:41:09] *------------------------------*
[07:41:09] Folding@Home GPU Core - Beta
[07:41:09] Version 1.09 (Fri Aug 1 11:46:54 PDT 2008)
[07:41:09]
[07:41:09] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[07:41:09] Build host: amoeba
[07:41:09] Board Type: Nvidia
[07:41:09] Core :
[07:41:09] Preparing to commence simulation
[07:41:09] - Looking at optimizations...
[07:41:09] - Created dyn
[07:41:09] - Files status OK
[07:41:09] - Expanded 45377 -> 246249 (decompressed 542.6 percent)
[07:41:09] Called DecompressByteArray: compressed_data_size=45377 data_size=246249, decompressed_data_size=246249 diff=0
[07:41:09] - Digital signature verified
[07:41:09]
[07:41:09] Project: 5506 (Run 8, Clone 434, Gen 59)
[07:41:09]
[07:41:09] Assembly optimizations on if available.
[07:41:09] Entering M.D.
[07:41:15] Working on p5506_supervillin_e1
[07:41:16] Client config found, loading data.
[07:41:16] Starting GUI Server
[07:42:47] Completed 1%
[07:44:17] Completed 2%
[07:45:47] Completed 3%
[07:47:18] Completed 4%
[07:48:48] Completed 5%
[07:50:18] Completed 6%
[07:51:49] Completed 7%
[07:53:19] Completed 8%
[07:54:49] Completed 9%
[07:56:20] Completed 10%
[07:57:50] Completed 11%
[07:59:20] Completed 12%
[08:00:51] Completed 13%
[08:02:21] Completed 14%
[08:03:52] Completed 15%
[08:05:22] Completed 16%
[08:06:52] Completed 17%
[08:08:23] Completed 18%
[08:09:53] Completed 19%
[08:11:23] Completed 20%
[08:12:54] Completed 21%
[08:14:24] Completed 22%
[08:15:54] Completed 23%
[08:17:25] Completed 24%
[08:18:55] Completed 25%
[08:20:25] Completed 26%
[08:21:56] Completed 27%
[08:23:26] Completed 28%
[08:24:56] Completed 29%
[08:26:27] Completed 30%
[08:27:57] Completed 31%
[08:29:28] Completed 32%
[08:30:58] Completed 33%
[08:32:28] Completed 34%
[08:33:59] Completed 35%
[08:35:29] Completed 36%
[08:36:59] Completed 37%
[08:38:30] Completed 38%
[08:40:00] Completed 39%
[08:41:30] Completed 40%
[08:43:01] Completed 41%
[08:44:31] Completed 42%
[08:46:01] Completed 43%
[08:47:32] Completed 44%
[08:49:02] Completed 45%
[08:50:32] Completed 46%
[08:52:03] Completed 47%
[08:53:33] Completed 48%
[08:55:04] Completed 49%
[08:56:34] Completed 50%
[08:58:04] Completed 51%
[08:59:35] Completed 52%
[09:01:05] Completed 53%
[09:02:35] Completed 54%
[09:04:06] Completed 55%
[09:05:36] Completed 56%
[09:07:06] Completed 57%
[09:08:37] Completed 58%
[09:10:07] Completed 59%
[09:11:37] Completed 60%
[09:13:08] Completed 61%
[09:14:38] Completed 62%
[09:16:08] Completed 63%
[09:17:39] Completed 64%
[09:19:09] Completed 65%
[09:20:40] Completed 66%
[09:22:10] Completed 67%
[09:23:40] Completed 68%
[09:25:11] Completed 69%
[09:26:41] Completed 70%
[09:28:11] Completed 71%
[09:29:42] Completed 72%
[09:31:12] Completed 73%
[09:32:42] Completed 74%
[09:34:13] Completed 75%
[09:35:43] Completed 76%
[09:37:13] Completed 77%
[09:38:44] Completed 78%
[09:40:14] Completed 79%
[09:41:44] Completed 80%
[09:43:15] Completed 81%
[09:44:45] Completed 82%
[09:46:16] Completed 83%
[09:47:46] Completed 84%
[09:49:16] Completed 85%
[09:50:47] Completed 86%
[09:52:17] Completed 87%
[09:53:47] Completed 88%
[09:55:18] Completed 89%
[09:56:48] Completed 90%
[09:58:18] Completed 91%
[09:59:49] Completed 92%
[10:01:19] Completed 93%
[10:02:49] Completed 94%
[10:04:20] Completed 95%
[10:05:50] Completed 96%
[10:07:20] Completed 97%
[10:08:51] Completed 98%
[10:10:21] Completed 99%
[10:11:52] Completed 100%
[10:12:52]
[10:12:52] Finished Work Unit:
[10:12:52] - Reading up to 34800 from "work/wudata_09.trr": Read 34800
[10:12:52] trr file hash check passed.
[10:12:52] - Reading up to 1127024 from "work/wudata_09.xtc": Read 1127024
[10:12:52] xtc file hash check passed.
[10:12:52] edr file hash check passed.
[10:12:52] logfile size: 130367
[10:12:52] Leaving Run
[10:12:54] - Writing 1293263 bytes of core data to disk...
[10:12:55] Done: 1292751 -> 1154770 (compressed to 89.3 percent)
[10:12:55] ... Done.
[10:12:55] - Shutting down core
[10:12:55]
[10:12:55] Folding@home Core Shutdown: FINISHED_UNIT
[10:12:57] CoreStatus = 64 (100)
[10:12:57] Sending work to server
[10:12:57] Project: 5506 (Run 8, Clone 434, Gen 59)
[10:12:57] - Read packet limit of 540015616... Set to 524286976.
[10:12:57] + Attempting to send results [August 23 10:12:57 UTC]
[10:14:45] - Couldn't send HTTP request to server
[10:14:45] + Could not connect to Work Server (results)
[10:14:45] (171.64.65.106:8080)
[10:14:45] + Retrying using alternative port
[10:16:39] - Couldn't send HTTP request to server
[10:16:39] + Could not connect to Work Server (results)
[10:16:39] (171.64.65.106:80)
[10:16:39] - Error: Could not transmit unit 09 (completed August 23) to work server.
[10:16:39] Keeping unit 09 in queue.
[10:16:39] Project: 5015 (Run 5, Clone 758, Gen 41)
[10:16:39] - Read packet limit of 540015616... Set to 524286976.
[10:16:39] + Attempting to send results [August 23 10:16:39 UTC]
[10:17:00] - Couldn't send HTTP request to server
[10:17:00] + Could not connect to Work Server (results)
[10:17:00] (171.64.65.20:8080)
[10:17:00] + Retrying using alternative port
[10:18:56] - Couldn't send HTTP request to server
[10:18:56] + Could not connect to Work Server (results)
[10:18:56] (171.64.65.20:80)
[10:18:56] - Error: Could not transmit unit 00 (completed August 22) to work server.
[10:18:56] - Read packet limit of 540015616... Set to 524286976.
[10:18:56] + Attempting to send results [August 23 10:18:56 UTC]
[10:18:56] - Couldn't send HTTP request to server
[10:18:56] (Got status 503)
[10:18:56] + Could not connect to Work Server (results)
[10:18:56] (171.64.122.76:8080)
[10:18:56] + Retrying using alternative port
[10:18:56] - Couldn't send HTTP request to server
[10:18:56] (Got status 503)
[10:18:56] + Could not connect to Work Server (results)
[10:18:56] (171.64.122.76:80)
[10:18:56] Could not transmit unit 00 to Collection server; keeping in queue.
[10:18:56] Project: 5015 (Run 9, Clone 640, Gen 6)
[10:18:56] - Read packet limit of 540015616... Set to 524286976.
[10:18:56] + Attempting to send results [August 23 10:18:56 UTC]
[10:22:06] - Couldn't send HTTP request to server
[10:22:06] + Could not connect to Work Server (results)
[10:22:06] (171.64.65.20:8080)
[10:22:06] + Retrying using alternative port
[10:24:18] - Couldn't send HTTP request to server
[10:24:18] + Could not connect to Work Server (results)
[10:24:18] (171.64.65.20:80)
[10:24:18] - Error: Could not transmit unit 01 (completed August 22) to work server.
[10:24:18] - Read packet limit of 540015616... Set to 524286976.
[10:24:18] + Attempting to send results [August 23 10:24:18 UTC]
[10:24:19] - Couldn't send HTTP request to server
[10:24:19] (Got status 503)
[10:24:19] + Could not connect to Work Server (results)
[10:24:19] (171.64.122.76:8080)
[10:24:19] + Retrying using alternative port
[10:24:19] - Couldn't send HTTP request to server
[10:24:19] (Got status 503)
[10:24:19] + Could not connect to Work Server (results)
[10:24:19] (171.64.122.76:80)
[10:24:19] Could not transmit unit 01 to Collection server; keeping in queue.
[10:24:19] Project: 5506 (Run 7, Clone 801, Gen 12)
[10:24:19] - Read packet limit of 540015616... Set to 524286976.
[10:24:19] + Attempting to send results [August 23 10:24:19 UTC]
[10:26:40] - Couldn't send HTTP request to server
[10:26:40] + Could not connect to Work Server (results)
[10:26:40] (171.64.65.106:8080)
[10:26:40] + Retrying using alternative port
[10:29:14] - Couldn't send HTTP request to server
[10:29:14] + Could not connect to Work Server (results)
[10:29:14] (171.64.65.106:80)
[10:29:14] - Error: Could not transmit unit 02 (completed August 22) to work server.
[10:29:14] - Read packet limit of 540015616... Set to 524286976.
[10:29:14] + Attempting to send results [August 23 10:29:14 UTC]
[10:29:14] - Couldn't send HTTP request to server
[10:29:14] (Got status 503)
[10:29:14] + Could not connect to Work Server (results)
[10:29:14] (171.64.122.76:8080)
[10:29:14] + Retrying using alternative port
[10:29:14] - Couldn't send HTTP request to server
[10:29:14] (Got status 503)
[10:29:14] + Could not connect to Work Server (results)
[10:29:14] (171.64.122.76:80)
[10:29:14] Could not transmit unit 02 to Collection server; keeping in queue.
[10:29:14] Project: 5506 (Run 2, Clone 916, Gen 14)
[10:29:14] - Read packet limit of 540015616... Set to 524286976.
[10:29:14] + Attempting to send results [August 23 10:29:14 UTC]
[10:31:27] - Couldn't send HTTP request to server
[10:31:27] + Could not connect to Work Server (results)
[10:31:27] (171.64.65.106:8080)
[10:31:27] + Retrying using alternative port
[10:33:47] - Couldn't send HTTP request to server
[10:33:47] + Could not connect to Work Server (results)
[10:33:47] (171.64.65.106:80)
[10:33:47] - Error: Could not transmit unit 03 (completed August 22) to work server.
[10:33:47] - Read packet limit of 540015616... Set to 524286976.
[10:33:47] + Attempting to send results [August 23 10:33:47 UTC]
[10:33:47] - Couldn't send HTTP request to server
[10:33:47] (Got status 503)
[10:33:47] + Could not connect to Work Server (results)
[10:33:47] (171.64.122.76:8080)
[10:33:47] + Retrying using alternative port
[10:33:47] - Couldn't send HTTP request to server
[10:33:47] (Got status 503)
[10:33:47] + Could not connect to Work Server (results)
[10:33:47] (171.64.122.76:80)
[10:33:47] Could not transmit unit 03 to Collection server; keeping in queue.
[10:33:47] Project: 5015 (Run 1, Clone 909, Gen 8)
[10:33:47] - Read packet limit of 540015616... Set to 524286976.
[10:33:47] + Attempting to send results [August 23 10:33:47 UTC]
[10:37:58] - Couldn't send HTTP request to server
[10:37:58] + Could not connect to Work Server (results)
[10:37:58] (171.64.65.20:8080)
[10:37:58] + Retrying using alternative port
[10:38:59] - Couldn't send HTTP request to server
[10:38:59] + Could not connect to Work Server (results)
[10:38:59] (171.64.65.20:80)
[10:38:59] - Error: Could not transmit unit 04 (completed August 22) to work server.
[10:38:59] - Read packet limit of 540015616... Set to 524286976.
[10:38:59] + Attempting to send results [August 23 10:38:59 UTC]
[10:38:59] - Couldn't send HTTP request to server
[10:38:59] (Got status 503)
[10:38:59] + Could not connect to Work Server (results)
[10:38:59] (171.64.122.76:8080)
[10:38:59] + Retrying using alternative port
[10:38:59] - Couldn't send HTTP request to server
[10:38:59] (Got status 503)
[10:38:59] + Could not connect to Work Server (results)
[10:38:59] (171.64.122.76:80)
[10:38:59] Could not transmit unit 04 to Collection server; keeping in queue.
[10:38:59] Project: 5014 (Run 9, Clone 47, Gen 4)
[10:38:59] - Read packet limit of 540015616... Set to 524286976.
[10:38:59] + Attempting to send results [August 23 10:38:59 UTC]
[10:40:56] - Couldn't send HTTP request to server
[10:40:56] + Could not connect to Work Server (results)
[10:40:56] (171.64.65.20:8080)
[10:40:56] + Retrying using alternative port
[10:42:51] - Couldn't send HTTP request to server
[10:42:51] + Could not connect to Work Server (results)
[10:42:51] (171.64.65.20:80)
[10:42:51] - Error: Could not transmit unit 05 (completed August 22) to work server.
[10:42:51] - Read packet limit of 540015616... Set to 524286976.
[10:42:51] + Attempting to send results [August 23 10:42:51 UTC]
[10:42:51] - Couldn't send HTTP request to server
[10:42:51] (Got status 503)
[10:42:51] + Could not connect to Work Server (results)
[10:42:51] (171.64.122.76:8080)
[10:42:51] + Retrying using alternative port
[10:42:51] - Couldn't send HTTP request to server
[10:42:51] (Got status 503)
[10:42:51] + Could not connect to Work Server (results)
[10:42:51] (171.64.122.76:80)
[10:42:51] Could not transmit unit 05 to Collection server; keeping in queue.
[10:42:51] Project: 5015 (Run 2, Clone 960, Gen 5)
[10:42:51] - Read packet limit of 540015616... Set to 524286976.
[10:42:51] + Attempting to send results [August 23 10:42:51 UTC]
[10:47:02] - Couldn't send HTTP request to server
[10:47:02] + Could not connect to Work Server (results)
[10:47:02] (171.64.65.20:8080)
[10:47:02] + Retrying using alternative port
[10:48:15] - Couldn't send HTTP request to server
[10:48:15] + Could not connect to Work Server (results)
[10:48:15] (171.64.65.20:80)
[10:48:15] - Error: Could not transmit unit 06 (completed August 22) to work server.
[10:48:15] - Read packet limit of 540015616... Set to 524286976.
[10:48:15] + Attempting to send results [August 23 10:48:15 UTC]
[10:48:16] - Couldn't send HTTP request to server
[10:48:16] (Got status 503)
[10:48:16] + Could not connect to Work Server (results)
[10:48:16] (171.64.122.76:8080)
[10:48:16] + Retrying using alternative port
[10:48:16] - Couldn't send HTTP request to server
[10:48:16] (Got status 503)
[10:48:16] + Could not connect to Work Server (results)
[10:48:16] (171.64.122.76:80)
[10:48:16] Could not transmit unit 06 to Collection server; keeping in queue.
[10:48:16] Project: 5506 (Run 8, Clone 407, Gen 45)
[10:48:16] - Read packet limit of 540015616... Set to 524286976.
[10:48:16] + Attempting to send results [August 23 10:48:16 UTC]
[10:50:22] - Couldn't send HTTP request to server
[10:50:22] + Could not connect to Work Server (results)
[10:50:22] (171.64.65.106:8080)
[10:50:22] + Retrying using alternative port
[10:52:35] - Couldn't send HTTP request to server
[10:52:35] + Could not connect to Work Server (results)
[10:52:35] (171.64.65.106:80)
[10:52:35] - Error: Could not transmit unit 07 (completed August 23) to work server.
[10:52:35] - Read packet limit of 540015616... Set to 524286976.
[10:52:35] + Attempting to send results [August 23 10:52:35 UTC]
[10:52:36] - Couldn't send HTTP request to server
[10:52:36] (Got status 503)
[10:52:36] + Could not connect to Work Server (results)
[10:52:36] (171.64.122.76:8080)
[10:52:36] + Retrying using alternative port
[10:52:36] - Couldn't send HTTP request to server
[10:52:36] (Got status 503)
[10:52:36] + Could not connect to Work Server (results)
[10:52:36] (171.64.122.76:80)
[10:52:36] Could not transmit unit 07 to Collection server; keeping in queue.
[10:52:36] Project: 5504 (Run 6, Clone 758, Gen 2)
[10:52:36] - Read packet limit of 540015616... Set to 524286976.
[10:52:36] + Attempting to send results [August 23 10:52:36 UTC]
[10:54:53] - Couldn't send HTTP request to server
[10:54:53] + Could not connect to Work Server (results)
[10:54:53] (171.64.65.106:8080)
[10:54:53] + Retrying using alternative port
[11:00:00] - Couldn't send HTTP request to server
[11:00:00] + Could not connect to Work Server (results)
[11:00:00] (171.64.65.106:80)
[11:00:00] - Error: Could not transmit unit 08 (completed August 23) to work server.
[11:00:00] - Read packet limit of 540015616... Set to 524286976.
[11:00:00] + Attempting to send results [August 23 11:00:00 UTC]
[11:00:00] - Couldn't send HTTP request to server
[11:00:00] (Got status 503)
[11:00:00] + Could not connect to Work Server (results)
[11:00:00] (171.64.122.76:8080)
[11:00:00] + Retrying using alternative port
[11:00:00] - Couldn't send HTTP request to server
[11:00:00] (Got status 503)
[11:00:00] + Could not connect to Work Server (results)
[11:00:00] (171.64.122.76:80)
[11:00:00] Could not transmit unit 08 to Collection server; keeping in queue.
[11:00:00] Project: 5506 (Run 8, Clone 434, Gen 59)
[11:00:00] - Read packet limit of 540015616... Set to 524286976.
[11:00:00] + Attempting to send results [August 23 11:00:00 UTC]
[11:01:48] - Couldn't send HTTP request to server
[11:01:48] + Could not connect to Work Server (results)
[11:01:48] (171.64.65.106:8080)
[11:01:48] + Retrying using alternative port
[11:03:36] - Couldn't send HTTP request to server
[11:03:36] + Could not connect to Work Server (results)
[11:03:36] (171.64.65.106:80)
[11:03:36] - Error: Could not transmit unit 09 (completed August 23) to work server.
[11:03:36] - Read packet limit of 540015616... Set to 524286976.
[11:03:36] + Attempting to send results [August 23 11:03:36 UTC]
[11:03:36] - Couldn't send HTTP request to server
[11:03:36] (Got status 503)
[11:03:36] + Could not connect to Work Server (results)
[11:03:36] (171.64.122.76:8080)
[11:03:36] + Retrying using alternative port
[11:03:36] - Couldn't send HTTP request to server
[11:03:36] (Got status 503)
[11:03:36] + Could not connect to Work Server (results)
[11:03:36] (171.64.122.76:80)
[11:03:36] Could not transmit unit 09 to Collection server; keeping in queue.
[11:03:36] - Preparing to get new work unit...
[11:03:36] - Work queue full. Deleting oldest item
[11:03:36] + Attempting to get work packet
[11:03:36] - Connecting to assignment server
[11:03:37] - Successful: assigned to (171.64.65.106).
[11:03:37] + News From Folding@Home: GPU folding beta
[11:03:37] Loaded queue successfully.
[11:03:38] Project: 5015 (Run 9, Clone 640, Gen 6)
[11:03:38] - Read packet limit of 540015616... Set to 524286976.
[11:03:38] + Attempting to send results [August 23 11:03:38 UTC]
[11:05:57] - Couldn't send HTTP request to server
[11:05:57] + Could not connect to Work Server (results)
[11:05:57] (171.64.65.20:8080)
[11:05:57] + Retrying using alternative port
[11:08:10] - Couldn't send HTTP request to server
[11:08:10] + Could not connect to Work Server (results)
[11:08:10] (171.64.65.20:80)
[11:08:10] - Error: Could not transmit unit 01 (completed August 22) to work server.
[11:08:10] - Read packet limit of 540015616... Set to 524286976.
[11:08:10] + Attempting to send results [August 23 11:08:10 UTC]
[11:08:13] - Couldn't send HTTP request to server
[11:08:13] (Got status 503)
[11:08:13] + Could not connect to Work Server (results)
[11:08:13] (171.64.122.76:8080)
[11:08:13] + Retrying using alternative port
[11:08:18] - Couldn't send HTTP request to server
[11:08:18] (Got status 503)
[11:08:18] + Could not connect to Work Server (results)
[11:08:18] (171.64.122.76:80)
[11:08:18] Could not transmit unit 01 to Collection server; keeping in queue.
[11:08:18] Project: 5506 (Run 7, Clone 801, Gen 12)
[11:08:18] - Read packet limit of 540015616... Set to 524286976.
[11:08:18] + Attempting to send results [August 23 11:08:18 UTC]
[11:12:32] - Couldn't send HTTP request to server
[11:12:32] + Could not connect to Work Server (results)
[11:12:32] (171.64.65.106:8080)
[11:12:32] + Retrying using alternative port
[11:16:05] - Couldn't send HTTP request to server
[11:16:05] + Could not connect to Work Server (results)
[11:16:05] (171.64.65.106:80)
[11:16:05] - Error: Could not transmit unit 02 (completed August 22) to work server.
[11:16:05] - Read packet limit of 540015616... Set to 524286976.
[11:16:05] + Attempting to send results [August 23 11:16:05 UTC]
[11:16:05] - Couldn't send HTTP request to server
[11:16:05] (Got status 503)
[11:16:05] + Could not connect to Work Server (results)
[11:16:05] (171.64.122.76:8080)
[11:16:05] + Retrying using alternative port
[11:16:05] - Couldn't send HTTP request to server
[11:16:05] (Got status 503)
[11:16:05] + Could not connect to Work Server (results)
[11:16:05] (171.64.122.76:80)
[11:16:05] Could not transmit unit 02 to Collection server; keeping in queue.
[11:16:05] Project: 5506 (Run 2, Clone 916, Gen 14)
[11:16:05] - Read packet limit of 540015616... Set to 524286976.
[11:16:05] + Attempting to send results [August 23 11:16:05 UTC]
Mr.Guvernment
08-23-08, 12:29 PM
if it was 1. would that mean the ISP is blocking port 80 /8080 ?
harlam357
08-24-08, 04:37 PM
Long time no see Ned!!! :beer: Where ya been buddy?
Ok, I think we've figured out that the servers were having issues... again :rolleyes:
But, I found this part of your log to be suspect... possibly instability?
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[20:05:35] Completed 43760 out of 25File work/wudata_05.log has changed since last checkpoint
[cli_1]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
[0]0:Return code = 255
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[20:05:39] CoreStatus = FF (255)
[20:05:39] Client-core communications error: ERROR 0xff
[20:05:39] Deleting current work unit & continuing...
NedClocker
08-25-08, 09:40 AM
garynator, I don't think it's a firewall problem. The machines can get work, just not send it back. This is happening on several machines.
Warrior, Yes, it is an Ubuntu machine. Almost all of my machines are Ubuntu, real or virtual! All of the major machines are Ubuntu. Yes, the SMP WUs expire fast just sitting in queue. It's more than one machine doing this. No I have not tried deleting Work folder and Queue.dat. I'll give that a try. I hate to lose that one sitting in queue, but will probably lose it anyway.
Adak, I think it is problem 2b, also, lol. Stanford's server must be bonkers. It's telling me on some of the units that the server does not have a record of this unit.
Harlam, It could be some instability. I set that rig back to stock, but I am still getting that message about the log file having changed. I'm sure that rig could use a good dusting. It's been a long time since I opened it up and cleaned it.
Edit
Adak, about bandwidth: I changed ISPs three days ago, so that's probably not the problem.
I am now on CABLE internet!!! Goodbye DSL. I got tired of videos stopping in the middle for no apparant reason. And my son reports much lower pings now while playing COD2. :)
NedClocker
08-25-08, 09:45 AM
Check out this latest restart.
jeff@jeff-desktop:~/folding$ cd ~/folding/FAH
jeff@jeff-desktop:~/folding/FAH$ ./fah6 -smp -verbosity 9
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.
2 cores detected
--- Opening Log file [August 25 14:12:35]
# SMP Client ################################################## ################
################################################## #############################
Folding@Home Client Version 6.02
http://folding.stanford.edu
################################################## #############################
################################################## #############################
Launch directory: /home/jeff/folding/FAH
Executable: ./fah6
Arguments: -smp -verbosity 9
[14:12:35] - Ask before connecting: No
[14:12:35] - User name: Ned_Clocker (Team 32)
[14:12:35] - Machine ID: 1
[14:12:35]
[14:12:35] Loaded queue successfully.
[14:12:35]
[14:12:35] + Processing work unit
[14:12:35] Core required: FahCore_a2.exe
[14:12:35] Core found.
[14:12:35] - Autosending finished units...
[14:12:35] Trying to send all finished work units
[14:12:35] + Attempting to send results
[14:12:35] - Reading file work/wuresults_06.dat from core
[14:12:36] Working on Unit 07 [August 25 14:12:36]
[14:12:36] + Working ...
[14:12:36] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 30 -verbose -lifeline 4727 -version 602'
[14:12:36] (Read 26368847 bytes from disk)
[14:12:36] Connecting to http://171.64.65.56:8080/
[14:12:36]
[14:12:36] *------------------------------*
[14:12:36] Folding@Home Gromacs SMP Core
[14:12:36] Version 1.91 (2007)
[14:12:36]
[14:12:36] Preparing to commence simulation
[14:12:36] - Ensuring status. Please wait.
[14:12:43] - Couldn't send HTTP request to server
[14:12:43] + Could not connect to Work Server (results)
[14:12:43] (171.64.65.56:8080)
[14:12:43] - Error: Could not transmit unit 06 (completed August 24) to work server.
[14:12:43] - 9 failed uploads of this unit.
[14:12:43] + Attempting to send results
[14:12:43] - Reading file work/wuresults_06.dat from core
[14:12:43] (Read 26368847 bytes from disk)
[14:12:43] Connecting to http://171.64.122.86:8080/
[14:12:53] - Looking at optimizations...
[14:12:53] - Working with standard loops on this execution.
[14:12:53] - Previous termination of core was improper.
[14:12:53] - Going to use standard loops.
[14:12:53] - Files staError: Could not write local fiError: Could not write local fiError: Could not write local file. Exiting.
[14:12:58] - Shutting down core
After shutdown
After shutdown
After shutdown
After shutdown
[14:14:58] Finalizing output
[14:15:02] CoreStatus = 0 (0)
[14:15:02] Client-core communications error: ERROR 0x0
[14:15:02] Deleting current work unit & continuing...
[14:15:39] Posted data.
[14:15:39] Initial: 0000; - Uploaded at ~146 kB/s
[14:15:39] - Averaged speed for that direction ~116 kB/s
[14:15:39] - Server does not have record of this unit. Will try again later.
[14:15:39] Could not transmit unit 06 to Collection server; keeping in queue.
[14:15:39] + Sent 0 of 1 completed units to the server
[14:15:39] - Autosend completed
After shutdown
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[14:19:24] - Warning: Could not delete all work unit files (7): Core file absent
[14:19:24] Trying to send all finished work units
[14:19:24] + Attempting to send results
[14:19:24] - Reading file work/wuresults_06.dat from core
[14:19:24] (Read 26368847 bytes from disk)
[14:19:24] Connecting to http://171.64.65.56:8080/
[14:19:56] - Couldn't send HTTP request to server
[14:19:56] + Could not connect to Work Server (results)
[14:19:56] (171.64.65.56:8080)
[14:19:56] - Error: Could not transmit unit 06 (completed August 24) to work server.
[14:19:56] - 10 failed uploads of this unit.
[14:19:56] + Attempting to send results
[14:19:56] - Reading file work/wuresults_06.dat from core
[14:19:56] (Read 26368847 bytes from disk)
[14:19:56] Connecting to http://171.64.122.86:8080/
[14:19:58] - Couldn't send HTTP request to server
[14:19:58] + Could not connect to Work Server (results)
[14:19:58] (171.64.122.86:8080)
[14:19:58] Could not transmit unit 06 to Collection server; keeping in queue.
[14:19:58] + Sent 0 of 1 completed units to the server
[14:19:58] - Preparing to get new work unit...
[14:19:58] + Attempting to get work packet
[14:19:58] - Will indicate memory of 1002 MB
[14:19:58] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 6
[14:19:58] - Connecting to assignment server
[14:19:58] Connecting to http://assign.stanford.edu:8080/
[14:20:00] Posted data.
[14:20:00] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[14:20:00] + News From Folding@Home: Welcome to Folding@Home
[14:20:00] Loaded queue successfully.
[14:20:00] Connecting to http://171.64.65.56:8080/
[14:20:07] Posted data.
[14:20:07] Initial: 0000; - Receiving payload (expected size: 5001164)
[14:20:50] - Downloaded at ~113 kB/s
[14:20:50] - Averaged speed for that direction ~202 kB/s
[14:20:50] + Received work.
[14:20:50] + Closed connections
[14:20:55]
[14:20:55] + Processing work unit
[14:20:55] Core required: FahCore_a2.exe
[14:20:55] Core found.
[14:20:55] Working on Unit 08 [August 25 14:20:55]
[14:20:55] + Working ...
[14:20:55] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 08 -checkpoint 30 -verbose -lifeline 4727 -version 602'
[14:20:55]
[14:20:55] *------------------------------*
[14:20:55] Folding@Home Gromacs SMP Core
[14:20:55] Version 1.91 (2007)
[14:20:55]
[14:20:55] Preparing to commence simulation
[14:20:55] - Ensuring status. Please wait.
[14:21:13] - Looking at optimizations...
[14:21:13] - Working with standard loops on this execution.
[14:21:13] - Previous termination of core was improper.
[14:21:13] - Going to use standard loops.
[14:21:13] - Files status OK
[14:21:13] Error: Work unit read from disk is invalid
[14:21:13] Finalizing output
[14:21:14] (decompressed 494.7 percent)
[14:21:14] 9 (decompressed 494.7 percent)
[14:21:15] 8, Gen 10)
[14:21:15]
[14:21:15] Entering M.D.
[14:21:15] ne 368, Gen 10)
[14:21:15]
[14:21:15] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=1, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=3, HOSTNAME=jeff-desktop
NNODES=4, MYRANK=2, HOSTNAME=jeff-desktop
NODEID=3 argc=14
NODEID=1 argc=14
NODEID=2 argc=14
NODEID=0 argc=14
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_20070720 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) mdrun (-:
Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
starting mdrun 'HGG with glycosylations'
250000 steps, 500.0 ps.
[14:21:24] Comple
[14:21:24] ed 0 out of 250000 steps (0%)
[14:32:02] Completed 1250 out of 250000 steps (1%)
WarriorII
08-25-08, 10:04 AM
I want to say that Standford has the problem.
Server Status:
http://fah-web.stanford.edu/serverstat.html
NedClocker
08-25-08, 10:06 AM
This is from a virtual Ubuntu machine.
--- Opening Log file [August 25 14:58:55]
# SMP Client ################################################## ################
################################################## #############################
Folding@Home Client Version 6.02
http://folding.stanford.edu
################################################## #############################
################################################## #############################
Launch directory: /home/nunya/folding/FAH
Executable: ./fah6
Arguments: -smp -verbosity 9
[14:58:55] - Ask before connecting: No
[14:58:55] - User name: Ned_Clocker (Team 32)
[14:58:55] - Machine ID: 1
[14:58:55]
[14:58:55] Loaded queue successfully.
[14:58:55] Unit 9's deadline (August 24 10:56) has passed.
[15:02:00] ***** Got an Activate signal (2)
[15:02:00] Killing all core threads
Folding@Home Client Shutdown.
nunya@nunya-desktop:~/folding/FAH$ ./fah6 -queueinfo
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.
--- Opening Log file [August 25 15:02:12]
# Linux Console Edition ################################################## #####
################################################## #############################
Folding@Home Client Version 6.02
http://folding.stanford.edu
################################################## #############################
################################################## #############################
Launch directory: /home/nunya/folding/FAH
Executable: ./fah6
Arguments: -queueinfo
[15:02:12] - Ask before connecting: No
[15:02:12] - User name: Ned_Clocker (Team 32)
[15:02:12] - User ID: 6FD514F370A61714
[15:02:12] - Machine ID: 1
[15:02:12]
[15:02:12] Loaded queue successfully.
[15:02:12] Printing Queue Information
CURRENT QUEUE:
00 *READY a2 171.64.65.56:8080 August 23 01:22 | August 26 01:22
[ P2662R1C103G9 ]
01 EMPTY
02 EMPTY
03 EMPTY
04 EMPTY
05 EMPTY
06 EMPTY
07 EMPTY
08 EMPTY
09 DONE a2 171.64.65.56:8080 August 21 10:56->August 23 01:21
[ P2662R1C325G16 ]
Folding@Home Client Shutdown.
nunya@nunya-desktop:~/folding/FAH$ After shutdown
After shutdown
After shutdown
[0]3:Return code = 0, signaled with Quit
NedClocker
08-25-08, 10:32 AM
This is from another Ubuntu VM which is NOT overclocked.
--- Opening Log file [August 26 13:06:09]
# SMP Client ################################################## ################
################################################## #############################
Folding@Home Client Version 6.02
http://folding.stanford.edu
################################################## #############################
################################################## #############################
Launch directory: /home/jeff/folding/FAH
Executable: ./fah6
Arguments: -smp -verbosity 9
[13:06:09] - Ask before connecting: No
[13:06:09] - User name: Ned_Clocker (Team 32)
[13:06:09] - Machine ID: 1
[13:06:09]
[13:06:09] Loaded queue successfully.
[13:06:09]
[13:06:09] + Processing work unit
[13:06:09] Core required: FahCore_a2.exe
[13:06:09] Core found.
[13:06:09] - Autosending finished units...
[13:06:09] Trying to send all finished work units
[13:06:09] + Attempting to send results
[13:06:09] - Reading file work/wuresults_05.dat from core
[13:06:09] Working on Unit 06 [August 26 13:06:09]
[13:06:10] + Working ...
[13:06:10] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 5331 -version 602'
[13:06:10] (Read 26785723 bytes from disk)
[13:06:10] Connecting to http://171.64.65.56:8080/
[13:06:10]
[13:06:10] *------------------------------*
[13:06:10] Folding@Home Gromacs SMP Core
[13:06:10] Version 1.91 (2007)
[13:06:10]
[13:06:10] Preparing to commence simulation
[13:06:10] - Ensuring status. Please wait.
[13:06:17] - Couldn't send HTTP request to server
[13:06:17] + Could not connect to Work Server (results)
[13:06:17] (171.64.65.56:8080)
[13:06:17] - Error: Could not transmit unit 05 (completed August 26) to work server.
[13:06:17] - 6 failed uploads of this unit.
[13:06:17] + Attempting to send results
[13:06:17] - Reading file work/wuresults_05.dat from core
[13:06:17] (Read 26785723 bytes from disk)
[13:06:17] Connecting to http://171.64.122.86:8080/
[13:06:18] - Couldn't send HTTP request to server
[13:06:18] + Could not connect to Work Server (results)
[13:06:18] (171.64.122.86:8080)
[13:06:18] Could not transmit unit 05 to Collection server; keeping in queue.
[13:06:18] + Sent 0 of 1 completed units to the server
[13:06:18] - Autosend completed
[13:06:27] - Looking at optimizations...
[13:06:27] - Working with standard loops on this execution.
[13:06:27] - Previous termination of core was improper.
[13:06:27] - Going to use standard loops.
[13:06:27] - Files status OK
[13:06:27] Error: Work unit read from disk is invalid
[13:06:27] Finalizing output
[13:06:30] - Expanded 5000848 -> 24742709 (decompressed 494.7 percent)
[13:06:30]
[13:06:30] Project: 2662 (Run 1, Clone 453, Gen 6)
[13:06:30]
[13:06:31] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=ned-desktop-VM1
NNODES=4, MYRANK=1, HOSTNAME=ned-desktop-VM1
NNODES=4, MYRANK=2, HOSTNAME=ned-desktop-VM1
NNODES=4, MYRANK=3, HOSTNAME=ned-desktop-VM1
NODEID=2 argc=14
NODEID=0 argc=14
NODEID=3 argc=14
NODEID=1 argc=14
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_20070720 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) mdrun (-:
[13:06:37] Will resume from checkpoint file
Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
starting mdrun 'HGG with glycosylations'
250000 steps, 500.0 ps.
[13:06:41] Resuming from checkpoint
old size= 71134 old_crc=4139799
-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_20070720
Source code file: md.c, line: 831
Fatal error:
Checkpoint error on step 1606260
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4
gcq#0: Thanx for Using GROMACS - Have a Nice Day
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[13:06:41] File work/wudata_06.log has changed since last checkpoint
[cli_1]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
[0]0:Return code = 255
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[13:06:48] CoreStatus = FF (255)
[13:06:48] Client-core communications error: ERROR 0xff
[13:06:48] Deleting current work unit & continuing...
After shutdown
[0]0:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[13:11:11] - Warning: Could not delete all work unit files (6): Core file absent
[13:11:11] Trying to send all finished work units
[13:11:11] + Attempting to send results
[13:11:11] - Reading file work/wuresults_05.dat from core
[13:11:12] (Read 26785723 bytes from disk)
[13:11:12] Connecting to http://171.64.65.56:8080/
[13:11:19] - Couldn't send HTTP request to server
[13:11:19] + Could not connect to Work Server (results)
[13:11:19] (171.64.65.56:8080)
[13:11:19] - Error: Could not transmit unit 05 (completed August 26) to work server.
[13:11:19] - 7 failed uploads of this unit.
[13:11:19] + Attempting to send results
[13:11:19] - Reading file work/wuresults_05.dat from core
[13:11:19] (Read 26785723 bytes from disk)
[13:11:19] Connecting to http://171.64.122.86:8080/
[13:11:20] - Couldn't send HTTP request to server
[13:11:20] + Could not connect to Work Server (results)
[13:11:20] (171.64.122.86:8080)
[13:11:20] Could not transmit unit 05 to Collection server; keeping in queue.
[13:11:20] + Sent 0 of 1 completed units to the server
[13:11:20] - Preparing to get new work unit...
[13:11:20] + Attempting to get work packet
[13:11:20] - Will indicate memory of 498 MB
[13:11:20] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 8
[13:11:20] - Connecting to assignment server
[13:11:20] Connecting to http://assign.stanford.edu:8080/
[13:11:20] Posted data.
[13:11:20] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[13:11:20] + News From Folding@Home: Welcome to Folding@Home
[13:11:21] Loaded queue successfully.
[13:11:21] Connecting to http://171.64.65.56:8080/
[13:11:25] Posted data.
[13:11:25] Initial: 0000; - Receiving payload (expected size: 4922117)
[13:11:37] - Downloaded at ~400 kB/s
[13:11:37] - Averaged speed for that direction ~280 kB/s
[13:11:37] + Received work.
[13:11:37] + Closed connections
[13:11:42]
[13:11:42] + Processing work unit
[13:11:42] Core required: FahCore_a2.exe
[13:11:42] Core found.
[13:11:43] Working on Unit 07 [August 26 13:11:43]
[13:11:43] + Working ...
[13:11:43] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 5331 -version 602'
[13:11:43]
[13:11:43] *------------------------------*
[13:11:43] Folding@Home Gromacs SMP Core
[13:11:43] Version 1.91 (2007)
[13:11:43]
[13:11:43] Preparing to commence simulation
[13:11:43] - Ensuring status. Please wait.
[13:12:00] - Looking at optimizations...
[13:12:00] - Working with standard loops on this execution.
[13:12:00] - Previous termination of core was improper.
[13:12:00] - Going to use standard loops.
[13:12:00] - Files status OK
[13:12:00] Error: Work unit read from disk is invalid
[13:12:00] Finalizing output
[13:12:01] (decompressed 494.9 percent)
[13:12:02]
[13:12:02] Project: 2662 (Run 2, Clone 189, Gen 21)
[13:12:02]
[13:12:02] 62 (Run 2, Clone 189, Gen 21)
[13:12:02]
[13:12:02] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=ned-desktop-VM1
NNODES=4, MYRANK=2, HOSTNAME=ned-desktop-VM1
NNODES=4, MYRANK=3, HOSTNAME=ned-desktop-VM1
NNODES=4, MYRANK=0, HOSTNAME=ned-desktop-VM1
NODEID=0 argc=14
NODEID=2 argc=14
NODEID=3 argc=14
NODEID=1 argc=14
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_20070720 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) mdrun (-:
Reading file work/wudata_07.tpr, VERSION 3.3.99_development_20070618 (single precision)
starting mdrun 'HGG in water'
249999 steps, 500.0 ps.
[13:12:13] Completed 0 out of 249999 steps (0%)
Delete the SMP a1 fahcore. I think there was a problem with forcing core updates with a1, and IIRC, the latest WUs need a v2.x fahcore.
Audioaficionado
08-25-08, 02:14 PM
Hi Ned
Make sure you have the latest core. I was having a similar issue until I deleted the old core and the client redownloaded the newest version of FahCore_a2.exe
You could also remove -advmethods and avoid FAHCore_a2 completely for the time being.
NedClocker
08-25-08, 09:44 PM
You could also remove -advmethods and avoid FAHCore_a2 completely for the time being.
How do you remove -advmethods?
NedClocker
08-25-08, 09:47 PM
Hi Ned
Make sure you have the latest core. I was having a similar issue until I deleted the old core and the client redownloaded the newest version of FahCore_a2.exe
I had to download the new client a couple of weeks ago. It was all working fine before that.
Can you delete that core in the middle of a WU?
WarriorII
08-25-08, 09:50 PM
I think you can look in the CFG file for changing the advmethods there.
and Yes, you can change cores in the middle of a WU.
I have done it. (and it worked)
thideras
08-25-08, 09:50 PM
Run the config (change advanced options = yes), say no to the advmethods question.
NedClocker
08-25-08, 09:56 PM
Run the config (change advanced options = yes), say no to the advmethods question.
It's... been... so... long.... :D But I remember now.
Should I delete the core first?
WarriorII
08-25-08, 10:03 PM
doesn't matter.
NedClocker
08-25-08, 10:03 PM
I did both at the same time. I have been away too long.
NedClocker
08-25-08, 10:16 PM
:mad::mad::mad::mad::mad:
:bang head:bang head:bang head:bang head:bang head
harlam357
08-25-08, 10:37 PM
Uh oh... :eek: That doesn't look good... what's the haps after the core delete and advmethods removal Ned?
Audioaficionado
08-25-08, 11:25 PM
Ned, with the latest client in a fresh install, the latest core will be automaticly downloaded and this one is supposed to be able to update it self when needed unlike the old one.
Just do a fresh start with an empty folder and './fah6 -smp -advmethods -verbosity' should work just fine.
I had some major hiccups until I just started over with a cleared out folder, fresh download of FAH6.02-Linux.tgz, untarred it and restarted the client.
Edit: Oh and I've been running in a root terminal just in case some silly permission issues come up. The worst thing to happen would be the destruction of my VM and I have freshly built spares just in case.
WarriorII
08-25-08, 11:35 PM
:confused:
that doesn't look good.
I wish Stanford would come up with client they'd keep for more than 6 months.
Core A2 is now released to general FAH. Deleting -advmethods won't help anymore. I picked up 2 on one machine last night and one stalled immediately. I have a feeling that at least some of the problems users are experiencing are related to the latest Linux Kernel (Ubuntu 8.04). It just might be time to experiment with NotFred's distro on VMs and dedicated machines.
Audioaficionado
08-26-08, 10:11 AM
ChasR, I'm running the latest Debian testing (Lenny) in VMs and it's been running fine. I did a net install from the 40MB business card image.
NedClocker
08-26-08, 10:46 AM
Uh oh... :eek: That doesn't look good... what's the haps after the core delete and advmethods removal Ned?
Same thing. It failed to upload the completed unit and deleted the current unit and started over.
So, this time (this morning) I deleted the entire folder, downloaded the client again and started over, as recommended.
It's running. I'll let you know after it finishes a unit!
It just wouldn't be folding without a challenge or two, now would it. :)
That box is Ubuntu 6.10. I have several Ubuntu 7.10, also, but no 8.04.
It's happening on both 6.10 and 7.10 boxes, real and VM.
Audioaficionado
08-26-08, 02:03 PM
Build a Debian Lenny VM from scratch. It ain't that hard.
NedClocker
08-27-08, 03:57 PM
Build a Debian Lenny VM from scratch. It ain't that hard.
:) Shouldn't be any harder than building an Ubuntu VM, should it? :screwy:
Well, it completed a work unit. And, more importantly, IT ACTUALLY MANAGED TO SEND THE WU BACK TO STANFORD!!! :bday:
Today's point total will be much more in line with what I was making before I went of vacation.
Audioaficionado
08-27-08, 08:26 PM
:) Shouldn't be any harder than building an Ubuntu VM, should it? :screwy:I've done both and it really isn't. The Debian Net installer is pretty good and uses a GUI with menu selections.
WarriorII
08-27-08, 08:31 PM
:) .....Well, it completed a work unit. And, more importantly, IT ACTUALLY MANAGED TO SEND THE WU BACK TO STANFORD!!! :bday:
Today's point total will be much more in line with what I was making before I went of vacation.
GREAT to hear it !!!!
vBulletin® v3.8.7, Copyright ©2000-2012, vBulletin Solutions, Inc.