• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

anyone seen this before?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

dillinja666

Member
Joined
May 28, 2009
Location
San Diego
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147216 atoms, while the current system consists of 147096 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[07:05:00] CoreStatus = FF (255)
[07:05:00] Sending work to server
[07:05:00] Project: 2671 (Run 10, Clone 49, Gen 195)
[07:05:00] - Error: Could not get length of results file work/wuresults_03.dat
[07:05:00] - Error: Could not read unit 03 file. Removing from queue.
[07:05:00] Trying to send all finished work units

it did it a few times on monday, so i turned my computer off for the day. Then tonight when I tried to start up again I got this error once, then it reconnected and starting folding...
 

Bobnova

Senior Member
Joined
May 10, 2009
If it was a WU that was just downloaded, it was one of the borked WUs that the 2.10 folding core now just dumps.
The previous core ran them, but could only use one core and they didn't complete on time.

Nothing to worry about.


Now if it was part way through folding it and did that, you have issues.
 

orion456

Member
Joined
May 31, 2004
Fatal error:
Checkpoint file is for a system of 147216 atoms, while the current system consists of 147096 atoms...

This is history in the making. You have managed to create something out of nothing, a feat not repeated since the big bang - or was that the big whisper...
 
OP
dillinja666

dillinja666

Member
Joined
May 28, 2009
Location
San Diego
If it was a WU that was just downloaded, it was one of the borked WUs that the 2.10 folding core now just dumps.
The previous core ran them, but could only use one core and they didn't complete on time.

Nothing to worry about.


Now if it was part way through folding it and did that, you have issues.

nope, I had finished a WU then was attempting to get another when that happened. Its happened this morning too when I recieved my first WU for the day, and later while I was at work it happened again. After both times the next attempt recieves a WU..i was just curious what was going on.

This is history in the making. You have managed to create something out of nothing, a feat not repeated since the big bang - or was that the big whisper...

So that explains the little people inside my computer useing my heatsink as their source of heat, and my LED fans as their light source. I thought I cranked my oc up a little too high :attn:
 
OP
dillinja666

dillinja666

Member
Joined
May 28, 2009
Location
San Diego
great well I guess I was wrong. While only having 46% of a WU done, it suddenly quit and gave me this error. I didnt notice my VM had failed until after It had scolled the failed message past the top of my terminal. now I just get this over and over again...whats going on

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[01:10:29] CoreStatus = FF (255)
[01:10:29] Sending work to server
[01:10:29] Project: 2671 (Run 5, Clone 64, Gen 198)
[01:10:29] - Error: Could not get length of results file work/wuresults_03.dat
[01:10:29] - Error: Could not read unit 03 file. Removing from queue.
[01:10:29] Trying to send all finished work units
[01:10:29] + No unsent completed units remaining.
[01:10:29] - Preparing to get new work unit...
[01:10:29] + Attempting to get work packet
[01:10:29] - Will indicate memory of 988 MB
[01:10:29] - Connecting to assignment server
[01:10:29] Connecting to http://assign.stanford.edu:8080/
[01:10:29] Posted data.
[01:10:29] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:10:29] + News From [email protected]: Welcome to [email protected]
[01:10:29] Loaded queue successfully.
[01:10:29] Connecting to http://171.67.108.24:8080/
[01:10:35] Posted data.
[01:10:35] Initial: 0000; - Receiving payload (expected size: 4845381)
[01:11:09] - Downloaded at ~139 kB/s
[01:11:09] - Averaged speed for that direction ~168 kB/s
[01:11:09] + Received work.
[01:11:09] Trying to send all finished work units
[01:11:09] + No unsent completed units remaining.
[01:11:09] + Closed connections
[01:11:14]
[01:11:14] + Processing work unit
[01:11:14] Core required: FahCore_a2.exe
[01:11:14] Core found.
[01:11:14] Working on queue slot 04 [January 21 01:11:14 UTC]
[01:11:14] + Working ...
[01:11:14] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 15 -verbose -lifeline 11751 -version 624'

[01:11:14]
[01:11:14] *------------------------------*
[01:11:14] [email protected] Gromacs SMP Core
[01:11:14] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[01:11:14]
[01:11:14] Preparing to commence simulation
[01:11:14] - Ensuring status. Please wait.
[01:11:24] - Looking at optimizations...
[01:11:24] - Working with standard loops on this execution.
[01:11:24] - Files status OK
[01:11:24] - Expanded 4844869 -> 24020689 (decompressed 495.7 percent)
[01:11:24] Called DecompressByteArray: compressed_data_size=4844869 data_size=24020689, decompressed_data_size=24020689 diff=0
[01:11:24] - Digital signature verified
[01:11:24]
[01:11:24] Project: 2671 (Run 5, Clone 64, Gen 198)
[01:11:24]
[01:11:25] Entering M.D.
[01:11:31] Using Gromacs checkpoints
NNODES=4, MYRANK=2, HOSTNAME=ubuntu
NNODES=4, MYRANK=0, HOSTNAME=ubuntu
NODEID=0 argc=23
NNODES=4, MYRANK=3, HOSTNAME=ubuntu
NODEID=2 argc=23
NNODES=4, MYRANK=1, HOSTNAME=ubuntu
NODEID=3 argc=23
NODEID=1 argc=23
Reading file work/wudata_04.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_04.cpt generated: Wed Jan 20 16:54:59 2010


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147270 atoms, while the current system consists of 146994 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[01:11:36] CoreStatus = FF (255)
[01:11:36] Sending work to server
[01:11:36] Project: 2671 (Run 5, Clone 64, Gen 198)
[01:11:36] - Error: Could not get length of results file work/wuresults_04.dat
[01:11:36] - Error: Could not read unit 04 file. Removing from queue.
[01:11:36] Trying to send all finished work units
[01:11:36] + No unsent completed units remaining.
[01:11:36] - Preparing to get new work unit...
[01:11:36] + Attempting to get work packet
[01:11:36] - Will indicate memory of 988 MB
[01:11:36] - Connecting to assignment server
[01:11:36] Connecting to http://assign.stanford.edu:8080/
[01:11:36] Posted data.
[01:11:36] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:11:36] + News From [email protected]: Welcome to [email protected]
[01:11:36] Loaded queue successfully.
[01:11:36] Connecting to http://171.67.108.24:8080/
[01:11:42] Posted data.
[01:11:42] Initial: 0000; - Receiving payload (expected size: 4845381)
[01:12:05] - Downloaded at ~205 kB/s
[01:12:05] - Averaged speed for that direction ~175 kB/s
[01:12:05] + Received work.
[01:12:05] Trying to send all finished work units
[01:12:05] + No unsent completed units remaining.
[01:12:05] + Closed connections
[01:12:10]
[01:12:10] + Processing work unit
[01:12:10] Core required: FahCore_a2.exe
[01:12:10] Core found.
[01:12:10] Working on queue slot 05 [January 21 01:12:10 UTC]
[01:12:10] + Working ...
[01:12:10] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -checkpoint 15 -verbose -lifeline 11751 -version 624'

[01:12:10]
[01:12:10] *------------------------------*
[01:12:10] [email protected] Gromacs SMP Core
[01:12:10] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[01:12:10]
[01:12:10] Preparing to commence simulation
[01:12:10] - Ensuring status. Please wait.
[01:12:20] - Looking at optimizations...
[01:12:20] - Working with standard loops on this execution.
[01:12:20] - Files status OK
[01:12:21] - Expanded 4844869 -> 24020689 (decompressed 495.7 percent)
[01:12:21] Called DecompressByteArray: compressed_data_size=4844869 data_size=24020689, decompressed_data_size=24020689 diff=0
[01:12:21] - Digital signature verified
[01:12:21]
[01:12:21] Project: 2671 (Run 5, Clone 64, Gen 198)
[01:12:21]
[01:12:21] Entering M.D.
[01:12:27] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=ubuntu
NNODES=4, MYRANK=1, HOSTNAME=ubuntu
NNODES=4, MYRANK=3, HOSTNAME=ubuntu
NODEID=3 argc=23
NODEID=0 argc=23
NNODES=4, MYRANK=2, HOSTNAME=ubuntu
NODEID=2 argc=23
NODEID=1 argc=23
Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_05.cpt generated: Sun Jan 17 11:11:18 2010


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 146880 atoms, while the current system consists of 146994 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[01:12:32] CoreStatus = FF (255)
[01:12:32] Sending work to server
[01:12:32] Project: 2671 (Run 5, Clone 64, Gen 198)
[01:12:32] - Error: Could not get length of results file work/wuresults_05.dat
[01:12:32] - Error: Could not read unit 05 file. Removing from queue.
[01:12:32] Trying to send all finished work units
[01:12:32] + No unsent completed units remaining.
[01:12:32] - Preparing to get new work unit...
[01:12:32] + Attempting to get work packet
[01:12:32] - Will indicate memory of 988 MB
[01:12:32] - Connecting to assignment server
[01:12:32] Connecting to http://assign.stanford.edu:8080/
[01:12:33] Posted data.
[01:12:33] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:12:33] + News From [email protected]: Welcome to [email protected]
[01:12:33] Loaded queue successfully.
[01:12:33] Connecting to http://171.67.108.24:8080/
[01:12:33] Posted data.
[01:12:33] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[01:12:33] - Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
[01:12:42] + Attempting to get work packet
[01:12:42] - Will indicate memory of 988 MB
[01:12:42] - Connecting to assignment server
[01:12:42] Connecting to http://assign.stanford.edu:8080/
[01:12:42] Posted data.
[01:12:42] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:12:42] + News From [email protected]: Welcome to [email protected]
[01:12:42] Loaded queue successfully.
[01:12:42] Connecting to http://171.67.108.24:8080/
[01:12:48] Posted data.
[01:12:48] Initial: 0000; - Receiving payload (expected size: 4839407)
[01:13:11] - Downloaded at ~205 kB/s
[01:13:11] - Averaged speed for that direction ~181 kB/s
[01:13:11] + Received work.
[01:13:11] Trying to send all finished work units
[01:13:11] + No unsent completed units remaining.
[01:13:11] + Closed connections
[01:13:16]
[01:13:16] + Processing work unit
[01:13:16] Core required: FahCore_a2.exe
[01:13:16] Core found.
[01:13:16] Working on queue slot 06 [January 21 01:13:16 UTC]
[01:13:16] + Working ...
[01:13:16] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 11751 -version 624'

[01:13:16]
[01:13:16] *------------------------------*
[01:13:16] [email protected] Gromacs SMP Core
[01:13:16] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[01:13:16]
[01:13:16] Preparing to commence simulation
[01:13:16] - Ensuring status. Please wait.
[01:13:25] - Looking at optimizations...
[01:13:25] - Working with standard loops on this execution.
[01:13:25] - Files status OK
[01:13:26] - Expanded 4838895 -> 24035457 (decompressed 496.7 percent)
[01:13:26] Called DecompressByteArray: compressed_data_size=4838895 data_size=24035457, decompressed_data_size=24035457 diff=0
[01:13:26] - Digital signature verified
[01:13:26]
[01:13:26] Project: 2671 (Run 19, Clone 63, Gen 197)
[01:13:26]
[01:13:26] Entering M.D.
[01:13:32] Using Gromacs checkpoints
NNODES=4, MYRANK=1, HOSTNAME=ubuntu
NNODES=4, MYRANK=0, HOSTNAME=ubuntu
NODEID=0 argc=23
NNODES=4, MYRANK=2, HOSTNAME=ubuntu
NODEID=2 argc=23
NNODES=4, MYRANK=3, HOSTNAME=ubuntu
NODEID=3 argc=23
NODEID=1 argc=23
Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_06.cpt generated: Sun Jan 17 17:39:40 2010


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 146880 atoms, while the current system consists of 147090 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[01:13:38] CoreStatus = FF (255)
[01:13:38] Sending work to server
[01:13:38] Project: 2671 (Run 19, Clone 63, Gen 197)
[01:13:38] - Error: Could not get length of results file work/wuresults_06.dat
[01:13:38] - Error: Could not read unit 06 file. Removing from queue.
[01:13:38] Trying to send all finished work units
[01:13:38] + No unsent completed units remaining.
[01:13:38] - Preparing to get new work unit...
[01:13:38] + Attempting to get work packet
[01:13:38] - Will indicate memory of 988 MB
[01:13:38] - Connecting to assignment server
[01:13:38] Connecting to http://assign.stanford.edu:8080/
[01:13:38] Posted data.
[01:13:38] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:13:38] + News From [email protected]: Welcome to [email protected]
[01:13:39] Loaded queue successfully.
[01:13:39] Connecting to http://171.67.108.24:8080/
[01:13:44] Posted data.
[01:13:44] Initial: 0000; - Receiving payload (expected size: 4839407)
[01:14:07] - Downloaded at ~205 kB/s
[01:14:07] - Averaged speed for that direction ~186 kB/s
[01:14:07] + Received work.
[01:14:08] Trying to send all finished work units
[01:14:08] + No unsent completed units remaining.
[01:14:08] + Closed connections
[01:14:13]
[01:14:13] + Processing work unit
[01:14:13] Core required: FahCore_a2.exe
[01:14:13] Core found.
[01:14:13] Working on queue slot 07 [January 21 01:14:13 UTC]
[01:14:13] + Working ...
[01:14:13] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 11751 -version 624'

[01:14:13]
[01:14:13] *------------------------------*
[01:14:13] [email protected] Gromacs SMP Core
[01:14:13] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[01:14:13]
[01:14:13] Preparing to commence simulation
[01:14:13] - Ensuring status. Please wait.
[01:14:22] - Looking at optimizations...
[01:14:22] - Working with standard loops on this execution.
[01:14:22] - Files status OK
[01:14:23] - Expanded 4838895 -> 24035457 (decompressed 496.7 percent)
[01:14:23] Called DecompressByteArray: compressed_data_size=4838895 data_size=24035457, decompressed_data_size=24035457 diff=0
[01:14:23] - Digital signature verified
[01:14:23]
[01:14:23] Project: 2671 (Run 19, Clone 63, Gen 197)
[01:14:23]
[01:14:23] Entering M.D.
[01:14:29] Using Gromacs checkpoints
NNODES=4, MYRANK=1, HOSTNAME=ubuntu
NNODES=4, MYRANK=2, HOSTNAME=ubuntu
NNODES=4, MYRANK=3, HOSTNAME=ubuntu
NNODES=4, MYRANK=0, HOSTNAME=ubuntu
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
Reading file work/wudata_07.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_07.cpt generated: Mon Jan 18 22:25:49 2010


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147183 atoms, while the current system consists of 147090 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[01:14:34] CoreStatus = FF (255)
[01:14:34] Sending work to server
[01:14:34] Project: 2671 (Run 19, Clone 63, Gen 197)
[01:14:34] - Error: Could not get length of results file work/wuresults_07.dat
[01:14:34] - Error: Could not read unit 07 file. Removing from queue.
[01:14:34] Trying to send all finished work units
[01:14:34] + No unsent completed units remaining.
[01:14:34] - Preparing to get new work unit...
[01:14:34] + Attempting to get work packet
[01:14:34] - Will indicate memory of 988 MB
[01:14:34] - Connecting to assignment server
[01:14:34] Connecting to http://assign.stanford.edu:8080/
[01:14:35] Posted data.
[01:14:35] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:14:35] + News From [email protected]: Welcome to [email protected]
[01:14:35] Loaded queue successfully.
[01:14:35] Connecting to http://171.67.108.24:8080/
[01:14:41] Posted data.
[01:14:41] Initial: 0000; - Receiving payload (expected size: 4839407)
[01:15:04] - Downloaded at ~205 kB/s
[01:15:04] - Averaged speed for that direction ~190 kB/s
[01:15:04] + Received work.
[01:15:04] Trying to send all finished work units
[01:15:04] + No unsent completed units remaining.
[01:15:04] + Closed connections
[01:15:09]
[01:15:09] + Processing work unit
[01:15:09] Core required: FahCore_a2.exe
[01:15:09] Core found.
[01:15:09] Working on queue slot 08 [January 21 01:15:09 UTC]
[01:15:09] + Working ...
[01:15:09] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 11751 -version 624'

[01:15:09]
[01:15:09] *------------------------------*
[01:15:09] [email protected] Gromacs SMP Core
[01:15:09] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[01:15:09]
[01:15:09] Preparing to commence simulation
[01:15:09] - Ensuring status. Please wait.
[01:15:18] - Looking at optimizations...
[01:15:18] - Working with standard loops on this execution.
[01:15:18] - Files status OK
[01:15:19] - Expanded 4838895 -> 24035457 (decompressed 496.7 percent)
[01:15:19] Called DecompressByteArray: compressed_data_size=4838895 data_size=24035457, decompressed_data_size=24035457 diff=0
[01:15:19] - Digital signature verified
[01:15:19]
[01:15:19] Project: 2671 (Run 19, Clone 63, Gen 197)
[01:15:19]
[01:15:19] Entering M.D.
[01:15:25] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=ubuntu
NNODES=4, MYRANK=3, HOSTNAME=ubuntu
NNODES=4, MYRANK=1, HOSTNAME=ubuntu
NNODES=4, MYRANK=2, HOSTNAME=ubuntu
NODEID=0 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
NODEID=1 argc=23
Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_08.cpt generated: Thu Jan 14 15:47:36 2010


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147105 atoms, while the current system consists of 147090 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[01:15:30] CoreStatus = FF (255)
[01:15:30] Sending work to server
[01:15:30] Project: 2671 (Run 19, Clone 63, Gen 197)
[01:15:30] - Error: Could not get length of results file work/wuresults_08.dat
[01:15:30] - Error: Could not read unit 08 file. Removing from queue.
[01:15:30] Trying to send all finished work units
[01:15:30] + No unsent completed units remaining.
[01:15:30] - Preparing to get new work unit...
[01:15:30] + Attempting to get work packet
[01:15:30] - Will indicate memory of 988 MB
[01:15:30] - Connecting to assignment server
[01:15:30] Connecting to http://assign.stanford.edu:8080/
[01:15:31] Posted data.
[01:15:31] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:15:31] + News From [email protected]: Welcome to [email protected]
[01:15:31] Loaded queue successfully.
[01:15:31] Connecting to http://171.67.108.24:8080/
[01:15:31] Posted data.
[01:15:31] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[01:15:32] - Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
 

Bobnova

Senior Member
Joined
May 10, 2009
I think you might have set a post length record :D

Aside from that, you seem to have found a server that is trying desperately to get it's screwed up WUs calculated.
Failing at 46% is a different issue i suspect. The errors in the log you posted are classed borked-WU stuff, nothing on your end.
 
OP
dillinja666

dillinja666

Member
Joined
May 28, 2009
Location
San Diego
well what about the failing WU... my oc is stable...been that way for a month now. cant get the 4.1 stable so im back to 4.0