• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

How to fix?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

CJ-5

Member
Joined
Jul 27, 2005
This has been happening nearly every time I finish a wu.

Code:
[22:57:21] + Processing work unit
[22:57:21] Core required: FahCore_a2.exe
[22:57:21] Core found.
[22:57:21] Working on Unit 02 [November 14 22:57:21]
[22:57:21] + Working ...
[22:57:21] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 30 -verbose -lifeline 13472 -version 602'

[22:57:21] 
[22:57:21] *------------------------------*
[22:57:21] [email protected] Gromacs SMP Core
[22:57:21] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:57:21] 
[22:57:21] Preparing to commence simulation
[22:57:21] - Ensuring status. Please wait.
[22:57:22] Called DecompressByteArray: compressed_data_size=4843329 data_size=24004881, decompressed_data_size=24004881 diff=0
[22:57:22] - Digital signature verified
[22:57:22] 
[22:57:22] Project: 2671 (Run 3, Clone 91, Gen 120)
[22:57:22] 
[22:57:22] Assembly optimizations on if available.
[22:57:22] Entering M.D.
[22:57:32] Run 3, Clone 91, Gen 120)
[22:57:32] 
[22:57:32] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=Ubuntu
NNODES=4, MYRANK=1, HOSTNAME=Ubuntu
NNODES=4, MYRANK=2, HOSTNAME=Ubuntu
NNODES=4, MYRANK=3, HOSTNAME=Ubuntu
NODEID=0 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
NODEID=1 argc=20
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
30250000 steps,  60500.0 ps (continuing from step 30000000,  60000.0 ps).

t = 60000.003 ps: Water molecule starting at atom 65638 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 60000.005 ps: Water molecule starting at atom 42760 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 60000.005 ps: Water molecule starting at atom 127297 can not be settled.
Check for bad contacts and/or reduce the timestep.
[22:57:40] [email protected] Core Shutdown: INTERRUPTED
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0

I have to delete the wu, queue and restart. Then it will start folding and finish a new wu but always ends like you see in the log above. It's been doing this for several days now.

The OS is Ubuntu 8.04 with all updates. The version is smp 6.02. Computer is Asus p5w dh deluxe with a Q6600 @ 3.0GHz.

Is this something I can fix or should I just reinstall?:shrug:
 
Last edited:
OP
C

CJ-5

Member
Joined
Jul 27, 2005
Memory failed memtest. I've had nothing but problems with this memory (pc2 8500, Crucial Ballistix). It's been replaced under warranty every 6 - 9 months almost like clockwork since they first came out. I've always run them at stock voltage. Crucial sent me a lower voltage replacement set after the first two but they just keep failing. I suppose you could say they are relilable in some since because they fail at predictable intervals. I have two sets of these and they both suck.

You would think after 2 or 3 years of this that Crucial would offer me something that worked and save themselves the hassle and expense of constant rma's.
 

Adak

Senior Member
Joined
Jan 9, 2006
What a PITA!

Sounds like they're sending you iffy sticks that barely did (or didn't quite), pass QA testing.

Good luck with the next ones.
 

torin3

Member
Joined
Dec 25, 2004
I've had Ballistix 8500s go bad on me often enough I finally switch to G-Skill 8500s. I haven't had a memory problem in over a year now. That would be my advice. Good luck! :thup:
 
OP
C

CJ-5

Member
Joined
Jul 27, 2005
I dropped in some Kingston value ram pc2 4200. I don't think it will hold the overclock. I'm a little supprised it even works in this MB. Just keeping the rig running at any speed is important when you are trying to stay on page 1 with smp considering all of the power gpu folders around here.

Sorry everyone about my rant, going ballistic on Ballistix, but I do feel better now at least.:D
 
Last edited:

Adak

Senior Member
Joined
Jan 9, 2006
A good product or company should be lauded in the forums, and a poor one should be exposed as such. I've had nothing but good results from Crucial, including RAM for my i7, but that's a grand total of maybe half a dozen purchases, over the years. Hardly statistically significant.

Going balistic on Ballistix! I like that turn of the phrase. :D
 
OP
C

CJ-5

Member
Joined
Jul 27, 2005
I've never had a problem with any other Crucial ram but the Ballistix pc2 8500. In all fairness to Crucial I have other Crucial/Micron ram that has run for more than 10 years and is still working.

Crucial knows they had a problem with the original pc2 8500 series Ballistix that ran stock at 2.2 volts. They admitted this to me over the phone when they replaced mine with a newer version that runs at 2.0 volts. This will now be my 3rd replacement of the newer version... if that's what they choose to do.
 

harlam357

Senior Fold-a-holic
Joined
Sep 22, 2004
The "first ones" you speak of are D9 based... I had the same issue with mine. I actually ran them at 2.1v and 900MHz (much lower than spec in both categories) and they still died on me in less than a year. I got the 2.0v stuff back in RMA and sold them.

I have sets of PC8000 from G.Skill running in 3 of my rigs... they've been great. Don't see them listed on Newegg anymore... but I've actually got a set of these for backup purposes.

http://www.newegg.com/Product/Product.aspx?Item=N82E16820231166
 
OP
C

CJ-5

Member
Joined
Jul 27, 2005
It appears that bad memory is not the only problem. When I finished the wu using good memory I still get this:

Code:
[14:48:58] Thank you for your contribution to [email protected]
[14:48:58] + Number of Units Completed: 34

[0]3:Return code = 0, signaled with Quit
[14:49:05] - Warning: Could not delete all work unit files (1): Core file absent
[14:49:05] Trying to send all finished work units
[14:49:05] + No unsent completed units remaining.
[14:49:05] - Preparing to get new work unit...
[14:49:05] + Attempting to get work packet
[14:49:05] - Will indicate memory of 2014 MB
[14:49:05] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 7
[14:49:05] - Connecting to assignment server
[14:49:05] Connecting to http://assign.stanford.edu:8080/
[14:49:05] Posted data.
[14:49:05] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[14:49:05] + News From [email protected]: Welcome to [email protected]
[14:49:05] Loaded queue successfully.
[14:49:05] Connecting to http://171.67.108.24:8080/
[14:49:11] Posted data.
[14:49:11] Initial: 0000; - Receiving payload (expected size: 4843841)
[14:49:16] - Downloaded at ~946 kB/s
[14:49:16] - Averaged speed for that direction ~810 kB/s
[14:49:16] + Received work.
[14:49:16] Trying to send all finished work units
[14:49:16] + No unsent completed units remaining.
[14:49:16] + Closed connections
[14:49:16] 
[14:49:16] + Processing work unit
[14:49:16] Core required: FahCore_a2.exe
[14:49:16] Core found.
[14:49:16] Working on Unit 02 [November 15 14:49:16]
[14:49:16] + Working ...
[14:49:16] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 30 -verbose -lifeline 7708 -version 602'

[14:49:16] 
[14:49:16] *------------------------------*
[14:49:16] [email protected] Gromacs SMP Core
[14:49:16] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[14:49:16] 
[14:49:16] Preparing to commence simulation
[14:49:16] - Ensuring status. Please wait.
[14:49:17] Called DecompressByteArray: compressed_data_size=4843329 data_size=24004881, decompressed_data_size=24004881 diff=0
[14:49:17] - Digital signature verified
[14:49:17] 
[14:49:17] Project: 2671 (Run 3, Clone 91, Gen 120)
[14:49:17] 
[14:49:17] Assembly optimizations on if available.
[14:49:17] Entering M.D.
[14:49:26] Run 3, Clone 91, Gen 120)
[14:49:26] 
[14:49:26] Entering M.D.
NNODES=4, MYRANK=3, HOSTNAME=Ubuntu
NNODES=4, MYRANK=0, HOSTNAME=Ubuntu
NODEID=0 argc=20
NNODES=4, MYRANK=1, HOSTNAME=Ubuntu
NODEID=1 argc=20
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=2, HOSTNAME=Ubuntu
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
30250000 steps,  60500.0 ps (continuing from step 30000000,  60000.0 ps).

t = 60000.003 ps: Water molecule starting at atom 65638 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 60000.005 ps: Water molecule starting at atom 42760 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 60000.005 ps: Water molecule starting at atom 127297 can not be settled.
Check for bad contacts and/or reduce the timestep.
[14:49:36] [email protected] Core Shutdown: INTERRUPTED
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[15:49:50] - Autosending finished units...
[15:49:50] Trying to send all finished work units
[15:49:50] + No unsent completed units remaining.
[15:49:50] - Autosend completed
[21:49:50] - Autosending finished units...
[21:49:50] Trying to send all finished work units
[21:49:50] + No unsent completed units remaining.
[21:49:50] - Autosend completed

I'm going to install a new HD tonight or in the morning. I've re-installed the program several times over the past several weeks but the problem comes back after a couple of days of folding.
 

harlam357

Senior Fold-a-holic
Joined
Sep 22, 2004
I ran into a couple WUs that did this recently... on different machines, so I doubt it was a machine issue.

You can try deleting the work folder, queue.dat, and machinedependent.dat files to see if a different WU solves the problem.
 
OP
C

CJ-5

Member
Joined
Jul 27, 2005
Past experience says it's going to download a new wu, complete it, turn it in and then error out.

Since I'm not the only one with this issue, I'll try it again just to be sure now that I have changed the memory.

I can always hope for the best.
 

ChasR

Senior Member
Joined
Apr 12, 2004
Location
Atlanta
It's the same r/c/g (Project: 2671 (Run 3, Clone 91, Gen 120)) that errored in the log and it's at the begining not the end of the WU. Delete Work dir, queue.dat and machinedependent.dat and you're almost guaranteed to get a different WU.
 
OP
C

CJ-5

Member
Joined
Jul 27, 2005
ChasR: It's the same r/c/g (Project: 2671 (Run 3, Clone 91, Gen 120))

That's really interesting. I hadn't noticed that it was the same wu as in my first post. Last night I deleted the queue, work folder and unit info of the wu that caused the first error. It restarted and downloaded a new wu, as you said it would. It finished the new wu this morning and then downloaded Project: 2671 (Run 3, Clone 91, Gen 120) again.

I looked over the log file and here is the history of it since this started a couple of days ago.

I finished a wu and downloaded Project 2671 (Run 3, Clone 91, Gen 120)...It failed

I restarted after deleting the wu, queue, and unitinfo. It downloaded a new wu and completed it...then downloaded Project 2671 (Run 3, Clone 91, Gen 120) again which failed.

I restarted the failed wu and, of course, it failed again.

I deleted the wu, queue, and unitinfo and downloaded a different wu. It completed that wu and downloaded another Project 2671 (Run 3, Clone 91, Gen 120) which failed.

I restarted the failed wu and it failed again.

I checked the memory, which failed memtest, and replaced it with good ram.

I deleted the failed wu work folder, queue and unitinfo and downloaded a new wu. It finished that wu then downloaded Project 2671 (Run 3, Clone 91, Gen 120) again...which failed.

It seems that I'm being cursed with having to fold Project 2671 (Run 3, Clone 91, Gen 120) until I can complete it.

I've never deleted machinedependent.dat before but this seems like a good time to try it. I've re-installed smp before to solve issues but this seems much easier if it works.

I'll try it and see what happens in the morning after it's completed the newest wu.
 

Jolly-Swagman

Member
Joined
Sep 20, 2007
Yes by deleting the machinedependent.dat Stanford Severs will think this is a new computer added so, will almost guarantee a new WU,
 
OP
C

CJ-5

Member
Joined
Jul 27, 2005
The curse has been lifted.:clap:

Deleting the machinedependent.dat seems to have worked. It finished the new wu and downloaded another new wu not Project 2671 (Run 3, Clone 91, Gen 120).

I hope I never see that project again and I pitty anyone else they send it to!

Thanks, everyone, for your help and advice.
 
Last edited: