- Joined
- Feb 20, 2001
The machine that is putting out most of the points for me at the moment is a 24c workstation (not currently in use for the usual calculations, so putting it to good use for FaH). However, it keeps stopping every couple of days with this error:
When I first set FaH up it was using 23 cores by default, which Gromacs then kicks down to 21 cores to avoid decomposition by a large prime. I forget how, but I managed to force it to use all 24 and it worked fine for weeks. However, now it's misbehaving ...
Any ideas for how to get around this? I did a bit of Googling and didn't come up with much that was helpful.
I check my stats a couple of times per day but this means I usually lose hours of production at a time. I then have to connect to a VPN (box is in my office at work), ssh in, and stop the service, delete the work folder, and restart the service. This suggests to me that it's specific work units that cause problems.
Code:
06:09:25:WU00:FS00:0xa7:*********************** Log Started 2020-07-10T06:09:24Z ***********************
06:09:25:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
06:09:25:WU00:FS00:0xa7: Type: 0xa7
06:09:25:WU00:FS00:0xa7: Core: Gromacs
06:09:25:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 706 -lifeline 763228 -checkpoint 15 -np
06:09:25:WU00:FS00:0xa7: 24
06:09:25:WU00:FS00:0xa7:************************************ CBang *************************************
06:09:25:WU00:FS00:0xa7: Date: Nov 5 2019
06:09:25:WU00:FS00:0xa7: Time: 06:06:57
06:09:25:WU00:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
06:09:25:WU00:FS00:0xa7: Branch: master
06:09:25:WU00:FS00:0xa7: Compiler: GNU 8.3.0
06:09:25:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
06:09:25:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
06:09:25:WU00:FS00:0xa7: Bits: 64
06:09:25:WU00:FS00:0xa7: Mode: Release
06:09:25:WU00:FS00:0xa7:************************************ System ************************************
06:09:25:WU00:FS00:0xa7: CPU: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
06:09:25:WU00:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
06:09:25:WU00:FS00:0xa7: CPUs: 24
06:09:25:WU00:FS00:0xa7: Memory: 93.08GiB
06:09:25:WU00:FS00:0xa7:Free Memory: 84.91GiB
06:09:25:WU00:FS00:0xa7: Threads: POSIX_THREADS
06:09:25:WU00:FS00:0xa7: OS Version: 5.4
06:09:25:WU00:FS00:0xa7:Has Battery: false
06:09:25:WU00:FS00:0xa7: On Battery: false
06:09:25:WU00:FS00:0xa7: UTC Offset: 1
06:09:25:WU00:FS00:0xa7: PID: 763232
06:09:25:WU00:FS00:0xa7: CWD: /var/lib/fahclient/work
06:09:25:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
06:09:25:WU00:FS00:0xa7: Version: 0.0.18
06:09:25:WU00:FS00:0xa7: Author: Joseph Coffland <[email protected]>
06:09:25:WU00:FS00:0xa7: Copyright: 2019 foldingathome.org
06:09:25:WU00:FS00:0xa7: Homepage: https://foldingathome.org/
06:09:25:WU00:FS00:0xa7: Date: Nov 5 2019
06:09:25:WU00:FS00:0xa7: Time: 06:13:26
06:09:25:WU00:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
06:09:25:WU00:FS00:0xa7: Branch: master
06:09:25:WU00:FS00:0xa7: Compiler: GNU 8.3.0
06:09:25:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
06:09:25:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
06:09:25:WU00:FS00:0xa7: Bits: 64
06:09:25:WU00:FS00:0xa7: Mode: Release
06:09:25:WU00:FS00:0xa7:************************************ Build *************************************
06:09:25:WU00:FS00:0xa7: SIMD: avx_256
06:09:25:WU00:FS00:0xa7:********************************************************************************
06:09:25:WU00:FS00:0xa7:Project: 16452 (Run 50, Clone 2, Gen 172)
06:09:25:WU00:FS00:0xa7:Unit: 0x000000b5038949075ee14c165a9a0dd3
06:09:25:WU00:FS00:0xa7:Reading tar file core.xml
06:09:25:WU00:FS00:0xa7:Reading tar file frame172.tpr
06:09:25:WU00:FS00:0xa7:Digital signatures verified
06:09:25:WU00:FS00:0xa7:Calling: mdrun -s frame172.tpr -o frame172.trr -x frame172.xtc -cpt 15 -nt 24
06:09:25:WU00:FS00:0xa7:Steps: first=86000000 total=500000
06:09:25:WU00:FS00:0xa7:ERROR:
06:09:25:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
06:09:25:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
06:09:25:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
06:09:25:WU00:FS00:0xa7:ERROR:
06:09:25:WU00:FS00:0xa7:ERROR:Fatal error:
06:09:25:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
06:09:25:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
06:09:25:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
06:09:25:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
06:09:25:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
06:09:25:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
06:09:30:WU00:FS00:0xa7:WARNING:Unexpected exit() call
06:09:30:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
06:09:30:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
06:09:30:WU00:FS00:0xa7:Saving result file md.log
06:09:30:WU00:FS00:0xa7:Saving result file science.log
06:09:30:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
When I first set FaH up it was using 23 cores by default, which Gromacs then kicks down to 21 cores to avoid decomposition by a large prime. I forget how, but I managed to force it to use all 24 and it worked fine for weeks. However, now it's misbehaving ...
Any ideas for how to get around this? I did a bit of Googling and didn't come up with much that was helpful.
I check my stats a couple of times per day but this means I usually lose hours of production at a time. I then have to connect to a VPN (box is in my office at work), ssh in, and stop the service, delete the work folder, and restart the service. This suggests to me that it's specific work units that cause problems.