• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Recurring issue: "No domain decomposition"

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

David

Forums Super Moderator
Joined
Feb 20, 2001
The machine that is putting out most of the points for me at the moment is a 24c workstation (not currently in use for the usual calculations, so putting it to good use for FaH). However, it keeps stopping every couple of days with this error:

Code:
06:09:25:WU00:FS00:0xa7:*********************** Log Started 2020-07-10T06:09:24Z ***********************
06:09:25:WU00:FS00:0xa7:************************** Gromacs [email protected] Core ***************************
06:09:25:WU00:FS00:0xa7:       Type: 0xa7
06:09:25:WU00:FS00:0xa7:       Core: Gromacs
06:09:25:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 763228 -checkpoint 15 -np
06:09:25:WU00:FS00:0xa7:             24
06:09:25:WU00:FS00:0xa7:************************************ CBang *************************************
06:09:25:WU00:FS00:0xa7:       Date: Nov 5 2019
06:09:25:WU00:FS00:0xa7:       Time: 06:06:57
06:09:25:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
06:09:25:WU00:FS00:0xa7:     Branch: master
06:09:25:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
06:09:25:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
06:09:25:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
06:09:25:WU00:FS00:0xa7:       Bits: 64
06:09:25:WU00:FS00:0xa7:       Mode: Release
06:09:25:WU00:FS00:0xa7:************************************ System ************************************
06:09:25:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
06:09:25:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
06:09:25:WU00:FS00:0xa7:       CPUs: 24
06:09:25:WU00:FS00:0xa7:     Memory: 93.08GiB
06:09:25:WU00:FS00:0xa7:Free Memory: 84.91GiB
06:09:25:WU00:FS00:0xa7:    Threads: POSIX_THREADS
06:09:25:WU00:FS00:0xa7: OS Version: 5.4
06:09:25:WU00:FS00:0xa7:Has Battery: false
06:09:25:WU00:FS00:0xa7: On Battery: false
06:09:25:WU00:FS00:0xa7: UTC Offset: 1
06:09:25:WU00:FS00:0xa7:        PID: 763232
06:09:25:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
06:09:25:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
06:09:25:WU00:FS00:0xa7:    Version: 0.0.18
06:09:25:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
06:09:25:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
06:09:25:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
06:09:25:WU00:FS00:0xa7:       Date: Nov 5 2019
06:09:25:WU00:FS00:0xa7:       Time: 06:13:26
06:09:25:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
06:09:25:WU00:FS00:0xa7:     Branch: master
06:09:25:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
06:09:25:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
06:09:25:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
06:09:25:WU00:FS00:0xa7:       Bits: 64
06:09:25:WU00:FS00:0xa7:       Mode: Release
06:09:25:WU00:FS00:0xa7:************************************ Build *************************************
06:09:25:WU00:FS00:0xa7:       SIMD: avx_256
06:09:25:WU00:FS00:0xa7:********************************************************************************
06:09:25:WU00:FS00:0xa7:Project: 16452 (Run 50, Clone 2, Gen 172)
06:09:25:WU00:FS00:0xa7:Unit: 0x000000b5038949075ee14c165a9a0dd3
06:09:25:WU00:FS00:0xa7:Reading tar file core.xml
06:09:25:WU00:FS00:0xa7:Reading tar file frame172.tpr
06:09:25:WU00:FS00:0xa7:Digital signatures verified
06:09:25:WU00:FS00:0xa7:Calling: mdrun -s frame172.tpr -o frame172.trr -x frame172.xtc -cpt 15 -nt 24
06:09:25:WU00:FS00:0xa7:Steps: first=86000000 total=500000
06:09:25:WU00:FS00:0xa7:ERROR:
06:09:25:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
06:09:25:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
06:09:25:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
06:09:25:WU00:FS00:0xa7:ERROR:
06:09:25:WU00:FS00:0xa7:ERROR:Fatal error:
06:09:25:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
06:09:25:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
06:09:25:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
06:09:25:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
06:09:25:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
06:09:25:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
06:09:30:WU00:FS00:0xa7:WARNING:Unexpected exit() call
06:09:30:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
06:09:30:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
06:09:30:WU00:FS00:0xa7:Saving result file md.log
06:09:30:WU00:FS00:0xa7:Saving result file science.log
06:09:30:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

When I first set FaH up it was using 23 cores by default, which Gromacs then kicks down to 21 cores to avoid decomposition by a large prime. I forget how, but I managed to force it to use all 24 and it worked fine for weeks. However, now it's misbehaving ...

Any ideas for how to get around this? I did a bit of Googling and didn't come up with much that was helpful.

I check my stats a couple of times per day but this means I usually lose hours of production at a time. I then have to connect to a VPN (box is in my office at work), ssh in, and stop the service, delete the work folder, and restart the service. This suggests to me that it's specific work units that cause problems.
 
OP
David

David

Forums Super Moderator
Joined
Feb 20, 2001
Ok, so my initial googling didn't work so well, but I found this on the [email protected] forum: https://foldingforum.org/viewtopic.php?f=108&t=34821&p=330033&hilit=domain#p330033

Looks like my options at this point are to (i) live with it; (ii) set it to 21 CPUs manually and lose >10% of the performance; or run 2 x 16 + 1 x 8 instances, which will decrease PPD. I think I'll just have to deal with it and get to used to logging into the machine at least once a day!

Once I'm back on site I might try to knock together some sort of hourly cron job/script that will ping me an email if this error pops up.
 

WhitehawkEQ

Premium Member
Joined
Dec 6, 2010
Why are you using odd #cores? [email protected] has never liked odd #cores. What version of the V7 Folding software are you using? 32 cores is max the current software will allow per folding slot.
There is V7.1.52 on the [email protected] forums in the archives that can allow up to 64 cores but I know you have 24 so no problem there.
 
OP
David

David

Forums Super Moderator
Joined
Feb 20, 2001
At the moment I am using 7.6.13 set for 24 cores.
 

HayesK

Member
Joined
Oct 11, 2008
The decomp issue is related to the project setup and thread counts with known decomp issues can be excluded by the project owner.

Each 6136 CPU in your linux host has 12 real cores. For dual 6136 CPUs, a single CPU:24 slot should be good, but you could also try two CPU:12 slots. If hyper-threading is enabled, you could try single CPU:48 slot, two CPU:24 slots, three CPU:16 slots or four CPU:12 slots. Lots of options possible, but may want to leave a few threads available for OS and background activity. Three CPU:12 slots would keep all the real cores working, even if one of the slots was hung for some reason.

edit: another thead at Folding Forum about decomps.
Re: Project: 17201 Domain Decomposition Errors
https://foldingforum.org/viewtopic.php?f=19&t=35748&view=unread#p339464
 
Last edited:
OP
David

David

Forums Super Moderator
Joined
Feb 20, 2001
Thanks - will take a look.

I can't enable HT because another application (that this machine was bought for) takes a massive hit if HT is enabled.