Fixes for recent FAH server outage
May 7, 2015 by Vijay Pande
We recently ran into some problems with our assignment server (AS). The AS is responsible for distributing the computational power of [email protected] by sending client’s to different work servers (WS), which in turn assign parts of the protein folding simulations to clients. In the interest of transparency, here’s what happened.
Two issues compounded to cause some clients to not get work assignments for many hours. The first problem is an issue we’ve run into before where the AS exceeds the number of open files allowed by the operating system. When this happens it continues to run but fails to assign. To address this problem, our lead developer (Joseph Coffland) has added code to the AS which will check the maximum allowed open files at startup and increase the limit to the highest possible value. If the value is still too low it will print a warning to the log file. This will help us ensure that our file limit settings are actually being respected.
The second issue was that failover to our second AS (assign2) didn’t work for some clients. This was related to how we handle clients that cannot connect to port 8080 and WS that cannot receive connections on port 80. The folding client will first attempt to connect to assign.stanford.edu on port 8080 if this fails it will try assign2.stanford.edu on port 80. The AS assumes that connections on port 80 are from clients which don’t support connections to 8080 and only assigns them to WS which support port 80.
In a failover situation, this assumption is invalid. The result is far fewer WS are available during a failover. To solve this problem the AS was modified to prefer rather than require WS which support port 80 for connections on port 80. This change can cause client/WS port mismatches but only when no better match was possible. Yes, it’s a tangled web.
In addition to these changes, we have plans to implement an early warning system which should help to alert us to such situations sooner. We already get SMS notifications if the AS goes down but we need more thorough reporting for situations where the AS is alive but not assigning. This new notification system will be put in place in the next few months.
Thank you for your patience and for your ongoing contributions to [email protected]!