• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Fixes for recent FAH server outage

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

'Cuda340

Very Welcoming Senior, Premium Member #11
Joined
May 30, 2004
Location
[email protected]
FYI...

Fixes for recent FAH server outage
May 7, 2015 by Vijay Pande

We recently ran into some problems with our assignment server (AS). The AS is responsible for distributing the computational power of [email protected] by sending client’s to different work servers (WS), which in turn assign parts of the protein folding simulations to clients. In the interest of transparency, here’s what happened.

Two issues compounded to cause some clients to not get work assignments for many hours. The first problem is an issue we’ve run into before where the AS exceeds the number of open files allowed by the operating system. When this happens it continues to run but fails to assign. To address this problem, our lead developer (Joseph Coffland) has added code to the AS which will check the maximum allowed open files at startup and increase the limit to the highest possible value. If the value is still too low it will print a warning to the log file. This will help us ensure that our file limit settings are actually being respected.

The second issue was that failover to our second AS (assign2) didn’t work for some clients. This was related to how we handle clients that cannot connect to port 8080 and WS that cannot receive connections on port 80. The folding client will first attempt to connect to assign.stanford.edu on port 8080 if this fails it will try assign2.stanford.edu on port 80. The AS assumes that connections on port 80 are from clients which don’t support connections to 8080 and only assigns them to WS which support port 80.

In a failover situation, this assumption is invalid. The result is far fewer WS are available during a failover. To solve this problem the AS was modified to prefer rather than require WS which support port 80 for connections on port 80. This change can cause client/WS port mismatches but only when no better match was possible. Yes, it’s a tangled web.

In addition to these changes, we have plans to implement an early warning system which should help to alert us to such situations sooner. We already get SMS notifications if the AS goes down but we need more thorough reporting for situations where the AS is alive but not assigning. This new notification system will be put in place in the next few months.

Thank you for your patience and for your ongoing contributions to [email protected]!




asset.php.jpg


Source
 

Silver_Pharaoh

Likes the big ones n00b Member
Joined
Sep 7, 2013
Cool!

Thankfully the recent issues haven't had my down & not folding for more than 10 minutes.

Glad to see the increased code and warning too. That will help prevent future mess ups :)
 

orion456

Member
Joined
May 31, 2004
Again I am getting no SMP WUs, it looks like they don't care anymore about those. Perhaps its time to just shut those boxes down.
 

torin3

Member
Joined
Dec 25, 2004
Again I am getting no SMP WUs, it looks like they don't care anymore about those. Perhaps its time to just shut those boxes down.

Yeah, I'm going to shut down my 4P unit and move the 970 over to a 1156 / 6 PCI-E board for dedicated Linux GPU folding.
 

Farwalker2u

Member
Joined
Mar 1, 2003
Location
Georgia
Right now four of my six GPUs are idle.
They have all failed to get a work assignment for hours now.

WU01:FS00:Connecting to assign-GPU.stanford.edu:8080
WARNING:WU01:FS00:Failed to get assignment from 'assign-GPU.stanford.edu:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
ERROR:WU01:FS00:Exception: Could not get an assignment
 

orion456

Member
Joined
May 31, 2004
Pandy Group is falling down on the job it seems. They can't get those servers running reliably.

Mine are getting WUs randomly, 5 right now are down, two are showing signs of equipment failure so it might be a bigger problem than I think.
 

Silver_Pharaoh

Likes the big ones n00b Member
Joined
Sep 7, 2013
Well, my 270x will be busy for 5 hours, but the 7850 will be idle if everything is down...

Can we return WU's or is the whole kit and caboodle down??
 

don256us

Uber Folding Senior
Joined
Jul 17, 2003
The good news is that all of my clients are folding so its not all bad. I should have one hell of an update when its fixed.
 

AmbientFiction

Senior Folding Zombie
Joined
Jun 16, 2001
Location
Somewhere in the top 100 folders for team 32
Hey guys I noticed I was having send errors:
Code:
22:51:23:ERROR:WU00:FS00:Exception: Could not get an assignment
23:09:20:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
23:09:20:ERROR:WU00:FS00:Exception: Could not get an assignment
23:38:22:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
23:38:22:ERROR:WU00:FS00:Exception: Could not get an assignment
00:25:21:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
00:25:21:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2015-05-19 *******************************
01:41:21:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
01:41:21:ERROR:WU00:FS00:Exception: Could not get an assignment
03:32:26:WU00:FS00:Connecting to 171.67.108.200:8080
03:32:27:WU00:FS00:Assigned to work server 171.64.65.99
03:32:27:WU00:FS00:Requesting new work unit for slot 00: READY cpu:8 from 171.64.65.99
03:32:27:WU00:FS00:Connecting to 171.64.65.99:8080
03:32:30:WU00:FS00:Downloading 5.43MiB
03:32:34:WU00:FS00:Download complete
03:32:34:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9752 run:3700 clone:0 gen:0 core:0xa4 unit:0x00000004ab4041635541764681ebee05
03:32:34:WU00:FS00:Starting

So I paused my client and killed the core/gui and restarted the client. Sent off without a hitch.