Fixes for recent FAH server outage

'Cuda340 · May 10, 2015

FYI...

Fixes for recent FAH server outage
May 7, 2015 by Vijay Pande

We recently ran into some problems with our assignment server (AS). The AS is responsible for distributing the computational power of Folding@home by sending client’s to different work servers (WS), which in turn assign parts of the protein folding simulations to clients. In the interest of transparency, here’s what happened.

Two issues compounded to cause some clients to not get work assignments for many hours. The first problem is an issue we’ve run into before where the AS exceeds the number of open files allowed by the operating system. When this happens it continues to run but fails to assign. To address this problem, our lead developer (Joseph Coffland) has added code to the AS which will check the maximum allowed open files at startup and increase the limit to the highest possible value. If the value is still too low it will print a warning to the log file. This will help us ensure that our file limit settings are actually being respected.

The second issue was that failover to our second AS (assign2) didn’t work for some clients. This was related to how we handle clients that cannot connect to port 8080 and WS that cannot receive connections on port 80. The folding client will first attempt to connect to assign.stanford.edu on port 8080 if this fails it will try assign2.stanford.edu on port 80. The AS assumes that connections on port 80 are from clients which don’t support connections to 8080 and only assigns them to WS which support port 80.

In a failover situation, this assumption is invalid. The result is far fewer WS are available during a failover. To solve this problem the AS was modified to prefer rather than require WS which support port 80 for connections on port 80. This change can cause client/WS port mismatches but only when no better match was possible. Yes, it’s a tangled web.

In addition to these changes, we have plans to implement an early warning system which should help to alert us to such situations sooner. We already get SMS notifications if the AS goes down but we need more thorough reporting for situations where the AS is alive but not assigning. This new notification system will be put in place in the next few months.

Thank you for your patience and for your ongoing contributions to Folding@home!

Source

~(o)-(0)~ · May 10, 2015

About time, hope it works, I've been getting killed

Silver_Pharaoh · May 10, 2015

Cool!

Thankfully the recent issues haven't had my down & not folding for more than 10 minutes.

Glad to see the increased code and warning too. That will help prevent future mess ups

orion456 · May 14, 2015

Again I am getting no SMP WUs, it looks like they don't care anymore about those. Perhaps its time to just shut those boxes down.

torin3 · May 14, 2015

orion456 said:
Again I am getting no SMP WUs, it looks like they don't care anymore about those. Perhaps its time to just shut those boxes down.

Yeah, I'm going to shut down my 4P unit and move the 970 over to a 1156 / 6 PCI-E board for dedicated Linux GPU folding.

Farwalker2u · May 16, 2015

Right now four of my six GPUs are idle.
They have all failed to get a work assignment for hours now.

WU01:FS00:Connecting to assign-GPU.stanford.edu:8080
WARNING:WU01:FS00:Failed to get assignment from 'assign-GPU.stanford.edu:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
ERROR:WU01:FS00:Exception: Could not get an assignment

orion456 · May 17, 2015

Pandy Group is falling down on the job it seems. They can't get those servers running reliably.

Mine are getting WUs randomly, 5 right now are down, two are showing signs of equipment failure so it might be a bigger problem than I think.

orion456 · May 17, 2015

Looks like everything is down now.....argh

AmbientFiction · May 18, 2015

Info on the issue. Looks like they are hoping to have it back online by noon their time.
https://folding.stanford.edu/home/blog/

Silver_Pharaoh · May 18, 2015

Well, my 270x will be busy for 5 hours, but the 7850 will be idle if everything is down...

Can we return WU's or is the whole kit and caboodle down??

don256us · May 18, 2015

The good news is that all of my clients are folding so its not all bad. I should have one hell of an update when its fixed.

don256us · May 18, 2015

The server is back up according to the link that AmbientFiction provided.
https://folding.stanford.edu/home/blog/

About an hour ago or 10 AM PST or 1 PM EST.

AmbientFiction · May 18, 2015

Hey guys I noticed I was having send errors:

Code:

22:51:23:ERROR:WU00:FS00:Exception: Could not get an assignment
23:09:20:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
23:09:20:ERROR:WU00:FS00:Exception: Could not get an assignment
23:38:22:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
23:38:22:ERROR:WU00:FS00:Exception: Could not get an assignment
00:25:21:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
00:25:21:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2015-05-19 *******************************
01:41:21:WARNING:WU00:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known. 
01:41:21:ERROR:WU00:FS00:Exception: Could not get an assignment
03:32:26:WU00:FS00:Connecting to 171.67.108.200:8080
03:32:27:WU00:FS00:Assigned to work server 171.64.65.99
03:32:27:WU00:FS00:Requesting new work unit for slot 00: READY cpu:8 from 171.64.65.99
03:32:27:WU00:FS00:Connecting to 171.64.65.99:8080
03:32:30:WU00:FS00:Downloading 5.43MiB
03:32:34:WU00:FS00:Download complete
03:32:34:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9752 run:3700 clone:0 gen:0 core:0xa4 unit:0x00000004ab4041635541764681ebee05
03:32:34:WU00:FS00:Starting

So I paused my client and killed the core/gui and restarted the client. Sent off without a hitch.

Fixes for recent FAH server outage

'Cuda340

Very Welcoming Senior, Premium Member #11

~(o)-(0)~

Member

Silver_Pharaoh

Likes the big ones n00b Member

orion456

Member

torin3

Member

Farwalker2u

Member

orion456

Member

orion456

Member

AmbientFiction

Senior Folding Zombie

Silver_Pharaoh

Likes the big ones n00b Member

don256us

Uber Folding Senior

don256us

Uber Folding Senior

AmbientFiction

Senior Folding Zombie

Similar threads