• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

The Trouble Coming Our Way

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Adak

Senior Member
Joined
Jan 9, 2006
We're going to be in a real struggle to keep our #4 team ranking. TSC! Russia has more than twice the number of active folders we have, AND they have been ramping up the speed of their PC's, over the last few years.

T32Trouble.PNG

RamonetB's big folding exhibition, has allowed us to regain nearly a 300 million points lead again, but we know that can't last.

Recruiting for FAH has never been easy, but try and recruit some friends and/or associates, to the team, when you can.

We're going to be in for a fight. :D
 
We're going to be in a real struggle to keep our #4 team ranking. TSC! Russia has more than twice the number of active folders we have, AND they have been ramping up the speed of their PC's, over the last few years.

View attachment 112101

RamonetB's big folding exhibition, has allowed us to regain nearly a 300 million points lead again, but we know that can't last.

Recruiting for FAH has never been easy, but try and recruit some friends and/or associates, to the team, when you can.

We're going to be in for a fight. :D
Unfortunately our low number of active folders as compared to other top teams is not a new problem.
I think our strength is the dedication level of the folders we do have:thup:

I was thinking of something the other day though. When is the last time we had like a contest give away? I've only been back folding since Feb so if you all just had one in the fall or something I would have missed it.

Note: The above is always my suggestion because my people skills stink :p
 
Adak, you are correct. When we lose the 5-6 million PPD from RamonetB, we are going to be in a world of hurt. We definitely need more active members. We have less than 1/2 as many active members as the other top teams. I have just restarted Folding after a 2-3 year hiatus. In fact, I am selling some of my old equipment in the classifieds, so that I can upgrade.
 
I'm frustrated not being able to retool my hardware any time soon. I'm just running my Q6600/9800 GTX+ 24/7 and only getting 12kppd for my trouble. Kids don't want it running on their machines anymore and my daughter's 9600GSO burned up due to GPU client. :shrug:
 
Folding creates heat.

Figured I'd chime in and let everyone know the current state of things.

This new cluster puts out heat. A LOT of heat. We knew that going in and have more than enough cooling capacity to handle it. However, stressing the system this week (along with 90+ degree temps) has shown the cracks in our cooling paradigm. The shortcomings of the physical plant in the way they handled the install of equipment six years ago is being seen too, but that's another story. Power outages (which corrupt my work unit data!!! :cry:) don't help either.

So the cluster has been down for a few days as we've made repairs to things that should never have been broken to begin with. We're also looking at expanding efficiency with room design and enclosures (we use a hot aisle / cold aisle topology).

But there is good news in all this. We're going to extend the burn-in for another week or so as we take thermal readings of the room and the proposed enclosed designs. So that's another 30 - 60 million points for the team before shutdown. But until then, it'll be a nice additional gap TSC will need to overcome.

For funsies:

screenshot20120629at100.png



That's a view of the hot aisle. The thing dangling in the center is a temperature probe. The red triangle indicates the hot spot, which happens to be the new cluster. Temperatures there are 122 degrees F. Average hot aisle temp is 98. But the cool aisle is 69. :thup:
 
Cool pic Ramonetb! :cool: Thanks for the heads up. You're definitely making a LARGE difference for T32. :)
 
Failure sending completed work units!

I know this is a "common" problem, but I'm encountering a large number of completed work units that are not being sent to the server. I've tried restarting the client, sending manually, "reinstalling," and check that the work server is up and running. I've got about 40 work units waiting to be picked up so there's a significant number of points. Does anyone have any ideas on something I've missed or other suggestions? I've poked around on Google but haven't been able to find anything concrete. :shrug:

Additional work units have been downloaded and are running, but when finished they are added to the queue and get stuck.


[19:51:20] + Attempting to send results [July 1 19:51:20 UTC]
[19:51:20] - Reading file work/wuresults_09.dat from core
[19:51:20] (Read 222363491 bytes from disk)
[19:51:20] Connecting to http://130.237.232.237:8080/
[20:08:17] - Couldn't send HTTP request to server
[20:08:17] + Could not connect to Work Server (results)
[20:08:17] (130.237.232.237:8080)
[20:08:17] + Retrying using alternative port
[20:08:17] Connecting to http://130.237.232.237:80/
[20:25:17] - Couldn't send HTTP request to server
[20:25:17] + Could not connect to Work Server (results)
[20:25:17] (130.237.232.237:80)
[20:25:17] - Error: Could not transmit unit 09 (completed June 26) to work server.
[20:25:17] - 20 failed uploads of this unit.
[20:25:17] Keeping unit 09 in queue.
[20:25:17] + Sent 0 of 1 completed units to the server
[20:25:17] - Autosend completed

:cry::cry:


Also, I'll get the make of the camera tomorrow. :)
 
I've had a few units stuck in the queue too (v6 clients only)... so I assume there must be a server issue at Stanford.
 
Well according to the fah server stats page I linked for you, the server in your fah log is accepting classic WUs. How is your backed up WU status Ramonetb?

If they're still stacking up, check a block from your end or the client is trying to connect to the wrong server.

Try shutting down the client and restarting.

http://fah-web.stanford.edu/logs/130.237.232.237.log.html
 
Last edited:
Well according to the fah server stats page I linked for you, the server in your fah log is accepting classic WUs. How is your backed up WU status Ramonetb?

If they're still stacking up, check a block from your end or the client is trying to connect to the wrong server.

Try shutting down the client and restarting.

http://fah-web.stanford.edu/logs/130.237.232.237.log.html


The server page I was looking at differs a little from the one you sent me. This is the one I was using: http://fah-web.stanford.edu/pybeta/serverstat.html

The server is listed as accepting SMP units so they should be okay. Currently, the WU's are still stuck. Restarting the client and sending them manually hasn't worked. I've even tried reinstalling the client itself, copying over the queue and work folders. They're still stuck.

I checked to see if there might be a firewall issue, but everything appears in order and the systems are able to download new work units. I've gone and done a clean install on one node (backing up the old folder) and am running a simple folding job on it to test if it's an issue with sending out to them as opposed to receiving. I'll know more in an hour.

Assuming the client is trying to connect to the wrong server, is there a way I can manually specify the one it should go to?


Also, for inquiring minds, the camera used is a FLIR T620.
 
Try sending ChasR a PM.

I've gotten in touch with Chasr, but unfortunately he's not certain of what's going on. From the logs it appears that something is blocking the machine from connecting to the work server.

The test units I folded have also failed to send. However, other computers within the same facility don't seem to be troubled by this issue. It suggests that there is something within the cluster itself. Given that the firewalls are configured largely identically across the systems is even more evidence. We did have a major storm that caused a power bump before all this started happening. I brought the system back up normally, but perhaps something is plaguing the switches. Everything is surge suppressed (by at least two levels) so I don't fear any lasting damage. But it's possible the switches didn't start up right.

I'll do a full system bump tomorrow once I get in. Hopefully that will cure the problem.

Thanks for all your help.
 
I'm back using my computer instead of my phone. The clip of the log didn't appear to be running -verbosity 9 so much diagnostic info was missing. THe fact that it was trying for the 20th time to send the same WU could indicate a problem with the WU. I'm not sure when it downloaded, but unless it is hung sending, it has likely download and completed subsequent WUs and the machine and FAH installation are OK and the problem is with the WU or with server .237, a problematic server.
 
I'm back using my computer instead of my phone. The clip of the log didn't appear to be running -verbosity 9 so much diagnostic info was missing. THe fact that it was trying for the 20th time to send the same WU could indicate a problem with the WU. I'm not sure when it downloaded, but unless it is hung sending, it has likely download and completed subsequent WUs and the machine and FAH installation are OK and the problem is with the WU or with server .237, a problematic server.

Hey Chasr,

Thanks for chiming in. Strangely, the client was given the verbosity 9 flag at start up but doesn't elaborate. I've even used it when sending individual work units thusly:

./fah6 -send 2 -verbosity 9

But I never receive any more detail. Also, other work units are being downloaded and completed, but they encounter the same problem when being sent. Some nodes now have three work units that can't be sent. On one system, I've deleted the supposed stuck / corrupt WU, but new one's still don't send. It's good to know that the server is a problematic one, though. I've run into this issue before, but it's never lasted 5 days for me.

I suspect the issue is on my end and, specifically, with the cluster. Other systems here are returning work units as normally (SMP systems as well, though unable to handle the bigadv projects). In a desperate attempt to fix it, I'm taking the entire system down -- master node, switches, and all -- and bringing it back up in sequential order. We'll see if it resolves the problem. :bang head

If it does resolve it, that's important to know. If we ever have a similar situation, I'll need to go through this procedure again just to avoid problems with code when it's in full production.
 
No change.

I think all I can do now is start from scratch and hope that works. Maybe something is wrong with the queue.dat file. Grasping at straws, I know.

:(


For those interested, the log file:




# Linux Console Edition #######################################################
###############################################################################

Folding@Home Client Version 6.34

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/folding
Executable: ./fah6
Arguments: -send 1 -verbosity 9

[15:11:29] - Ask before connecting: No
[15:11:29] - User name: RamonetB (Team 32)
[15:11:29] - User ID: 17F95D2F23A2C5E6
[15:11:29] - Machine ID: 1
[15:11:29]
[15:11:29] Loaded queue successfully.
[15:11:29] Attempting to return result(s) to server...
[15:11:29] Project: 6903 (Run 4, Clone 12, Gen 106)
[15:11:29] - Read packet limit of 540015616... Set to 524286976.


[15:11:29] + Attempting to send results [July 3 15:11:29 UTC]
[15:11:29] - Reading file work/wuresults_01.dat from core
[15:11:31] (Read 222512322 bytes from disk)
[15:11:31] Connecting to http://130.237.232.237:8080/
[15:28:28] - Couldn't send HTTP request to server
[15:28:28] + Could not connect to Work Server (results)
[15:28:28] (130.237.232.237:8080)
[15:28:28] + Retrying using alternative port
[15:28:28] Connecting to http://130.237.232.237:80/
[15:45:30] - Couldn't send HTTP request to server
[15:45:30] + Could not connect to Work Server (results)
[15:45:30] (130.237.232.237:80)
[15:45:30] - Error: Could not transmit unit 01 (completed July 2) to work server.
[15:45:30] - 8 failed uploads of this unit.
[15:45:30] Keeping unit 01 in queue.
[15:45:30] - Failed to send unit 01 to server
[15:45:30] ***** Got a SIGTERM signal (15)
[15:45:30] Killing all core threads

Folding@Home Client Shutdown.
 
Last edited:
if i remember right there is a flag for sending work units what i would do is copy the WU files to a flash drive take it home and try sending the units from there and let the servers have a fresh start
 
You can't sneaker net any more. The server will reject the WU if it is sent by a different machine than downloaded it.
 
Back