• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

helsyeah...somethings wrong?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Joe Camel

Senior Camel Kicker
Joined
Aug 6, 2003
Location
---> NEW HOUSE 7/17/09 !! <---
hate to be the bearer of bad news (if you havent seen) but....
something dont look right here :eh?:

date..............points...............WU's

07.04.05.......... 6,920........... 2,516 :eek:
07.05, 6am .........0 .................162
07.05, 9am .........0 ....................0

stats page

nothing like going back to work after a holiday is there?

GOOD LUCK!!!
 
Power failure at my office got me over the weekend too. All but the rigs on big UPSs shutdown. Foutunately, it happened on monday, so I only lost about 12-14 hours on 20+ clients.
 
Ive noticed the odd issues with the WU's, and also what seems to be a good sized glut in production.

I am investigating, and it appears that one lab was either shutdown for the weekend or perhaps the IT guys pulled FAH, i hopefully will find out shortly.

As far as the WU's are concerned, I think this is a result of a couple reimages happening about a month ago, and so all machines lost WU's in twice, maybe three times within a week or two. I think that these numbers just reflect these WU's expiring... just a guess though..
 
date..............points...............WU's
07.05, 12pm ....814 ..................2
07.05, 3pm .....2,246 ................5

looks MUCH better now!!

know what went wrong?
 
Well, i can definately confirm that one lab (13 p4 machines 13 p3 machines) is completely offline. I did manage to get 5 borgs at my "real" job, so that helped lessen the blow a bit. Im not sure what is up, but it is summer and the IT department may be doing some serious over haul of the paticular lab in question, so I wont stress it too much. As it is, i still consider myself pretty lucky to have gotten what i have :).

I still cant figure out the 2k+ WUs deal, it really looks like each of my lab borgs managed to get something like 50 WU's in a single day assigned to them??!!! and they all just expired yesterday and today. Im completely baffled on that one....
 
The Stanford stats and EOC agree on the number of WUs. It looks like all of the 2000+ WUs were p1912 and p1910 turned in for partial credit. Perhaps the IT department limited the amount of memory allowed to be used by a single process in the current image? So as memory use grew preconvergence, it hit the limit and EEd. Stanfords assignment server would eventually cut you off.
 
ChasR said:
The Stanford stats and EOC agree on the number of WUs. It looks like all of the 2000+ WUs were p1912 and p1910 turned in for partial credit. Perhaps the IT department limited the amount of memory allowed to be used by a single process in the current image? So as memory use grew preconvergence, it hit the limit and EEd. Stanfords assignment server would eventually cut you off.

Interesting point, although i suspect this isnt the case. The labs in question are primarily solid modeling labs and so I kinda doubt that limiting memory allowed by a process would be a good idea when its very easy for the modeling software to use well over the 200ish megs that a QMD would use. I'm very curious what caused the WU issue, and will keep digging into my logs and hopefully can come up with a definate answer soon. It does the project no good when 2k+ WU's dont get finished right... :(
 
Preconvergence, QMDs use well over 300 MB of ram. Maybe someone complained about the performance hit, and the IT department, rather than uninstall FAH, limited single process memory usage. As 26 computers started EEing and accessing the net to transmit and more importantly download new work units, net traffic would go through the roof and FAH would get uninstalled. At the least, an interesting senairo, even if it's made with no knowledge of the workings of your lab.
 
ChasR said:
Preconvergence, QMDs use well over 300 MB of ram. Maybe someone complained about the performance hit, and the IT department, rather than uninstall FAH, limited single process memory usage. As 26 computers started EEing and accessing the net to transmit and more importantly download new work units, net traffic would go through the roof and FAH would get uninstalled. At the least, an interesting senairo, even if it's made with no knowledge of the workings of your lab.

Yes, interesting...

However I wrote a custom utility that stops the FAH services when a user logs on to the computer, just to remove any possible hit on performance when someone is actually using the computer. The IT department has been pretty happy with it so far, and I have yet to see any crazy error logs popping up on the machines. As it is, I plan on talking with the IT guys and seeing what the issue is, and if there is something i have been neglecting, or need to do to help any problems that might have come up.
 
Do you have access to the logs? It might not have been all of them that had a problem. Over a DSL/Ethernet connection it takes about two minutes to connect, upload results, download a new WU and commence the simulation. If it ends immediately one rig could EE over 700 times in 24 hours. So it had to be more than one but not necessarily all of them. If it's all of them, it had to be something the IT guys did. Since it's more than one, it probably has to be something the IT guys did.
 
Not to rule out the WU EE posibility, but lets add another piece to the puzzle:


These machines have deep freeze installed, and rely on the utility i wrote to upload work (and also maintain a constant userid for each machine) to an ftp on a set schedule, and ideally, when a computer is shutdown. After a computer is restarted, it is *supposed* to clear any work in FAH (since it would be frozen from when FAH was initially installed) and download the latest backup of the work from the ftp and continue on.

Contributing to this is the fact that most of the network was reworked a while back, and a new sub-net was assigned to the labs and FTP i was using. I dont know for sure, but i would be willing to bet that there was a 24 hr - 48 hr period about a month ago where the machines could not access the ftp to back up work. So each time a machine was restarted without being able to access the ftp, it would end up deleting it work and just download a new WU instead of resuming what was on the ftp. If this happened multiple times, between the two good sized labs, (~45 machines total) then that *maybe* result in the 2k+ WU's not being completed (although each machine would have to be restarted 50ish times to work out to 2K expired WU's, pretty unlikely actually....)

I plan to pull up my ftp logs after work and see how long the ftp was inactive. Beyond that, i can check my current backed up FAH logs from each machine, and see if just before the lab was shut down maybe there were a bunch of EE'd WUs.

As it is, the whole "house of cards" (the whole backup scheme and such) that I have been relying on has been (and is) getting some seriously reworking to try to eliminate this kind of problem from occuring again.
 
In your month ago senario, I can see how you could download a lot of WUs, but if the work is deleted prior to upload, there wouldn't be any turned in to Stanford. The fact that 3151 WUs more than your average were turned in on 04 JUL and O5 JUL, would tend to lead one away from a month ago problem. You could ask the Stanford Forum moderators to tell you what they saw from their end. They'll have a record by WU, user name, machine id and ip address. I'll ask for you if you want me to.
 
hmmm, good point. I will contact the stanford mods myself, i appreciate the offer, but this is most certainly my mess, and I would like to see what has happened. Hopefully they can shed some more light on whats going on.

Edit: Actually, any advice which mods to contact?
 
Nice banks of borgs really put out the PPD/PPW but the price you pay is the constant maintenance. A price most of us would be willing to pay :D

Nice production as you frequently out paced Nik.
 
Well, this certainly explains why us Cookie Commanders are smashing the Cheezers at the moment..... I'm starting to think that we're too far ahead for them to catch up. Takes the fun out of it really..... :eh?:
 
its a contest, and such are the fortunes of war... things could have happend the other way as well...

Update:

I was able to get info from **** Howell over at the folding-comunity forums and he was able to get info on what WU's i have turned in since the 3rd of july. It appears that there was NOT a spike in WU's turned in for partial or zero credit on the days where the they appeared on the stats.

It does appear that the spike in WU count shows up in both stanfords stats and the EOC stats. From what **** mentioned, its seems that on a rare occasion stanford's stats will end up counting WU's that had not shown up before, even if the appropriate amount of points WERE counted. This could account for the spike, although I wouldnt be suprised that there is more to the WU spike than that...
 
Very fishy. If you have turned in the number of WUs shown, for full credit on p191x alone, you'd have 2,785,500 points. Obviously thats not the case, so something else is wrong, what ****'s seeing, or something in the scoring.
 
Good point.

I finally dug into my 80 odd megs of ftp logs and I am seeing some pretty odd behavior. On the 4th of june, and it looks like part of the 3rd and 5th, each machine i have using my client to back WU's was connecting to the FTP 5 times per HOUR, and not trasfering anything...

Now, i have the machines set on a 2 hour rotation to back up work. The only other time a machine would access the ftp is to check for a new version of the program i wrote, when the machine is initially powered on.

I'm not entirely sure what to make of this yet, but i have added some additional checks on my program to just keep fah OFF if it cant get backed up work off the ftp (even if the ftp is active), until I can get this resolved...

**** is out of the country for the next few weeks so I plan to follow up with Bruce over at the folding forums and see if it would be necessary to start digging into the full stanford logs to track down the issue. I suspect its most likely my fault for not planning well in the developement of my program.

On the other hand, **** seems to think that this may just be an artifact of the stats system. Either way, i will keep a very close eye on things on my borgs and do what I can to figure out what circumstances could have led to to a mistake I made causing the number of WU spike...
 
Back