• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

All my wu's wiped out during storm.

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

glasslicker

Member
Joined
Dec 27, 2005
Location
Cactusland
Well, we had an extended power outage at the office caused by Saturday's storm. Only the server and one PC had battery backup as a power outage never causes us the loss of anything more than a single note on a file.

All work units corrupted with I/O errors. 1300 points total including 2 404's. This doesn't include a third 404 I was folding for someone else. I even tried moving all of them to another machine. After six hours I managed to save a single 56 pointer. The others are trashed. :bang head

All WU's were replaced with 56 pointers, and that's with big packets on and
-advmethods. (Not whining, just commenting.)

Unfortunately I didn't have WU insurance. I'll call State Farm for a quote today ! :D

Can someone comment on what happens to the WU or how the program logic works when it is interupted without closing properly ?
 
glasslicker said:
Well, we had an extended power outage at the office caused by Saturday's storm. Only the server and one PC had battery backup as a power outage never causes us the loss of anything more than a single note on a file.

All work units corrupted with I/O errors. 1300 points total including 2 404's. This doesn't include a third 404 I was folding for someone else. I even tried moving all of them to another machine. After six hours I managed to save a single 56 pointer. The others are trashed. :bang head

All WU's were replaced with 56 pointers, and that's with big packets on and
-advmethods. (Not whining, just commenting.)

Unfortunately I didn't have WU insurance. I'll call State Farm for a quote today ! :D

Can someone comment on what happens to the WU or how the program logic works when it is interupted without closing properly ?

The WU will be reissued to someone else - it's possible the temporary work files have been corrupted by the power outage and it can't distinguish work from corruption.
 
glasslicker said:
Well, we had an extended power outage at the office caused by Saturday's storm. Only the server and one PC had battery backup as a power outage never causes us the loss of anything more than a single note on a file.

All work units corrupted with I/O errors. 1300 points total including 2 404's. This doesn't include a third 404 I was folding for someone else. I even tried moving all of them to another machine. After six hours I managed to save a single 56 pointer. The others are trashed. :bang head

All WU's were replaced with 56 pointers, and that's with big packets on and
-advmethods. (Not whining, just commenting.)

Unfortunately I didn't have WU insurance. I'll call State Farm for a quote today ! :D

Can someone comment on what happens to the WU or how the program logic works when it is interupted without closing properly ?

So, that means I get ANOTHER week before you pass me??

What a HORRIBLE shame! :p j/k

As the WU progresses, it makes backup files every 15 minutes or so. (it calls them checkpoint files). If the machine somehow cuts off the core or console before it's done, it will try to reload the data it needs to continue, from the checkpoint files.

It doesn't always work, and even when it does, you usually lose at least 10 minutes because of the time it takes to check the data to see if it's able to find a good checkpoint to continue from.

Everybody has been getting these little 56 pointers, ad nauseum, but they seem to be diminishing just yesterday. Hopefully some WU's with a little more "meat" to them will be the norm for a while.

It certainly slows down the science for Vijay and Co., because they have to complete one generation all the way through, and only then can the next generation be constructed and given out. So it's all VERY sequential, with no redundancy of WU's, according to Vijay, unless they see a problem with the returned data.

Good luck getting all your boxes back up and working right!

Adak
 
Another week ? Heck, you have close to 20,000 points. Going to take a whole lot longer than that.

Right now, I'm just working on stability and making it to page three.

Well, so far I think I have had a about a 2000 point education.

Still, I'm folding and that's all that matters.
 
It is very unusual to lose all your WUs during a power failure if the checkpoint files are written to a local drive. My office is in the power failure capital of the world and my non ups protected machines lose less than 5% of WUs on power failure.
 
ChasR said:
It is very unusual to lose all your WUs during a power failure if the checkpoint files are written to a local drive. My office is in the power failure capital of the world and my non ups protected machines lose less than 5% of WUs on power failure.

Hmmmmmm. That concerns me.

Looks like I'll have to do some more reading and figure out what I'm doing wrong.
 
/Bow heads

Our Program, who art in Memory,
"hello" be Thy name.
The Operating System come,
Thy commands be done,
at the printer as they are on the Screen.

Give us this day, our daily WU's,
and forgive us, our I/O Errors, as we forgive those
whose Logic Circuits are faulty.
Lead us not into Frustration,
and deliver us from Power Surges.

For Thine is the Algorithm,
the Application,
and the Solution,
Folding forever and ever.

Return.


:attn:
 
Last edited:
LOL !

Im over it. 600 points today so far. Still have some 56 pointers so no breaking 1000 today.

Folding On !
 
Aww, that power cut sucks glasslicker. Good luck getting the PPD back up to speed :) - I've also had my PPD destroyed (completely) by a short holiday, lol...
 
Some cores are better than others at power failure recovery. Probably half my rigs at the office run Deadlinless WUs (p2 400MHz class machines). The Tinker core almost never loses a WU on power failure. Gromacs was bad about losing WUs on power failure but version 1.8x improved checkpointing and reduced the losses. The frequency you have set for checkpointing makes a difference. If you checkpoint at some absurd interval like 3 minutes, the likelyhodd that the power will be interputed while a checkpoint is being written is ten times greater than 30 minute checkpoints. If the power goes out during the disk write, the WU is lost.
 
ChasR said:
The frequency you have set for checkpointing makes a difference. If you checkpoint at some absurd interval like 3 minutes, the likelyhodd that the power will be interputed while a checkpoint is being written is ten times greater than 30 minute checkpoints. If the power goes out during the disk write, the WU is lost.
That makes a lot of sense, I've never thought of that :p
 
Back