600GB RAID 5 Array dies - dubious circumstances - please help restore!

Growly · Jun 13, 2006

Hello my beloved friends,

I have gazed in wonder at you all from afar till now, when, in my darkest hour, I ask for help

Before I break out in tears, I thought I may share my anguishes with all of you. Who knows, maybe someone knows exactly what I need to fix my little predicament...

Background: I bought a Promise FastTrak S150 SX4-M SATA RAID controller a couple of years ago, and with it 4 Seagate 200GB SATA drives (brand new). This is not a cheap card - at NZD$400 I expected something more than the average integrated controller, and you'll understand my anger after spending NZD$1400 on what I expected would serve me well

Infact, I had been enjoying those little advantages and revelling in them until about 2 days ago. I had created a RAID 5 array that spanned all four disks, totalling 600GB in size. It was full - of everything - of more days and weeks of downloading and hours of work than there are molecules in my foot.

Everything is going fine - the array is healthy, all drives are operational. Then, the server freezes - it stops responding. I restart it (forcefully), and after it reloads Windows I find that my array has vanished.

A look at the array management software reveals that two of the Array drives - namely 1 & 2, are no longer part of the array. The array itself is still there, and it thinks that it's missing two drives - but those two drives also appear as perfectly healthy... the catch is that they're not assigned to any arrays anymore.

I cannot, it seems, add the drives back into the array for one reason or another (the software won't let me). Whatever means by which the controller identified the drives has disappeared, and I'm left with my life-savings in data gone.

No, I don't have backups, because that would imply that I had another 600GB lying around somewhere.

I have one idea and one idea alone to help me out of this situation - delete the array (which wipes the data), reform an identical array, and use recovery software to scrounge back my countless megabytes of information. The only problem is then that I need to find more drives and alot of spare time.

What I don't understand is how something like this happens. I have two healthy drives, newer than the ones in my Desktop, and they both magically detach themselves from my RAID 5 array AT ONCE?

What the hell?!

Now this is where it gets crazy. Promise states that an array can be moved between controllers - which reaffirms my belief that if I recreate the array, exactly as it was before, all will be restored to normal.

So I did that, and upon rebooting, Windows Server 2003 starts chkdsking the drive. I didn't get time to stop it running (I was busy pleading other people on other forums for guidance at the time), so I thought I should leave it running. It had detected my old array's volume name, and the filesystem (NTFS), and so I was far too excited to worry about the 90,000 minor corrections it made to all my files:

Then when I logged into Windows itself, I was greated by a drive that had no filesystem, no data, and no formatting. Chkdsk no longer works on the drive either, dying with an "unspecified error".

What the hell?

I try disk management, and it tells me nothing I don't already fear. So I pull out Runtime's GetDataBack for NTFS, and do an excessive scan of the whole array. It finds my data, in bits and pieces, and in different folders to where I originally put it, and it's all corrupt.

So everything turned to crap.

My question is this, my beloved efriends: Could chkdsk have ruined my files completely, or would they already have been ruined, causing it to go through and do all those little fixes in the first place?

How does a RAID controller *forget* that its RAID array member drives are infact member drives? Can anyone explain to me the low level implications of this? On a 1s and 0s basis?

Before you all tell me what I should've done, I had no provision to back any of these files up before a month ago when I bought my first DVD writer. I wasn't intending to back up 600GB to CD-Rs. I know that in future, I must be more prudent with the external media.

There is only one thing I can possible blame for this. The day my computer froze, and the day after, we were having funny power problems. The day after, for example, we experienced numerous brown outs that caused all my equipment to do funny things. After I reset the machine in question following one of these brownouts, I found that the controller had once again dropped a drive from the array. The day of the original freeze, our smoke alarms had gone off, in response to a fire caused by a power surge (or so my father hypothesised). This is the only pattern I see.

Please, I beg of you, I know it's a lot to read, help! Even if it's an indepth explanation of what could have happened (with reasoning) so that I can learn something from this and move on!

Xaotic · Jun 13, 2006

As with any other RAID controller, a glitch can kill data. In this case, the suspected power problems can negatively affect any number of things. The check disk didn't help, as it can kill data without warning.

As to where to go from here, try the demo version of R-Studio from R-TT.com, data here:

http://www.data-recovery-software.net/

It has a RAID recovery solution that I haven't tried yet, but the rest of the program works very well and is less expensive than other options.

CGR · Jun 13, 2006

You dont have any kind of battery backup? Its bad enough when a single drive gets interrupted by a browout, but it is really bad when it happens to a raid system. For future make sure you always have battery backup on any machine you do not want to lose data from.

I fear you will not have much luck in restoring the raid. I have tried in the past to restore a raid where two drives failed out of a three drive array to no avail.

Also remember to periodically backup all important data, through tape, CD burning Imaging whatever just as long as you have some kind of backup somewhere. Its very very important.

Growly · Jun 14, 2006

Yes, backups were important. It is only in hindsight that I realise that some of the data was more important than others (because I just realised there are no other copies) - at the time I wouldn't thought to back it up because I would've tried to backup the whole array...

and to CDs, that's no plausible. I only just got a DVD writer too...

Thanks for the replies though, good to see some people respond to my cries for help

CGR · Jun 14, 2006

That happens to the best IT people.. I have either forgotten or thought a folder wasnt important to backup at times to learn when its lost that I was wrong and gotten yelled at for it. Live and learn

Snugglebear · Jun 14, 2006

Sorry to see this happen to you, but like others have said, RAID controllers can and do suffer glitches that destroy arrays. I've seen this happen a few times at the office and in consulting life. In any case, it's stressful and usually inexplicable - one might encounter a driver bug, power problem, OS bug, or whatever else, and the end result is that the controller will mark a number of the member drives as failed or inoperative. Once that array loses synchronization between drives and parity, it's toast.

The guys who write controller BIOSes and utilities know this, so they'll try to prevent you from going cowboy and recreating arrays from non-synched drives. Should they allow this option, chances are a fair chunk of the data will be corrupt and some of this would never be discovered unless checksums or copies of the data are kept elsewhere. Since nobody keeps checksum logs of all their files, it is assumed a backup is kept, and in the event of such a failure the array will be wiped, recreated from scratch, then reloaded from backups. On the cynical side of things, not even the most explicit of EULAs, warnings, and wet signatures on waivers would keep the manufacturers from getting sued should a feature allowing arrays to be stitched back together.

As for what's happening at the low-level - array info is stored in the boot sectors of the drives. That info mirrors what's in the controller, namely which drives are members of what array, their ID, what they're used for (data stripe, parity), etc. These few bytes are what allows arrays to be moved to new controllers, as the controller will check its own settings against what's on the drives during initialization. While I've never been able to watch it happen, I've always suspected that as soon as the glitch occurs and member drives are marked as failed, the controller will immediately change this data both in its own memory and on the drives to reflect the failure(s). Once that's done even a new controller will look at the drives and see the busted array, then refuse to work with it.

Growly · Jun 15, 2006

CGR said:
That happens to the best IT people.. I have either forgotten or thought a folder wasnt important to backup at times to learn when its lost that I was wrong and gotten yelled at for it. Live and learn

Snugglebear said:
Sorry to see this happen to you, but like others have said, RAID controllers can and do suffer glitches that destroy arrays. I've seen this happen a few times at the office and in consulting life. In any case, it's stressful and usually inexplicable - one might encounter a driver bug, power problem, OS bug, or whatever else, and the end result is that the controller will mark a number of the member drives as failed or inoperative. Once that array loses synchronization between drives and parity, it's toast.

The guys who write controller BIOSes and utilities know this, so they'll try to prevent you from going cowboy and recreating arrays from non-synched drives. Should they allow this option, chances are a fair chunk of the data will be corrupt and some of this would never be discovered unless checksums or copies of the data are kept elsewhere. Since nobody keeps checksum logs of all their files, it is assumed a backup is kept, and in the event of such a failure the array will be wiped, recreated from scratch, then reloaded from backups. On the cynical side of things, not even the most explicit of EULAs, warnings, and wet signatures on waivers would keep the manufacturers from getting sued should a feature allowing arrays to be stitched back together.

As for what's happening at the low-level - array info is stored in the boot sectors of the drives. That info mirrors what's in the controller, namely which drives are members of what array, their ID, what they're used for (data stripe, parity), etc. These few bytes are what allows arrays to be moved to new controllers, as the controller will check its own settings against what's on the drives during initialization. While I've never been able to watch it happen, I've always suspected that as soon as the glitch occurs and member drives are marked as failed, the controller will immediately change this data both in its own memory and on the drives to reflect the failure(s). Once that's done even a new controller will look at the drives and see the busted array, then refuse to work with it.

Your words offer immeasurable comfort. Although I don't why it happened twice, I now know that I'm not the only one in the world with the issue. Although some material can be acquired from other sources, I fear that much (including photos), has been lost - and that the only real solution is to buy two controllers and mirror their arrays independently. Yes, that's how a pro would do it. (Oh, and the power equipment, sure.)

I am slightly comforted by the fact that I now have atleast a list of the files I lost (from the MFTs), and so I can see what I need to get back from elsewhere.

That's nice.

One last question - is it possible that the same thing that killed my RAID controller's memory also caused the drives to corrupt themselves? Or, as you said, is it more likely that the drives believe they are themselves corrupt, and have essentially given up on themselves? I'm talking about a power surge, because that's the ONLY thing I can plausibly blame.

This is why I like low level, it explains things.

Snugglebear · Jun 15, 2006

If you want a reasonable solution to safeguard the data, just grab an external drive or removeable drive kit. Taking a snapshot every day or three is far less expensive than creating an entirely separate array to mirror the data. It also makes the data mobile, which is a good thing if this really is a bad power issue that eventually turns into a fire or the failing controller shorts/surges into the rest of the components.

The alternative is, as you've noted, rather spendy and inefficient. Believe me, it's hard to go before management where everyone has a C in front of their title and ask for tens of thousands of dollars to set up high-availability clusters (and this in a relatively small company). Many people don't grasp the importance of having live equipment, data backups & images of the live equipment (on & offsite), hot/warm spare equipment (e.g. clusters, standby systems), and then even more offsite cold spare equipment. Infrastructure really is a pain, as the more that gets added to the top (e.g. new capabilities), perhaps several times that needs to get added down at the foundational level.

Anyway, it could have been any number of things that caused the failures. Power is always a good initial suspect and can cause the symptoms. Similarly, the failure may have been some incompatibility between the controller and board or some other component(s). The controller itself may have been shoddy and blown a cap, shorted, or otherwise experienced a local failure. In that situation, yes, it is entirely possible that it went berzerk and started corrupting data both in its cache and on the disks. Perhaps the drives started having ATA errors and they forced the controller to downgrade their speed or predict their impending departure from life. Other, less likely items include excessive EMI, cosmic radiation (I'm not kidding, while rare at ground level, radiation can flip the states of unshielded transistors, hence aerospace equipment is sheilded and most important data is housed on CRC-enabled systems), or even tin whiskers causing shorts.

In all likelihood, though, the controller probably encountered some sort of unrecoverable error, panicked, and marked a few drives as failed. Maybe the drives were having timeouts, maybe the 5v rail dipped and a couple drives shut themselves down during operations. Either way, there was probably dirty data both in the controller and in the drives (ATA specs allow write acceleration, where the controller sends a write request to the drive and the drive immediately replies it has been completed, although the data is truly in the drive's cache awaiting writeout to the physical disk; battery modules for controller cache won't help here, since the controller will have moved on thinking the dirty data has been flushed out to disk), so when whatever occurred, enough data was lost to throw off chkdisk.

Growly · Jun 15, 2006

Snugglebear said:
If you want a reasonable solution to safeguard the data, just grab an external drive or removeable drive kit. Taking a snapshot every day or three is far less expensive than creating an entirely separate array to mirror the data. It also makes the data mobile, which is a good thing if this really is a bad power issue that eventually turns into a fire or the failing controller shorts/surges into the rest of the components.

The alternative is, as you've noted, rather spendy and inefficient. Believe me, it's hard to go before management where everyone has a C in front of their title and ask for tens of thousands of dollars to set up high-availability clusters (and this in a relatively small company). Many people don't grasp the importance of having live equipment, data backups & images of the live equipment (on & offsite), hot/warm spare equipment (e.g. clusters, standby systems), and then even more offsite cold spare equipment. Infrastructure really is a pain, as the more that gets added to the top (e.g. new capabilities), perhaps several times that needs to get added down at the foundational level.

Anyway, it could have been any number of things that caused the failures. Power is always a good initial suspect and can cause the symptoms. Similarly, the failure may have been some incompatibility between the controller and board or some other component(s). The controller itself may have been shoddy and blown a cap, shorted, or otherwise experienced a local failure. In that situation, yes, it is entirely possible that it went berzerk and started corrupting data both in its cache and on the disks. Perhaps the drives started having ATA errors and they forced the controller to downgrade their speed or predict their impending departure from life. Other, less likely items include excessive EMI, cosmic radiation (I'm not kidding, while rare at ground level, radiation can flip the states of unshielded transistors, hence aerospace equipment is sheilded and most important data is housed on CRC-enabled systems), or even tin whiskers causing shorts.

In all likelihood, though, the controller probably encountered some sort of unrecoverable error, panicked, and marked a few drives as failed. Maybe the drives were having timeouts, maybe the 5v rail dipped and a couple drives shut themselves down during operations. Either way, there was probably dirty data both in the controller and in the drives (ATA specs allow write acceleration, where the controller sends a write request to the drive and the drive immediately replies it has been completed, although the data is truly in the drive's cache awaiting writeout to the physical disk; battery modules for controller cache won't help here, since the controller will have moved on thinking the dirty data has been flushed out to disk), so when whatever occurred, enough data was lost to throw off chkdisk.

That makes sense. I love that it makes sense.

I hate that it happened. Thank you

JCLW · Jun 15, 2006

What firmware is your controller running?

Sneaky · Jun 15, 2006

this thread was bad luck for me... i read it last night, and now this morning when i go to turn on my computer, on the intel ICH7R RAID boot screen... it says volume disk #2 has failed

i am so tired of these hitachis - they worked fine for a number of months, and all of my friends have been having problems with them after 2-3 months themselves... i guess it was my turn to get skullf***ed by my hard drives

RIP 500gb of data from the past year... atleast i have a 120gb with some of the stuff backed up... but the last backup i did was in may... so poop. :bang head

mrgreenjeans · Jun 15, 2006

Coming from left field dept.

As soon as he noticed the failure, would've an immediate run of "Sytem Restore" to an earlier working set have a snowball's chance in h*** of setting it back up?

Growly · Jun 16, 2006

Sneaky said:
this thread was bad luck for me... i read it last night, and now this morning when i go to turn on my computer, on the intel ICH7R RAID boot screen... it says volume disk #2 has failed

i am so tired of these hitachis - they worked fine for a number of months, and all of my friends have been having problems with them after 2-3 months themselves... i guess it was my turn to get skullf***ed by my hard drives

RIP 500gb of data from the past year... atleast i have a 120gb with some of the stuff backed up... but the last backup i did was in may... so poop.

I send to you my sincerest apologies, the greatest virtual hug possible, and an invitation the mass suicide organised for tomorrow.

As far as firmware is concerned - I can't shut the server down yet to check from the card (it runs email, DNS, etc), but I can tell you that:

Agent version (presumably PAM, I don't know): 4.0.0.90
Driver version: 2.0.0.25

En ce-qui concerne le system restore, I don't think it would've worked at all - but someone can prove me wrong if they wish

Snugglebear · Jun 16, 2006

System restore isn't going to help with corrupted arrays. Once something in your system happily goes and writes garbage to the disk(s), it's pretty much over unless you have a full backup image (ghost, ultrabak, dd, etc.) someplace else. Microsoft is only concerned with imaging the OS and essential data, not the entire system contents or those on data-only volumes. Hence system restore is useful if an installation goes bad, a new driver doesn't work out, or the system picks up some spyware. System restore isn't much use when younger siblings come and delete all your data, coffee/mountain dew/tea gets spilled on the system, or girlfriends/wives do that damn thing where they brush their hair while walking around the house in wool socks and then hop in your lap while some part of either body is in contact with expensive computer equipment.

mrgreenjeans · Jun 16, 2006

Snugglebear said:
girlfriends/wives do that damn thing where they brush their hair while walking around the house in wool socks and then hop in your lap while some part of either body is in contact with expensive computer equipment.

Sounds as if it's happened before?

Snugglebear · Jun 16, 2006

Yup, I just try to tell myself she was worth it.

Garrett_thief2 · Jun 18, 2006

In one of my systems, this exact same thing has happened to me TWICE since May 29th. Turned out to be faulty rounded IDE cables. The cables were swapped on May 27th, and the corruption began.

A similar thing happened on a sata array on a server that I was called in to repair. They had the SATA, IDE, and power cables bundled up and zip tied together in a huge bundled mess. After proper cable routing the problem hasn't returned.

Check your cables and cable routing.

Snugglebear · Jun 19, 2006

Garrett_thief2 said:
In one of my systems, this exact same thing has happened to me TWICE since May 29th. Turned out to be faulty rounded IDE cables. The cables were swapped on May 27th, and the corruption began.

A similar thing happened on a sata array on a server that I was called in to repair. They had the SATA, IDE, and power cables bundled up and zip tied together in a huge bundled mess. After proper cable routing the problem hasn't returned.

Check your cables and cable routing.

Not an OEM machine, I take it?

Encore2097 · Jun 21, 2006

When my RAID 0 of 2 hdds failed due to one of the drives dying, I used Easy Recovery Professional, worked like a charm (took a long time though) recovered all my data. BTW.. Easy Recovery Pro got my data when GetDataBack and R-Studio couldnt even see the RAID array. It booted directly into DOS and performed a BIOS level recovery, it can do it from windows but like i said Windows didnt see the array anymore

I would also try File Scavenger it operates in Windows but it also can reconstruct Raid 0 and RAID 5 arrays.

600GB RAID 5 Array dies - dubious circumstances - please help restore!

Growly

New Member

Xaotic

Very kind Senior

CGR

Member

Growly

New Member

CGR

Member

Snugglebear

Member

Growly

New Member

Snugglebear

Member

Growly

New Member

JCLW

Member

Sneaky

Skulltrail Junkie

mrgreenjeans

Member

Growly

New Member

Snugglebear

Member

mrgreenjeans

Member

Snugglebear

Member

Garrett_thief2

Member

Snugglebear

Member

Encore2097

Member

Similar threads