Hardware raid partition becomes raw stumped, frustrated, about to cry!

silkshadow · Aug 25, 2010

I'm having a PIA issue. Its a bit complex, with lots of elements of failure, so I will be as thorough as possible here. I just built a new array with WD20EARS hard drives (see thread on that for tech details). Here are the highlights:

Windows 2008 x64
24 WD20EARS
8 drives run off a promise ex8350 (in a PCIe 4x slot)
12 drives run off an Areca ARC-1130 (in a PCIe 16x slot)
4 drives off of an Areca ARC-1110 (in a PCIe 16x slot)
Arrays set to GPT

I tested it for a week, as recommended in that thread and it seemed to be doing ok.

So last week I started the big file copy over. I started with about 4TBs of data and left it over the weekend. It finished. Some directories were showing up as corrupt. I rebooted, the partition came up as raw (in file explorer, it asked do you want to format, checked in device manager and it said it was raw but blue colored). Crap! I just lost about 2TBs of stock video that I was using to test the array with the week before. Oh well, I have it all on discs so ok, I iwll just have to reload them at some point. Sucks but not a big deal.

So I took array down and checked each disk individually again for errors (plugged into mobo sata interface). All disks came up fine. I rebuild the arrays and copied over only a couple hundred gigs of data over to see how it goes. All ok. I setup a 1TB data copy and left it. Checked it before I went to sleep, looked ok. I rebooted the server and all ok. So I setup teh rest of the 3TB to copy. Came back a few days later and same problem

. Raw partition.

Now let me get a bit more descriptive with this. I was copying data to just 1 of my arrays the 8 drive Promise one. I am dong this because I have a working 8 drive promise array in operation, so I can test this card with a group of working disks if I have issues and I have troubleshooting flexibility. When I come back to the machine, the partition is still readable via a file explorer. However clicking on some directories brings up a "folder is corrupt" error. Trying to run checkdisk on it, checkdisk says its raw. When I reboot, the entire volume is raw and file explorer asks me if I want to format it.

So at this point, I have ruled out the drives being bad but that's about it. So its time to do the troubleshooting mambo. First up, basic hardware competencies. Ramtest, system stress test all come up fine. Next up, testing the disks/array on a working environment and testing the raid card. I move the array that is currently raw to my working and in operation ex8350. In that setup it also comes up raw. Second, I move the working, in operation, 8 drives (1.5TB Samsung) to the ex8350 in the problem machine. Its working fine. I copy 1.2Tbs of data to it (its all the free space on that array). I check it later and its fine (phew, I was worried about loosing data on that array). I swap the disks back (thank heavens for hot swap bays)

Ok after that I am stuck for what to test, so I copy 2Tbs of data to the 12 drive Areca array. It seems to be ok. 3 reboots and no problems. So I do the remaning 2TB copy (of my original 4TB attempt) and that finished today. Raw partition!

I am totally lost. Summing up here's what I know:

1) 24 drives are fine
2) Promise card is fine.
3) Didn't do exhaustive checks but the mobo, ram, video, etc are fine. Passed a system stress test.
4) Data is a constant. I am using the same about 4 TBs of data to copy over and over again

Please help? Ideas, anything at all. What can I do? Its been weeks now and I can't get my LAN back up, its a terrible thing. I can't get back to work, I can't game, my office is a total disaster area as I started another project assuming that I would just have to wait for this to finish copying over, I can't even enjoy my home theater setup. I'm tearing my hair out in frustration!

Thanks!

Edit: More info. Raid is raid 5. Set write cache to write back, sector size is 4k, stripe size is 128KB. I don't really know more than what google tells me about these settings. I've used most of these settings before. I specifically choose the 4k sector size because of the "advanced format" EARS drives.

First windows format was default file allocation, then I changed it to 64k, and after that I changed between default and 32k and 64k to see iof any of these made a difference. Again, not an expert with any of this, just read google.

Is there anything here I should change?

jason4207 · Aug 25, 2010

Maybe try a different data set to copy over?

Maybe you can only copy over <2TB at a time?

I'd be scared to run a RAID5 set that large...too much that can go wrong imo. What are you storing and accessing? Are you accessing the server exclusively over a LAN...do you need the speed that RAID5 can provide (the speed of 1 drive will come close to maxing Gb LAN anyway)? There are JBOD+parity solutions available that are much safer if all you're doing is media serving; unRAID and FlexRAID

Automata · Aug 25, 2010

Are the drives dropping out of the array when you copy the data?

I did some quick looking and it seems these drives have a rather large value for TLER. What that means is if the drive detects a read error, it has to "fix or give up" in x seconds (TLER). During this time that it is fixing the issue, the drive is unresponsive to SATA/RAID controllers. This means that your RAID controller is thinking the drive is disconnected/dead and removing it from the array which would produce your "RAW" partition if two drives did this before the rebuilt finishes (which would take 10+ hours on your array). You may be thinking, "The drives are good, why would they go offline to check for errors?". I'm honestly not sure why, but you can ask most users that have combined Western Digital's Green series with a RAID controller. It is nothing but a headache as the drives will continuously drop out of the array.

I would suggest staying away from RAID 5 with that many hard drives as you are just asking for a ton of issues. I'd suggest RAID 10.

Either way, you need to monitor the drives to see if they are dropping out of the array. If they are, time to return those drives and get ones that don't have such long TLER values (Hitachi, for example).

Adragontattoo · Aug 25, 2010

He is using WD Greens Thid, and while I was typing this you responded again, with more info... Silly Ninja..

As an FYI, WD and a few others pretty much eliminated the ability to use their drives in a RAID array without serious issues (like you are seeing.)

Older WD desktop drives (Velociraptor excluded) could not be safely used in RAID because TLER was turned off. A tool called WDTLER, that runs in a DOS PE, enables you to switch on TLER on their desktop drives, enabling you to use them in RAID safely. Their Enterprise RE drives had TLER turned on by default. However, starting about October 2009 to December 2009, WD updated their firmware on all of their desktop drives so that TLER could no longer be enabled.

But OP knew this judging by his previous thread. http://www.overclockers.com/forums/showthread.php?t=651185

You might want to try to work with the jumpers to see what happens. Also if you are getting corrupted data, as a suggestion, test your RAM and PSU. I have seen a few times where bad power or bad RAM can end up crushing the data on a drive.

Automata · Aug 25, 2010

Doh, thought they were Seagates for some reason. If those are Green drives, you are going to run into nothing but issues because of TLER.

silkshadow · Aug 25, 2010

First of all thanks for the help!

Yeah, I am aware of the TLER issue.

In the thread I linked in the first post, its exactly the issue I was asking about. When I bought them I was sure they would be WD20EACS (which WDTLER would work on). However, they arrived as EARS and I'm pretty much stuck with these drives.

However, drives are not dropping out of the array. In fact, the array itself is rock solid. It hasn't rebuilt itself once and I did pull the drives out of the array to test them individually (machine powered off of course). Usually when I do that, even taking and replacing the drives with the machine off, it wouldn't be unusual to see a rebuild triggered.

Actually this is not 1 big raid 5 array. This is 3 raid 5 arrays. Consider it, in a way, 3 disk redundancy. Yup, LAN access and local access.

I did run memtest on the ram and it came up ok as well as a system stress test. I will give the PSU a look but its a redundant rack system PSU. Its designed for reliability and I've only been using it for about a year and its rails are rock solid.

In reading your posts, I am wondering about the "advanced format" situation. For a partition to go raw that means sector 0 must've gotten corrupted. I read this about the advanced format drives

Through the jumpering of pins 7 and 8 on an Advanced Format drive, the drive controller will use a +1 offset, resolving Win 5.xx’s insistence on starting the first partition at LBA 63 by actually starting it at LBA 64, an aligned position. This is exactly the kind of crude hack it sounds like since it means the operating system is no longer writing to the sector it thinks its writing to, but it’s simple to activate and effective in solving the issue so long as only a single partition is being used. If multiple partitions are being used, then this offset cannot be used as it can negatively impact the later partitions. The offset can also not be removed without repartitioning the drive, as the removal of the offset would break the partition table.

While I don't understand this, considering the jumper does +1, how does this effect windows and the sector 0 of the disk? I am also quite confused as to if I enable the jumper on all disks in the array or just the first one?

I don't know, I am all confused. Advanced format drives suck a$$. I saved almost a thousand dollars by buying these drives, but they were supposed to be EACS not EARS

.

I will start a new file copy with a new group of files and do 2TBs at a time and see if that helps but I am not hopeful that this is the problem. I've done file copies in the 7TB range before to this very machine without a problem.

Any other ideas, I would be grateful for them.

Thanks!

silkshadow · Aug 28, 2010

Well, I am ready to give up!

Unless someone has something I can try or test or something to figure out why this is happening? Is there another forum that deals specifically with storage issues that I might try? I'm pretty desperate.

The only way out of this I can see is to JBOD'd them and used windows dynamic disk to create a raid array. Would be a huge waste of all these nice raid cards though.

Hardware raid partition becomes raw stumped, frustrated, about to cry!

silkshadow

Member

jason4207

Senior Member

Automata

Destroyer of Empires and Use

Adragontattoo

Trailer Chasing Senior

Automata

Destroyer of Empires and Use

silkshadow

Member

silkshadow

Member

Similar threads