Max0r said:
JCLW, you raise a point that I find impossible to ignore. I MUST FIND OUT MORE! Any info/links for me (regarding the error recovery and how certain error recovery scenarios unfold)?
Over time, all hard drives develop bad sectors. Modern drives have "spare" good sectors that data can be transferred to when a used sector starts to go bad.
IIRC most modern drives ship with 512 spare sectors, but your mileage may vary.
You can see how many sectors have been reallocated by looking at the SMART data. The "Reallocated Sector Count" raw data is the number of sectors that have been reallocated.
But back to the point...
The drives firmware controls how the drive behaves when it has trouble reading a particular sector.
A long time ago drives were dumb, and you'd get the dreaded "Abort, Retry, Fail?" whenever it couldn't read a sector. Then you'd have to drag out your Norton Disk Doctor floppy and let it do its magic. It would try and read the bad sector(s), move the data to an unused good sector(s), and mark the bad ones as bad.
Now we have more intelligent drives, and the drive itself will detect the bad sector, attempt to recover the data and move it to a spare sector, and then hide the bad sector. This is all transparent to both the users and the OS.
When using a drive alone or in a RAID 0 configuration we want to give the drive as much time as possible to try and recover the data in order to maximize our chances of getting it all back. Note that while the drive is recovering data it does not respond to any controller commands. Some higher end RAID controllers (3ware in particular) will mark a drive as "failed" if the drive does not respond to a command within a certain time period (usually around 10secs). So by the time the drive has finished recovering the bad sector the RAID controller has already marked the drive as failed, and you're left with a degraded array. The only was to fix it would be to physically power cycle the drive (unplug it and plug it back in) and then rebuild the array (which can take hours, plus put a lot of stress on all the drives). And if your server is located in a different building it can be a real PITA.
So WD released a special series of hard drives (RE or RAID Edition) that only spend a maximum of 7 secs trying to recover your data. The firmware is also optimized for higher queue depths.
edit: this is what I was looking for earlier:
Western Digital said:
It is important to realize TLER-capable hard drives should not be used in non-RAID environments.
http://www.wdc.com/en/library/sata/2579-001098.pdf