• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Current_Pending_Sector woe (WD30EFRX)

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

HankB

Member
Joined
Jan 27, 2011
Location
Beautiful Sunny Winfield
:(
(NB: I'm running Linux so questions and answers are in terms of that.)
I found that a drive had been dropped from a RAID1 a couple days ago. At that time the drive no longer responded to SMART queries ('smartctl -a /dev/sdc') Following reboot the drive now responds but has not been added back to the MD RAID1 array. (I'm not sure if this is automatic or if it requires manual intervention - this is the first time I've had a drive dropped from an MD RAID.)

Looking at the SMART diagnostics I find:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       201
  3 Spin_Up_Time            0x0027   196   178   021    Pre-fail  Always       -       5158
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       70
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   156   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   055   055   000    Old_age   Always       -       33491
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       70
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   194   194   000    Old_age   Always       -       19264
194 Temperature_Celsius     0x0022   123   106   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
[B][COLOR="#B22222"]197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
[/COLOR][/B]198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

I performed a 'long' self test ('smartctl -t long /dev/sdc') and it failed quickly. Results of all self tests are
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     33491         239126392
# 2  Extended offline    Completed without error       00%     30627         -
# 3  Short offline       Completed without error       00%     30619         -
Normally the self test takes hours on a drive of this size (3TB.)

I decided to perform a read/write test using 'badblocks' ('badblocks -v -s -n /dev/sdc')

Code:
root@oak:/home/hbarta# badblocks -v -s -n /dev/sdc
Checking for bad blocks in non-destructive read-write mode
From block 0 to 2930266583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern:   0.13% done, 58:30 elapsed. (0/0/0 errors)

If my arithmetic is right, it's going to take about a month to complete the badblocks scan. :( Of course if it gets to the bad block sooner, it may not need to finish.

My understanding of the 'Current_Pending_Sector' count is it is a sector that could not be read. It will be remapped when it is written. (Or if it is written and can be read back, it will not be remapped and the pending count cleared.)

I'm curious if anyone has a better suggestion. I've purchased a copy of 'spinrite' and could use that to checkout the drive. At the moment I haven't taken anything apart to do that yet. (I'd have to remove the drive from the server.) I'm also wondering if this is a minor issue or if I should be ordering a replacement.

Drive is a WD 3TB Red NAS (model WD30EFRX) And of course it's out of warranty. ;)

There's a lot more information in the SMART reply and if anyone is interested in that it can be viewed at https://pastebin.com/cs78mkMQ

Thanks!

Edit.01: 'badblocks' seems to make the drive unresponsive (as when I originally found it.) That seemed to happen about the time I expected it to hit the bad block (again, if my calculations were correct.) I will need to pull it to perform further diagnostics. I'm also looking at a replacement drive. I'm thinking about an HGST 8TB Helium filled drive model HUH728080ALE600. Backblaze is seeing good results with this drive. (Go big or go home!) part of the reason for upping the size is a potential migration to ZFS. Due to heavy hard linking, my backups nearly double in size when migrated to a new FS. 'rsync' can't preserve hard links for large data sets. That drive is $190US on Amazon presently.

Edit.02 Spinrite on the PC available to run it on doesn't seem to work for a 3TB drive. :( Support request submitted.

Edit.02.1 I asked the vendor of the HGST drive listed above about the warranty and they replied with "30 day return policy." I did some searching and found that WD has a site where a user can check warranty status and some drives come up 'warranty provided by seller.' I've never had a drive fail within the warranty period but I'm reluctant to rely on the seller. And now I need to check warranty status on some other cheap HGST drives purchased in the past.

Edit.03 Reply from GRC. Spinrite doesn't grok GPT partitioned drives or drives bigger than 2TB. :mad: I withdraw my previous recommendation for their product.
 
Last edited:
"It's dead Jim"

I've gone through several read/write cycles. Writing clears pending bit and reading sets it. Most recently I tried the SMART long self test a couple of times. It stops at the first error at 10% done. The pending bit remains clear but now the Multi_Zone_Error_Rate is climbing. I'm writing this drive off. Luckily it was part of a RAID1 array (which is backed up to two other RAID1 arrays at the moment since I'm getting ready to commission another backup server) so no bytes were lost.
 
At nearly 4 years power on it had a good run. Didn't see thread earlier, but I've kinda taken to thinking on 1st error I'd replace an drive in raid as matter of routine, rather than try to revive it.

I had a WD Red, within warranty, just drop out once. No explanation. All tests showed nothing wrong, but my confidence was already gone on that drive. It now acts as a game storage drive in a desktop where there haven't been problems. I've even gone further the other way, and now I'm not paying a premium for so called NAS drives. Doing a Backblaze style just get the cheapest per capacity.
 
At nearly 4 years power on it had a good run. Didn't see thread earlier, but I've kinda taken to thinking on 1st error I'd replace an drive in raid as matter of routine, rather than try to revive it.
I'm inclined to agree except that I prefer to confirm that there is really a problem before discarding it. I have a 200GB Seagate Barracuda that reported errors and developed a few remapped sectors a few months in. I had a pretty good idea that it was the result of line power issues - the kind that cause the lights to dim and flicker before going out. I've run it for about 7 years now without further issue. I think it's worth spending a little time to differentiate a momentary glitch form a true problem. I also had a 2TB Seagate Barracuda that grew remapped sectors over a period of years before I replaced it. At replacement it had about 3500 remapped sectors. It was useful to see what would really happen when ZFS identifies a failed drive because it would not survive a scrub. ;) In both cases these drives were RAID members so complete loss of a drive would not have been catastrophic.

At present I'm either using RAID1 or RAIDZ2 depending on drive size and count. In the former I can lose half the drives (1 of 2 ;) ) and the later allows the loss of 2 drives (2 of 6 drives) before I lose data. And these are backed up elsewhere so I'd have to lost at least three RAIDs before I actually lost data.

I guess 4 years isn't too bad.
 
Hmmm... Just about any metal except gold (AFAIK) is going to corrode. That property of gold is why it is used to plate contacts. Also I did not see what the problem was that the author saw with his drives.

Here is a portion of the PCB on mine that he seemed to focus on.

DSC_3599-PP.JPG
 
Back