Resurrecting a failed SSD

HankB · Jun 23, 2019

I've done it! This is a Crucial M4 that began reporting errors a while ago. I tried running self tests on it expecting it to remap a bad sector but the self test terminated with 90% to go. I thought it would simply remap a bad sector and go on. That's standard operation on many of the HDDs I've used.

Symptoms were failing disk operations. I don't recall the details. It's been a while. Another symptom was SMART errors logged and some questionable SMART statistics. And finally, it took a l-o-n-g time for 'smartctl' to read the SMART statistics from the drive. I felt the problem was compounded by faulty drive firmware (which should have handled the problem and continued normal operation.)

Then I ran across this page: https://www.smartmontools.org/wiki/BadBlockHowto
There's a lot of detail there about how to figure out just what offsets to pass to 'dd' to rewrite the ailing sector. I didn't bother with that. First, the calculations made my head hurt.

Second, I wanted to write the entire drive to reveal any other bad spots. I think I did something like `dd if=/dev/urandom of=/dev/sdd`. It would have been smarter to use /dev/zero as the source. (*) Maybe I'll repeat this. Wouldn't hurt to repeat the check.

Following this, I created a ZFS filesystem. The reason for this choice is that ZFS checksums all data written to the drive and has a 'scrub' operation that reads all data back to verify integrity. I then filled the disk with files and initiated a scrub. The scrub just finished with no errors reported.

Nice!

It's not a particularly fast or large drive, and I wouldn't put it in anything I considered mission critical, but it will be useful as a boot drive for any of a number of lab systems that I fool around with. Oh, and now fetching SMART stats is no longer delayed. What I see is

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       13
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8192
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       27436
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1384
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       2
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   098   098   010    Pre-fail  Always       -       65
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       81
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       13 0 13
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2727
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       131
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       12898
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       38
202 Perc_Rated_Life_Used    0x0018   098   098   001    Old_age   Offline      -       2
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       0

I'll comment on "5 Reallocated_Sector_Ct". 8192 decimal is 0x2000 in Hexadecimal - in other words one bit set. I don't think this is accurately reported. I'll put more stock in no pending sectors and 196 Reallocated_Event_Count of only 2.

(*) Writing zeroes to an SSD tells the controller that the sector is not used and then need not be erased before being overwritten. Or something like that. This provides better performance when there are empty sectors for the SSD controller's wear leveling algorithms to use.

EarthDog · Jun 24, 2019

So you secure erased the drive and it came back?

HankB · Jun 24, 2019

EarthDog said:
So you secure erased the drive and it came back?

For some definition of secure erase. I don't want this to be confused with the drive's ability (if supported by the M4) to erase itself securely. In some cases, that just causes the drive to regenerate an internal encryption key and that would probably not achieve this result. But writing the entire drive should accomplish a secure erase and that is what I did.

EarthDog · Jun 24, 2019

Yep... that is a fairly typical step to secure erase the drive. Most all drives have software which will SE the drive where it writes out zeros and wipes partitions. You need to initialize the drive and drop a partition etc on it.

HankB · Jun 24, 2019

EarthDog said:
... You need to initialize the drive and drop a partition etc on it.

ZFS is spoiling me.

a 'zpool create ...' writes the partition table, creates a partition, formats it and mounts it under the name of the pool, all in one command.

Resurrecting a failed SSD

HankB

Member

EarthDog

Gulper Nozzle Co-Owner

HankB

Member

EarthDog

Gulper Nozzle Co-Owner

HankB

Member

Similar threads