• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

SSD system area rewrite death (no wear levelling there)-- is that a thing?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Max0r

Member
Joined
Oct 18, 2005
Location
Chicago Burbs
So apparently a big reason SSDs like to die suddenly is that they have on their storage media a reserved area used to manage the rest of the space with things such as translation tables. This area, or parts of it, cannot be wear-levelled/remapped and there is no health indicator for it. As a result, regardless of the state of the rest of the media, if this system area ends up getting rewritten in some spots too much, as soon as any part of that area fails, it's game over for the entire device. I will quote this blog post


Why SSD Drives Fail with no SMART Errors​

SSD drives are designed to sustain multiple overwrites of its entire capacity. Manufacturers warrant their drives for hundreds or even thousands complete overwrites. The Total Bytes Written (TBE) parameter grows with each generation, yet we’ve seen multiple SSD drives fail significantly sooner than expected. We’ve seen SSD drives fail with as much as 99% of their rated lifespan remaining, with clean SMART attributes. This would be difficult to attribute to manufacturing defects or bad NAND flash as those typically account for around 2% of devices. Manufacturing defects aside, why can an SSD fail prematurely with clean SMART attributes?

Each SSD drive has a dedicated system area. The system area contains SSD firmware (the microcode to boot the controller) and system structures. The size of the system area is in the range of 4 to 12 GB. In this area, the SSD controller stores system structures called “modules”. Modules contain essential data such as translation tables, parts of microcode that deal with the media encryption key, SMART attributes and so on.

If you have read our previous article, you are aware of the fact that SSD drives actively remap addresses of logical blocks, pointing the same logical address to various physical NAND cells in order to level wear and boost write speeds. Unfortunately, in most (all?) SSD drives the physical location of the system area must remain constant. It cannot be remapped; wear leveling is not applicable to at least some modules in the system area. This in turn means that a constant flow of individual write operations, each modifying the content of the translation table, will write into the same physical NAND cells over and over again. This is exactly why we are not fully convinced by endurance tests such as those performed by 3DNews. Such tests rely on a stream of data being written onto the SSD drive in a constant flow, which loads the SSD drive in unrealistic manner. On the other side of the spectrum are users whose SSD drives are exposed to frequent small write operations (sometimes several hundred operations per second). In this mode, there is very little data actually written onto the SSD drive (and thus very modest TBW values). However, system areas are stressed severely being constantly overwritten.

Such usage scenarios will cause premature wear on the system area without any meaningful indication in any SMART parameters. As a result, a perfectly healthy SSD with 98-99% of remaining lifespan can suddenly disappear from the system. At this point, the SSD controller cannot perform successful ECC corrections of essential information stored in the system area. The SSD disappears from the computer’s BIOS or appears as empty/uninitialized/unformatted media.

If the SSD drive does not appear in the computer’s BIOS, it may mean its controller is in a bootloop. Internally, the following cyclic process occurs. The controller attempts to load microcode from NAND chips into the controller’s RAM; an error occurs; the controller retries; an error occurs; etc.

However, the most frequent point of failure are errors in the translation module that maps physical blocks to logical addresses. If this error occurs, the SSD will be recognized as a device in the computer’s BIOS. However, the user will be unable to access information; the SSD will appear as uninitialized (raw) media, or will advertise a significantly smaller storage capacity (e.g. 2MB instead of the real capacity of 960GB). At this point, it is impossible to recover data using any methods available at home (e.g. the many undelete/data recovery tools).

If indeed this is the case, it would certainly motivate me to take steps to slow down the march toward such failures on OS drives or important work drives to avoid running into unexpected down time, such as moving, as much as possible, temporary files/cache operations either to separate throw-away SSDs or to an HDD.

I was bothered quite a bit by just how much data is constantly being written on the OS drive by all sorts of programs (kind of silly they'd need to do that all day long if you ask me), but when I did the math, it would take a large number of years before it came anywhere close to the supposed rewrite capacity. But that sounded just too good to be true... and indeed, it appears it WAS too good to be true, because the general storage media wear-down isn't where the true danger lies.

Accurate?
 
Glad you did the math as writes haven't been a concern for several years/multiple generations of SSDs. I was wondering where this was going, lol.


If indeed this is the case, it would certainly motivate me to take steps to slow down the march toward such failures on OS drives or important work drives to avoid running into unexpected down time, such as moving, as much as possible, temporary files/cache operations either to separate throw-away SSDs or to an HDD.
I recall reading about this a few years back, but, it never tickled me to do anything about it. I shrugged my shoulders and moved on. Like for any drive failure, always have a good backup and when something fails, replace.........rebuild. I've been 'lucky' (lol) and not had one fail in this manner/prematurely (Read: before it's warranty).

are users whose SSD drives are exposed to frequent small write operations (sometimes several hundred operations per second).
So, this is a worst-case... but who's system is doing this constantly... to the tune of multiple hundreds of ops /s? Is this your usage scenario or is paranoia swaying the decision to act?

So while the article may be accurate (why some drives die without warning), I'm not concerned (or, IMO, nor should most people) and it doesn't lead me to take any action. Use your M.2 drive... have good backups... don't put your temp files on a slow arse HDD... stop playing in the minutia. :soda:
 
Last edited:
it never tickled me to do anything about it. I shrugged my shoulders and moved on.
I can't help but notice that seems to be your approach to... everything? :chair:

Like for any drive failure, always have a good backup and when something fails, replace.........rebuild
As soon as my disposable income becomes high enough, I will have replacement hardware in stock at all times, not to mention probably replacement systems, for this to not matter. Until then, the financial and/or operational wrench of such failures tends toward a focus on... pushing them as far back as possible, if such efforts are easy enough, which... they seem to be very much so.

So, this is a worst-case... but who's system is doing this constantly... to the tune of multiple hundreds of ops /s? Is this your usage scenario or is paranoia swaying the decision to act?
Unfortunately, I don't even know what constitutes a write being one operation or another. Is it demarcated by file, i.e. one file is one op, another is another? Clearly it isn't demarcated by size, at least not in any major way. Is it demarcated by start/completion, i.e. a write is triggered, and once that writing task is complete, if a new write is triggered, to the same file, it would constitute a new operation, and if so, would this potentially update a translation table? There's no way for me to know. What I can say is that, silly programs are constantly doing writes, and not in some large, sequential batch job way it seems. Whether they constitute many operations or only a few, I could not begin to fathom. And whether every operation is roughly equal in how it hammers certain sections of the system area, I gain could not begin to fathom. This is an unknown, and when it comes to hedging against problems, unknowns suggest leaning toward the worst case, so long as the hedge itself is significantly less costly than the potential cost of failing to hedge. The potential cost is experiencing OS SSD failure months or even years earlier than I otherwise would have. I would reckon that as long as this drive can go 2 years before failure, I'll be in a much better position to not give a ****. However, if it were to fail within 1 to 1.5 years, it would be a far greater disruption. Not a disaster, but a great annoyance. I figure by just offloading a little bit off the drive, I can ensure, should the drive itself not be relatively defective, that the failure will be pushed beyond that timeframe.
Use your M.2 drive... have good backups... don't put your temp files on a slow arse HDD...
Shifting my browser cache to the HDD had zero effect on browser performance/responsiveness. In fact, shifting many applications' work space to the HDD had no effect. On the contrary, having the HDD being the OS drive, or game drive, application drive, etc. would have a terrible effect. I remember the last time running an old copy of WoW of its HDD backup location just to check something in an old config. I think I had time for a shower and a shave before the damn game loaded. Normally it loads almost instantly.

For the time being, none of my application usage requires a level of performance that would necessitate an SSD for a temporary file work space. All the misc. storage operations are numerous but not really hard for any device to keep up with. In fact, I was even shocked that encoding/cutting some videos didn't seem to be affected, but I'm not doing hardcore stuff with that, nor am I messing with super high resolution super high bitrate stuff. I've tried opening/navigating through video files from both sources. No difference. The only real difference is if I have thumbnails on, and a huge amount of video files in one folder. In that case, it will take some time for all the thumbnails to load from HDD. However, this is a use scenario that is so uncommon for me it wouldn't be a bottleneck, as if I needed to quickly process certain videos, they'd already have a special place anyway where there aren't hundreds of videos in one folder.

I suspect even when I'm recording videos of stuff happening on my screens, using the HDD will have no adverse effect on performance. However I will test both, as even the slightest loss of either responsiveness or video smoothness would be unacceptable, and I wouldn't be surprised if using the HDD might have some subtle effect on those things, in higher GPU/CPU usage scenarios. I suspect doing some video recording onto the SSD won't really generate a huge amount of discrete "operations" anyway, and I would always maintain a high amount of free space on that SSD.

If I really do require an SSD workspace for some very heavy usage, I can always just add a work SSD in and split up the wear between them.

stop playing in the minutia. :soda:
No.

I don't think potentially adding many years to the life of an SSD without running into performance problems for myself is minutia.

Not everyone is prepared to just keep buying new **** all the time when it fails. 50-150 bucks for some people is a big deal. Given the choice between spending it every 5-15 years or every 1-5 years I think many are in a position that the difference can be significant, especially if it fans out to multiple parts.

Do you consider accelerating turning something into e-waste by years to be minutia? Maybe you do. I don't think it is.
 
I can't help but notice that seems to be your approach to... everything? :chair:
I've been messing with PC's for so long, things like this don't tend to bother me, correct. I'd rather replace a drive after it's warranty (I expect this when purchasing...) than to continuously edit my system and make tweaks that (I deem) are unnecessary. So many other things can happen. When you get into one-offs, they add up and it gets more complicated.

I don't think potentially adding many years to the life of an SSD without running into performance problems for myself is minutia.
You say years, but in reality, we both don't know. Your system is more than likely not pumping IOs to your OS drive like that worst-case scenario they describe anyway. In the end, we hedge our bets differently. :)

Given the choice between spending it every 5-15 years or every 1-5 years
This is a guess, your timetable. And if you're under warranty............
I would reckon that as long as this drive can go 2 years before failure, I'll be in a much better position to not give a ****. However, if it were to fail within 1 to 1.5 years, it would be a far greater disruption.
....many of these devices are warrantied for 3-5 years in the first place, so although the time/effort to rebuild is still there (mitigated with proper backups - it takes me 20 mins to restore my OS image), you'll get a new drive and don't have to pay for it when within the warranty period. I don't know what drive you have, however, or where you're at in the product lifecycle. My OS drive is still under warranty (5 years, 3 left) but my secondary M.2's are past theirs I believe. Since they are secondary drives, I would guess this doesn't apply since there's no OS on the drive? I understand the System Area would still be there, but since there's no OS, I don't know why it would have excessive writes to that area.

I've got a spare 500GB SSD from several years back (OCZ Trion/Arc I reviewed?) I always keep in case of emergencies. You can buy, brand new, a 250GB SSD for an OS backup drive for $33. If your OS and games are on the same drive that won't be ideal of course, but you're up and running at least. Surely there are some used 2.5" SSDs worth considering as a backup if uptime is a consideration.

Also, I would think, if this issue was rampant, that these storage vendors wouldn't warranty their drives for so long. If they were eating warranty losses because of something like this, I wonder if things would change on the warranty front? Or... maybe work with the OS to NOT have that happen?

Do you consider accelerating turning something into e-waste by years to be minutia? Maybe you do. I don't think it is.
I don't consider not making those changes as 'accelerating SSD death' in the first place...more like natural causes. :p

This article didn't do much to convince me...unless you're the worst-case scenario they describe... which I don't how we know. I try not to act on hunches (try, lol) but am driven more by facts. There are too many unknowns here to do anything regardless if the mitigation efforts seem to have no performance impact with your system.

You asked, and this is my thinking after reading the article. To each their own in what they do with all of this info. That said, I understand that money doesn't grow on trees and a mitigating effort is a mitigating effort. I just want to make sure it's actually a problem and the mitigating efforts work before jumping in head first. :)



EDIT: Here's my logic...... from your article...and some info I googled.
Each SSD drive has a dedicated system area. The system area contains SSD firmware (the microcode to boot the controller) and system structures. The size of the system area is in the range of 4 to 12 GB. In this area, the SSD controller stores system structures called “modules”. Modules contain essential data such as translation tables, parts of microcode that deal with the media encryption key, SMART attributes and so on.
...and one I just found...
....has its own OS and file system that is different than that which the user interacts with when using the computer.

The file system used by the controller chip contains files specific to drive functions, such as firmware, the translator table, the defect table and others. When a user first presses the power button on an SSD device, the controller’s OS must walk through each of these files in order for the computer to boot up.

Your page file is part of the OS and the path in C:\, the System area is not. The browser caching files are also a C:\ thing and wouldn't write to the part of the SSD that's needed to boot. With that in mind, would disabling that stuff only help with writes on the NAND area with wear leveling and not in the System Area modules that doesn't use it (wear leveling)?


Nothing terribly concrete in my links either, but should shed some light on why...
A. ....changes to PF and browser cache don't appear to help this concern (is that logical?).
B. ...I am not doing it. If that's right, then fixing the brakes doesn't make the air conditioner run cooler. :burn:

I guess from here, my questions are.......

1. How can one test to see if the System Area is being hit with excessive writes for the space?
2. If it is, what's causing the excessive activity? Can w/e it is get disabled? Is there a problem with the drive that has nothing to do with writes to those non wear level modules?
3. If it's not something I can disable, what are mitigating efforts that actually work within that space?


LOL, I've been editing this on and off all day... lol.. sorry. Donezo. :)
 
Last edited:
See here's the problem. Given that no technical insight is being brought to this thread, the thread itself has already lead to dramatically more investment of effort than that expended on making a few small adjustments to hedge against a potential system area wear problem. There's just nothing left to say here that isn't completely subjective/temperament based.
 
I'd apologize for adding subjective posts (I did provide a link from the readings supporting my assertion...), but I feel the skepticism over the issue has brought to light some real questions that should get answered before changing things (regardless how easy that may be) on one's system.

Do you agree that changing the page file and browser cache isn't helping this problem out since all of that sits in C:\ and not in the System Area of the SSD though? What made you make those specific changes (did you read something - link us!)? After thinking critically, I suppose those can potentially be a bunch O writes...but everyone has those (who doesn't change it) and you don't see failures like this commonly.

We're all striving to learn here. If this is a big deal, we want to figure out how to actually mitigate the issue. If it isn't a big deal we should know that too. :)
 
Last edited:
I'd apologize for adding subjective posts (I did provide a link from my readings...), but I feel the skepticism over the issue has brought to light some real questions that should get answered before changing things (regardless how easy that may be) on one's system.
The subjectivity I refer to is our differences in tendencies to emphasize how to deal with certain unknowns in hardware reliability stuff. It is truly, in my opinion, purely how we're wired. If you want to apologize for anything, apologize for waging a campaign to impose your sensibilities upon my sensibilities. JK please don't :rofl:

I feel the skepticism over the issue has brought to light some real questions that should get answered before changing things (regardless how easy that may be) on one's system.
Last I checked, moving a browser cache location wasn't exactly a... revolutionary.... alteration ;) Granted, if it is TOTALLY USELESS, it certainly would be dumb to be doing it. I do not believe it useless to do so, in light of my findings. It is a PROBABILISTIC move, just like many things are in this game, whether we want to admit it or not.


Do you agree that changing the page file and browser cache isn't helping this problem out since all of that sits in C:\ and not in the System Area of the SSD though? What made you believe those changes would do anything (did you read something)?
First of all I did not change the location of the pagefile. I doubt it gets much action to begin with, and if for some reason it did, I'd rather it be on the SSD.

Second of all, the fact that all these things are outside of the SSD system area is completely irrelevant. You missed the entire point. What is relevant is that when there are constant write operations, potentially lots of small ones to many different files, potentially creating/deleting/replacing these files all throughout the day, these write operations can end up updating tables/etc in the system area constantly. These things in the system area cannot be wear-levelled, and thus the potential for unpredictable sudden death is increased through this vector. Death through conventional storage media wear is predictable and predictably slow. Death through the system area storage media wear vector is more unpredictable and could happen much faster than anticipated, without any type of forewarning, as there is neither wear levelling nor is there tracking of the wear.

Just to reiterate if it wasn't already obvious, minimizing unnecessary write operations on the SSD, especially when doing so comes at no performance cost, can be reasonably expected to, if not prevent the likelihood of a system area wear-down death event, to be more likely to push its timetable back significantly. Due to lack of further insight, the best I can do is leave it at that and maybe my adjustments work toward this, maybe they don't. I lean on the side of reasonable probability, rather than to just throw up my hands and not try any strategy. The effort necessary to know for sure may end up being great, time-consuming, and laborious, thus, it is fine if this remains an unknown and just a probabilistic bet. I wish I had the time to pursue my interests more and really learn about this. Sadly, our material conditions are what they are.

We're all striving to learn here. If this is a big deal, we want to figure out how to actually mitigate the issue. If it isn't a big deal we should know that too. :)
Are we? You seem to be fine just calling things not big deals and disregarding them, without even knowing the truth. And myself? I seem to be fine with just posting low effort questions and, if not finding answers, potentially just leaving it at that. If I had more time/energy, I'd certainly be putting more effort into learning. As it stands, that will be filed under the "would be nice for later" category. If I really wanted to escalate the situation I could go to a few other places online and hope someone who knows about this answers. If it were a bigger issue for me, I might consider doing that. But, as it stands, it's actually a small issue for me. We have both agreed to remain ignorant on this topic in our own ways, for own reasons. So in a way, we've both agreed it is no big deal.
 
Second of all, the fact that all these things are outside of the SSD system area is completely irrelevant. You missed the entire point. What is relevant is that when there are constant write operations, potentially lots of small ones to many different files, potentially creating/deleting/replacing these files all throughout the day, these write operations can end up updating tables/etc in the system area constantly. These things in the system area cannot be wear-levelled, and thus the potential for unpredictable sudden death is increased through this vector. Death through conventional storage media wear is predictable and predictably slow. Death through the system area storage media wear vector is more unpredictable and could happen much faster than anticipated, without any type of forewarning, as there is neither wear levelling nor is there tracking of the wear.
This is what I missed more than anything and why I questioned the action of w/e you did with w/e to mitigate this issue. :)

Worth noting, according to the article I posted, SOME modules in the System Area have wear leveling. Which ones, not sure.. but I think that's worth understanding if only to identify scope and severity.

Are we? You seem to be fine just calling things not big deals and disregarding them, without even knowing the truth. And myself? I seem to be fine with just posting low effort questions and, if not finding answers, potentially just leaving it at that.
We are. I didnt de3m it actionable from reading the info provided is all. We took different approaches and supported my thoughts on why (you did ask if you thought it was accurate in the first post initiating a conversation, no?). I thought I posted a few low effort questions that need to be answered too.

I still would like my first question answered... I wonder if there is a tool to see what kind of activity is going on in the System Area. If we can see that, we can figure out, factually, if any problems exist/changes are worth doing. :thup:

EDIT: I just noticed that link in the first post was written by a data recovery/forensics service.

EDIT2: FTR, I reached out to some reviewer peers who have engineer contacts in the field who should be able to help with the technical information needed. I'll post back when I hear something, if only for those who run across the thread in the future. :)
 
Last edited:
I still would like my first question answered... I wonder if there is a tool to see what kind of activity is going on in the System Area
That is something I want answered as well, but I'm guessing nobody in the industry has any reason to want that to be revealed or possible. They've achieved the perfect balance between enough reliability to sell units and enough unit-killing to sell a lot more units.
 
Surely the data forensics company from the first post won't provide it... they're selling a service for data recovery. Doom and gloom in their advertising! :p

Still waiting to hear back, fyi. :)
 
Ok, so, high-level response here.....

*it varies by SSD, how they do it, how they handle it, etc.
*not a widespread concern by any means
*it certainly isn't among the most common roots of failures
*it would be an edge case for it to cause failure. Most of the new SSDs use SLC for this area, so it has uber write endurance


I'll say again...your article is from a forensics/data recovery business. I'd think a picture from them is painted a certain way, while perhaps others, who aren't selling recovery services, view the criticality (or lack thereof) quite different.

Thanks for bringing this up. I have to say I learned quite a bit by looking around and asking. In the end, there weren't enough facts posted here (for me) to act.
 
Last edited:
I just found this topic ... and it's too long to read (I checked it quickly), so this is how I see it:
1. SSD die without a warning even when SMART isn't showing any errors, but it's very low % of all SSD. It's only electronics and it can fail like anything else. We can talk about reasons but there is no one clear reason and most people don't care and won't fix or recover data anyway.
2. If it fails and you have no idea why, then it's good to leave it without any operations for some longer, in case of SATA SSD, connect only a power cable. It's because firmware area and some other things can be recovered/fixed by TRIM operations and some other things that run in the background. On older MX500, support recommended to leave it for at least 5-6 hours.
3. If above doesn't help then run diskpart /clean all, as it cleans not only the visible area but also hidden parts of the SSD (it was a tip from Crucial support that I got long time ago and it actually fixed 2 SSD). Of course can't do that when the SSD is totally dead and not visible anywhere. Btw. even if BIOS can't see the SSD then there is a chance that windows/linux or other tools will see it.
4. Always make a backup of the important data, no matter if you use SSD, HDD or anything else. Data recovery costs much more than a spare drive for data copy.
5. Try to buy only SSD with 5-year+ warranty. Even if it dies then you get a new one or money back. It's kind of investment as usually SSD with longer warranty cost some more.
 
Last edited:
On one hand I'm disappointed that most of this stuff is little more than rumor/innuendo (what else can you expect from industry insiders talking to outsiders) but on the other hand, what little is trickling out sounds "promising." On the other other hand though, it sounds a lot like "Just be a good little consumer and keep buying and using products indiscriminately. Everything is going to be ooooookaaaaaay."

Yea whatever. The good news is, I don't know what it is, maybe the CPU, maybe the chipset, maybe something else, but holy jesus initial testing of a piece of one of my backup batch files... the backup of thousands of files to two hard drives happened so fast I couldn't believe it, compared to the previous 7700k system. Although, I was using the slower Skyhawk drives for this job on the previous system, and here I'm using the HGST Ultrastars which might be specially suited for those kinds of operations.

I suspect it's at least partially the CPU, however, because when I ran the batch file, and I was specifically watching for this, all 4 ecores suddenly went from mostly idle to fully saturated, and even 2 p-cores were like 40-70% saturated out of nowhere. Just ran it again for the lawlz, e-cores fully saturated, 1 p-core fully saturated. Then once the batch file closed, in the aftermath, I see some increased load on the e-cores compared to idle, and 1 p-core is hovering about 30-50%. It's slowly dying down annnnnd now it's back to idle. So evidently after the batch file closed and it already completed its tasks, at least on its end, there was some stuff finishing up on the back end for even longer than the batch file ran. Right now it's running at 3.6 GHz max. I wonder if this would go even faster at default speeds.

Oh well, whatever, interesting stuff but no matter for this thread.
 
Back