DSHR's Blog: Storage Update: Part 2

This is part 2 of my latest update on storage technology. Part 1, covering developments in DNA as a storage medium is here. This part was sparked by a paper at Usenix's File And Storage Technologies conference from Bianca Schroeder's group at U. Toronto and NetApp on the performanmce of SSDs at scale. It followed on from their 2020 FAST "Best Paper" that I discussed in Enterprise SSD Reliability, and it prompted me to review the literature of this area. The result is below the fold.
There have been a number of studies of the reliability of SSDs at scale in the field, including:

2015's A Large-Scale Study of Flash Memory Failures in the Field by Justin Meza et al from C-MU and Facebook.
2016's Flash reliability in production: The expected and the unexpected by Bianca Schroeder et al from U. Toronto and Google.
2016's SSD Failures in Datacenters: What? When? And Why? by Iyswarya Narayanan et al from Penn State and Microsoft.
2019's Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures by Erci Xu et al from Ohio State, Iowa State and Alibaba.
2020's A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al from Bianca Schroeder's group at U. Toronto and NetApp.
2022's Operational Characteristics of SSDs in Enterprise Storage Systems: A Large-Scale Field Study by Stathis Maneas et al from Bianca Schroeder's group at U. Toronto and NetApp.
2022's The SSD Edition: 2021 Drive Stats Review by Andy Klein of Backblaze.

Meza et al (2015)

They:

examined the majority of flash-based SSDs in Facebook’s server fleet, which have operational lifetimes extending over nearly four years and comprising many millions of SSD-days of usage.

And concluded:

SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected.

the effects of read disturbance errors are not prevalent in the field.
sparse logical data layout across an SSD’s physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate

higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures.

data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software.

Schroeder et al (2016)

This was "based on data collected over 6 years of production use in Google’s data centers". Their study's results are organized into Sections 3-8.

This section discusses final read and write errors:

Depending on the model, between 20-63% of drives experience at least one [final read] error and between 2-6 out of 1,000 drive days are affected. ... Depending on the model, 1.5-2.5% of drives and 1-4 out of 10,000 drive days experience a final write error, i.e. a failed write operation that did not succeed even after retries.
Final read errors are attributed to bit corruptions beyond the limit of ECC. Final write errors are rarer because a failure to write a block will be re-tried at a different block.
This section disusses the drives' raw bit error rates:

The standard metric to evaluate flash reliability is the raw bit error rate (RBER) of a drive, defined as the number of corrupted bits per number of total bits read (including correctable as well as uncorrectable corruption events).
They observe:

large differences in the RBER across different drive models, ranging from as little as 5.8e-10 to more than 3e-08 for drives of the first generation. The differences are even larger when considering the 95th or 99th percentile RBER, rather than the median. For example, the 99th percentiles of RBER ranges from 2.2e08 for model SLC-B to 2.7e-05 for MLC-D. Even within drives of the same model, there are large differences: the RBER of a drive in the 99th percentile tends to be at least an order of magnitude higher than the RBER of the median drive of the same model.

The difference in RBER between models can be partially explained by differences in the underlying flash technology. RBER rates for the MLC models are orders of magnitudes higher than for the SLC models, so the higher price point for the SLC models pays off with respect to RBER.
And:

age, as measured by days in the field, has a significant effect on RBER, independently of cell wear-out due to PE cycles. That means there must be other aging mechanisms at play, such as silicon aging.
And:

the two SLC models with a 34nm lithography (models SLC-A and SLC-D) have RBER that are an order of magnitude higher than the two 50nm models (models SLC-B and SLC-C). For the MLC models, the only 43nm model (MLC-B) has a median RBER that is 50% higher than that of the other three models, which are all 50nm. Moreover, this difference in RBER increases to 4X with wear-out, as shown in Figure 2. Finally, their smaller lithography might explain the higher RBER for the eMLC drives compared to the MLC drives.
Of course, devices with smaller lithography can afford a greater proportion of error-correcting bits, so the increase in RBER can be masked.
This section discusses the uncorrectable bit error rate (UBER) and its relationship to RBER. They point out that:

makes the implicit assumption that the number of uncorrectable errors is in some way tied to the number of bits read. ... We find no evidence for a correlation between the number of reads and the number of uncorrectable errors. ... also write and erase operations are uncorrelated with uncorrectable errors, so an alternative definition of UBER, which would normalize by write or erase operations instead of read operations, would not be any more meaningful either.

We therefore conclude that UBER is not a meaningful metric, ... UBER will artificially decrease the error rates for drives with high read count and artificially inflate the rates for drives with low read counts, as UEs occur independently of the number of reads.
Schroeder et al Fig 7
This section discusses hardware failures and observes significant temporal correlations:

most drives with bad blocks experience only a small number of them: the median number of bad blocks for drives with bad blocks is 2-4, depending on the model. ... We observe, in particular for MLC drives, a sharp increase after the second bad block is detected, when the median number of total bad blocks jumps to close to 200, i.e. 50% of those drives that develop two bad blocks will develop close to 200 or more bad blocks in total.
...
the chance of experiencing an uncorrectable error in a month following another uncorrectable error is nearly 30%, compared to only a 2% chance of seeing an uncorrectable error in a random month. But also final write errors, meta errors and erase errors increase the UE probability by more than 5X. In summary, prior errors, in particular prior uncorrectable errors, increase the chances of later uncorrectable errors by more than an order of magnitude.
This section compares SLC and MLC drives and concludes:

SLC drives do not perform better for those measures of reliability that matter most in practice: SLC drives don’t have lower repair or replacement rates, and don’t typically have lower rates of non-transparent errors.
...
we conclude that SLC drives are not generally more reliable than MLC drives.
This section compares SSDs and HDDs and concludes:

that the flash drives in our study experience significantly lower replacement rates (within their rated lifetime) than hard disk drives. On the downside, they experience significantly higher rates of uncorrectable errors than hard disk drives.

Narayanan et al (2016)

They collected data on "over half a million SSDs" in "five large and several small" datacenters over nearly three years and concluded:

The observed Annualized Failure Rate (AFR) in these production datacenters for some models is significantly higher (as much as 70%) than that quoted in SSD specifications, reiterating the need for this kind of field study.

Four symptoms Data Errors (Uncorrectable and CRC), Sector Reallocations, Program/Erase Failures and SATA Downshift experienced by SSDs at the lower levels are the most important (in that order) of those captured by the SMART attributes.

Even though Uncorrectable Bit Errors in our environment are not as high as in a prior study [24], it is still at least an order of magnitude higher than the target rates [26].

There is a higher likelihood of the symptoms (captured by SMART) preceding SSD failures, with an intense manifestation preventing their survivability beyond a few months. However, our analysis shows that these symptoms are not a sufficient indicator for diagnosing failures.

Other provisioning (what model? where deployed? etc.) and operational parameters (write rates, write amplification, etc.) all show some correlation with SSD failures. This motivates the need for not just a relative ordering of their influence (to be useful to a datacenter operator), but also a systematic multi-factor analysis of them all to better answer the what, when and why of SSD failures.

They used machine learning models to show that (my reformatting):

Failed devices can be differentiated from healthy ones with high precision (87%) and recall (71%) using failure signatures from tens of important factors and their threshold values;

Top factors used in accurate identification of failed devices include: Failure symptoms of data errors and reallocated sectors, device and server level workload factors such as total NAND writes, total reads and writes, memory utilization, etc.;

Devices are more likely to fail in less than a month after their symptoms match failure signatures, but, they tend to survive longer if the failure signature is entirely based on workload factors;

Causal analysis suggests that symptoms and the device model have direct impact on failures, while workload factors tend to impact failures via media wear-out.

Xu et al (2019)

They studied more than three years of usage of nearly half a million SSDs in Alibaba's cloud data centers, taking:

a holistic view to analyze both device errors and system failures to better understand the potential casual relations. Particularly, we focus on failures that are Reported As "SSD-Related" (RASR) by system status monitoring daemons. Through log analysis, field studies, and validation experiments, we identify the characteristics of RASR failures in terms of their distribution, symptoms, and correlations. Moreover, we derive a number of major lessons and a set of effective methods to address the issues observed.

As with Jiang et al's 2008 HDD study, they find many errors not caused by the media as such:

5.6% are RASR failures (i.e., about 10K instances), which manifested in five symptoms: Node Unbootable, File System Unmountable, Drive Unfound, Buffer IO Error, and Media Error. By correlating the RASR failures with the repair logs, we find that a significant number (34.4%) of RASR failures are not caused by the SSD device. For example, plugging SSDs into wrong drive slots, a typical human mistake, accounts for 20.1% of RASR failures. Moreover, for RASR failures caused by SSDs, we find that both the location of devices (i.e., in different datacenters) and the type of cloud services may affect SSD failure rates.

Xu et al Figure 5

As with previous studies, Xu et al's Figure 5 showed that errors increased with usage, as activity heated the SSDs, but this was not the only thermal effect they found.

Xu et al also studied the effect of environmental passive heating on idle SSDs, for example because idle SSDs were positioned alongside highly active SSDs, or because of inadequate cooling of the whole rack. Their Figure 3 shows this effect:

Xu et al Figure 3

poor rack architecture can increase the temperature of idle SSDs by up to 28 C ◦ , resulting in 57% more device errors after 128 hours of passive heating. ... scanning the entire device to trigger the FTL internal read refresh ... every 4 hours can offset most negative impact of passive heating.

This could well be a significant cause of correlated errors.

Maneas et al (2020)

Bianca Schroeder's group at U. Toronto continued their work on SSDs by shifting from studying their use in cloud data centers with:

the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.4 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in previous works, including the effect of firmware versions, the reliability of TLC NAND, and correlations between drives within a RAID system. This paper presents our analysis, along with a number of practical implications derived from it.

Their main findings were:

"One third of replacements are associated with one of the most severe reason types (i.e., SCSI errors), but on the other hand, one third of drive replacements are merely preventative based on predictions."
"We observe a very drawn-out period of infant mortality, which can last more than a year and see failure rates 2-3X larger than later in life."
"Overall, the highest replacement rates in our study are associated with 3D-TLC SSDs. However, no single flash type has noticeably higher replacement rates than the other flash types studied in this work, indicating that other factors, such as capacity or lithography, can have a bigger impact on reliability."
"Drives with very large capacities not only see a higher replacement rate overall, but also see more severe failures and fewer of the (more benign) predictive failures."
"In contrast to previous work, higher density drives do not always see higher replacement rates. In fact, we observe that, although higher density eMLC drives have higher replacement rates, this trend is reversed for 3D-TLC."
"Earlier firmware versions can be correlated with significantly higher replacement rates, emphasizing the importance of firmware updates."
"SSDs with a non-empty defect list have a higher chance of getting replaced, not only due to predictive failures, but also due to other replacement reasons as well."
"SSDs that make greater use of their over-provisioned space are quite likely to be replaced in the future."
"While large RAID groups have a larger number of drive replacements, we find no evidence that the rate of multiple failures per group (which is what can create potential for data loss) is correlated with RAID group size. The reason seems to be that the likelihood of a follow-up failure after a first failure is not correlated with RAID group size."

Maneas et al (2022)

The latest from Bianca Schroeder's group at U. Toronto and NetApp is Operational Characteristics of SSDs in Enterprise Storage Systems: A Large-Scale Field Study by Stathis Maneas et al. Their abstract reads:

As we increasingly rely on SSDs for our storage needs, it is important to understand their operational characteristics in the field, in particular since they vary from HDDs. This includes operational aspects, such as the level of write amplification experienced by SSDs in production systems and how it is affected by various factors; the effectiveness of wear leveling; or the rate at which drives in the field use up their program-erase (PE) cycle limit and what that means for the transition to future generations of flash with lower endurance. This paper presents the first large-scale field study of key operational characteristics of SSDs in production use based on a large population of enterprise storage systems covering almost 2 million SSDs of a major storage vendor (NetApp).

They divide their SSDs into two classes:

one that uses SSDs as a write-back cache layer on top of HDDs (referred to as WBC), and another consisting of flash-only systems, called AFF (All Flash Fabric-Attached-Storage (FAS)). An AFF system uses either SAS or NVMe SSDs, and is an enterprise end-to-end all-flash storage array. In WBC systems, SSDs are used as an additional caching layer that aims to provide low read latency and increased system throughput.

In this study they do not report failures, but rather parameters of production operation including read vs. write rates, the Write Amplification Factor, Drive Writes Per Day (DWPD), NAND Usage Rates (the proportion of the drives program-erase (PE) cycle limit used in a year) and the effectiveness of wear-leveling. They summarize their key findings in Table 4 as follows (with my reformatting):

§3.1.2: The majority of SSDs in our data set consume PE cycles at a very slow rate. Our projections indicate that the vast majority of the population (~95%) could move toward QLC without wearing out prematurely.
§3.1, 3.2: The host write rates for SSDs used as caches are significantly higher than for SSDs used as persistent storage. Yet, they do not see higher NAND write rates as they also experience lower WAF. It is thus not necessarily required to use higher endurance drives for cache workloads (which is a common practice).
§3.2: WAF varies significantly (orders of magnitude) across drive families and manufacturers. We conclude that the degree to which a drive’s firmware affects its WAF can be surprisingly high, compared to other factors also known to affect WAF.
§3.2: We identify as the main contributor to WAF, for those drive families with the highest WAF, the aggressive rewriting of blocks to avoid retention issues. This is surprising, as other maintenance tasks (e.g., garbage collection, wear-leveling) generally receive more attention; common flash simulators and emulators (e.g., FEMU) do not even model rewriting to avoid retention issues.
§3.2: The WAF of our drives is higher than values reported in various academic studies based on trace-driven simulation. This demonstrates that it is challenging to recreate the real-world complexities of SSD internals and workloads in simulation.
§3.3: Wear leveling is not perfect. For instance, 5% of all SSDs report an erase ratio above 6, i.e., there are blocks in the drive which will wear out six times as fast as the average block. This is a concern not only because of early wear-out, but also because those blocks are more likely to experience errors and error correction contributes to tail latencies.
§3.4: AFF systems are on average 43% full. System fullness increases faster during the first couple of years in production, and after that increases only slowly. Systems with the largest capacity are fuller than smaller systems.
§4.3, §4.4: We find that over-provisioning and fullness have little impact on WAF in practice, unlike commonly assumed.
§5: The vast majority of workloads (94%) associated with SSDs in our systems are read-dominant, with a median R/W ratio of 3.62:1, highlighting the differences in usage between SSD-based and HDD-based systems. Many widely-used traces from HDDbased systems see more writes than reads, raising concerns when using these traces for SSD research, as is common in practice.
§5: The read and write rates for the drives in our enterprise storage systems are an order of magnitude higher than those reported for data center drives (comparing same-capacity drives).
§5: The read/write ratio reported by SSDs that act as caches decreases significantly over their lifetime. This might indicate a decreasing effectiveness of the SSD cache over time.
§3.2, §5: The differences between some of our results and those reported based on the analysis of widely used HDD-based storage traces emphasize the importance for us as a community to bring some representative SSD-based traces into the public domain.

Lets look in detail at a few of their many interesting observations. First, as regards the rate at which drives use up their PE limit, the first factor is DWPD, controlled by the application. They write:

The DWPD varies widely across drives: the median DWPD of the population is only 0.36, well below the limit that today’s drives can sustain. However, there is a significant fraction of drives that experiences much higher DWPD. More than 7% of drives see DWPD above 3, higher than what many of today’s drive models guarantee to support. Finally, 2% of drives see DWPD above 10, pushing the limits even of today’s drive models with the highest endurance.

When separating the data into AFF and WBC systems, we observe (probably not surprisingly) that WBC systems experience significantly higher DWPD. Only 1.8% of AFF drives see DWPD above 3 compared to a quarter of all WBC drives. The median DWPD is 3.4× higher for WBC than AFF, while the the 99th percentile is 10.6× higher.

We note vast differences in DWPD across the WBC systems, including a long tail in the distribution. While the median is equal to 1, the drives in the 99th and the 99.9th %-ile experience DWPD of 40 and 79, respectively.

The second factor is the NAND Usage Rate, controlled by the drive's firmware. They write:

Maneas et al (2022) Fig2

Annualized NAND Usage Rates are generally low. The majority of drives (60% across the entire population) report a NAND usage rate of zero, indicating that they use less than 1% of their PE cycle limit per year. At this rate, these SSDs will last for more than 100 years in production without wearing out.

There is a huge difference in NAND Usage Rates across drive families. In particular, drive families I-C, I-D, and I-E experience much higher NAND usage rates compared to the remaining population. These drive families do not report higher numbers of host writes ..., so the difference in NAND usage rates cannot be explained by higher application write rates for those models. We therefore attribute the extremely high NAND usage rates reported by I-C/I-D drives to other housekeeping operations which take place within the device (e.g., garbage collection, wear leveling, and data rewrite mechanisms to prevent retention errors ...

There is little difference in NAND usage rates of AFF systems and WBC systems. This is surprising given that we have seen significantly higher host write rates for WBC systems than for AFF systems. At first, we hypothesized that WBC systems more commonly use drives with higher PE cycle limits, so higher DWPD could still correspond to a smaller fraction of the PE cycle limit. However, we observe similar NAND usage rates for WBC systems and AFF systems, even when comparing specific drive families and models with the same PE cycle limit. Interestingly, ... the reason is that WBC systems experience lower WAF, which compensates for the higher host write rates.

Maneas et al (2022) Fig 6

Their Figure 2 shows the "huge difference" clearly. Their 2020 paper showed that "Earlier firmware versions can be correlated with significantly higher replacement rates", and in this paper their Figure 6 shows not merely that different manufacturers's firmware has very different approaches to controlling WAF, but even that that earlier firmware versions can have significantly worse WAF than later versions. This could be a major factor explaining the higher replacement rates, in addition to the later version fixing bugs that caused replacements.

Maneas et al (2022) Fig 11

A second important observation is the striking difference between the data center workloads that previous work has studied, and the enterprise workloads in their 2020 and 2022 papers. They write:

First, the workloads associated with the SSDs in our data set are significantly more intensive: the corresponding read and write rates are at least one order of magnitude higher than the ones in the other two studies (note the log scale on both axes). Keeping in mind that our rates involve host reads and writes, while those of the two data center studies report physical reads and writes, the actual differences are even larger.

Second, in contrast to the drives at Facebook and Alibaba, which report a comparable number of reads and writes, our systems see a larger number of reads than writes.

Klein (2022)

Source

Backblaze has extended their invaluable transparency about hard disk reliability to SSDs with Andy Klein's The SSD Edition: 2021 Drive Stats Review. Their experience shows that:

For all of 2021, all three drives have had cumulative AFR rates below 1%.

This compares to the cumulative AFR for all SSD drives as of Q4 2021 which was 1.07% (from the previous chart).

Extending the comparison, the cumulative (lifetime) AFR for our hard drives was 1.40% as noted in our 2021 Drive Stats report. But, as we have noted in our comparison of HDDs and SSDs, the two groups (SSDs and HDDs) are not at the same point in their life cycles. As promised, we’ll continue to examine that dichotomy over the coming months.

The model (ZA250CM10002) represented by the red line seems to be following the classic bathtub failure curve, experiencing early failures before settling down to an AFR below 1%. On the other hand, the other two drives showed no signs of early drive failure and have only recently started failing. This type of failure pattern is similar to that demonstrated by our HDDs which no longer fit the bathtub curve model.

It is important to note that Backblaze's data only records drive failures, not the much more detailed data recording pre-failure diagnostics in the studies above.

Discussion

The data each of these studies reports measures soomewhat different parameters, so they are not diretcly comparable. But the important takeaways are clear:

SSDs suffer fewer field replacements than HDDs, but their error rate is higher than HDDs and higher than their specifications suggest. For long-term data storage SSDs' lower replacement rate is nowhere close to compensating for thier higher cost.
SSD firmware causes a significant proportion of the replacements and errors.
Different manufacturers's firmware contributes to significant differences in the NAND usage rates, the amplification factor, and the effectiveness of wear leveling.
Patching SSD firmware is important both to reduce errors and to reduce the NAND usage rate.
Source

There are significant correlations among SSD errors. Firmware is likely one cause.
Thermal effects are a significant cause of errors, both active thermal effects as activity heats the device, and passive thermal effects from adjacent active devices and poor rack configuration. Thermal effects probably contribute significantly to correlated errors.
Simulations based on outdated HDD-based traces, and those that lack a model of the SSD's firmware, are likely to be misleading.
The workloads of cloud data centers and enterprise storage systems are distinct. Separate traces are needed to simulate each of them.

DSHR's Blog

Tuesday, March 22, 2022

Storage Update: Part 2