A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al from Bianca Schroeder's group at U. Toronto and NetApp was one of two awarded "Best Paper":
This paper presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.4 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in previous works, including the effect of firmware versions, the reliability of TLC NAND, and correlations between drives within a RAID system. This paper presents our analysis, along with a number of practical implications derived from it.The authors set out the lessons they learned from their study in Section 8. My comments on some of them are:
- We tend to think of HDDs and SSDs as "media", which obscures the fact they are themselves complex computers containing large amounts of software. And, of course, this software has bugs, which need patching to minimize the risk of software-induced failure:
Our observations emphasize the importance of firmware updates, as earlier firmware versions can be correlated with significantly higher failure rates. Yet, we observe that 70% of drives in our study remain at the same firmware version throughout the length of our study.Alas, many enterprise sysadmins favor "the devil they know", being reluctant to take the risk of a software upgrade failing or introducing new bugs to diagnose and work around. This study should belp convince sysadmins that the risk of upgrading is less than the risk of doing nothing.
I've been pointing out the importance of the correlation between failures at least since our 2006 EuroSys paper A Fresh Look at the Reliability of Long-term Digital Storage. The authors concur:
Our results highlight the occurrence of temporally correlated failures within the same RAID group. This observation indicates that ... realistic data loss analysis certainly has to consider correlated failures.It is notoriously difficult to model correlated failures in storage systems because the data is sparse and the correlations can be between diverse components, not just media. Jiang et al's 2008 paper Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics showed that only about half of failures in the field could be attributed to the media.
for the vast majority of enterprise users, a move towards QLC’s PE cycle limits poses no risks, as 99% of systems use at most 15% of the rated life of their drives.
Cell type Ratio 5Xnm 32% 3Xnm 28% 2Xnm 20% 1Xnm 14%
- In 2017's Storage Failures In The Field I commented on Backblaze's observation that the 10TB and 12TB HDD generations showed much reduced infant mortality:
devoting engineering effort to reducing infant mortality can have a significant return on investment. A drive that fails early will be returned under warranty, costing the company money. A drive that fails after the warranty expires cannot be returned. Warranty costs must be reserved against in the company's accounts. Any reduction in the rate of early failures goes straight to the company's bottom line.
In contrast to the “bathtub” shape assumed by classical reliability models, we observe no signs of failure rate increases at end of life and also a very drawn-out period of infant mortality, which can last more than a year and see failure rates 2-3X larger than later in life.The low failure rate at end-of-life is encouraging for archival use, but the infant mortality is a problem. The authors write:
we observe for both 3D-TLC and eMLC drives, a long period (12–15 months) of increasing failure rates, followed by a lengthy period (another 6–12 months) of slowly decreasing failure rates, before rates finally stabilize. That means that, given typical drive lifetimes of 5 years, drives spend 20-40% of their life in infant mortality.Eyeballing the eMLC curve in the upper part of Figure 2, it looks like, if the drives have a 5-year warranty, 2.6% of the drives will be returned, a rather expensive proposition.
- Because it was assumed that SSDs followed a bathtub failure curve, the authors write:
There has been a fear that the limited PE cycles of NAND SSDs can create a threat to data reliability in the later part of a RAID system’s life due to correlated wear-out failures, as the drives in a RAID group age at the same rate. Instead we observe that correlated failures due to infant mortality are likely to be a bigger threat.Because, absent replacements, the drives in a RAID are typically all the same age, the infant mortality and the unpatched firmware together may be driving the relatively high correlation shown in Figure 7 above.