Tuesday, March 10, 2020

Enterprise SSD Reliability

I couldn't attend this year's USENIX FAST conference. Because of the COVID-19 outbreak the normally high level of participation from Asia was greatly reduced, with many registrants and even some presenters unable to make it. But I've been reading the papers, and below the fold I have commentary on an extremely interesting one about the reliability of SSD media in enterprise applications.

A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al from Bianca Schroeder's group at U. Toronto and NetApp was one of two awarded "Best Paper":
This paper presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.4 million SSDs of a major storage vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in previous works, including the effect of firmware versions, the reliability of TLC NAND, and correlations between drives within a RAID system. This paper presents our analysis, along with a number of practical implications derived from it.
The authors set out the lessons they learned from their study in Section 8. My comments on some of them are:
  • We tend to think of HDDs and SSDs as "media", which obscures the fact they are themselves complex computers containing large amounts of software. And, of course, this software has bugs, which need patching to minimize the risk of software-induced failure:
    Our observations emphasize the importance of firmware updates, as earlier firmware versions can be correlated with significantly higher failure rates. Yet, we observe that 70% of drives in our study remain at the same firmware version throughout the length of our study.
    Alas, many enterprise sysadmins favor "the devil they know", being reluctant to take the risk of a software upgrade failing or introducing new bugs to diagnose and work around. This study should belp convince sysadmins that the risk of upgrading is less than the risk of doing nothing.
  • Source
  • I've been pointing out the importance of the correlation between failures at least since our 2006 EuroSys paper A Fresh Look at the Reliability of Long-term Digital Storage. The authors concur:
    Our results highlight the occurrence of temporally correlated failures within the same RAID group. This observation indicates that ... realistic data loss analysis certainly has to consider correlated failures.
    It is notoriously difficult to model correlated failures in storage systems because the data is sparse and the correlations can be between diverse components, not just media. Jiang et al's 2008 paper Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics showed that only about half of failures in the field could be attributed to the media.
  • Source
    I included in 2016's QLC Flash on the horizon this table showing the greatly reduced Program Erase (PE) cycle limits of quad-level (QLC) flash, and predicted that it would cripple QLC's use in many non-archival applications. Surprisingly, the authors point out that:
    for the vast majority of enterprise users, a move towards QLC’s PE cycle limits poses no risks, as 99% of systems use at most 15% of the rated life of their drives.
    Cell typeRatio
    5Xnm32%
    3Xnm28%
    2Xnm20%
    1Xnm14%
    Although the low usage of PE cycles is striking, I'm not sure I agree with this conclusion. From the table above, we can compute the percentage of TLC's PE limit represented by QLC's PE limit. So in 1Xnm technology, the 1% of drives that used 15% of their PE limit in TLC would exceed their PE limit in QLC.
  • In 2017's Storage Failures In The Field I commented on Backblaze's observation that the 10TB and 12TB HDD generations showed much reduced infant mortality:
    devoting engineering effort to reducing infant mortality can have a significant return on investment. A drive that fails early will be returned under warranty, costing the company money. A drive that fails after the warranty expires cannot be returned. Warranty costs must be reserved against in the company's accounts. Any reduction in the rate of early failures goes straight to the company's bottom line.
    Source
    But it seems this investment hasn't been made for enterprise SSDs. The authors note that:
    In contrast to the “bathtub” shape assumed by classical reliability models, we observe no signs of failure rate increases at end of life and also a very drawn-out period of infant mortality, which can last more than a year and see failure rates 2-3X larger than later in life.
    The low failure rate at end-of-life is encouraging for archival use, but the infant mortality is a problem. The authors write:
    we observe for both 3D-TLC and eMLC drives, a long period (12–15 months) of increasing failure rates, followed by a lengthy period (another 6–12 months) of slowly decreasing failure rates, before rates finally stabilize. That means that, given typical drive lifetimes of 5 years, drives spend 20-40% of their life in infant mortality.
    Eyeballing the eMLC curve in the upper part of Figure 2, it looks like, if the drives have a 5-year warranty, 2.6% of the drives will be returned, a rather expensive proposition.
  • Because it was assumed that SSDs followed a bathtub failure curve, the authors write:
    There has been a fear that the limited PE cycles of NAND SSDs can create a threat to data reliability in the later part of a RAID system’s life due to correlated wear-out failures, as the drives in a RAID group age at the same rate. Instead we observe that correlated failures due to infant mortality are likely to be a bigger threat.
    Because, absent replacements, the drives in a RAID are typically all the same age, the infant mortality and the unpatched firmware together may be driving the relatively high correlation shown in Figure 7 above.
I've only discussed a few of the papers findings and lessons. All the others are interesting; the paper will repay close reading.

No comments: