Tuesday, March 16, 2021

Correlated Failures

The invaluable statistics published by Backblaze show that, despite being built from technologies close to the physical limits (Heat-Assisted Magnetic Recording, 3D NAND Flash), modern digital storage media are extraordinarily reliable. However, I have long believed that the models that attempt to project the reliability of digital storage systems from the statistics of media reliability are wildly optimistic. They ignore foreseeable causes of data loss such as Coronal Mass Ejections and ransomware attacks, which cause correlated failures among the media in the system. No matter how many they are, if all replicas are destroyed or corrupted the data is irrecoverable.

Modelling these "black swan" events is clearly extremely difficult, but much less dramatic causes are in practice important too. It has been known at least since Talagala's 1999 Ph.D. thesis that media failures in storage systems are significantly correlated, and at least since Jiang et al's 2008 Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics that only about half the failures in storage systems are traceable to media failures. The rest happen in the pipeline from the media to the CPU. Because this typically aggregates data from many media components, it naturally causes correlations.

As I wrote in 2015's Disk reliability, discussing Backblaze's experience of a 40% Annual Failure Rate (AFR) in over 1,100 Seagate 3TB drives:
Alas, there is a long history of high failure rates among particular batches of drives. An experience similar to Backblaze's at Facebook is related here, with an AFR over 60%. My first experience of this was nearly 30 years ago in the early days of Sun Microsystems. Manufacturing defects, software bugs, mishandling by distributors, vibration resonance, there are many causes for these correlated failures.
Despite plenty of anecdotes, there is little useful data on which to base models of correlated failures in storage systems. Below the fold I summarize and comment on an important paper by a team from the Chinese University of Hong Kong and Alibaba that helps remedy this.

An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers by Shujie Han et al is an important contribution to the study of storage systems. Their abstract reads:
Flash-based solid-state drives (SSDs) are increasingly adopted as the mainstream storage media in modern data centers. However, little is known about how SSD failures in the field are correlated, both spatially and temporally. We argue that characterizing correlated failures of SSDs is critical, especially for guiding the design of redundancy protection for high storage reliability. We present an in-depth data-driven analysis on the correlated failures in the SSD-based data centers at Alibaba. We study nearly one million SSDs of 11 drive models based on a dataset of SMART logs, trouble tickets, physical locations, and applications. We show that correlated failures in the same node or rack are common, and study the possible impacting factors on those correlated failures. We also evaluate via trace-driven simulation how various redundancy schemes affect the storage reliability under correlated failures. To this end, we report 15 findings. Our dataset and source code are now released for public use.

Summary

The paper is in two main parts. In the first they analyze the data collected from the field looking for correlations among the failures. They report twelve findings, listed below. To summarize, they found very significant correlations among the failures of drives in the same node or in the same rack, as shown in their Figures 1 and 2. This validates the arguments I've been making about the optimism of the models.

They also found significant failure correlations among drives of the same model and age. The obvious conclusion is that configuring nodes and racks with homogeneous drives is asking for trouble; diversity of drives enhances reliability.

Two additional observations are consistent with earlier research, that the more writes the more errors, and that "SMART attributes have limited correlations with intra-node or intra-rack failures"

In the second part of the paper, the authors describe how they took the data from the first part and built models including the correlations they observed. They simulated three types of fault tolerance:
  • Replication as Rep(2) for two copies and Rep(3) for three copies.
  • Reed-Solomon Coding in three variants used by major cloud services, i.e. RS(6,3), RS(10,4) and RS(12,4).
  • Local Reconstruction Coding in the variant used by Azure, LRC(12,2,2).
They compare the reliability of the various techniques using two metrics:
  • Probability of data loss (PDL). It measures the likelihood that (unrecoverable) data loss occurs in a data center (i.e., the number of chunk failures in a coding group exceeds the tolerable limit).
  • Normalized magnitude of data loss (NOMDL). It measures the amount of (unrecoverable) data loss (in bytes) normalized to the storage capacity. Back in 2010 I reviewed Greenan et al's excellent Mean time to meaningless: MTTDL, Markov models, and storage system reliability, which introduced the NOMDL concept.
They report three findings:
  1. Erasure coding shows higher reliability than replication based on the failure patterns in our dataset.:
    Figure 18 shows that erasure coding achieves lower PDL and NOMDL (i.e., higher reliability) than replication. In particular, Rep(2) has the highest PDL (59.9%), indicating that two chunk copies are insufficient to tolerate failures. Also, Rep(3) is not good enough with a PDL of 10.1%. In contrast, RS(10,4) has the lowest PDL and NOMDL among all RS codes, since it tolerates more failures than RS(6,3) and has less repair bandwidth than RS(12,4). LRC(12,2,2) has slightly higher PDL and NOMDL than RS(12,4), since it cannot tolerate four chunks at any time.
    They simulated Rep(2), where a second failure caused loss, so correlations had a big impact. Rep(3) had a NOMDL around 10 times better, although not as good as RS(6,3) despite around double the storage consumption.
  2. Redundancy schemes that are sufficient for tolerating independent failures may be insufficient for tolerating the correlated failures as shown in our dataset.:
    The PDL and NOMDL under only independent failures for Rep(3), RS(6,3), RS(10,4), RS(12,4), and LRC(12,2,2) are zero. However, the reliability of these redundancy schemes degrades under the failure patterns in our dataset. The reason is that some correlated failures occur within a short time period (Finding 3) and additional failures are likely to occur in a short time with the existing correlated failures on the same node or rack (Finding 2), leading to the competition for network bandwidth resources and a slowdown of the repair process. This increases the likelihood of data loss.
  3. Lazy recovery is less suitable than eager recovery for tolerating correlated failures in our dataset. They compare eager recovery (starting repair as soon as a failure is detected) with lazy recovery (postponing repair until a second, third, ... failure is detected):
    The reason of the reliability degradation of lazy recovery under the failures in our dataset is that when the number of failed chunks reaches a larger threshold of chunk failures, additional correlated failures are also more likely to occur in a short time (Findings 2 and 3). Thus, the most proper threshold number of chunk failures is one,
    For example, an RS(10,4) system can tolerate 4 failures. If failures were independent, postponing repair until 3 failures had been detected would seem safe. But failures are correlated, so the likelihood of failures 4 and 5 occurring before the repair can be completed is high.

Related work

Commendably, the authors include a comprehensive bibliography and overview of related work, only some of which I have previously discussed. Highlights include, in date order:
  • In 2006's Efficient replica maintenance for distributed storage systems Chun et al detected correlated failures in a years-worth of data from PlanetLab.
  • And in a second paper from the same NSDI conference, Subtleties in tolerating correlated failures in wide-area storage systems by Nath et al, they write (my emphasis):
    In reality, the assumption of failure independence is rarely true. Node failures are often correlated, with multiple nodes in the system failing (nearly) simultaneously. The size of these correlated failures can be quite large. For example, Akamai experienced large distributed denialof-service (DDoS) attacks on its servers in May and June 2004 that resulted in the unavailability of many of its client sites ... The PlanetLab experienced four failure events during the first half of 2004 in which more than 35 nodes (≈ 20%) failed within a few minutes. Such large correlated failure events may have numerous causes, including system software bugs, DDoS attacks, virus/worm infections, node overload, and human errors. The impact of failure correlation on system unavailability is dramatic (i.e., by orders of magnitude) ... As a result, tolerating correlated failures is a key issue in designing highly available distributed storage systems. Even though researchers have long been aware of correlated failures, most systems ... are still evaluated and compared under the assumption of independent failures.
    Incidentally, the same year my co-authors and I discussed the impact of correlated failures on archival storage systems in Section 4.2 of A fresh look at the reliability of long-term digital storage.
  • In 2007's Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? Schroeder and Gibson looked at hard disk replacements in a large storage system:
    we determine the correlation of the number of disk replacements observed in successive weeks or months by computing the correlation coefficient between the number of replacements in a given week or month and the previous week or month. For data coming from a Poisson processes we would expect correlation coefficients to be close to 0. Instead we find significant levels of correlations, both at the monthly and the weekly level.
  • 2010's Availability in globally distributed storage systems by Ford et al reported on a year-long study of unavailability in Google's storage system. In Section 4 they:
    • Apply a clustering heuristic for grouping failures which occurs almost simultaneously and show that a large fraction of failures happen in bursts.
    • Quantify how likely a failure burst is associated with a given failure domain. We find that most large bursts of failures are associated with rack- or multirack level events.
    They write "the critical element in models of availability is their ability to account for the frequency and magnitude of correlated failures."
  • The same year Schroeder et al studied spatial rather than temporal correlations in Understanding latent sector errors and how to protect against them:
    When trying to protect against LSEs, it is important to understand the distribution of the lengths of error bursts. By an error burst we mean a series of errors that is contiguous in logical block space. The effectiveness of intra-disk redundancy schemes, for example, depends on the length of bursts, as a large number of contiguous errors likely affects multiple sectors in the same parity group preventing recovery through intra-disk redundancy.
  • 2015's A large-scale study of flash memory failures in the field by Meza et al from Carnegie-Mellon and Facebook identified temporal correlations in failures of individual SSDs:
    An explanation for the relatively large differences in errors per machine could be that error events are correlated. Examining the data shows that this is indeed the case: during a recent two weeks, 99.8% of the SSDs that had an error during the first week also had an error during the second week. We therefore conclude that an SSD that has had an error in the past is highly likely to continue to have errors in the future.
  • One of two "Best Paper" awards at Usenix's 2020 FAST conference was A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al from Bianca Schroeder's group at U. Toronto and NetApp. They write:
    Our results highlight the occurrence of temporally correlated failures within the same RAID group. This observation indicates that ... realistic data loss analysis certainly has to consider correlated failures.
As you can see, corrrelations have been a recurring topic for a decade and a half. They have been detected at all levels from individual sectors to drives to RAID groups and now, thanks to Han et al, at the levels of nodes and racks in data centers.

Findings from the data

  1. A non-negligible fraction of SSD failures belong to intra-node and intra-rack failures (12.9% and 18.3% in our dataset, respectively). Also, the intra-node and intra-rack failure group size can exceed the tolerable limit of some typical redundancy protection schemes.
  2. The likelihood of having an additional intra-node (intra-rack) failure in an intra-node (intra-rack) failure group depends on the already existing intra-node (intra-rack) failures.
  3. A non-negligible fraction of intra-node and intra-rack failures occur within a short period of time, even within one minute.
  4. The relative percentages of intra-node and intra-rack failures vary across drive models. Putting too many SSDs from the same drive model in the same nodes (racks) leads to a high percentage of intra-node (intra-rack) failures. Also, the AFR and environmental factors (e.g., temperature) affect the relative percentages of intra-node and intra-rack failures.
  5. There exist non-negligible fractions of intra-node and intra-rack failures with a short failure time interval for most drive models (e.g., up to 33.4% and 37.1% with a failure time interval of within one minute in our dataset, respectively).
  6. MLC SSDs with higher densities generally have lower relative percentages of intra-node and intra-rack failures.
  7. The relative percentages of intra-node and intra-rack failures increase with age. The intra-node and intra-rack failures at an older age are more likely to occur within a short time due to the increasing rated life used.
  8. The relative percentages of intra-node and intra-rack failures vary significantly across the capacity. There is no clear trend between the relative percentages of intra-node (or intra-rack) failures for different thresholds of failure time intervals and the capacity.
  9. The SMART attributes have limited correlations with intra-node and intra-rack failures, and the highest SRCC values (from S187) are only 0.23 for both intra-node and intra-rack failures. Thus, SMART attributes are not good indicators for detecting the existence of intra-node and intra-rack failures. Also, intra-node and intra-rack failures have no significant difference of the absolute values of SRCC for each SMART attribute.
  10. Write-dominant workloads lead to more SSD failures overall, but are not the only impacting factor on the AFRs. Other factors (e.g., drive models) can affect the AFRs.
  11. The applications with more SSDs per node (rack) and write-dominant workloads tend to have a high percentage of intra-node (intra-rack) failures.
  12. Among individual applications, the intra-node and intra-rack failures at an older age and with more write-dominant workloads tend to occur in a short time.

No comments: