DSHR's Blog: More On Failures From FAST 2020

A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al, which I discussed in Enterprise SSD Reliability, wasn't the only paper at this year's Usenix FAST conference about storage failures. Below the fold I comment on one specifically about hard drives rather than SSDs, making it more relevant to archival storage.

Because HDDs and SSDs are in fact computers, not just media, their software is capable of reporting a good deal of diagnostic information via the SMART API. It has always been an attractive idea to use this information to predict device failures and enable proactive replacement. But past efforts to do so haven't been as effective as one might have hoped.

Now, Making Disk Failure Predictions SMARTer! by Sidi Lu et al applies machine learning to a very comprehensive dataset. Their abstract reads:

Disk drives are one of the most commonly replaced hardware components and continue to pose challenges for accurate failure prediction. In this work, we present analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator. Our proposed machine learning based models predict disk failures with 0.95 F-measure and 0.95 Matthews correlation coefficient (MCC) for 10-days prediction horizon on average.

Lu: Figures 1 & 2

Previous work showed that SMART attributes did predict failure, but only so close to actual failure as to prevent proactive replacement. Their Figure 1 shows this - the red line for failed drives only becomes distinct in the last day or so of its life.

On the other hand, Figure 2 shows that the performance metrics of failed disks are distinguishable much earlier. This matches the observation in FAST 2018's Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems by Haryadi Gunawi and 20 other authors from 12 companies, national labs and University data centers that slow performance attributable to various components is common:

Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults.

In the left half of Figure 2, note that the failed drives are the slowest for the entire period of the graph.

To observe cases where the performance of a disk leading up to failure was distinguishable from normal behavior, they used the difference between the actual parameter and the average of healthy disks on the same server:

If there is only one failed disk on a specific failed server, we keep the raw value of the failed disk (RFD) and calculate the average value of all healthy disks (AHD) for every time point. Then, we get the difference between RFD and AHD, which indicates the real-time difference between the signatures of failed disks and healthy disks on the same server. If there are N ( N ≥ 2 ) failed disks, then for each failed disk, we calculate the difference between RFD and AHD for every time point.

Lu Figure 5

Using the difference between RFD and AHD, the authors observed a range of behavior patterns leading up to failures, as shown in Figure 5:

The top two graphs of Figure 5 illustrate that some failed disks have a similar value to healthy disks at first, but then their behavior becomes unstable as the disk nears the impending failure. The bottom two graphs of Figure 5 show that some failed disks report a sharp impulse before they fail, as opposed to a longer erratic behavior. These sharp impulses may even repeat multiple times. We did not find such patterns for SMART attributes so far before the failure of this selected example. The diversity of patterns demonstrates that disk failure prediction using performance metrics is non-trivial.

Using machine learning techniques focused on performance metrics, with the interesting addition of location data, they were able to predict disk failures 10 days ahead with high confidence:

We discover that performance metrics are good indicators of disk failures. We also found that location markers can improve the accuracy of disk failure prediction. Lastly, we trained machine learning models including neural network models to predict disk failures with 0.95 F-measure and 0.95 MCC for 10 days prediction horizon.

Disks in the same location in a rack are subject to the same vibration and thermal environment. The relevance of this to failure was noted in Nisha Talagala's 1999 Ph.D. thesis Characterizing Large Storage Systems:Error Behavior and Performance Benchmarks as a prime cause of correlated failures:

The time correlation data in Section 3.5.4 showed that several machines showed bursts of SCSI errors, escalating over time. This data suggests that a sequence of error messages from the same device will suggest imminent failure. A single message is not enough; as Section 3.5 showed, components report hardware errors from time to time without failing entirely. ... the SCSI parity errors were relatively localized, appearing in only three of the 16 machines.

Back in 2015, RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures by Ao Ma et al showed that even simpler methods can work quite well. In EMC's environment they could effectively predict SATA disk failures by observing the reallocated sector count and greatly reduce RAID failures by proactively replacing drives whose counts exceed a threshold.

Ma Figure 6

From their abstract:

We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from 6 disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures.
...
we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures.

The graphs in Figure 6 show clearly the strong signal of impending failure that reallocated sector count provides.

DSHR's Blog

Tuesday, March 24, 2020

More On Failures From FAST 2020

No comments: