Tuesday, March 26, 2019

FAST 2019

I wasn't able to attend this year's FAST conference in Boston, and reading through the papers I didn't miss much relevant to long-term storage. Below the fold a couple of quick notes and a look at the one really relevant paper.

Erasure coding is a very important technique for long-term storage, as it can significantly reduce the amount of raw storage needed to provide a given level of reliability. There were two papers on erasure coding:
  • Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques by Zhou et al studies ways to accelerate the computations of erasure coding:
    Various techniques have been proposed in the literature to improve erasure code computation efficiency, including optimizing bitmatrix design, optimizing computation schedule, common XOR operation reduction, caching management techniques, and vectorization techniques. These techniques were largely proposed individually previously, and in this work, we seek to use them jointly. In order to accomplish this task, these techniques need to be thoroughly evaluated individually, and their relation better understood. Building on extensive test results, we develop methods to systematically optimize the computation chain together with the underlying bitmatrix.
    This is useful work but the computation load of erasure codes isn't a factor in long-term storage.
  • OpenEC: Toward Unified and Configurable Erasure Coding Management in Distributed Storage Systems by Li et al addresses the issue of how improved erasure codes can be deployed in practice:
    integrating new erasure coding solutions into existing distributed storage systems is a challenging task and requires non-trivial re-engineering of the underlying storage workflows. We present OpenEC, a unified and configurable framework for readily deploying a variety of erasure coding solutions into existing distributed storage systems. OpenEC decouples erasure coding management from the storage workflows of distributed storage systems, and provides erasure coding designers with configurable controls of erasure coding operations through a directed-acyclic-graph-based programming abstraction.
I've often praised Backblaze for their exemplary transparency. One aspect of this is their open data on failure rates of the various hard disks they deploy. Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity by Kadekodi et al makes use of this data by observing that different drive models have considerably different Annual Failure Rates (AFR), as shown in the graph.

The major difficulty in modeling and designing long-term storage systems is that the failures against which protection is required are often correlated in ways that are hard to discover or predict. If, as the Backblaze data show, different drive models exhibit different failure patterns, the correlation between them will be low. Kadekodi et al observe that:
Despite such differences, the degree of redundancy employed in cluster storage systems for the purpose of long term data reliability (e.g., the degree of replication or erasure code parameters) is generally configured as if all of the devices have the same reliability. Unfortunately, this approach leads to configurations that are overly resource-consuming,overly risky, or a mix of the two. For example, if the redundancy settings are configured to achieve a given data reliability target (e.g., a specific mean time to data loss (MTTDL))based on the highest AFR of any device make/model (e.g.,S-4 from Fig. 1), then too much space will be used for redundancy associated with data that is stored fully on lower AFR makes/models (e.g., H-4A). Continuing this example, our evaluations show that the overall wasted capacity can be up to 16% compared to uniform use of erasure code settings stated as being used in real large-scale storage clusters [13, 25, 26, 28] and up to 33% compared to using 3-replication for all data — the direct consequence is increased cost, as more disks are needed. If redundancy settings for all data are based on lower AFRs, on the other hand, then data stored fully on higher-AFR devices is not sufficiently protected to achieve the data reliability target.
Their HeART (Heterogeneity-Aware Redundancy Tuner) system is:
an online tool for guiding exploitation of reliability heterogeneity among disks to reduce the space overhead (and hence the cost) of data reliability. HeART uses failure data observed over time to empirically quantify each disk group’s reliability characteristics and determine minimum-capacity redundancy settings that achieve specified target data reliability levels. For the Backblaze dataset of 100,000+HDDs over 5 years, our analysis shows that using HeART’s settings could achieve data reliability targets with 11–33% fewer HDDs, depending on the baseline one-scheme-for-all settings. Even when the baseline scheme is a 10-of-14 erasure code whose space-overhead is already low, HeART further reduces disk space used by up to 14%.
The details in the paper are fascinating, you should go read it. The related work section cites many important studies in this area. One caveat I have is that the Backblaze data includes one disk model which showed anomalously rapid wear-out at around 3 years, as shown in the graph. The other disk models in the dataset all show a fairly mild bathtub curve of failures. This atypical behavior may have decreased the observed correlation, and enhanced the space savings that the authors compute. I believe the anomalous model carried the normal 5-year warranty; these failures were very expensive for the manufacturer. Avoiding failures during warranty is a high priority for the drive engineering teams, so this rapid wear-out behavior should be rare.

No comments: