Thursday, March 2, 2017

Injecting Faults in Distributed Storage

I'll record my reactions to some of the papers at the 2017 FAST conference in a subsequent post. But one of them has significant implications for digital preservation systems using distributed storage, and deserves a post to itself. Follow me below the fold as I try to draw out these implications.

Four of the 27 papers at this year's FAST conference, and both the Best Paper awards, were from UW Madison. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions by Ashiwarya Ganesan et al from UW Madison wasn't a Best Paper but I would have voted for it. And kudos to the primary author for her very clear presentation. The abstract reads:
We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous bugs related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability.
Previous work studied the response of local file systems to media faults. At the 2005 SOSP a team from UW Madison presented IRON File Systems. They injected device faults below state-of-the-art file systems and studied the response.
commodity file system failure policies are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures.
The experiments described in the 2017 paper were based on injecting realistic media-caused faults into the local file systems underlying the distributed storage systems in question. They inserted a FUSE-based stacked file system between the local file system and the distributed storage system being studied to inject these faults. This code is being released for others to use.

The faults they injected each affected only a single data block in the local file system of only one server in the distributed system. They did not inject faults into local file system metadata. The injected faults were corrupted data, read errors and write errors. It is notable that a single fault of this kind at a single node of the distributed storage system, which did not itself affect the replicas at other nodes, could still cause significant problems, including corrupting those replicas.

The distributed storage systems studied were Redis, ZooKeeper, Cassandra, Kafka, RethinkDB, MongoDB, LogCabin, and CockroachDB. None looked good; all failed in some ways and none effectively exploited the redundancy they maintained to recover from the failures. The paper drew five general conclusions:
  1. Systems employ diverse data integrity strategies, ranging from trusting the underlying file systems (RethinkDB and Redis) to using checksums (ZooKeeper, MongoDB, CockroachDB).
  2. Local Behavior: Faults are often undetected; even if detected, crashing is the most common local reaction
  3. Redundancy is underutilized: A single fault can have disastrous cluster-wide effects.
  4. Crash and corruption handling are entangled.
  5. Nuances in commonly used distributed protocols can spread corruption or data loss.
This work has two kinds of implication for digital preservation systems:
  • These systems frequently use stored checksums to detect corruption, which they repair from a replica. They may therefore exhibit some of the bad behaviors of the similar processes in the distributed storage systems the paper studied.
  • Systems using a distributed storage layer for long-term storage frequently place some level of trust in it, for example they may trust it to store the checksums. They do so because they believe the replication of the distributed storage system renders it fault-tolerant. This paper destroys that belief.
There was related work in the Work In Progress session. On Fault Resilience of File System Checkers by Om Rameshwar Gatla and Mai Zheng of New Mexico State was inspired by a real data loss incident at a high-performance computing center. A power failure triggered their Lustre file system checker, which would take several days. During the check, a second power failure interrupted it. The checker uses the underlying local file system checkers, in this case e2fsck. The interrupted e2fsck runs had already made changes to their file systems which could not be repaired by a subsequent e2fsck.

The paper used similar fault injection techniques to replicate and study the problem. The take-away is that file system checkers need to be transactional, a conclusion which seems obvious once it has been stated. This is especially the case since the probability of a power outage is highest immediately after power restoration. Until the checkers are transactional, their vulnerability to crashes and power failures, and the inability of the distributed storage systems above them to correctly respond poses a significant risk.

No comments: