User rblandau at MIT-Informatics has a high-level simulation of distributed preservation that looks like an interesting way of exploring these questions. Below the fold, my commentary.
rblandua's conclusions from the first study using the simulation are:
- Across a wide range of error rates, maintaining multiple copies of documents improves the survival rate of documents, much as expected.
- For moderate storage error rates, in the range that one would expect from commercial products, small numbers of copies suffice to minimize or eliminate document losses.
- Auditing document collections dramatically improves the survival rate of documents using substantially fewer copies (than required without auditing).
- Auditing is expensive in bandwidth. We should work on (cryptographic) methods of auditing that do not require retrieving the entire document.
- Auditing does not need to be performed very frequently.
- Glitches increase document loss more or less in proportion to their frequency and impact. They cannot be distinguished from overall increases in error rate.
- Institutional failures are dangerous in that they remove entire collections and expose client collections to higher risks of permanent document loss.
- Correlated failures of institutions could be particularly dangerous in this regard by removing more than one copy from the set of copies for long periods.
- We need more information on plausible ranges of document error rates and on institutional failure rates.
- Auditing document collections dramatically improves the survival rate - no kidding! If you never find out that something has gone wrong you will never fix it, so you will need a lot more copies.
- Auditing is expensive in bandwidth - not if you do it right. There are several auditing systems that do not require retrieving the entire document, including LOCKSS, ACE and a system from Mehul Shah et al at HP Labs. None of these systems is ideal in all possible cases, but their bandwidth use isn't significant in their appropriate cases. And note the beneficial effects of combining local and networked detection of damage.
- Auditing does not need to be performed very frequently - it depends. Oversimplifying, the critical parameters are MeanTimeToFailure (MTTF), MeanTimeToDetection (MTTD) and MeanTimeToRepair (MTTR), and the probability that the system is in a state with an un-repaired failure is (MTTD+MTTR)/MTTF. MTTD is the inverse of the rate at which auditing occurs. A system with an un-repaired failure is at higher risk because its replication level is reduced by one.
- Institutional failures are dangerous - yes, because repairs are not instantaneous. At scale, MTTR is proportional to the amount of damage that needs to be repaired. The more data a replica loses, the longer it will take to repair, and thus the longer the system will be at increased risk. And the bandwidth that it uses will compete with whatever bandwidth the audit process uses.
- Correlated failures of institutions could be particularly dangerous - yes! Correlated failures are the elephant in the room when it comes to simulations of systems reliability, because instead of decrementing the replication factor of the entire collection by one, they can reduce it by an arbitrary number, perhaps even to zero. If it gets to zero, its game over.
- We need more information - yes, but we probably won't get much. There are three kinds of information that would improve our ability to simulate the reliability of digital preservation:
- Failure rates of storage media. The problem here is that storage media are (a) very reliable, but (b) less reliable in the field than their specification. So we need experiments, but to gather meaningful data they need to be at an enormous scale. Google, NetApp and even Backblaze can do these experiments, preservation systems can't, simply because they aren't big enough. It isn't clear how representative of preservation systems these experiments are, and in any case it is known that media cause only about half the failures in the field.
- Failure rates of storage systems from all causes including operator error and organizational failure. Research shows that the root cause for only about half of storage system failures is media failure. But this means that storage systems are also so reliable that collecting failure data requires operating at large scale.
- Correlation probabilities between these failures. Getting meaningful data on the full range of possible correlations requires collecting vastly more data than for individual media reliability.
Here is a slide with a depressing, yet incomplete list of all things that could have gone wrong when an integrity check (fixity check) fails. I wonder how many checks fail because of a temporary glitch and will be fine when checking again and the network is back on.
ReplyDeleteBut that is really posing the question what is the actual repair procedure and how it should be factored into the calculation.
This is a depressing yet incomplete list of possible causes for a failing integrity check (= fixity check): https://twitter.com/benfinoradin/status/641693053061853184
ReplyDeleteMakes me wonder what the cost calculations for repair should look like, since they can range from "plug network cable back in" to "rebuild entire backup site after asteroid strike". I would also like to know how many integrity check failures are false alarms due to a temporary glitch.