Tuesday, June 14, 2022

Where Did The Number 3 Come From?

The Keepers Registry, which tracks the preservation of academic journals by various "keepers" (preservation agencies), currently says:
20,127 titles are being ‘kept safe’ by 3 or more Keepers
The registry backs this up with this page, showing the number of journals being preserved by N keepers.
The NDSA Levels of Digital Preservation: An Explanation and Uses from 2013 is still in wide use as a guide to preserving digital content. It uses specifies the number of independent copies as 2 for "Level 1" and 3 for "Levels 2-4".

Alicia Wise of CLOCKSS asked "where did the number 3 come from?" Below the fold I discuss the backstory.

Back in 2011 I gave a talk at Netherlands Beeld en Geluid in Hilversum entitled How Few Copies?. The Beeld en Geluid (B&G) is the Dutch audiovisual archive. The large majority of the bytes they preserve are video, so their average content unit is big. The cost of preserving bytes is more than linear in the number of bytes, so the answer to the question "how few copies do we need to be safe?" is an important input to their budget. The same is true for many other digital preservation services.

In the real world, perfect digital preservation would need an infinite budget. With a finite budget, some stuff will get lost. I explained the theoretical ideal:
We can spend more to keep more copies, or be more careful with each copy, and hopefully lose less stuff. Its a trade-off. In theory, for each technology there is a curve like this, relating the number of copies to the probability of loss. Each additional copy improves reliability, subject to the law of diminishing returns.

In theory, for each number of copies there is a curve like this, relating the cost per copy to the probability of loss. Each additional dollar improves reliability, subject to the law of diminishing returns.

What we'd like to do is to put these two sets of graphs together, like this:
Then, given a target loss probability, we could work out the most cost-effective technology and replication factor. Or, given a budget, we could work out the best achievable loss probability.
My talk made two main points. First:
Why don't we have such a graph? There are two main reasons. The first is that, although they aren't nearly reliable enough, storage systems are very, very reliable. To measure a system's loss rate involves looking at a huge amount of data for a long time, so gathering the data for a graph like this is way too expensive to be feasible. The second is that the loss rate depends critically on the details of the internal design and implementation of every component of every system. We have to do this infeasibly expensive experiment for every single system, we can't do it once and generalize.
We don't even have the data upon which to base realistic simulations of the data loss in real systems, because we can't model correlations:
Why are correlations between causes of data loss important? Suppose we have two replicas, and at some point one replica fails. Some time after that, the failure is detected and a repair process is initiated, copying from the other replica. If the second replica fails after the initial failure and before the copy is completed the data will be lost. Thus the critical number is not the probability of a replica failing, but the probability of a second replica failing during the detection and repair interval after a first failure.
And thus, second:
We'll just have to make the best decisions we can without the graph. Even though we can't draw the exact graph, we know the shape it will have, and thus we can formulate some rules of thumb, all of which are subject to diminishing returns:
  • The more copies the safer.
  • The less correlated the copies the safer.
  • The more reliable each copy the safer.
  • The faster failures are detected and repaired the safer.
In the normal case, these rules are in descending order of importance; extra copies are more important than less correlation is more important than increased copy reliability is more important than faster detection and repair.
I expect my talk was less helpful than they hoped.

All this is to say that there is no strong theoretical justification for the criterion for safety being three, or any other number. In theoretical computer science there is a very strong theoretical result dating from 1982, called Byzantine Fault Tolerance (BFT). This proves that 3f+1 replicas can in the worst case survive f simultaneous failures. This would lead to the idea that, to be safe against one failure, four replicas are needed. Or to be safe against two failures, seven replicas are needed.

So why do the Keepers Registry and the NDSA use three not at least seven? There are two reasons:
  • BFT is sound computer science, showing that the system behaves perfectly provided it encounters no more than f simultaneous failures. But from the engineering perspective the issue is that if it does encounter more than f simultaneous failures, it fails completely. So the engineering question is "what is the probability of f+1 simultaneous failures during the system's designed life?".
  • Seven would imply a system 233% as expensive. In practice, there normally isn't really enough money for three high-quality replicas, so choosing seven means spending 58% less on each replica. The result would be that they would fail a lot more often, because of the law of diminishing returns. This would greatly increase the probability of two simultaneous failures.
Theory provides a clear answer, engineering provides a question about probabilities.

The Keepers Registry is right to focus on the number of different services preserving a unit of content. Lets look at the rules of thumb given a target of three:
  • The more copies the safer. Presumably, each of the three services maintains more than one copy, so the number of copies is likely several times three.
  • The less correlated the copies the safer. Presumably, each of the three services is running different software, is under different administration, and is geographically distant from the others.* So although the correlations among the copies at each service will be quite high, the correlation among the different services will be low.
  • The more reliable each copy the safer. Each of the three services is striving to be reliable, so the level of reliability of each of them will be relatively high.
  • The faster failures are detected and repaired the safer. Presumably, each of the three services is running fixity checks on a different, uncorrelated schedule.
Overall the assessment is that, despite a number that appears to have no theoretical justification, in the real world content preserved at three independent services is probably safe against anything short of economic collapse or a major Coronal Mass Ejection. While both these threats are plausible, as I discussed in Seeds Or Code?, if they happen society is likely to have bigger problems than re-reading old academic papers.

*) Note that it is more and more likely that the services are using Cloud For Preservation, and it is quite possible that they are all using the same cloud provider. The Keepers Registry should be recording whether, and if so which, cloud providers underlie each of the services they are tracking.

1 comment:

Greg Janée said...

My argument: If you have two copies and lose one due to failure, then you're down to one copy, which is a precarious position to be in. If you have three copies and lose one, then you're in a degraded state of having only two copies, but still, you've got a backup during the interim period when you're reconstructing your third copy. In this way three copies is qualitatively different from (and better than) two copies.