A talk at the San Diego Supercomputer Center in September 2006 was when I started arguing (pdf) that one of the big problems in digital preservation is that we don't know how to measure how well we are doing it, and that makes it difficult to improve how well we do it. Because supercomputer people like large numbers, I started using the example of keeping a petabyte of data for a century to illustrate the problem. This post expands on my argument.
Lets start by assuming an organization has a petabyte of data that will be needed in 100 years. They want to buy a preservation system good enough that there will be a 50% chance that at the end of the 100 years every bit in the petabyte will have survived undamaged. This requirement sounds reasonable, but it is actually very challenging. They want 0.8 exabit-years of preservation with a 50% chance of success. Suppose the system they want to buy suffered from bit rot, a process that had a very small probability of flipping a bit at random. By analogy with the radioactive decay of atoms, they need the half-life of bits in the system to be at least 0.8 exa-years, or roughly 100,000,000 times the age of the universe.
In order to be confident that they are spending money wisely, the organization commissions an independent test lab to benchmark the competing preservation systems. The goal is to measure the half-life of bits in each system to see whether it meets the 0.8 exa-year target. The contract for the testing specifies that results are needed in a year. What does the test lab have to do?
The lab needs to assemble a big enough test system so that, if the half-life is exactly 0.8 exa-year, it will see enough bit flips to be confident that the measurement is good. Say it needs to see 5 bit flips or fewer to claim that the half-life is long enough. Then the lab needs to test an exabyte of data for a year.
The test consists of writing an exabyte of data into the system at the start of the year and reading it back several times, lets say 9 times, during the year to compare the bits that come out with the bits that went in. So we have 80 exabits of I/O to do in one year, or roughly 10 petabits/hour, which is an I/O rate of about 3 terabits/sec. That is 3,000 gigabit Ethernet interfaces running at full speed continuously for the whole year.
At current storage prices just the storage for the test system will cost hundreds of millions of dollars. When you add on the cost of the equipment to sustain the I/O and do the comparisons, and the cost of the software, staff, power and so on, its clear that the test to discover whether a system would be good enough to keep a petabyte of data for a century with a 50% chance of success would cost in the billion-dollar range. This is of the order of 1,000 times the purchase price of the system, so the test isn't feasible.
I'm not an expert on experimental design, and this is obviously a somewhat simplistic thought-experiment. But, suppose that the purchasing organization was prepared to spend 1% of the purchase price per system on such a test. The test would then have to cost roughly 100,000 times less than my thought-experiment to be affordable. I leave this 100,000-fold improvement as an exercise for the reader.