In a comment on my "Petabyte for a Century" post Chris Rushbridge argues that because his machine has 100GB of data and he expects much higher reliability than a 50% chance of it surviving undamaged for a year, which would be a bit half-life of 100 times the age of the universe, that the Petabyte for a Century challenge is not a big deal.
It is true that disk, tape and other media are remarkaby reliable, and that we can not merely construct systems with a bit half-life of the order of 100 times the age of the universe, but also conduct experiments that show that we have done so. Watching a terabyte of data for a year is clearly a feasible experiment, and at a bit half-life of 100 times the age of the universe one would expect to see 5 bit flips.
Nevertheless, it is important to note that this is an experiment very few people actually do. Does Chris maintain checksums of every bit of his 100GB? Does he check them regularly? How certain is he that at the end of the year every single bit is the same as it was at the start? I suspect that Chris assumes that because he has 100GB of data and most of it is over a year old and he hasn't noticed anything bad, that the problem isn't that hard. Even if all these assumptions were correct, the petabyte for a century problem is one million times harder. Chris' argument amounts to saying "I have a problem one-millionth the size of the big one, and because I haven't looked very carefully I believe that it is solved. So the big problem isn't scary after all."
The few people who have actually measured silent data corruption in large operational data storage systems have reported depressing results. For example, the excellent work at CERN described in a paper (pdf) and summarized at StorageMojo showed that the error rate delivered to applications from a state-of-the-art storage farm is of the order of ten million times worse than the quoted bit error rate of the disks it uses.
We know that assembling large numbers of components into a system normally results in a system much less reliable than the components. And we have evidence from CERN, Google and elsewhere (pdf) that this is what actually happens when you assemble large numbers of disks, controllers, busses, memories and CPUs into a storage system. And we know that these systems contain large amounts of software which contains large (pdf) amounts (pdf) of bugs. And we know that it is economically and logistically impossible to do the experiments that would be needed to certify a system as delivering a bit error rate low enough to provide a 50% probability of keeping a petabyte uncorrupted for a century.
The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.