Wednesday, October 6, 2010

"Petabyte for a Century" Goes Main-Stream

I started writing about the insights to be gained from the problem of keeping a Petabyte for a century four years ago in September 2006. More than three years ago in June 2007 I blogged about them. Two years ago in September 2008 these ideas became a paper at iPRES 2008 (PDF). After an unbelievable 20-month delay from the time it was presented at iPRES, the International Journal of Digital Preservation finally published almost exactly the same text (PDF) in June 2010.

Now, an expanded and improved version of the paper, including material from my 2010 JCDL keynote, has appeared in ACM Queue.

Alas, I'm not quite finished writing on this topic. I was too busy when I was preparing this article and so I failed to notice an excellent paper by Kevin Greenan, James Plank and Jay Wylie, Mean time to meaningless: MTTDL, Markov models, and storage system reliability.

They agree with my point that MTTDL is a meaningless measure of storage reliability, and that bit half-life isn't a great improvement on it. They propose instead NOMDL (NOrmalized Magnitude of Data Loss), i.e. the expected number of bytes that the storage will lose in a specified interval divided by its usable capacity. As they point out, it is possible to compute this using Monte Carlo simulation based on distributions of component failures that experiments have shown to fit the real world. These simulations produce estimates that are relatively credible, especially compared to the ludicrous estimates I pillory in the article.

NOMDL is a far better measure than MTTDL. Greenan, Plank and Wylie are to be congratulated for proposing it. However, it is not a panacea. It is still the result of models based on data, rather than experiments on the system in question. The major points of my article still stand:
  • That the reliability we need is so high that benchmarking systems to assure that they exceed it is impractical.

  • That projecting the reliability of storage systems based on simulations based on component reliability distributions is likely to be optimistic, given both the observed auto- and long-range correlations between failures, and the inability of the models to capture the major causes of data loss, such as operator error.

Further, there is still a use for bit half-life. Careful readers will note subtle changes in the discussion of bit half-life between the iPRES and ACM versions. These are due to incisive criticism of the earlier version by Tsutomo Shimomura. The ACM version describes the use of bit half-life thus:
"Even if we are sublimely confident that every source of data loss other than bit rot has been totally eliminated, we still have to run a benchmark of the system’s bit half-life to confirm that it is longer than [required]"
However good simulations of the kind Greenan et al. propose may be, at some point we need to compare them to the reliability that the systems actually deliver.

1 comment:

David. said...

Communications of the ACM has published the ACM Queue version of this paper as the version of record.

Note that the drum-beat of hints that the $/byte curve for disk storage will flatten continues.