Spring CNI Plenary: The Video

CNI has now posted the video of Cliff Lynch's introduction, my plenary presentation, and the questions.

I gave a significantly shortened version of this talk at the Sun PASIG meeting in Malta June 26.

Hard Disk Drives: The Good, the Bad and the Ugly

Jon Elerath just published a wonderful paper in the June 2009 Communications of the ACM entitled "Hard Disk Drives: The Good, the Bad and the Ugly". Everyone, especially anyone who believes bit preservation is a solved problem, should read it. He clearly communicates the incredible complexity of the technology inside the familiar 3.5" drive form factor.

Elerath reviews the range of hard disk failure modes, and shows how difficult it will be for disk manufacturers to maintain the drive reliability constant as disks get bigger. And even if they succeed in keeping drive reliability constant while the disk gets bigger, the bit reliability they deliver goes down. He says:
Multi-terabyte capacity drives using perpendicular recording will be available soon, increasing the probability of both correctable and uncorrectable errors by virtue of the narrowed track widths, lower flying heads, and susceptibility to scratching by softer particle contaminants.
Thus, as I have been saying for a while, just as we are trying to preserve larger and larger numbers of bits, the technologies we use to make those bits reliable are not keeping pace. Elerath concludes:
Only when these high-probability [failure] events are included in the optimization of the RAID operation will reliability improve. Failure to address them is a recipe for disaster.
I agree that RAID technology needs to adapt to the decreasing bit reliability and longer time to repair of newer disk drives. But, as I argued in my iPRES2008 paper (pdf), even if we do a good job of adapting RAID to cope with these problems we will still be many orders of magnitude below the reliability levels digital preservation needs.

Sheila Morrissey's comment

Spring CNI Plenary: The Remix

This post provides the text of the slides, sources and commentary for the opening plenary that I just gave at the CNI Spring Task Force meeting. The actual slides are available here (PDF). Follow me below the fold for the full details.

Spring CNI Plenary

I can finally reveal the mysterious talk I referred to in this comment; it is the opening plenary at CNI's Spring Task Force meeting one week from today. In essence, the talk is a look back at Jeff Rothenberg's 1995 Scientific American article "Ensuring the Longevity of Digital Documents" which asks:
  • What led Jeff to his dire predictions?
  • Would one make the same dire predictions now?
  • If not, what dire predictions would one make, and why?
This talk arose because Cliff Lynch invited me to give a talk at UC Berkeley's School of Information last November. He liked it enough to ask me to give it at CNI and I agreed, thinking I had already written it. But once I thought about it more, and considered the very different audience, I realized that it needed to be almost completely rewritten. I'm still revising it, based on feedback from giving it to the Stanford Library staff.

CNI will post the slides after the talk, and I plan to post here a commentary on them providing links to sources and additional details. You will be able to see how the discussions here were a very valuable resource. Thank you all.

2009 FAST conference

I attended the 2009 FAST Conference in San Francisco almost a month ago. This post was delayed because, as I mentioned in this comment, I've been working on an important talk which draws from recent discussions on this blog. The papers are on the web following Usenix's commendable open access policy. Follow me below the fold for my highlights.

Postel's Law

In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:

"Be conservative in what you do; be liberal in what you accept from others."

Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.

On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:

"Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"

No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.

These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.

Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.

Are format specifications important for preservation?

On the Digital Curation Centre Associates mail list, Steven Ranking pointed to the release of Microsoft's specifications for the Office formats under their Open Specification Promise. This sparked a discussion in which two topics were confused; the suitability of Microsoft Office formats for preservation, and the value of the specifications for preservation. As regards the first, I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed. I'm a skeptic about the second; I don't think preserving the specifications contributes anything to practical digital preservation, as I explain below the fold.