Thursday, January 15, 2009

Postel's Law

In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:

"Be conservative in what you do; be liberal in what you accept from others."

Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.

On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:

"Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"


No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.

These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.

Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.

Sunday, January 4, 2009

Are format specifications important for preservation?

On the Digital Curation Centre Associates mail list, Steven Ranking pointed to the release of Microsoft's specifications for the Office formats under their Open Specification Promise. This sparked a discussion in which two topics were confused; the suitability of Microsoft Office formats for preservation, and the value of the specifications for preservation. As regards the first, I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed. I'm a skeptic about the second; I don't think preserving the specifications contributes anything to practical digital preservation, as I explain below the fold.