"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.
On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:
"Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"
No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.
These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.
Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.