"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.
On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:
"Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"
No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.
These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.
Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.
Why, in a forum devoted to digital curation, would anyone ask about the proportion of PDF files that don't conform to the standards? After all, the PDF files they are asking about are generated by tools. No-one writes PDF by hand. So if they don't conform to the standards, it is because the tool that generated them had a bug in it. Why not report the bug to the tool creator? Because even if the tool creator fixed the bug, the files the tool generated before the fix was propagated would still be wrong. There's no way to recall and re-create them, so digital curators simply have to deal with them.
The saving grace in this situation is that the software, such as Adobe Reader, that renders PDF is constructed according to Postel's Law. It does the best it can to render even non-standard PDF legibly. Because it does so, it is very unlikely that a bug in the generation tool will have a visible effect. And if the bug doesn't have a visible effect, it is very unlikely to be detected, reported and fixed.
Thus we see that a substantial proportion of non-conforming PDF files is to be expected. And it is also to be expected that the non-conforming files will render correctly, since they will have been reviewed by at least one human (the author) for legibility.
Is the idea to report the bugs, which don't have visible effects, to the appropriate tool vendors? This would be a public-spirited effort to improve tool quality, but a Sysiphean task. And it wouldn't affect digital curation of PDF files since, as we have seen, it would have no effect on the existing population of PDF files.
Is the idea to build a PDF repair tool, which takes non-conforming PDF files as input and generates a conforming PDF file that has an identical visual rendering as output? That would be an impressive feat of programming, but futile. After all, the non-conforming file is highly likely to render correctly without modification. And if it doesn't, how would the repair tool know what rendering the author intended?
Is the idea to reject non-conforming files for preservation or curation? This is the possibility that worried me, as it would be a violation of Postel's Law. To see why I was worried, substitute HTML for PDF. It is well-known that a proportion, perhaps the majority, of web sites contain HTML that fails the W3C conformance tests but that is perfectly legible when rendered by all normal browsers. This isn't a paradox; the browsers are correctly observing Postel's Law. They are doing their best with whatever they are given, and are to be commended for doing so. Web crawls by preservation institutions such as national libraries and the Internet Archive would be very badly advised to run the W3C tests on the HTML they collect and reject any that failed. Such nit-picking would be a massive waste of resources and would cause them to fail in their mission of preserving the Web as a cultural artifact.
And how would an archive reject non-conforming files? By returning them to the submittor with a request to fix the problem? In almost all cases there's nothing the submittor can do to fix the problem. It was caused by a bug in a tool he used, not by error on his part. All the submittor could do would be to transmit the error report to the tool vendor and wait for an eventual fix. This would not be a very user-friendly archive.
So why do digital curators think it is important to use tools such as JHOVE to identify and verify the formats of files? Identifying the format is normally justified on the basis of knowing what formats are being preserved (interesting) and flagging those thought to be facing obsolescence (unlikely to happen in the foreseeable future to the formats we're talking about). But why do curators care that the file conforms to the format specification rather than whether it renders legibly?
The discussion didn't answer this question but it did reveal some important details:
First, although it is true that JHOVE flags a certain proportion of PDF files as not conforming to the standards, it is known that in some cases these are false positives. It is not known what JHOVE's rate of false negatives is, which would be cases in which it did not flag a file that in fact did not conform. It is hoped that JHOVE2 (PDF), the successor to JHOVE which is currently under development, will have lower error rates. But there don't appear to be any plans to measure these error rates, so it'll be hard to be sure that JHOVE2 is actually doing better.
Second, no-one knows what proportion of files that JHOVE flags as not conforming are not legible when rendered using standard tools such as Adobe Reader or Ghostscript. There are no plans to measure this proportion, either for JHOVE or for JHOVE2. So there is no evidence that the use of these tools contributes to future readers' ability to read the files which is, after all, the goal of curation. Wouldn't it be a good idea to choose a random sample among the Portico PDFs that JHOVE flags, render them with Ghostscript, print the results and have someone examine them to see if they were legible?
Third, although Portico classifies the PDF files it receives into the three JHOVE categories, it apparently observes Postel's Law by accepting PDF files for preservation irrespective of the category they are in. If so, they are to be commended.
Fourth, there doesn't seem to be much concern about the inevitable false positives and false negatives in the conformance testing process. The tool that classifies the files isn't magic, it is just a program that purports to implement the specification which, as I pointed out in a related post, is not perfect. And why would we believe that the programmer writing the conformance tester was capable of flawless implementation of the specification when his colleagues writing the authoring tools generating the non-conformances were clearly not? Lastly, absence of evidence is not evidence of absence. If the program announces that the file does not conform, it presumably identifies the non-conforming elements. They can be checked to confirm that the program is correct. Otherwise, it presumably says OK. But what it is really saying is "I didn't find any non-conforming elements. So the estimate from running the program is likely to be an under-estimate - there will be false negatives, non-conforming files that the program fails to detect.
The real question for people who think that JHOVE-like tools are important, either as gatekeepers or as generators of metadata, is "what if the tool is wrong?" There are two possible answers. Something bad happens. That makes the error rate of the tool a really important, but unknown, number. Alternatively, nothing bad happens. That makes the tool irrelevant, since not using it can't be worse than using it and having it give wrong answers.
Thus, to be blunt for effect, we have a part of the ingest pipeline that is considered to be important which classifies files into three categories with some unknown error rate. There is no evidence that these categories bear any relationship to the current or eventual legibility of these files by readers. And the categories are ignored in subsequent processing. Why are we bothering to do this?