Thursday, January 15, 2009

Postel's Law

In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:


"Be conservative in what you do; be liberal in what you accept from others."

Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.

On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:

"Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"


No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.

These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.

Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.

Why, in a forum devoted to digital curation, would anyone ask about the proportion of PDF files that don't conform to the standards? After all, the PDF files they are asking about are generated by tools. No-one writes PDF by hand. So if they don't conform to the standards, it is because the tool that generated them had a bug in it. Why not report the bug to the tool creator? Because even if the tool creator fixed the bug, the files the tool generated before the fix was propagated would still be wrong. There's no way to recall and re-create them, so digital curators simply have to deal with them.

The saving grace in this situation is that the software, such as Adobe Reader, that renders PDF is constructed according to Postel's Law. It does the best it can to render even non-standard PDF legibly. Because it does so, it is very unlikely that a bug in the generation tool will have a visible effect. And if the bug doesn't have a visible effect, it is very unlikely to be detected, reported and fixed.

Thus we see that a substantial proportion of non-conforming PDF files is to be expected. And it is also to be expected that the non-conforming files will render correctly, since they will have been reviewed by at least one human (the author) for legibility.

Is the idea to report the bugs, which don't have visible effects, to the appropriate tool vendors? This would be a public-spirited effort to improve tool quality, but a Sysiphean task. And it wouldn't affect digital curation of PDF files since, as we have seen, it would have no effect on the existing population of PDF files.

Is the idea to build a PDF repair tool, which takes non-conforming PDF files as input and generates a conforming PDF file that has an identical visual rendering as output? That would be an impressive feat of programming, but futile. After all, the non-conforming file is highly likely to render correctly without modification. And if it doesn't, how would the repair tool know what rendering the author intended?

Is the idea to reject non-conforming files for preservation or curation? This is the possibility that worried me, as it would be a violation of Postel's Law. To see why I was worried, substitute HTML for PDF. It is well-known that a proportion, perhaps the majority, of web sites contain HTML that fails the W3C conformance tests but that is perfectly legible when rendered by all normal browsers. This isn't a paradox; the browsers are correctly observing Postel's Law. They are doing their best with whatever they are given, and are to be commended for doing so. Web crawls by preservation institutions such as national libraries and the Internet Archive would be very badly advised to run the W3C tests on the HTML they collect and reject any that failed. Such nit-picking would be a massive waste of resources and would cause them to fail in their mission of preserving the Web as a cultural artifact.

And how would an archive reject non-conforming files? By returning them to the submittor with a request to fix the problem? In almost all cases there's nothing the submittor can do to fix the problem. It was caused by a bug in a tool he used, not by error on his part. All the submittor could do would be to transmit the error report to the tool vendor and wait for an eventual fix. This would not be a very user-friendly archive.

So why do digital curators think it is important to use tools such as JHOVE to identify and verify the formats of files? Identifying the format is normally justified on the basis of knowing what formats are being preserved (interesting) and flagging those thought to be facing obsolescence (unlikely to happen in the foreseeable future to the formats we're talking about). But why do curators care that the file conforms to the format specification rather than whether it renders legibly?

The discussion didn't answer this question but it did reveal some important details:

First, although it is true that JHOVE flags a certain proportion of PDF files as not conforming to the standards, it is known that in some cases these are false positives. It is not known what JHOVE's rate of false negatives is, which would be cases in which it did not flag a file that in fact did not conform. It is hoped that JHOVE2 (PDF), the successor to JHOVE which is currently under development, will have lower error rates. But there don't appear to be any plans to measure these error rates, so it'll be hard to be sure that JHOVE2 is actually doing better.

Second, no-one knows what proportion of files that JHOVE flags as not conforming are not legible when rendered using standard tools such as Adobe Reader or Ghostscript. There are no plans to measure this proportion, either for JHOVE or for JHOVE2. So there is no evidence that the use of these tools contributes to future readers' ability to read the files which is, after all, the goal of curation. Wouldn't it be a good idea to choose a random sample among the Portico PDFs that JHOVE flags, render them with Ghostscript, print the results and have someone examine them to see if they were legible?

Third, although Portico classifies the PDF files it receives into the three JHOVE categories, it apparently observes Postel's Law by accepting PDF files for preservation irrespective of the category they are in. If so, they are to be commended.

Fourth, there doesn't seem to be much concern about the inevitable false positives and false negatives in the conformance testing process. The tool that classifies the files isn't magic, it is just a program that purports to implement the specification which, as I pointed out in a related post, is not perfect. And why would we believe that the programmer writing the conformance tester was capable of flawless implementation of the specification when his colleagues writing the authoring tools generating the non-conformances were clearly not? Lastly, absence of evidence is not evidence of absence. If the program announces that the file does not conform, it presumably identifies the non-conforming elements. They can be checked to confirm that the program is correct. Otherwise, it presumably says OK. But what it is really saying is "I didn't find any non-conforming elements. So the estimate from running the program is likely to be an under-estimate - there will be false negatives, non-conforming files that the program fails to detect.

The real question for people who think that JHOVE-like tools are important, either as gatekeepers or as generators of metadata, is "what if the tool is wrong?" There are two possible answers. Something bad happens. That makes the error rate of the tool a really important, but unknown, number. Alternatively, nothing bad happens. That makes the tool irrelevant, since not using it can't be worse than using it and having it give wrong answers.

Thus, to be blunt for effect, we have a part of the ingest pipeline that is considered to be important which classifies files into three categories with some unknown error rate. There is no evidence that these categories bear any relationship to the current or eventual legibility of these files by readers. And the categories are ignored in subsequent processing. Why are we bothering to do this?

Read More......

Sunday, January 4, 2009

Are format specifications important for preservation?

On the Digital Curation Centre Associates mail list, Steven Ranking pointed to the release of Microsoft's specifications for the Office formats under their Open Specification Promise. This sparked a discussion in which two topics were confused; the suitability of Microsoft Office formats for preservation, and the value of the specifications for preservation. As regards the first, I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed. I'm a skeptic about the second; I don't think preserving the specifications contributes anything to practical digital preservation, as I explain below the fold.


Nearly a quarter-century ago, James Gosling and I and a small team at Sun cloned Adobe's PostScript language for the NeWS system. Adobe had just published the PostScript language specification in the "Red Book". We started from this book, but we also had an Apple LaserWriter running Adobe's implementation of the language. When we found something obscure or missing in the book, we could run experiments on the LaserWriter to figure out what our implementation was supposed to do. This is close to what Silicon Valley refers to as a "clean-room" implementation, ensuring that the implementors have access only to public information. Since then, others have repeated the process with even greater fidelity.

So I'm someone with actual experience of implementing a renderer for a format from its specification. Based on this, I'm sure that no matter how careful or voluminous the specification is, there will always be things that are missing or obscure. There is no possibility of specifying formats as complex as Microsoft Office's so comprehensively that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same product. As between, for example, Office on the PC and Office on the Mac.

Those who argue that depositing format specifications in format registries is essential to, or even useful for, digital preservation seem to have in mind a scenario like this. Some time after a format is obsolete and no renderer for it is any longer available, some poor sucker is assigned to retrieve the specification from the format registry and use it to create a brand-new one. How likely is this to happen?

The pre-condition for the preserved format specification to be useful is that there is no renderer for the format. That necessarily implies that there is no open source renderer for the format. Logically, there are six possible explanations for this absence. They are quite revealing:

1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for. That's likely to mean that the content in the format isn't worth the effort of writing a new renderer from scratch on the basis of the preserved specifications.

2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written. The preserved specifications are not going to change this.

3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken. The preserved specifications are not going to change that.

4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers. It is possible that the (much smaller) digital preservation community would be able to recruit programmers who were better enough to handle the task without access to a working renderer, but it isn't likely.

5. An open source renderer was written but in the interim was lost. I argue below that open source is far better preserved than the content we are talking about. If open source code is being lost we're unlikely to be able to preserve the content, or even the format specifications.

6. An adequate open source renderer was written, but in the interim stopped working. I have argued elsewhere that the structure of open source makes this unlikely, and history supports this. But even if it did, the cure is not to throw away a once-working renderer and create a new one afresh from the format specification; it would be a far easier task to fix the reason the renderer stopped working. The preserved format specifications are useless for this purpose. What is needed is information about the changes to the operating system that stopped the renderer working. For an open source operating system, this is available from the source code control system, which is also incidentally capable of reconstructing the operating system as it was in the days when the renderer worked.

This analysis doesn't look encouraging for the proponents of preserving the specifications. But lets blithely ignore these problems and press on with the assumption that somehow a poor sucker has to create a renderer from the specification. How realistic is this task?

First, we actually know how much work it is to do a clean-room implementation of Microsoft Office's formats. Several open source products have done a credible job of doing so, including Open Office. In the nature of open source development, successive products are able to build on the work done by others, so the total amount of work is greater than any individual effort committed, although less than the total of all efforts. The history of Open Office reveals a very large investment; it was originally developed as a commercial product, and its development continues to be subsidized by Sun Microsystems as a basis for a commercial product. To achieve its current functionality has taken a significant, salaried team more than a decade. It is not credible to expect that this level of effort could be justified by digital preservation activities alone.

Second, the task envisaged is actually far more difficult than a simple clean-room implementation of the format. The whole justification for the task is that there is no functional renderer for the format available. Thus there is no way for the poor sucker to test his interpretation of the specifications against the original. The effort needed to achieve a fidelity of rendering equivalent to Open Office's would therefore be much greater than was required by the Open Office team, who could test their interpretations against Microsoft's code.

Third, the digital preservation world often complains that even Open Office's level of fidelity is inadequate. Many of these criticisms are beside the point; they refer to inaccuracies in Open Office's rendering of the latest Microsoft Office formats. But from the perspective of digital preservation, the relevant criterion is Open Office's rendering of old, in fact obsolete, formats. After all, the precondition for the task of creating a clean-room renderer is that the formats are so obsolete that no functional renderer is available. In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents. And note that the most recent case of Microsoft Office format obsolescence was caused by Microsoft's deliberate decision to remove support for old formats. This was so unpopular that it was rapidly rescinded. No-one is arguing for Open Office to remove support for old formats, and it appears that even Microsoft's ability to do so has expired.

Many of the criticisms of Open Office's fidelity in rendering Microsoft Office documents relate to layout changes between the two renderings. These are beside the point for another reason. The changes are typically caused by small differences between the fonts available in Microsoft Office and in Open Office. They exist not because Open Office incorrectly interprets the Office document format, nor because the Open Office developers were incompetent. They would plague the poor sucker's renderer just as much. Fonts, and in particular the font spacing tables that drive the layout process, are protected by copyright. If the Open Office developers had copied the font spacing tables so exactly that there were no layout changes they may well have been breaking the law.

Just because a document format has gone obsolete does not mean that the fonts used by documents encoded in that format have gone out of copyright. The poor sucker is likely to face even worse intellectual property hurdles than the Open Office developers did. He will probably be faced with the orphan font problem; wanting to get permission to use a copyright font but being unable to find the copyright owner to ask for it. The need to preserve the fonts used by a document as well as the text motivates the ability of PDF to embed the fonts it uses into the document itself.

Fourth, there is behind this discussion an unrealistically black-and-white view of the world. Renderers are software. They all have flaws. Some are better than others, but none is perfect. If we plot the quality achieved by a newly created renderer for a format against the cost of creating it we will get an S curve. A certain amount of money is needed to get to a barely functional renderer. Beyond that, quality increases rapidly at first but after a while the law of diminishing returns sets in. Getting from 99% to 99.9% is very expensive; the cost of getting to 100% is infinite. Emulation of the entire original hardware and software environment is the only way to guarantee 100% fidelity. Anything else means that preserved content will be rendered with flaws. The only real question is how much to spend to get to how close a rendering.

Fifth, we have a way to greatly reduce the cost of getting to a given level of fidelity. As we see with Open Office, creating an open source renderer for a format before it goes obsolete is much less costly than doing so afterwards. This is especially true because the open source community will almost always do this on their own initiative, without needing resources from the digital preservation community. They want to access documents in the format here and now; a much more powerful motivator.

Even better, they will then preserve the resulting renderer far better than most digital preservation systems preserve the content entrusted to them. Open source code is in ASCII, so there is no risk of format obsolescence. Just as Creative Commons licenses do for copyright content, open source licenses permit all the activities needed to preserve the code, without negotiation with the copyright owner. Open source code is already preserved in large, well-funded, independently managed repositories such as SourceForge. Further, open source teams maintain many copies of their work, both in the form of nightly backups of their part of the repository, and in the form of working copies of the code. Finally, just like internet protocols (90K PPT), open source development is so decentralized that flag days or changes that break applications are very difficult and time-consuming, and thus very unlikely.

It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format. If, during the currency of the format, it acquires an open source renderer there is no significant risk of ever ending up without a functional renderer. The need for a new one to be created from the specification is extremely unlikely ever to arise. If that unlikely event ever happened, it is hard to believe that resources on the scale needed to do the job would be available. And in the unlikely event that they were, it is unreasonable to believe that the combination of the preserved specification and the available resources would be enough to create a renderer that would satisfy those who reject Open Office because of minor rendering flaws.

Don't let the perfect be the enemy of the good.

Clearly, formats with open source renderers are, for all practical purposes, immune from format obsolescence. Equally, preserving the specifications for formats which lack an open source renderer is likely to be ineffective in assuring future access to content in those formats. Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now. In addition, national libraries should consider collecting and preserving open source repositories such SourceForge. They are essential to the library's efforts to preserve other important content, such as Web crawls. There are no legal or technical barriers to preservation, And who is to say that the corpus of open source is a less important cultural and historical artifact than, say, romance novels.

Read More......