On the Digital Curation Centre Associates mail list, Steven Ranking pointed to the release of Microsoft's specifications for the Office formats under their Open Specification Promise. This sparked a discussion in which two topics were confused; the suitability of Microsoft Office formats for preservation, and the value of the specifications for preservation. As regards the first, I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed. I'm a skeptic about the second; I don't think preserving the specifications contributes anything to practical digital preservation, as I explain below the fold.
Nearly a quarter-century ago, James Gosling and I and a small team at Sun cloned Adobe's PostScript language for the NeWS system. Adobe had just published the PostScript language specification in the "Red Book". We started from this book, but we also had an Apple LaserWriter running Adobe's implementation of the language. When we found something obscure or missing in the book, we could run experiments on the LaserWriter to figure out what our implementation was supposed to do. This is close to what Silicon Valley refers to as a "clean-room" implementation, ensuring that the implementors have access only to public information. Since then, others have repeated the process with even greater fidelity.
So I'm someone with actual experience of implementing a renderer for a format from its specification. Based on this, I'm sure that no matter how careful or voluminous the specification is, there will always be things that are missing or obscure. There is no possibility of specifying formats as complex as Microsoft Office's so comprehensively that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same product. As between, for example, Office on the PC and Office on the Mac.
Those who argue that depositing format specifications in format registries is essential to, or even useful for, digital preservation seem to have in mind a scenario like this. Some time after a format is obsolete and no renderer for it is any longer available, some poor sucker is assigned to retrieve the specification from the format registry and use it to create a brand-new one. How likely is this to happen?
The pre-condition for the preserved format specification to be useful is that there is no renderer for the format. That necessarily implies that there is no open source renderer for the format. Logically, there are six possible explanations for this absence. They are quite revealing:
1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for. That's likely to mean that the content in the format isn't worth the effort of writing a new renderer from scratch on the basis of the preserved specifications.
2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written. The preserved specifications are not going to change this.
3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken. The preserved specifications are not going to change that.
4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers. It is possible that the (much smaller) digital preservation community would be able to recruit programmers who were better enough to handle the task without access to a working renderer, but it isn't likely.
5. An open source renderer was written but in the interim was lost. I argue below that open source is far better preserved than the content we are talking about. If open source code is being lost we're unlikely to be able to preserve the content, or even the format specifications.
6. An adequate open source renderer was written, but in the interim stopped working. I have argued elsewhere that the structure of open source makes this unlikely, and history supports this. But even if it did, the cure is not to throw away a once-working renderer and create a new one afresh from the format specification; it would be a far easier task to fix the reason the renderer stopped working. The preserved format specifications are useless for this purpose. What is needed is information about the changes to the operating system that stopped the renderer working. For an open source operating system, this is available from the source code control system, which is also incidentally capable of reconstructing the operating system as it was in the days when the renderer worked.
This analysis doesn't look encouraging for the proponents of preserving the specifications. But lets blithely ignore these problems and press on with the assumption that somehow a poor sucker has to create a renderer from the specification. How realistic is this task?
First, we actually know how much work it is to do a clean-room implementation of Microsoft Office's formats. Several open source products have done a credible job of doing so, including Open Office. In the nature of open source development, successive products are able to build on the work done by others, so the total amount of work is greater than any individual effort committed, although less than the total of all efforts. The history of Open Office reveals a very large investment; it was originally developed as a commercial product, and its development continues to be subsidized by Sun Microsystems as a basis for a commercial product. To achieve its current functionality has taken a significant, salaried team more than a decade. It is not credible to expect that this level of effort could be justified by digital preservation activities alone.
Second, the task envisaged is actually far more difficult than a simple clean-room implementation of the format. The whole justification for the task is that there is no functional renderer for the format available. Thus there is no way for the poor sucker to test his interpretation of the specifications against the original. The effort needed to achieve a fidelity of rendering equivalent to Open Office's would therefore be much greater than was required by the Open Office team, who could test their interpretations against Microsoft's code.
Third, the digital preservation world often complains that even Open Office's level of fidelity is inadequate. Many of these criticisms are beside the point; they refer to inaccuracies in Open Office's rendering of the latest Microsoft Office formats. But from the perspective of digital preservation, the relevant criterion is Open Office's rendering of old, in fact obsolete, formats. After all, the precondition for the task of creating a clean-room renderer is that the formats are so obsolete that no functional renderer is available. In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents. And note that the most recent case of Microsoft Office format obsolescence was caused by Microsoft's deliberate decision to remove support for old formats. This was so unpopular that it was rapidly rescinded. No-one is arguing for Open Office to remove support for old formats, and it appears that even Microsoft's ability to do so has expired.
Many of the criticisms of Open Office's fidelity in rendering Microsoft Office documents relate to layout changes between the two renderings. These are beside the point for another reason. The changes are typically caused by small differences between the fonts available in Microsoft Office and in Open Office. They exist not because Open Office incorrectly interprets the Office document format, nor because the Open Office developers were incompetent. They would plague the poor sucker's renderer just as much. Fonts, and in particular the font spacing tables that drive the layout process, are protected by copyright. If the Open Office developers had copied the font spacing tables so exactly that there were no layout changes they may well have been breaking the law.
Just because a document format has gone obsolete does not mean that the fonts used by documents encoded in that format have gone out of copyright. The poor sucker is likely to face even worse intellectual property hurdles than the Open Office developers did. He will probably be faced with the orphan font problem; wanting to get permission to use a copyright font but being unable to find the copyright owner to ask for it. The need to preserve the fonts used by a document as well as the text motivates the ability of PDF to embed the fonts it uses into the document itself.
Fourth, there is behind this discussion an unrealistically black-and-white view of the world. Renderers are software. They all have flaws. Some are better than others, but none is perfect. If we plot the quality achieved by a newly created renderer for a format against the cost of creating it we will get an S curve. A certain amount of money is needed to get to a barely functional renderer. Beyond that, quality increases rapidly at first but after a while the law of diminishing returns sets in. Getting from 99% to 99.9% is very expensive; the cost of getting to 100% is infinite. Emulation of the entire original hardware and software environment is the only way to guarantee 100% fidelity. Anything else means that preserved content will be rendered with flaws. The only real question is how much to spend to get to how close a rendering.
Fifth, we have a way to greatly reduce the cost of getting to a given level of fidelity. As we see with Open Office, creating an open source renderer for a format before it goes obsolete is much less costly than doing so afterwards. This is especially true because the open source community will almost always do this on their own initiative, without needing resources from the digital preservation community. They want to access documents in the format here and now; a much more powerful motivator.
Even better, they will then preserve the resulting renderer far better than most digital preservation systems preserve the content entrusted to them. Open source code is in ASCII, so there is no risk of format obsolescence. Just as Creative Commons licenses do for copyright content, open source licenses permit all the activities needed to preserve the code, without negotiation with the copyright owner. Open source code is already preserved in large, well-funded, independently managed repositories such as SourceForge. Further, open source teams maintain many copies of their work, both in the form of nightly backups of their part of the repository, and in the form of working copies of the code. Finally, just like internet protocols (90K PPT), open source development is so decentralized that flag days or changes that break applications are very difficult and time-consuming, and thus very unlikely.
It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format. If, during the currency of the format, it acquires an open source renderer there is no significant risk of ever ending up without a functional renderer. The need for a new one to be created from the specification is extremely unlikely ever to arise. If that unlikely event ever happened, it is hard to believe that resources on the scale needed to do the job would be available. And in the unlikely event that they were, it is unreasonable to believe that the combination of the preserved specification and the available resources would be enough to create a renderer that would satisfy those who reject Open Office because of minor rendering flaws.
Don't let the perfect be the enemy of the good.
Clearly, formats with open source renderers are, for all practical purposes, immune from format obsolescence. Equally, preserving the specifications for formats which lack an open source renderer is likely to be ineffective in assuring future access to content in those formats. Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now. In addition, national libraries should consider collecting and preserving open source repositories such SourceForge. They are essential to the library's efforts to preserve other important content, such as Web crawls. There are no legal or technical barriers to preservation, And who is to say that the corpus of open source is a less important cultural and historical artifact than, say, romance novels.