Sunday, April 29, 2007

Format Obsolescence: Scenarios

This is the first of a series of posts in which I'll argue that much of the discussion of digital preservation, which focuses on the problem of format obsolescence, has failed to keep up with the evolution of the market and the technology. The result is that the bulk of the investment in the field is going to protecting content that is not at significant risk from events that are unlikely to occur, while at-risk content is starved of resources.

There are several format obsolescence "horror stories" often used to motivate discussion of digital preservation. I will argue that they are themselves now obsolete. The community of funders and libraries are currently investing primarily in preserving academic journals and related materials published on the Web. Are there realistic scenarios in which this content would become obsolescent?

The most frequently cited "horror story" is that of the BBC Micro and the Domesday Book. In 1986 the BBC created a pair of video disks, hardware enhancements and software for the Acorn-based BBC Micro home computer. It was a virtual exhibition celebrating the 900th anniversary of the Domesday Book. By 2002 the hardware was obsolete and the video disks were decaying. In a technical tour de force the CAMiLEON project, a collaboration among Leeds University, the University of Michigan and the UK National Archives rescued it by capturing the video from the media and building an emulator for the hardware that ran on a Windows PC.

The Domesday Book example shares certain features with almost all the "horror stories" in that it involves (a) off-line content, (b) in little-used, proprietary formats, (c) published for a limited audience and (d) a long time ago. The market has moved on since these examples; the digital preservation community now focuses mostly on on-line content published in widely-used, mostly open formats for a wide audience. This is the content that, were it on paper, would be in library collections. It matches the Library of Congress collection practice, which is the "selection of best editions as authorized by copyright law. Best editions are generally considered to be works in their final state." By analogy with libraries' paper collections, the loss or unreadability of this content would severely impact our culture. Mitigating these risks surely justifies significant investment.

How might this content be lost? Experience starting with the Library of Alexandria shows that the way to ensure that content survives is to distribute copies across a range of independent repositories. This was the way printed paper worked for hundreds of years, but the advent of the Web changed the ground rules. Now, readers gain temporary access to the original publisher's copy; there is no distribution of long-lived copies as a side-effect of providing access to the content. As we have seen with music, and as we are seeing with video, once this mechanism becomes established its superior economics rapidly supplant any distribution channel involving physical artefacts. Clearly, no matter how careful web publishers intend to be with their content the risk of loss is greater than with a proliferation of physical copies. Simply keeping the bits from being lost is the sine qua non of digital preservation, and its not as easy as people think (a subject of future posts).

Lets assume we succeed in avoiding loss of the bits; how might those bits become unreadable? Lets look at how they can be rendered now, and try to construct a scenario in which this current rendering process would become impossible.

I'm writing this on my desktop machine. It runs the Ubuntu version of Linux, with the Firefox browser. Via the Stanford network I have access through Stanford's subscriptions to a vast range of e-journals and other web resources as well as the huge variety of open access resources. I've worked this way for several years, since I decided to eliminate Microsoft software from my life. Apart from occasional lower quality than on my PowerBook, I don't have problems reading e-journals or other web resources. Almost all formats are rendered using open source software in the Ubuntu distribution; for a few such as Adobe's Flash the browser uses a closed-source binary plugin.

Lets start by looking at the formats for which an open source renderer exists (HTML, PDF, the Microsoft Office formats, and so on). The source code for an entire software stack capable of rendering each of these formats, from the BIOS through the boot loader, the operating system kernel, the browser, the PostScript and PDF interpreters and the Open Office suite is in ASCII, a format that will not itself become obsolete. The code is carefully preserved in a range of source code repositories. The developers of the various projects don't actually rely on the repositories; they also keep regular backups. The LOCKSS program is typical, we keep multiple backup copies of our SourceForge repository. They are synchronized nightly. We could switch to any one of them at a moment's notice. All the tools needed to build a working software stack are also preserved in the same way, and regularly exercised (most open source projects have automatic build and test processes that are run at least nightly).

As if this wasn't safe enough, in most cases there are multiple independent implementations of each layer of functionality in the stack. For example, at the kernel layer there are at least 5 independent open source implementations capable of supporting this stack (Linux, FreeBSD, NetBSD, OpenBSD and Solaris). As if even this wasn't safe enough, this entire stack can be built and run on a large number of different CPU architectures (NetBSD supports 16 of them). Even if the entire base of Intel architecture systems stopped working overnight, in which case format obsolescence would be the least of our problems, this software stack would still be able to render the formats just as it always did, although on a much smaller total number of computers. In fact, almost all the Windows software would continue to run (albeit a bit slower) since there are open source emulations of the Intel architecture. Apple used similar emulation technology during their transitions from the Motorola 68000 to PowerPC, and PowerPC to Intel architectures.

What's more, the source code is preserved in source code control systems, such as subversion. These systems ensure that the state of the system as it was at any point in the past can be reconstructed. Since all the code is handled this way, the exact state of the entire stack at the time that some content was rendered correctly can be recreated.

But what of the formats for which there is no open source renderer, only a closed-source binary plugin? Flash is the canonical example, but in fact there is an open source Flash player, it is just some years behind Adobe's current one. This is very irritating for partisans of open source, who are forced to use Adobe's plugin to view recent content, but it may not be critical for digital preservation. After all, if preservation needs an open source renderer it will, by definition, be many years after the original release of the new format. There will be time for the open source renderer to emerge. But even if it doesn't, and even if subsequent changes to the software into which the plugin is plugged make it stop working, we have seen that the entire software stack at a time when it was working can be recreated. So provided that the binary plugin itself survives, the content can still be rendered.

Historically, the open source community has developed rendering software for almost all proprietary formats that achieve wide use, if only after a significant delay. The Microsoft Office formats are a good example. Several sustained and well-funded efforts, including Open Office, have resulted in adequate, if not pixel-perfect, support for these formats. The Australian National Archives preservation strategy is based on using these tools to preemptively migrate content from proprietary formats to open formats before preservation. Indeed, the availability of open source alternatives is now making it difficult for Microsoft to continue imposing proprietary formats on their customers.

Even the formats which pose the greatest problems for preservation, those protected by DRM technology, typically have open source renderers, normally released within a year or two of the DRM-ed format's release. The legal status of a preservation strategy that used such software, or some software arguably covered by patents such as MP3 players, would be in doubt. Until the legal issues are clarified, no preservation system can make well-founded claims as to its ability to preserve these formats against format obsolescence. However, in most but not all cases these formats are supported by binary plugins for open source web browsers. If these binary plugins are preserved, we have seen that the software stack into which they plugged could be recreated in order to render content in that format.

It is safe to say that the software environment needed to support rendering of most current formats is preserved much better than the content being rendered.

If we ask "what would have to happen for these formats no longer to be renderable?" we are forced to invent implausible scenarios in which not just all the independent repositories holding the source code of the independent implementations of one layer of the stack were lost, but also all the backup copies of the source code at the various developers of all these projects, and also all the much larger number of copies of the binaries of this layer.

What has happened to make the predictions of the impending digital dark ages less menacing, at least as regards published content? First, off-line content on hardware-specific media has come to be viewed simply as a temporary backup for the primary on-line access copy. Second, publishing information on-line in arcane, proprietary formats is self-defeating. The point of publishing is to get the content to as many readers as possible, so publishers use popular formats. Third, open source environments have matured to the point where, with their popular and corporate support, only the most entrenched software businesses can refuse to support their use. Fourth, experience has shown that, even if a format is proprietary, if it is popular enough the open source community will support it effectively.

The all-or-nothing question that has dominated discussion of digital preservation has been how to deal with format obsolescence, whether by emulating the necessary software environment, or by painstakingly collecting "preservation metadata" in the hope that it will make future format migration possible. It turns out that:
the "preservation metadata" that is really needed for a format is an open source renderer for that format.
The community is creating these renderers for reasons that have nothing to do with preservation.

Of course, one must admit that reconstructing the entire open source software stack is not very convenient for the eventual reader, and could be expensive. Thus the practical questions about the obsolescence of the formats used by today's readers are really how convenient it will be for the eventual reader to access the content, and how much will be spent when in order to reach that level of convenience. The next post in this series will take up these questions.

These ideas have evolved from those in a paper called Transparent Format Migration of Preserved Web Content we published in 2005. It explained the approach the LOCKSS program takes to format migration. LOCKSS is a trademark of Stanford University.

1 comment:

David. said...

The next post in this series is here.