Monday, May 7, 2007

Format Obsolescence: the Prostate Cancer of Preservation

This is the second post in a series on format obsolescence. In the first I argued that it is hard to find a plausible scenario in which it would no longer be possible to render a format for which there is an open-source renderer.

In the long run, we are all dead. In the long run, all digital formats become obsolete. Broadly, reactions to this dismal prospect have taken two forms:

- The aggressive form has been to do as much work as possible as soon as possible in the expectation that when format obsolescence finally strikes, the results of this meticulous preparation will pay dividends.

-The relaxed form has been to postpone doing anything until it is absolutely essential, in the expectation that the onward march of technology will mean that the tools available for performing the eventual migration will be better than those available today.

I will argue that format obsolescence is the prostate cancer of digital preservation. It is a serious and ultimately fatal problem. If you were to live long enough you would eventually be diagnosed with it, and some long time later you would die from it. Once it is diagnosed there is as yet no certain cure for it. No prophylactic measures are known to be effective in preventing its onset. All prophylactic measures are expensive and have side-effects. But it is highly likely that something else will kill you first, so "watchful waiting", another name for the "relaxed" approach, is normally the best course of action.

The most important threat to digital preservation is not the looming but distant and gradual threat of format obsolescence. It is rather the immediate and potentially fatal threat of economic failure. No-one has enough money to preserve the materials that should be preserved with the care they deserve. That is why current discussions of digital preservation prominently feature the problem of "sustainability".

The typical aggressive approach involves two main activities that take place while a repository ingests content:

- An examination of the incoming content to extract and validate preservation metadata which is stored along with the content. The metadata includes a detailed description of the format. The expectation is that detailed knowledge of the formats being preserved will provide the repository with advance warning as they approach obsolescence, and assist in responding to the threat by migrating the content to a less obsolete format.

- The preemptive use of format migration tools to normalize the content by creating and preserving a second version of it in a format the repository considers less likely to suffer obsolescence.

The PREMIS dictionary of preservation metadata is a 100-page report defining 84 different entities, although only about half of them are directly relevant to format migration. Although it was initially expected that humans would provide much of this metadata, the volume and complexity of the content to be preserved meant that human-generated preservation metadata was too expensive and too unreliable to be practical. Tools such as JHOVE were developed to extract and validate preservation metadata automatically. Similarly, tools are needed to perform the normalization. These tools are typically open source. Is there a plausible scenario in which it is no longer possible to run these tools? If not, what is the benfit of running them now and preserving their output, as opposed to running them whenever the output is needed?

Since none of the proponents of the "aggressive" approach are sufficiently confident to discard the original bits, their storage costs are more than double those of the "relaxed" approach. The normalized copy of the original will be about the same size, plus storage is needed for the preservation metadata. Further, the operational costs of the "aggressive" ingest pipeline are significantly higher, since it undertakes much more work, and humans are needed to monitor progress and handle exceptions. The best example of a "relaxed" ingest pipeline is the Internet Archive, which has so far ingested over 85 billion web pages with minimal human intervention.

Is there any hard evidence that either preservation metadata or normalization actually increases the chance of content surviving format obsolescence by enough to justify the increased costs it imposes? Even the proponents of the "aggressive" approach would have to admit that the answer is "not yet". None of the formats in wide use when serious efforts at digital preservation of published content started a decade ago have become obsolete. Nor, as I argued in the previous post, is there any realistic scenario in which they will become obsolete in the near future. Thus both preservation metadata and normalization will remain speculative investments for the foreseeable future.

A large speculative investment in preparations of this kind might be justified if it were clear that format obsolescence was the most significant risk facing the content being preserved. Is that really the case? In the previous post I argued that for the kinds of published content currently being preserved format obsolescence is not a plausible threat, because all the formats being used to publish content have open source renderers. There clearly are formats for which obsolescence is the major risk, but content in those formats is not being preserved. For example, console games use encryption and other DRM techniques (see Bunnie Huang's amazing book for the Xbox example) that effectively prevent both format migration and emulation. Henry Lowood at Stanford Library is working to preserve these games, but using very different approaches.

Many digital preservation systems define levels of preservation; the higher the level assigned to a format, the stronger the "guarantee" of preservation the system offers. For example, PDF gets a higher level than Microsoft Word. Essentially, the greater the perceived difficulty of migrating a format, the lower the effort that will be devoted to preserving it. But the easier the format is to migrate, the lower the risk it is at. So investment, particularly in the "aggressive" approach, concentrates on the low-hanging fruit. This is neither at significant risk of loss, nor at significant risk of format obsolescence. A risk-based approach would surely prefer the "relaxed" approach, minimizing up-front and storage costs, and thereby freeing up resources to preserve more, and higher-risk content.

In my next post, I plan to look at what a risk-based approach to investing in preservation would look like, and whether it would be feasible.