Monday, January 17, 2011

Why Migrate Formats? The Debate Continues

I am grateful for two recent contributions to the debate about whether format obsolescence is an exception, or the rule, and whether migration is a viable response to it:I respond to Andy below the fold. Responding to Rob involves some research to clear up what appears to be confusion on my part, so I will postpone that to a later post.

Andy gives up the position that format migration is essential for preservation and moves the argument to access, correctly quoting an earlier post of mine saying that the question about access is how convenient it is for the eventual reader. As Andy says:
What is the point of keeping the bits safe if your user community cannot use the content effectively?
In this shift Andy ends up actually agreeing with much, but not quite all, of my case.

He says, quite correctly, that I argue that a format with an open source renderer is effectively immune from format obsolescence. But that isn't all I'm saying. Rather, the more important observation is that formats are not going obsolete, they are continuing to be easily render-able by the normal tools that readers use. Andy and I agree that reconstructing the entire open source stack as it was before the format went obsolete is an imposition on an eventual reader. That isn't actually what would have to happen if obsolescence happened, but the more important point is that obsolescence isn't going to happen.

The digital preservation community has failed to identify a single significant format that has gone obsolete in the 15+ years since the advent of the Web, which is one quarter of the entire history of computing. I have put forward a theory that explains why format obsolescence ceased; I have yet to see any competing theory that both explains the lack of format obsolescence since the advent of the Web and, as it would have to in order to support the case for format migration, predicts a resumption in the future. There is unlikely to be any reason for a reader to do anything but use the tools they have to hand to render the content, and thus no need to migrate it to a different format to provide "sustainable access".

Andy agrees with me that the formats of the bulk of the British Library's collection are not going obsolete in the foreseeable future:
The majority of the British Library's content items are in formats like PDF, TIFF and JP2, and these formats cannot be considered 'at risk' on any kind of time-scale over which one might reasonably attempt to predict. Therefore, for this material, we take a more 'relaxed' approach, because provisioning sustainable access is not difficult.
This relaxed approach to format obsolescence, preserving the bits and dealing with format obsolescence if and when it happens, is the one I have argued for since we started the LOCKSS program.

Andy then goes on to discuss the small proportion of the collection that is not in formats that he expects to go obsolete in the future, but in formats that are hard to render with current tools:
Unfortunately, a significant chunk of our collection is in formats that are not widely used, particularly when we don't have any way to influence what we are given (e.g. legal deposit material).
The BL eases access this content by using migration tools on ingest to create an access surrogate and, as the proponents of format migration generally do, keeping the original.
Naturally, we wish to keep the original file so that we can go back to it if necessary,
Thus, Andy agrees with me that it is essential to preserve the bits. Preserving the bits will ensure that these formats stay as hard to render as they are right now. Creating an access surrogate in a different format may be a convenient thing to do, but it isn't a preservation activity.

Where we may disagree is on the issue of whether is is necessary to preserve the access surrogate. It isn't clear whether the BL does, but there is no real justification for doing so. Unlike the original bits, the surrogate can be re-created at any time by re-running the tool that created it in the first place. If you argue for preserving the access surrogate, you are in effect saying that you don't believe that you will be able to re-run the tool in the future. The LOCKSS strategy for handling format obsolescence, which was demonstrated and published more than 6 years ago, takes advantage of the transience of access surrogates; we create an access surrogate if a reader ever accesses content that is preserved in a original format that the reader regards as obsolete. Note that this approach has the advantage of being able to tailor the access surrogate to the reader's actual capabilities; there is no need to guess which formats the eventual reader will prefer. These access surrogates can be discarded immediately, or cached for future readers; there is no need to preserve them.

The distinction between preservation and access is valuable, in that it makes clear that applying preservation techniques to access surrogates is a waste of resources.

One of the most interesting features of this debate has been detailed examinations of claims that this or the other format is obsolete; the claims have often turned out to be exaggerated. Andy says:
The original audio 'master' submitted to us arrives in one of a wide range of formats, depending upon the make, model and configuration of the source device (usually a mobile phone). Many of these formats may be 'exceptional', and thus cannot be relied upon for access now (never mind the future!).
But in the comments he adds:
The situation is less clear-cut in case of the Sound Map, partly because I'm not familiar enough with the content to know precisely how wide the format distribution really is.
The Sound Map page says:
Take part by publishing recordings of your surroundings using the free AudioBoo app for iPhone or Android smartphones or a web browser.
This implies that, contra Andy, the BL is in control of the formats used for recordings. It would be useful if someone with actual knowledge would provide a complete list of the formats ingested into Sound Map, and specifically identify those which are so hard to render as to require access surrogates.

1 comment:

euanc said...

This post on a paper funded by Microsoft that states that opensource software is more expensive than proprietary software has some interesting comments about incompatibilities of OOXML
http://news.slashdot.org/story/11/01/19/1613258/Open-Source-More-Expensive-Says-MS-Report

This post on the Australian Government mandating OOXML within government draws similar comments
http://it.slashdot.org/story/11/01/19/0059209/Australia-Mandates-Microsofts-Office-Open-XML

Many reinforce the points you are making.

Euan