Friday, January 28, 2011

Threats to preservation

More than 5 years ago we published the LOCKSS threat model, the set of threats to preserved content against which the LOCKSS system was designed to preserve content. We encouraged other digital preservation systems to do likewise; it is hard to judge how effective systems are in achieving their goal of preserving content unless you know what they are intended to preserve content against. We said:
We concur with the recent National Research Council recommendations to the National Archives that the designers of a digital preservation system need a clear vision of the threats against which they are being asked to protect their system's contents, and those threats under which it is acceptable for preservation to fail.
I don't recall any other system rising to the challenge; I'd be interested in any examples of systems that have documented their threat model that readers could provide in comments.

This lack of clarity as to the actual threats involved is a major reason for the misguided focus on format obsolescence that consumes such a large proportion of digital preservation attention and resources. As I write this two ongoing examples illustrate the kinds of real threats attention should be focused on instead.

In an attempt to damp down anti-government protests, the Egyptian government shut down the Internet in their country. One copy of the Internet Archive's Wayback Machine is hosted at the Bibliotheca Alexandrina. As I write it is accessible, but the risk is clear. But, you say, the US government would never do such a thing, so the Internet Archive is quite safe. Think again. Senators Joe Lieberman and Susan Collins are currently pushing a bill, the Protecting Cyberspace as a National Asset Act of 2010, to give the US government the power to do exactly that whenever it feels like doing so.

Also as I write this SourceForge is unavailable, shut down in the aftermath of a compromise. The LOCKSS software, in common with many other digital preservation technologies, is preserved in SourceForge's source code control system. Other systems essential to digital preservation use one of a small number of other similar repositories. When SourceForge comes back up, we will have to audit the copy it contains of our source code against our backups and working copies to be sure that the attackers did not tamper with it.

I have argued for years, again with no visible effect, that national libraries should preserve these open source repositories. Not merely because, as the SourceForge compromise illustrates, their contents are the essential infrastructure for much of digital preservation, and that there are no economic, technical or legal barriers to doing so, but even more importantly they are major cultural achievements, just as worthy of future scholar's attention as books, movies and even tweets.

Monday, January 17, 2011

Why Migrate Formats? The Debate Continues

I am grateful for two recent contributions to the debate about whether format obsolescence is an exception, or the rule, and whether migration is a viable response to it:I respond to Andy below the fold. Responding to Rob involves some research to clear up what appears to be confusion on my part, so I will postpone that to a later post.

Andy gives up the position that format migration is essential for preservation and moves the argument to access, correctly quoting an earlier post of mine saying that the question about access is how convenient it is for the eventual reader. As Andy says:
What is the point of keeping the bits safe if your user community cannot use the content effectively?
In this shift Andy ends up actually agreeing with much, but not quite all, of my case.

He says, quite correctly, that I argue that a format with an open source renderer is effectively immune from format obsolescence. But that isn't all I'm saying. Rather, the more important observation is that formats are not going obsolete, they are continuing to be easily render-able by the normal tools that readers use. Andy and I agree that reconstructing the entire open source stack as it was before the format went obsolete is an imposition on an eventual reader. That isn't actually what would have to happen if obsolescence happened, but the more important point is that obsolescence isn't going to happen.

The digital preservation community has failed to identify a single significant format that has gone obsolete in the 15+ years since the advent of the Web, which is one quarter of the entire history of computing. I have put forward a theory that explains why format obsolescence ceased; I have yet to see any competing theory that both explains the lack of format obsolescence since the advent of the Web and, as it would have to in order to support the case for format migration, predicts a resumption in the future. There is unlikely to be any reason for a reader to do anything but use the tools they have to hand to render the content, and thus no need to migrate it to a different format to provide "sustainable access".

Andy agrees with me that the formats of the bulk of the British Library's collection are not going obsolete in the foreseeable future:
The majority of the British Library's content items are in formats like PDF, TIFF and JP2, and these formats cannot be considered 'at risk' on any kind of time-scale over which one might reasonably attempt to predict. Therefore, for this material, we take a more 'relaxed' approach, because provisioning sustainable access is not difficult.
This relaxed approach to format obsolescence, preserving the bits and dealing with format obsolescence if and when it happens, is the one I have argued for since we started the LOCKSS program.

Andy then goes on to discuss the small proportion of the collection that is not in formats that he expects to go obsolete in the future, but in formats that are hard to render with current tools:
Unfortunately, a significant chunk of our collection is in formats that are not widely used, particularly when we don't have any way to influence what we are given (e.g. legal deposit material).
The BL eases access this content by using migration tools on ingest to create an access surrogate and, as the proponents of format migration generally do, keeping the original.
Naturally, we wish to keep the original file so that we can go back to it if necessary,
Thus, Andy agrees with me that it is essential to preserve the bits. Preserving the bits will ensure that these formats stay as hard to render as they are right now. Creating an access surrogate in a different format may be a convenient thing to do, but it isn't a preservation activity.

Where we may disagree is on the issue of whether is is necessary to preserve the access surrogate. It isn't clear whether the BL does, but there is no real justification for doing so. Unlike the original bits, the surrogate can be re-created at any time by re-running the tool that created it in the first place. If you argue for preserving the access surrogate, you are in effect saying that you don't believe that you will be able to re-run the tool in the future. The LOCKSS strategy for handling format obsolescence, which was demonstrated and published more than 6 years ago, takes advantage of the transience of access surrogates; we create an access surrogate if a reader ever accesses content that is preserved in a original format that the reader regards as obsolete. Note that this approach has the advantage of being able to tailor the access surrogate to the reader's actual capabilities; there is no need to guess which formats the eventual reader will prefer. These access surrogates can be discarded immediately, or cached for future readers; there is no need to preserve them.

The distinction between preservation and access is valuable, in that it makes clear that applying preservation techniques to access surrogates is a waste of resources.

One of the most interesting features of this debate has been detailed examinations of claims that this or the other format is obsolete; the claims have often turned out to be exaggerated. Andy says:
The original audio 'master' submitted to us arrives in one of a wide range of formats, depending upon the make, model and configuration of the source device (usually a mobile phone). Many of these formats may be 'exceptional', and thus cannot be relied upon for access now (never mind the future!).
But in the comments he adds:
The situation is less clear-cut in case of the Sound Map, partly because I'm not familiar enough with the content to know precisely how wide the format distribution really is.
The Sound Map page says:
Take part by publishing recordings of your surroundings using the free AudioBoo app for iPhone or Android smartphones or a web browser.
This implies that, contra Andy, the BL is in control of the formats used for recordings. It would be useful if someone with actual knowledge would provide a complete list of the formats ingested into Sound Map, and specifically identify those which are so hard to render as to require access surrogates.

Tuesday, January 4, 2011

Apology to Safari Users

I'm a Firefox user, so I have only just noticed that in Safari the "front page" of this blog does not render correctly. The first part of the material "below the fold" correctly does not appear, but mysteriously as soon as I use "blockquote", "ol" or "ul" tags, it reappears. So in Safari most of the post appears on the "front page" but with a chunk in the middle elided. Fortunately, clicking on the headline gets you a properly rendered version. Sorry about this. I'm looking in to the problem.

Monday, January 3, 2011

Memento & the Marketplace for Archiving

In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators (which will, I predict, end up being called something snappier and less obscure).

As we know, a complex ecology of competition, advertising, optimization and spam has grown up around search engines, and we can expect something similar to happen around Aggregators . Below the fold I use an almost-real-life example to illustrate my ideas about how this will play out.