Wednesday, January 30, 2008

Does Preserving Context Matter?

As a Londoner, I really appreciate the way The Register brings some of the great traditions of Fleet Street to technology. In an column that appeared there just before Christmas, Guy Kewney asks his version of Provost O'Donnell's question, "Who's archiving IT's history?" and raises the important issue of whether researchers need only the "intellectual content" to survive, or whether they need the context in which it originally appeared.

Now is an unusual opportunity to discuss this issue, because the same content has been preserved both by techniques that do, and do not, preserve the context, and it has been made available in the wake of a trigger event. Some people, but not everyone, will be able to draw real comparisons.

Kewney writes:
One of my jobs recently has been to look back into IT history and apply some 20-20 hindsight to events five years ago and ten years ago.
Temporarily unable to get to his library of paper back issues of IT Week for inspiration, he turned to the Internet Archive's Wayback Machine to look back five years at his NewsWireless site:
I won't hear a word against the WayBackMachine. But I will in honesty have to say a few words against it: it's got holes.
What it's good at is holding copies of "That day's edition" just the way a newspaper archive does. I can, for example, go back to NewsWireless by opening up this link; and there, I can find everything that was published on December 6th 2002 - five years ago! - more or less. I can even see that the layout was different, if I look at the story of how NewsWireless installed a rogue wireless access point in the Grand Hotel Palazzo Della Fonte in Fiuggi, ...

Now, have a look at the same story, as it appears on NewsWireless today. The words are there, but it looks nothing like it used to look.

Unusually, NewsWireless does give you the same page you would have seen five years ago. When you're reading the Fiuggi story, the page shows you contemporary news... It's the week's edition, in content at least.

Most websites don't do this.

You can, sometimes, track back a particular five-year-old story (though sadly you'll often find it's been deleted), but if you go to the original site you're likely to find that the page you see is surrounded by modern stories. It's not a five-year-old edition. Take, for example Gordon Laing's Christmas 2002 article ... and you'll find exactly no stories at all relating to Christmas 2002. They were published, yes, but they aren't archived together anywhere - except the WayBackMachine.
Look at the two versions of the Fiuggi story linked from the quote above - although the words are the same the difference is striking. It reveals a lot about the changes in the Web over the past five years.

A much more revealing example than Kewney's is now available. SAGE publishes many academic journals. Some succeed, others fail. One of the failures was Graft: Organ and Cell Transplantation, of which SAGE published three volumes from 2001 to 2003. SAGE participates in both the major e-journal archiving efforts, CLOCKSS and Portico, and both preserve the content of these three volumes. SAGE decided to cease publishing these volumes, and has allowed both CLOCKSS and Portico to trigger the content, i.e. to go through the process each defines for making preserved content available.

The Graft content in CLOCKSS is preserved using LOCKSS technology, which uses the same basic approach as the Internet Archive. The system carefully crawls the e-journal website, collecting the content of every URL that it thinks of as part of the journal. After the trigger event all these collected URLs are reassembled to re-constitute the e-journal website, which is made freely available to all under a Creative Commons license.

You can see the result at the CLOCKSS web site. The page at that link is an introduction, but if you follow the links on that page to the Graft volumes, you will be seeing preserved content extracted from the CLOCKSS system via a script that arranges it in a form suitable for Apache to serve. Please read the notes on the introductory page describing ways in which content preserved in this way may surprise you.

The Graft content in Portico is preserved by a technique that aims only to preserve the "intellectual content", not the context. Content is obtained from the publisher as source files, typically the SGML markup used to generate the HTML, PDF and other formats served by the e-journal web site. It undergoes a process of normalization that renders it uniform. In this way the same system at Portico can handle content from many publishers consistently, because the individual differences such as branding have been normalized away. The claim is that this makes the content easier to preserve against the looming crisis of format obsolescence. It does, however, mean that the eventual reader sees the "intellectual content" as published by Portico's system now, not as originally published by SAGE's system. Since the trigger event, readers at institutions which subscribe to Portico can see this version of Graft for themselves. Stanford isn't a subscriber, so I can't see it; I'd be interested in comments from those who can make the comparison.

It is pretty clear that Kewney is on the LOCKSS side of this issue:
Once upon a time, someone offered me all the back numbers of a particular tech magazine I had contributed to. He said: "I don't need it anymore. If I want to search for something I need to know, I Google it."

But what if you don't know you need to know it? What sort of records of the present are we actually keeping? What will historians of the future get to hear about contemporary reactions to stories of the day, without the benefit of hindsight?

Maybe, someone in the British Library ought to be solemnly printing out all the content on every news website every day, and storing them in boxes, labelled by date?
The LOCKSS technology can in some respects do better than that, but in other respects it can't. For example, every reader of a Web page containing advertisements may see a different ad. Printing the page gets one of them. The LOCKSS technology has to exclude the ads. But, as you can see, it does a reasonable job of capturing the context in which the "intellectual content" appeared. Notice, for example, the difference between the headline bar of a typical table of contents page extracted from an Edinburgh University CLOCKSS node and a Stanford University CLOCKSS node. This is an artifact of the institution's different subscriptions to SAGE journals.

This isn't a new argument. The most eloquent case for the importance of preserving what the publisher published was made by Nicholson Baker in Double fold: libraries and the assault on paper. He recounts how microfilm vendors convinced librarians of a looming crisis. Their collections of newspapers were rapidly decaying. It was urgently necessary to microfilm them or their "intellectual content" would be lost to posterity. Since the microfilm would take up much less space, they would save money in the long run. The looming crisis turned out to be a bonanza for the microfilm companies but a disaster for posterity. Properly handled newspapers were not decaying, improperly handled they were. Although properly handled microfilm would not decay, improperly handled it decayed as badly as paper. The process of microfilming destroyed both "intellectual content" and context.

I'd urge anyone tempted to believe that the crisis of format obsolescence looms so menacingly that it can be solved only through the magic of "normalization" to read Nicholson Baker.


  1. interesting and thought provoking

  2. Hmmm. Does preserving context matter? I think the answer is "it depends..." My favourite example of this is Walter Scott's novel, Kenilworth (the example applies to all his novels, but Kenilworth is where I live). This was published anonymously in 3 volumes, in leather bindings and typography of its time. If I want to read the story, I will probably go to a modern, paperback edition, which may or may not have a scholarly introduction explaining how the multitude of errors introduced by Scott's attempts to ensure his anonymity were "corrected". This modern volume will not be anonymous, but will accurately indicate the author.

    My point is, the original edition is important for a scholar, containing much information revealing of its time and context. The modern information, shorn of much of that context but containing the "intellectual content" is most useful for me, a general reader. So, it depends...