I've been warning for some time that one of the fundamental problems facing digital preservation is the evolution of content from static to dynamic. What, exactly, does it mean to preserve something that is different every time you look at it?
When the dynamic elements in Web pages were things like advertisements or news headlines it was possible to argue that preserving these elements was not important. Indeed, some preservation services argued that their mission was limited to preserving the intellectual content (PDF).
Even if you don't believe that preserving context matters (if so, see the digression below), the evolution of the Web has meant that more and more of the intellectual content is delivered as dynamic elements. It started with streaming video; no-one can understand the 2008 US election without YouTube. The improved user interface capabilities of AJAX mean that it is now used even by academic journal publishers such as the Royal Society of Chemistry.
The evolution of user interface protocols from static to programmable has a history. Bob Sproull and Elaine Thomas published the first Network Graphics Protocol in 1974. It was developed at Xerox PARC in the early days of the ARPAnet, and was designed to adapt to various types of display hardware, including both storage tubes and high-performance vector graphics terminals. Nearly a decade later, James Gosling & I based the Andrew Window Manager on a similar fixed-function protocol between applications and the process managing the display. The X Window System continues this approach to this day.
Things aren't as bad as they could be; the service has to deliver an interpreter for the proprietary, encrypted format before it can deliver the content in that format. But obtaining and decrypting the content requires running this interpreter in the browser or, more to the point for preservation, in the Web crawler. Although a crawler could collect the interpreter, then run it and collect any content it accessed, this does not guarantee that re-running the interpreter later with the preserved content will work. For example, the interpreter could be programmed to stop working a few minutes after it was delivered to the browser (or crawler).
Another way of expressing the same thought is that HTML5 allows content owners to implement a semi-effective form of DRM for the Web. This is a major reason Roger McNamee is so excited about it. Content owners have been attracted by the walled garden app approach, because although it restricts the potential market and may require paying taxes to the platform, it makes it harder to steal the content. So, for example, Kindle was not just a hardware device, but also an app for iPhone, iPad and Android. But now we have browsers which support enough HTML5 capabilities, we also have Kindle in a browser, which looks and feels just like an app but needs no installation. Amazon loves the HTML5 approach, because if you buy a book in Apple's walled app garden, Apple takes 30% of the gross. If you buy a book in your browser, Apple gets 0%. Moonalice loves HTML5 because you can watch their shows on your phone without needing a Moonalice app. These advantages make up for the somewhat less effective DRM than in the app.
First with AJAX and now with HTML5, the Web is evolving in a way that makes preserving Web content much harder. Traditional Web crawlers, such as the Internet Archive's widely used Heritrix, will become less and less effective at collecting. A major effort to develop techniques capable of collecting and preserving AJAX and HTML5 is urgently needed. Working with students at Carnegie-Mellon's Silicon Valley campus, the LOCKSS team has started to investigate techniques for collecting AJAX content.
A digression about preserving context.
In 2007, this now removed page at the JSTOR website used to say:
“If, in keeping with our mission to function as a trusted archive, JSTOR is to serve as a substitute for the journal volumes on the shelves, it must offer an electronic version that is a faithful replication of the original. An image-based approach ensures the integrity of the materials in the archive, while also retaining the appearance and “look and feel” of the journal in its original preservation. This is central to our mission and a key basis upon which JSTOR was founded.”This agrees with the LOCKSS philosophy, but it contrasts with the "intellectual content" approach espoused by Portico's Eileen Fenton. The importance of context can be illustrated by Andrei Codrescu's Exquisite Corpse. These are three screen-grabs of the home page at different times (click image for larger):
Would preserving the words alone without the typography, layout and images be useful?
Would preserving all of Shakespeare except the First Folio be useful?
I think the answer depends on your point of view. For some, the "work" or the "story" (the intellectual content) is what is important. For some, the object, the experience of the original, is essential.
Preserving the work, if possible, can be (I think) much cheaper than preserving the original, the experience, the full context. You have often argued that if we take the expensive preservation option we preserve less.
So, I hope your new venture can indeed find a cheap way to preserve the experience with full context. If not, I would settle for more content and a bit less context, accepting that some content may be irreparably lost in the process!
On the Web, it has typically been cheaper to preserve the context than to construct a complex ingest pipeline to extract the "intellectual content" and preserve it alone. Compare web archiving approaches such as the Internet Archive and LOCKSS with Portico's complex and expensive ingest processes.
I believe this will remain true through the transition to HTML5, although it will involve developing new techniques for Web preservation. Extracting and preserving the "intellectual content" in an HTML5 world will also require developing new techniques, which are likely to remain more complex, difficult and expensive.
I'm happy see more effort being put into capturing the dynamic web, but I don't think HTML5 is all bad. The new video tag makes archiving video content much easier - instead of the kind of complex incantations required to archive flash video, we can get the payload directly over HTTP.
Also, I think the case of hash-bang URIs is very informative, and illustrates how the need to ensure website remain discoverable (indexable by Google) will push against the tendency to obscure the data (if at some cost in terms of increased crawler complexity).
Two well-taken points. Thanks for the links.
The Economist writes about the transition to HTML 5.
Post a Comment