Monday, December 27, 2010

The Importance of Discovery in Memento

There is now an official Internet Draft of Memento, the important technique by which preserved versions of web sites may be accessed. The Memento team deserve congratulations not just for getting to this stage of the RFC process, but also for, on Dec. 1st, being awarded the 2010 Digital Preservation Award. Follow me below the fold for an explanation of one detail of the specification which, I believe, will become very important.

Memento cleverly extends the HTTP content negotiation mechanism, which we used to implement transparent on-access format migration, into the time dimension. The basic mechanism works like this:

  • A browser wanting to access past versions of a URL issues a GET request to the URL's server with an Accept-Datetime header specifying the time in the past in which it is interested.
  • If the web server knows about past versions of its URLs, it sends back a LINK response directing the browser to a TimeGate.
  • The TimeGate can be thought of as a time index, mapping between preserved versions of the URL and the times at which they were collected. The browser re-directs to the TimeGate, again with the Accept-Datetime header.
  • The TimeGate's response re-directs to the most appropriate version it knows about, and also provides a link to a TimeMap describing the sequence of versions it knows about in case its idea of the most appropriate one doesn't match the browser's.
  • The browser either follows the redirect, or requests the TimeMap and chooses the appropriate URL, and displays its content.
Some web sites, for example content management systems or blog platforms, can be their own TimeGate because they themselves preserve the chain of versions of each of their URLs. But most web sites will be preserved by separate archives, such as the Internet Archive. Each of these archives will need to implement a TimeGate indexing their own content.

This mechanism works if the web site whose previous versions the browser wants to access (a) still exists, (b) knows about the various archives preserving its history, and (c) wants browsers to be able to access previous versions. In the vast majority of cases some or all of these conditions won't hold. The browser will send the request with Accept-Datetime, and it will get back 404 or the current version of the site, neither of which are helpful. How does the browser find the right TimeGate if the original web server doesn't point to it?

To handle the case of the missing or un-cooperative web site, the Memento plugin for Firefox is configured with a list of TimeGates to try in turn. Clearly, it isn't going to be practical for this list to contain the TimeGates for every archive that might possibly hold some of the history of all the web sites that the browser might at some point want to access.

To address this problem, Memento uses Aggregators. These behave just like a TimeGate, but instead of indexing just the content of a specific archive, they aggregate the contents of many TimeGates. These Aggregators use a bulk query mechanism to capture TimeMaps from the TimeGates they aggregate. TimeMaps summarize the TimeGate's content. Aggregators are in effect search engines, indexing the preserved versions in the archives so that they can answer queries about their content.

The next question is how Aggregators know where TimeGates are? One way would be to require TimeGates to actively register themselves with all the Aggregators they know about. But there will always be some Aggregators that they don't know. So, by analogy with search engines, we need a way for Aggregators to find passive TimeGates on the web using a web crawler. This is the importance of Section 3.2.3 of the draft, which explains how to recognize that something is a TimeGate, and Section 4.2, which explains some ways to embed hints in the Web that will guide crawlers to TimeGates.