Monday, January 3, 2011

Memento & the Marketplace for Archiving

In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators (which will, I predict, end up being called something snappier and less obscure).

As we know, a complex ecology of competition, advertising, optimization and spam has grown up around search engines, and we can expect something similar to happen around Aggregators . Below the fold I use an almost-real-life example to illustrate my ideas about how this will play out.


The example is of the now-defunct medical journal Graft which, when it was being published, used to appear at http://gft.sagepub.com/. When it ceased publication, this URL started redirecting to http://online.sagepub.com/site/moved, a list of relocated and defunct journals. For the sake of clarity, however, I will assume in the following that the DNS name gft.sagepub.com no longer resolves.

At least four archives claim to preserve part or all of Graft:

  1. Portico, at http://www.portico.org/Portico/browse/access/vols.por?journalId=ISSN_15221628. The content is accessible only to readers at institutions with a current access lease from Portico.
  2. The Koninklijke Bibliotheek, the National Library of the Netherlands, but only for their readers (e.g.):
    [Graft] is protected under national copyright laws and international copyright treaties and may only be accessed, searched, browsed, viewed, printed and downloaded on the KB premises for personal or internal use by autorised KB visitorss and is made available under license between the Koninklijke Bibliotheek and the publisher.
  3. The Internet Archive, at http://web.archive.org/web/*/http://gft.sagepub.com. The content is open access, but consists of the table of contents pages, the abstracts of individual articles and, for each full-text article, a login page such as this.
  4. CLOCKSS, at http://www.clockss.org/clockss/Graft. The content is open access, under Creative Commons licenses, and consists of (two copies of) the entire Graft web site as it was shortly before it vanished.
As (we are assuming) gft.sagepub.com no longer resolves, a browser wishing to access the content has somehow to find a TimeGate for it without the cooperation of the original server. For reasons discussed in the earlier post, this TimeGate will be at one of Memento's search-engine-like Aggregators. As with search engines, the reader will have a choice of competing Aggregators, but it is likely that the reader's preferred Aggregator will be configured in to their browser. It is also likely, for the reasons described by Brian Arthur and exemplified by Google (and Baidu), that a substantial majority of readers will prefer one particular Aggregator.

The first thing we learn from this example is that Memento is correct in assuming that there will be many archives preserving given content, and thus some Aggregator-like mechanism is necessary to provide readers convenient access to them. The second thing, which Memento doesn't address, is that the usefulness of these archives to the reader is likely to vary widely. Some are going to be buggy, some will work correctly, some are likely to be wrong about their content, some are going to require payment for access.

Search engines have developed sophisticated techniques, such as Google's PageRank, to rate pages according to their likely usefulness to the reader, presenting the reader with a list of pages in decreasing order of predicted usefulness. Unfortunately, Memento for obvious reasons specifies that TimeGates, including Aggregators, return their list of versions in time order, not usefulness order. Aggregators are going to have to develop ways of both estimating the usefulness of preserved content, and of conveying these estimates to readers.

A third thing we can learn from the example is that the archives will have a range of business models. In fact, given the current scarcity of sustainable archive business models, we are likely to see a much bigger range in future, probably including advertiser-supported and pay-per-view. These archives compete with each other for eyeballs, just as the sites indexed by search engines compete with each other. In the search space, this has led to a massive arms race as Search Engine Optimization (SEO) techniques try to game the usefulness-estimation algorithms. We can expect a similar arms race in the archive space, as Aggregator Optimization (AO) techniques try to direct as many eyeballs as possible to their archive.

Unfortunately, there is one AO technique that is immediately obvious and hard for Aggregators to combat. Typically, the Aggregator redirects the browser to the preserved version closest in time to the browser's requested Datetime, together with headers pointing to earlier and later versions. The higher the proportion of the Aggegator's time index taken up by links to my archive, the higher the probability that my archive will get the redirect. This proportion depends on how many versions my archive tells the Aggregator it has. If my archive lies, saying for example that it has a version of the site every 5 seconds, it will be pretty much guaranteed to get the redirect, at least until a competing archive figures this out and claims to have a version every 2 seconds.

This example shows that Aggregators are mis-named. They cannot simply aggregate what archives tell them. Trust such as this is too easily abused in the Internet. They will need to assess the credibility of the information they collect, and the usefulness of the content to which it points.

This verification won't be easy, as the example of the Internet Archive's collection of Graft shows. The reason the Internet Archive thinks it has the full-text content is that the Graft web server, in common with almost all subscription content servers, did not conform to the HTTP standard. Conforming would have required it to return a request for authentication or payment before accessing the full-text content with a 401 or 402 error status. Instead, the Graft server lied, returning a page asking for authentication or payment with a 200 success status. The Internet Archive's crawler fell for the lie, believing that this was the page it was supposed to get by following the full-text link. The Internet Archive is not supposed to collect subscription content, but the LOCKSS system is. The LOCKSS system had to implement custom per-site login page detectors to determine whether what it was collecting was actual content, or a login page. This is a very simple example; as the AO arms race heats up much more complex verification techniques will be required.

The big lesson of the last decade's work on digital preservation is the difficulty of finding a sustainable business model for it. A problem doesn't get a Blue Ribbon Task Force unless it is difficult, and their report doesn't solve it. Bill Bowen, someone with experience of finding archiving business models, is more pessimistic than the report:
"it has been more challenging for Portico to build a sustainable model than parts of the report suggest."
The Aggregators will be substantial pieces of Internet infrastructure and, as we have seen, they are likely to need continual R&D to combat AO techniques. They also will need a sustainable business model. As yet, I have seen no suggestions for this.

There is a particular problem in finding a sustainable Aggregator business model. The Memento designers deserve plaudits for making Aggregators transparent to the reader. As we found with the LOCKSS system, making archive services transparent to readers may be an impressive technological feat, and a triumph of discreet user interface design, but it poses business model problems. It is hard to get users to pay for an invisible service. You can't brand it, or sell ads on it, or implement pay-per-view. Of course, there are many ways to build a non-transparent Aggregator. Instead of redirecting to the preferred version, it could insert interstitial ads, or show the preferred version in a frame surrounded by ads. All of these degrade the user experience, but one or more of them may be needed.

1 comment:

David. said...

The New York Times reveals a little corner of the byzantine Search Engine Optimization market. Think of the archive market with even one percent of this kind of cut-throat competition, and no equivalent of Google pushing back.