Tuesday, March 5, 2013

Re-thinking Memento Aggregation

A bit more than two years ago in my second post about Memento I described some issues with its concept of Aggregators. These are the search-engine like services that guide browsers to preserved content. As part of the work we are doing to enhance the LOCKSS daemon software under a grant from the Mellon Foundation we have implemented the basic Memento mechanisms, so I'm now having to face some of these issues.

I have come to believe that the problems with the Aggregator concept are more fundamental than I originally described, and require a significant re-think. Below the fold I set out my view of the problems, and an outline of my proposed solution.

My list of problems with the current concept of Aggregators is;
  • Conditional Access: In my earlier post and again more recently I pointed out that even if we assume that an archive is correct in announcing that it contains a valid copy of the resource at a particular URL at a particular time, that does not imply that it is willing to satisfy a browser's request for that copy. The canonical example I use in these posts is the four archives (CLOCKSS, Internet Archive, Portico and the Dutch KB) preserving the content of the defunct journal Graft. The first two will deliver their copy to any browser, whereas the last two will deliver content only to Portico subscribers, or to browsers on the KB's internal network respectively. In the current model, an Aggregator responding to a request from a browser has no way of knowing whether the archive will deliver content to that browser, so it must always include every archive. Thus readers will continually encounter archives refusing to satisfy their requests, not a way to encourage use of Memento.
  • "Soft 403":  In the Graft example, the Internet Archive's copy consists of a login page refusing to supply the actual content, which is not useful. The Aggregator cannot know whether the reader found the content to which it was redirected useful. Even if it did know, it is constrained to return results in time not preference order, so unlike search engines it would have no way to communicate this knowledge to the browser.
  • Scale: The two alternatives for Aggregators in the current model are either to poll each archive on each request, which won't scale well, or to search through metadata about every preserved version of every URI in every archive. For each version of each URI an Aggregator needs to store a DateTime (say 8 bytes) and a URI (say on average 77 bytes). For the Internet Archive's quarter-trillion URI instances alone this would be nearly 25TB of data. Clearly, building, operating and updating an Aggregator is a major undertaking.
  • Business Model: Aggregators, like search engines, have increasing returns to scale, so the market for aggregation will be captured by one large service. Given the costs for bandwidth, hardware, power, cooling, staff and so on involved in running a service at this scale there has to be a viable business model to support it. No-one is going to run something at this scale just for fun. So the current Aggregator model implies some way in which accesses to preserved content are monetized by the Aggregator as well as potentially by the archive. The whole point of the Memento model, and one of its most valuable features, was that it made access to preserved content transparent to the reader. No-one that I know of has come up with a way to monetize transparency. Certainly the LOCKSS Program found that being completely transparent was not conducive to economic sustainability.
  • Abuse: As I pointed out, the analogy between search engines and Aggregators strongly suggests the potential for abuse and outright fraud as Aggregators and archives seek to monetize accesses. In the search engine space combating abuse has become an intense and expensive arms race, raising the bar for a sustainable Aggregator business model.
  • Reputation: The techniques for combating abuse depend on developing a reputation for the information sources. Just as with the conditional access and "soft 403" problems, because the Aggregator has to supply its results in time order, and has no way to know whether the reader obtained any content, or found it satisfactory, it is in a poor position to perform ranking calculations to improve the quality of its service and combat abuse.
These problems all arise because in the current model the responsibility for merging the TimeMaps from the individual archives into a unified time order, and redirecting to the most suitable version, lies with the Aggregator. This is the wrong place for it to be; the Aggregator has to provide a one-size-fits-all response because it doesn't have information about the browser's view of the world.

The problem here is that the Memento protocol operates only at the level of individual URIs. Consider instead a model in which the archives and the Aggregator can also supply hints, allowing the merge and the redirection to happen in the browser:
  • If the server for the URL in question knows about a TimeGate for its content the mechanism works as before.
  • Otherwise, the browser redirects to an Aggregator requesting the URI with a DateTime.
  • The Aggregator does not maintain an index of versions of URIs or redirect to a preferred version of the URI. Instead, it returns a TimeMap-like hint that lists the TimeGates of the archives that have at least some content from a prefix of the URI, with the earliest and latest DateTime of this content. TimeGates whose content time range does not include the requested DateTime are excluded.
  • This index is built from similar TimeMap-like hints provided to the Aggregator by the archives, via a suitable request. The prefix will normally be just the host part of the URI, although some large sites will choose to include some leading components of the URI.
  • The browser receives the hint list from the Aggregator. Based on its own experience of requesting content from archives it filters the list to exclude those which refuse to supply content, and those whose content, such as login pages, is not useful. It requests actual, not hint, TimeMaps for the specific URI from the remaining archives, merges them, and redirects to the most appropriate version.
Here is some data from the Internet Archive. There are three columns: year from 1995 to 2010, number of top-level hosts, average number of unique captures. A unique capture is defined as a [url, checksum] tuple that is unique. For 2010, there are about 56M top-level hosts. On average, IA has about 275 URLs/host. If we plausibly assume that the overlap among the hosts for each year is close to 100%, the hint database would be about 15GB, or more than 1000 times smaller than the Aggregator's index for the same collection.

The advantages of this model for Aggregators include:
  • An Aggregator's demand for storage and computation is reduced by a factor of something like 1000, transforming an Aggregator into an easily manageable service that would not require obtrusive monetization to be sustainable.
  • An Aggregator's demand for bandwidth is reduced because it needs to receive much less data to stay up-to-date.
  • An Aggregator's demand for bandwidth is reduced because the data transmitted in response to browser requests is smaller.
  • An Aggregator sees many fewer requests. Browsers will typically request many preserved URLs from the same host in sequence. The browser can cache the response from the Aggregator and apply the hints it contains to every request for a URL from that host. A browser cache of this kind will have a high hit rate and will thus greatly reduce the load on and the bandwidth demand of the Aggregator.
The advantages of this model for the reader include:
  • Their browser can accumulate and apply knowledge of the individual reader's situation and preferences. The preserved resource eventually redirected to is much more likely to be useful; the reader is much less likely to encounter refusal to supply content, or bogus content masquerading as the real thing.
  • The scope for multiple competing Aggregators is much greater; browsers can request and merge hints from all the Aggregators they know about.
To sum up, the two alternative models for Aggregators have analogies with two successful Internet services:
  • The current Aggregator model is analogous to Google, a centralized service. Google would never have succeeded had it been transparent, which is a goal of Memento. Google succeeded because it was good at ranking search results in order of usefulness to the reader, something the current Aggregator can't do.
  • The alternative model is analogous to the Domain Name System, a distributed service which gradually directs enquiries to a place they can be satisfied. DNS would never have succeeded had it been non-transparent. Imagine if every access to a DNS server popped-up an advertisement and waited for you to click on it before proceeding to the next name server.
I am discussing these issues with the Memento team, and we are agreed that further development of support for both centralized and distributed aggregation is needed. Indeed, as I pointed out in my first post on Memento, aggregation also requires a discovery mechanism, and the sections on discovery in the original draft have been removed from the current draft as being premature.

1 comment:

Chris Adams said...

We recently had a twitter conversation about using some flavor of DHT to distribute this kind of data – and perhaps either adapting something like CoralCDN or a generic data-store like Cassandra.

One of the ideas I've been intrigued by is whether you could simply use BitTorrent to basically publish the list of domains archived by a particular organization, making it trivial to distribute very large lists and allow new players to volunteer their content without a gatekeeper. Obviously you'd still need to tackle the question of trustworthiness but something as simple as a ZIP file containing a text file listing domains and a PGP signature would make it easy to allow users to decide who to trust and still make it very straight forward to protect against a single point of failure. It'd also have the nice characteristic of making minimal demands on the implementation stack and bringing a lot of battle-tested network code, with significant precedent for browser integration (e.g. Opera) which would be really nice for closing the final loop by allowing clients to deeply integrate Memento support.

(There's some precedent for deeper integration into a server stack, too: Etsy famously uses BitTorrent to distribute Solr indices more efficiently: http://codeascraft.com/2012/01/23/solr-bittorrent-index-replication/ This could be really interesting from the perspective of how quickly one could bring a new TimeGate server online.)