Tuesday, April 23, 2013

Making Memento Succesful

I gave a talk at the IIPC General Assembly on the problems facing Memento as it attempts the transition from a technology to a ubiquitous part of the Web's infrastructure. It was based on my earlier posts on Memento, my talk at the recent CNI and discussions with the Memento team, and intended to provide the background for subsequent talks from Herbert van de Sompel and Michael Nelson. Below the fold is an edited text with links to the sources.

I'm David Rosenthal from the LOCKSS program at the Stanford Libraries, although I'm actually here under the auspices of the Internet Archive. For those of you who don't know me, I'm a Londoner. I moved to the US 30 years ago to work on the Andrew project at C-MU. It was run by Jim Morris, who was very wise about what we were getting into: "We're either going to have a disaster, or a success disaster". He often added: "And I know which I prefer!". Jim was exaggerating to make two points:
  • Being a success didn't mean the problems were over, it meant a different set of problems.
  • Ignoring the problems of success was a good way of preventing it.
The rest of the first half of my time in the US was taken up with three startups, Sun Microsystems, NVIDIA and Vitria Technology, all of which IPO-ed. The second half has been taken up with the LOCKSS program, which does a very specialized kind of Web preservation dealing with e-journals and e-books.

I'm a big fan of the work the Memento team have done defining and implementing a mechanism for accessing preserved versions of Web content. We have implemented it in the LOCKSS daemon. I would love to see it become a success. But my startup experience tells me that what makes the difference between the disaster and the success disaster is not the technology so much as all the stuff that goes around the basic technology. So this talk is in two parts:
  • Disaster - things about the user experience of Memento that might prevent it being a success in the first place.
  • Success disaster - things that will become problems if Memento avoids the disasters and becomes  a success.
I'm focused on problems here. I believe Herbert and Michael will talk about potential solutions to many of these problems. Developing solutions or showing that I'm wrong about the problems would make this session a success disaster.

But before I get into the problems, I want to look at what success might look like. By success, I mean that Memento becomes a normal, widely used part of the everyday Web user's experience. I have two diametrically opposed examples of services that have achieved this kind of success.

Google is a centralized service that was successful with users because it was effective at communicating the relevance of its results to a user's search. It was effective as a business because it was not transparent, it was able to insert ads and charge advertisers. And because this was a viable business model it was able to fund the huge investment needed to supply an essential service to the entire Internet.

DNS is a distributed service that was successful with users precisely because it was transparent. Imagine ads popping up every time you resolve a host name! Because it was transparent there was no viable business model for the service of DNS resolution. But, fortunately, precisely because it was distributed, the cost of providing the service at each site was negligible. There was no need for massive spending on centralized infrastructure.

The first problem I want to point to is marketing. Memento needs either:
Support in the client is a plugin. Unlike, say, Flash which is fairly ubiquitous, Memento doesn't provide support for a particular MimeType, so its installation isn't triggered by visiting the appropriate link. The plugin requires some awareness by the end-user that they need to install a plugin to access preserved content. Apparently only 555 users have so far installed Mementofox. Similarly, using a gateway such as the BL's requires awareness that it exists and will get you to preserved content.

The marketing problem is that the mind-share about access to preserved Web content is already occupied by the Wayback Machine. Memento's plugins and gateways are solving a problem most Web users think is already solved. The Wayback Machine is collecting and preserving the Web, so that's where they go to access preserved content. 35 unique IPs do every second, making it the 230th most-visited site.

Now, we know that the problem isn't solved. Web archives, even the Wayback Machine, are fallible and incomplete. But to create mind-share, Memento has to deliver a large, visible value add over the Wayback Machine. And this brings us to the experience the user will encounter even after the marketing problem has been overcome. The first thing the average Memento user will notice right now is that almost all the results are in the Wayback Machine, so where is the value add?

This isn't a problem of Memento, but it is a problem for Memento. Research by Ahmed AlSum and his co-authors shows that, as usual on the Web, the contribution of archives to TimeMaps is long-tailed. Potentially, Memento can aggregate the long tail so that it delivers comparable value to the Wayback Machine, but to do so it needs to recruit a lot of archives.

What other problems will the user encounter? First, the archive may be wrong about what it contains. My canonical example of this is the defunct journal Graft. Four archives claim to preserve Graft: the Internet Archive, Portico, the KB, and CLOCKSS.

Not a Graft article
If you visit the Internet Archive's version, each article looks like this. An archive full of pages refusing to supply the requested page without payment may accurately reflect the state of the Web, but it isn't a positive user experience.

If you visit the CLOCKSS Archive's version of Graft, the same page looks like this, which is what the user wanted. The problem here is "Soft 403s"; web sites returning a refusal to supply content with a 200 OK rather than 403 Forbidden, or maybe 401 Unauthorized or probably what they really meant which was 402 Payment Required.
A Graft article

It is really hard for an archive to detect that this 200 OK page actually isn't real content worth preserving. The CLOCKSS crawler got the real page not the refusal to supply content page because we had an agreement with the publisher to allow us access. The LOCKSS system is designed to preserve subscription content, so this is an issue we deal with every day. Our code includes lots of custom, per-site login page detectors, not something that a general Web archive can handle.

The incentives for a Web site to return the correct error code are pretty much non-existent. So this problem isn't confined to 403s; it happens for 404s too.

Even if the archive actually contains the real content you want, it may refuse to give it to you. When I visited the KB's version of Graft I got a page that said:
[Graft] is protected under national copyright laws and international copyright treaties and may only be accessed, searched, browsed, viewed, printed and downloaded on the KB premises for personal or internal use by autorised KB visitorss and is made available under license between the Koninklijke Bibliotheek and the publisher.
This isn't special to the KB, most national library copyright deposit archives have similar restrictions. So there are two choices:
  • They should not publish their TimeMaps. This avoids frustrating users, increases the dominance of the Wayback Machine, and obscures the extent to which the Web is being preserved.
  • They should publish their TimeMaps, in which case they are degrading the experience of the vast majority of users who aren't actually at their national library, even those who are actually at some other country's national library.
There is a strong argument for even restricted access archives to publish their TimeMaps. This transparency of collection policies allows the coverage of Web preservation to be measured, and these measurements can make the case for improved collection policies and fewer access restrictions. But, returning to the marketing issue, adding a lot of archives in the long tail whose content the user cannot access doesn't help Memento add value. Only accessible content adds value.

Suppose Memento does become a success. Billions of browsers include it, millions of Web sites implement it, hundreds of Web archives implement it. The first thing that happens is that it becomes impossible to know all the archives that might have a preserved version of the URI you are looking for. You need the same kind of help in the time dimension that search engines provide in the URI dimension. Memento calls the services that guide you to the appropriate  preserved version of a URI Aggregators, and they will be very important in the success of Memento.

Aggregators do two things:
  • They know where archives are. How do they do this? At present, they just know. This works because there aren't many archives. If Memento is a success, they will need to crawl the Web and find them. This Discovery mechanism was in earlier versions of the draft standard, but its been removed.  That doesn't mean it isn't needed.
  • They aggregate the TimeMaps each archive returns describing its individual content into a single composite TimeMap that the client can use to find the preserved versions of the URI it is looking for, wherever they are. The draft standard doesn't discuss Aggregators; at the protocol level requesting a TimeMap from an Aggregator is exactly the same as requesting one from an archive or a web site. That doesn't mean Aggregators aren't needed.
What problems will arise for these Aggregators? There are two architectures for centralized Aggregators:
  • The current Aggregator polls each archive on each user URI request, modulo some help from caching. This simply isn't going to work if there are a large number of archives; cache hit rates aren't going to be high enough.
  • An Aggregator could hold an index of every URI on every archive, together with a DateTime for it. Just for the Internet Archive alone this index would be 25TB of data, so the only way this could survive success would be to have an enormously successful business model to fund the necessary infrastructure.
Is there a viable architecture for a distributed, transparent Aggregator? That's a good question.

A book everyone thinking about Disaster vs. Success Disaster should read is Increasing Returns and Path Dependence in the Economy by W. Brian Arthur. He points out that markets with increasing returns to scale are captured by a single player, or are at most an oligopoly of a few large players. The market for Aggregation probably isn't big enough for an oligopoly, so if we go with a centralized architecture there will be one Aggregator. If Memento is a success running this Aggregator will be expensive, so the Aggregator will need to monetize user requests for URIs. It can't do that if Memento is transparent, because there is no place to put the ads. And if it tries to insert interstitial ads, you can pretty much guarantee that Memento isn't going to be a big enough success to matter.

This part is based on bitter experience. The original design for the LOCKSS system (PDF) made it completely transparent. It was a really neat exploitation of the Web infrastructure. But it proved impossible to get people to pay for a service that didn't deliver visible value but instead "just worked" to make access to Web content better. We had to implement a non-transparent access model in order to get people to pay us, and the result is LOCKSS content isn't accessed nearly as much as it could be.

There is a further problem. Running a Web archive at scale is an expensive business too. Even not-for-profit free archives like the Internet Archive actually monetize accesses by showing funders how much their content is used. But now the Aggregator wants to monetize exactly the same accesses, so it is competing with its own data sources. Normally a hard trick to pull off, but we are assuming that Memento is a success and that there is only one Aggregator, so it is the source of the majority of the traffic for the archives. This gives it some negotiating leverage with the archives.

This situation is rife for abuse, as we see in the Search Engine Optimization world. And Memento has no real defenses against abuse. For example, BlackArchive.org can spam the TimeMap by claiming to have a version of the URI every minute since the Web began and capture all but a tiny fraction of the traffic to serve it "Simple Trick" ads. How is the Aggregator going to know the archive's claim is wrong? After all, the BBC website used to say "updated every minute of every day". And maybe the archive is actually a content management system that really can reproduce its state at any moment in the past.

Worse, the result of abuse is a need for the search engine, or the Aggregator in our case, to engage in a costly, never-ending arms race with the black hats. This pushes up the cost of running the one true Aggregator significantly.

Google, social media, and other Web services depend on establishing reputations for the entities they deal with and providing each user with results in reputation order. The Aggregator is hampered in doing this in many ways:
  • It doesn't know whether any of the versions it supplied was accessed, or even whether the requesting user could have accessed them, because the TimeMap points to the archive not back to the Aggregator.
  • It doesn't know whether the content, if it was accessed, was useful. Memento is transparent, so there is no place to put +1/-1 buttons. So, for example, it can't get feedback that some archive is full of "soft 403s".
  • It is hard to see how the Aggregator, even if it did have reputation knowledge, could signal it to the user, The user pretty certainly wants results in time order, so even if some protocol were developed for transmitting reputation along with the entries in the TimeMaps, there is a UI problem in presenting this information to the user.
Although the BL's doesn't work this way, a gateway service could be designed to frame the preserved content, similar to the way the Wayback Machine inserts a navigation bar. This implementation could avoid many of these issues, by being a lot less transparent and more intrusive.

I think the Memento team has been wise to focus the draft standard so that it describes only a mechanism by which a client can access preserved versions of an individual URI.

Nevertheless, if Memento is to be the kind of success I would like it to be, it needs ways to deal with the meta-level issues around discovery and aggregation, and the business and scale issue of dealing with the real world of the Web.

2 comments:

Anonymous said...

Aggregators need to know not only that a repo holds an item, but if there are access restrictions on it that may impede access.

There should not only be the choice "publish TimeMap" or "don't publish TimeMap". There needs to be a way to publish the TimeMap, with metadata, probably on an item-by-item basis, saying that access is restricted.

This problem is endemic in the realm of libraries trying to deal with scholarly publishing, especially with regard to aggregation -- it's not unique to memento/timemap.

I run into it with OAI-PMH. The OAISter archive started out with promise, but became nearly useless for many potential uses (especially automated ones), because there is no way for software user agent to tell if a given item is actually going to be accessible to the user. The user just has to click on it (and then maybe get a paywall with a 200 http response, yeah, definitely a related issue).

Aggregators, whether of 'live' content like OAI-PMH, or 'archival' content like memento -- need to know if the content is public access or may be restricted.

We can imagine some controlled vocabulary for this, and it gets tricky expressing all the possibilities in a machine-accessible way. But to begin with, a HUGE step forward would simply be a controlled vocabulary wtih two possible values: "Publically accessible to anyone", and "Restricted access to only certain users."

If the TimeMap protocol lacks a way to advertise this sort of thing, that probably needs to be corrected. As ideally it would be in OAI-PMH too. Of course, then there's the difficult challenge of getting content providers to actually _provide_ this 'access restriction' value where appropriate, and to make sure the relevant software supports providing it, etc. But if there isn't even a standardized way to provide this info supported by the standards, you're lost from the start.

David. said...

The problem with putting this talk on my blog is that it was only one of three talks in the session. Mine was specifically only about problems with Memento as it currently exists not about potential solutions.

Right now, an archive with restricted access has only two choices, publish or not publish the TimeMaps. Something needs to be done about restricted access content but that is only one of the problems that needs to be addressed, and kludging stuff into the current clean and simple protocol for accessing content to address each in turn is almost certainly a bad idea. The URI-level protocol needs to be accompanied by a meta-level protocol to address these problems. Designing the meta-level protocol is one of the necessary next steps.