DSHR's Blog: How Much Of The Web Is Archived?

MIT's Technology Review has a nice article about Scott Ainsworth et al's important paper How Much Of The Web Is Archived? (readable summary here). The paper reports an important initial step in measuring the effectiveness of Web archiving, and Scott and his co-authors deserve much credit for it. Below the fold I summarize the paper and raise some caveats as to the interpretation of the results. Tip of the hat to the authors for comments on a draft of this post.

About two years ago Scott and his co-authors constructed four samples of URIs from DMOZ, Delicious, Bitly and search engines. Each sample contained 1000 URIs. For each URI in each sample they investigated the current status, and the status in a number of major Web archives. They were able to do so because these archives all supported Memento, the standard for accessing preserved Web resources. The goals were:

The primary purpose of this experiment was to estimate the percentage of all publicly-visible URIs that have archive copies available in public archives such as the Internet Archive. A secondary purpose was to evaluate the quality of Web archiving.

The results are interesting:

URIs from DMOZ and Delicious have a very high probability of being archived at least once. On the other hand, URIs from search engine sampling have about 2/3 chance of being archived and Bitly URIs just under 1/3.

However, these results need careful interpretation. At least three caveats spring immediately to mind, two of which would suggest that their numbers were inflated, and one of which would suggest that they are deflated. None of this is to take away from the importance of this paper; it is to point out that measuring the effectiveness of Web archiving is a very hard problem. Exploiting Memento as Scott and his co-authors did is an important start.

What does the paper mean by a URI being archived?

The answer is that their Aggregator was able to find at least one Memento for the URI. Where was the Aggregator looking for Mementos? Table 6 shows the list; it includes 9 sources that everyone would agree are archives, but three sources that almost everyone would agree are not archives in any meaningful sense. These are the caches of the Yahoo, Bing and Google search engines. The paper points out:

Based on our observations, Google and Bing cached copies are kept for a maximum of one month. Yahoo provides cached version without date, we used a new technique to estimate the cached version age for Yahoo which may be several years.

The three non-archives appear to contribute a significant proportion of the archive coverage. As I interpret Table 7, the proportion of original resources for which the search engine caches return at least one Memento is DMOZ 52%, Delicious 68%, Bitly 68% and SE 74%. In the authors' JCDL 2011 presentation slide 18 shows a table that I reproduce here:

	Including SE	Excluding SE
DMOZ	90%	79%
Delicious	97%	68%
Bitly	35%	16%
SE	88%	19%

Thus the proportion of all URIs in the sample for which the search engine caches are the only "archive" is DMOZ 11%, Delicious 29%, Bitly 19% and SE 69%. It is more interesting to look only at the URIs that are archived somewhere; the proportion that is archived only by the search engine caches is DMOZ 12%, Delicious 30%, Bitly 54% and SE 78%.

In any case, the question of whether to treat the search engine caches as archives is now moot. Michael Nelson informs me that since the research, they have monetized their APIs, so the Memento Aggregators can no longer use them as a source.

What does it mean for a URI to be publicly-visible?

The answer appears to be that fetching the page returns some payload with a 200 return code. The fact that an archive supplies a Memento for such a publicly-visible URI does not mean that the archive actually has useful content for that URI. I pointed out this problem in a post two years ago, using as an example the content of the defunct e-journal Graft. At least four archives claim to preserve Graft:

Portico, at http://www.portico.org/Portico/browse/access/vols.por?journalId=ISSN_15221628. The content is accessible only to readers at institutions with a current access lease from Portico. If I attempt to access the full text of a typical article I get a page that says:
You do not have access to this article.
Access to articles in Portico is available at participating institutions. If your institution does participate and you cannot access the full text of an article, try using Portico at your campus library or talk with your librarian about off-campus access to Portico. Information on participating in Portico can be found in the For Libraries section of our website.
The Koninklijke Bibliotheek, the National Library of the Netherlands, but only for their readers (e.g.):
[Graft] is protected under national copyright laws and international copyright treaties and may only be accessed, searched, browsed, viewed, printed and downloaded on the KB premises for personal or internal use by autorised KB visitorss and is made available under license between the Koninklijke Bibliotheek and the publisher.
The Internet Archive, at http://web.archive.org/web/*/http://gft.sagepub.com. The content is open access, but consists of the table of contents pages, the abstracts of individual articles and, for each full-text article, a login page such as this.
CLOCKSS, at http://www.clockss.org/clockss/Graft. The content is open access, under Creative Commons licenses, and consists of (two copies of) the entire Graft web site as it was shortly before it vanished.

Thus, for each full-text article we would have four Mementos, but for most readers only one of them would point to a usable archived copy. There are two problems here, even if we assume that all parties are acting in good faith (not a safe assumption on the Internet):

Memento provides a way for archives to announce that they contain a copy of a URI as of a certain date, but not a way for them to announce that they would deliver their copy to the browser issuing the request.
The endemic practice of serving login pages at the requested URI with a 200 return code makes a significant proportion of the Web which is actually private appear public.

Preserving large numbers of pages refusing, in their various ways, to supply the requested content may be an accurate reflection of the state of the Web at that time, but it isn't a great contribution to what most people would regard as Web archiving.

When is a URI archived?

The ingest processes of archives typically impose a delay between capturing the URI and making it available for readers to access. This is a particular problem for the Internet Archive, which is the most significant source of Mementos. They, correctly, publish Mementos only when the Wayback Machine is able to provide access to the preserved content, not when they collect it. The authors note that this delay is significant:

IA has a delay from 6-24 months between the crawling and appearance on the archive web interface [9], but the results showed that this (6-24 months) period may be longer as less than 0.1% of the IA archived versions appeared after 2008.

Thus, at least 2 years worth of URIs are in the Internet Archive's ingest pipeline. These URIs have been preserved. They will not be lost, and will eventually become accessible. I would argue that these URIs have been 'archived" in the sense that most people would use.

This is important in light of my first caveat. The SE sample is biased toward recent URIs. The Internet Archive contains older URIs, at least 2 years old. But it also contains, but does not yet advertise, many of the URIs that the search engine caches contain and advertise. If it were again possible (or even desirable) to treat search engine caches as archives, the 78% number above for the proportion of archived URIs for which search engine caches are the only archive would be likely to decrease through time as the Internet Archive's ingest pipeline continues to pump out content.

DSHR's Blog

Monday, January 7, 2013

How Much Of The Web Is Archived?

1 comment: