Thursday, November 19, 2015

You get what you get and you don't get upset

The title is a quote from Coach Junior, who teaches my elder grand-daughter soccer. It comes in handy when, for example, the random team selection results in a young lady being on the opposite team to her best friend. It came to mind when I read Kalev Leetaru's How Much Of The Internet Does The Wayback Machine Really Archive? documenting the idiosyncratic and evolving samples the Internet Archive collects of the Web, and the subsequent discussion on the IIPC mail alias. Below the fold, my take on this discussion.

Leetaru analyzed the Wayback Machine's contents using its CDX API to look at the distribution of collections of site's home pages and wrote:
Taken together, these findings suggest that far greater understanding of the Internet Archive’s Wayback Machine is required before it can be used for robust reliable scholarly research on the evolution of the web. Historical documentation on the algorithms and inputs of its crawlers are absolutely imperative, especially the workflows and heuristics that control its archival today. One possibility would be for the Archive to create a historical archive where it preserves every copy of the code and workflows powering the Wayback Machine over time, making it possible to look back at the crawlers from 1997 and compare them to 2007 and 2015.

More detailed logging data is also clearly a necessity, especially of the kinds of decisions that lead to situations like the extremely bursty archival of or why the homepage was not archived until 2000. If the Archive simply opens its doors and releases tools to allow data mining of its web archive without conducting this kind of research into the collection’s biases, it is clear that the findings that result will be highly skewed and in many cases fail to accurately reflect the phenomena being studied.
Niels Br├╝gger of Aarhus University wrote:
what we need as researchers when using web archives — the Internet Archive and any other web archive — is as much documentation as possible. Basically, we’d like to know why did this web element/page/site end up in the web archive, and how? And this documentation can be anything between documentation on the collection level down to each individual web entity. This documentation will probably have to be collected from a variety of sources, and it will probably also have to be granulated to fit the different phases of the research process, we need documentation about the collection before we start, and we may need other types of documentation as we move along. And the need for establishing documentation becomes more and more imperative as web archives grow older, because those who created the archive will not continue to be around.
Michael Neubert of the Library of Congress wrote:
Even though I have only heard from a very small sample of researchers, the message that they want to know about how and why the items in the archive were selected and made part of the archive is a clear one.
I sympathize with the researchers' wish to be on the same team as their BFF. The world would be a better place if they were, but sometimes life doesn't turn out that way. There are three big reasons why the utopia of clean, consistently collected Web archives complete with detailed metadata describing collection policies is never going to happen.

First, the reason the Internet works as well as it does is that it is a best-efforts network. There are no guarantees at the network level, and thus there are no guarantees further up the stack. You get what you get and you don't get upset. The Web is not reliable, so even if you perfectly understood the collection policy that caused the crawler to request a page, you do not know the many possible reasons the page may not be in the archive. The network may have failed or been congested, the Web server may have been down, or overloaded, or buggy, or now have a robots.txt exclusion. The crawler may have been buggy. I could go on.

You can't deduce anything from the absence of a page in the archive. So you cannot treat the samples of the Web in a Web archive as though they were the result of a consistently enforced collection policy. This should not be a surprise. Paper archives are the result of unreliable human-operated collection policies both before and during accessioning, so you cannot deduce anything from the absence of a document. You don't know whether it never existed, or existed but didn't make it across the archive's transom, or was discarded during the acquisition process.

The LOCKSS technology attempts to deal with this unreliability by making multiple independent collections of the e-journal pages it preserves, and then comparing them to identify differences. This works reasonably well for relatively small, mostly-static e-journals, but it wouldn't work for general Web archives. It multiplies the cost per page of collection, it can't deal with rapidly-changing or highly personalized sites, and resolving the cases where the comparison fails would be far too time-consuming.

Second, the Web is a moving target. Web archives are in a constant struggle to keep up with its evolution, in order that they maintain the ability to collect and preserve anything at all. I've been harping on this since at least 2012. Given the way the Web is evolving, the important question is not how well researchers can use archives to study the evolution of the Web, but whether the evolution of the Web will make it impossible to archive in the first place. And one of the biggest undocumented biases in Web archives is the result of Web archiving technology lagging behind the evolution of the Web. It is the absence of sites that use bleeding-edge technology in ways that defeat the crawlers. You get what you can get and you don't get upset.

Third, the fact is that metadata costs money. It costs money to generate, it costs money to document, it costs money to store. Web archives, and the Internet Archive in particular, are not adequately funded for the immense scale of their task, as I pointed out in The Half-Empty Archive. So better metadata means less data. It is all very well for researchers to lay down the law about the kind of metadata that is "absolutely imperative", "a necessity" or "more and more imperative" but unless they are prepared to foot the bill for generating, documenting and storing this metadata, they get what they get and they don't get upset.

The Internet Archive did not become, as I write this, the 235th most visited site on the Internet by catering to researchers. There aren't enough Web archive researchers to notice in their audience. The mass audience the Wayback Machine serves doesn't care whether it has too many snapshots of "a Russian autoparts website", or whether the crawl metadata is accurate. The audience cares whether it has snapshots of the site they are looking for. And it has, enough of the time to make it far more useful than any national library's website ( is currently ranked 4650th). The Internet Archive's goal of "universal access to knowledge" makes it more like a public library than a research library.

Despite that it is clear, for example from the talks by Andy Jackson, Jane Winters, Helen Hockx-Yu and Josh Cowls about research on the BL's UK domain crawls at the IIPC's 2015 GA, that interesting research can be done on the current crawls with their limited metadata. Would the quality of this research be increased more by spending an incremental pound on a small amount more metadata or a larger amount more data? In any case, this is an academic question. The BL's budget for Web archiving is vastly more likely to be cut than increased.


Unknown said...

Words of wisdom from someone who truly understands and accepts how messy the web is and how costly it is to try to organise it.

NielsB said...

Dear David,

Many thanks for your long blog post. Since you include my email text in your comment, I thought that I’d better comment it.

Apparently you read my comment as if I would advocate for “the utopia of clean, consistently collected Web archives complete with detailed metadata describing collection policies is never going to happen”.

This is not the case. Let’s take the words of your sentence one by one:

‘clean’: researchers who have used web archives very well know that web archives are messy, for a number of reasons; but that they are messy or ‘unclean’ does not mean that it’s impossible to get information out of them of relevance for researchers — what we would like to know is just a little bit more about HOW they are messy.

‘consistently collected’: again, many researchers are well aware that web archives are not consistently collected (I have written a number of texts about this).

‘complete with detailed metadata’: well, some web archives actually have metadata (e.g. the Australian Pandora), but web archives based on bulk archiving does not; and, again, researchers know that when you scale up metadata is not an option. But there are other ways of providing relevant documentation.

‘describing collection policies’: some — in fact most — web archives have written collection policies (or strategies, as they are also called), but the IA does not, at least not to my knowledge. Thus, on a collection level documentation exist in many web archives, but in many cases it is not made available for researchers

And then the three reasons:
1) The internet is a best-efforts network. Nothing new here. No matter how well the collection strategies are described one never knows what gets into the web archive — so many things can go wrong (I have described all this in a book from 2005, and in publications before that).

2) The Web is a moving target. It sure is, and again this is not new, also not to researchers (old publications of mine also about this). The web is evolving and the web archiving technology is lagging behind, yes, but this does not exonerate web archives from the task of documenting all this. For a scholar it can be very valuable to know that in the period between x and y it was impossible to archive phenomenon z. If web archives know that in 2000 it was not possible to archive flash, for instance, they should tell researchers.

3) Metadata costs money. Yes it does, and the IA is doing an outstanding job in collecting the web. But not all web archives have the same business model, some are national web archives with national funding, and one of the advantages of this business model is that it’s not impossible to get funding for improving documentation. And initiatives are going on trying to do this. As time goes by, the money not spent now on making documentation available will increase the problem, and, ultimately, diminish the usefulness of the web archive.

Most scholars using web archives know that they get what they can get and, in fact, they don't get upset. We’re quite happy with most web archives, for the simple reason that it’s the only ones we have. But if I researchers could find the same collection in a web archive with no documentation, and in a well-documented web archive — guess what they would choose?

We don’t ask for an utopia and your three reasons do not exclude that web archives should strive to provide as much documentation about their collections as possible. As mentioned in my comment, the information exists, ranging from crawllogs, seedlists, etc. to crawler settings, curator information, and collecting policies and strategies. The task is to make this information accessible to researchers so that it is useful. And just a bit is better than nothing.

We do not ask for clean and consistent collections. What we ask for is information about why and how web archive collections are not clean and consistent. And why do we need this? We need this to make as informed choices as possible, so that we can substantiate the studies we make.


Andy Jackson said...

For what it's worth, I tried to summarise our position here:

David. said...

It urns out that Andy Jackson's summary is worth a great deal. Go read it.

David. said...

Kalev Leetaru continues to want a less messy world, more to his and other scholars' linking. But, as before, he doesn't suggest any way the inadequate budgets of web archives could be increased to fund the work he wants done, or that this would be the best use of incremental funding. If he could find incremental funding, it might have better results from his point of view if it were spent educating researchers about how to work within the limits of the available data.

David. said...

Kalev Leetaru continues his examination of the Internet Archive in The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web and figures out:

- That the Wayback Machine provides access to only a part of the Internet Archive's holdings.

- That these holdings are the result of running a wide variety of evolving collection policies in parallel, some driven by the Archive, some by external donors, some by ArchiveIt partners, some in reaction to world events.

- And that the result of the partial view and the range of policies is that the view presented by the Wayback Machine is much messier than scholars would like.

Welcome to the real world, in which the Archive has to sell services and accept donations to fund its activities. It would have been wonderful if, at the birth of the Web, some far-sighted billionaire had provided an archive with an endowment sufficient to ensure that every version of every Web page was preserved for posterity. Or even to run a single consistent collection policy.