Leetaru analyzed the Wayback Machine's contents using its CDX API to look at the distribution of collections of site's home pages and wrote:
Taken together, these findings suggest that far greater understanding of the Internet Archive’s Wayback Machine is required before it can be used for robust reliable scholarly research on the evolution of the web. Historical documentation on the algorithms and inputs of its crawlers are absolutely imperative, especially the workflows and heuristics that control its archival today. One possibility would be for the Archive to create a historical archive where it preserves every copy of the code and workflows powering the Wayback Machine over time, making it possible to look back at the crawlers from 1997 and compare them to 2007 and 2015.Niels Brügger of Aarhus University wrote:
More detailed logging data is also clearly a necessity, especially of the kinds of decisions that lead to situations like the extremely bursty archival of savy.lt or why the CNN.com homepage was not archived until 2000. If the Archive simply opens its doors and releases tools to allow data mining of its web archive without conducting this kind of research into the collection’s biases, it is clear that the findings that result will be highly skewed and in many cases fail to accurately reflect the phenomena being studied.
what we need as researchers when using web archives — the Internet Archive and any other web archive — is as much documentation as possible. Basically, we’d like to know why did this web element/page/site end up in the web archive, and how? And this documentation can be anything between documentation on the collection level down to each individual web entity. This documentation will probably have to be collected from a variety of sources, and it will probably also have to be granulated to fit the different phases of the research process, we need documentation about the collection before we start, and we may need other types of documentation as we move along. And the need for establishing documentation becomes more and more imperative as web archives grow older, because those who created the archive will not continue to be around.Michael Neubert of the Library of Congress wrote:
Even though I have only heard from a very small sample of researchers, the message that they want to know about how and why the items in the archive were selected and made part of the archive is a clear one.I sympathize with the researchers' wish to be on the same team as their BFF. The world would be a better place if they were, but sometimes life doesn't turn out that way. There are three big reasons why the utopia of clean, consistently collected Web archives complete with detailed metadata describing collection policies is never going to happen.
First, the reason the Internet works as well as it does is that it is a best-efforts network. There are no guarantees at the network level, and thus there are no guarantees further up the stack. You get what you get and you don't get upset. The Web is not reliable, so even if you perfectly understood the collection policy that caused the crawler to request a page, you do not know the many possible reasons the page may not be in the archive. The network may have failed or been congested, the Web server may have been down, or overloaded, or buggy, or now have a robots.txt exclusion. The crawler may have been buggy. I could go on.
You can't deduce anything from the absence of a page in the archive. So you cannot treat the samples of the Web in a Web archive as though they were the result of a consistently enforced collection policy. This should not be a surprise. Paper archives are the result of unreliable human-operated collection policies both before and during accessioning, so you cannot deduce anything from the absence of a document. You don't know whether it never existed, or existed but didn't make it across the archive's transom, or was discarded during the acquisition process.
The LOCKSS technology attempts to deal with this unreliability by making multiple independent collections of the e-journal pages it preserves, and then comparing them to identify differences. This works reasonably well for relatively small, mostly-static e-journals, but it wouldn't work for general Web archives. It multiplies the cost per page of collection, it can't deal with rapidly-changing or highly personalized sites, and resolving the cases where the comparison fails would be far too time-consuming.
Second, the Web is a moving target. Web archives are in a constant struggle to keep up with its evolution, in order that they maintain the ability to collect and preserve anything at all. I've been harping on this since at least 2012. Given the way the Web is evolving, the important question is not how well researchers can use archives to study the evolution of the Web, but whether the evolution of the Web will make it impossible to archive in the first place. And one of the biggest undocumented biases in Web archives is the result of Web archiving technology lagging behind the evolution of the Web. It is the absence of sites that use bleeding-edge technology in ways that defeat the crawlers. You get what you can get and you don't get upset.
Third, the fact is that metadata costs money. It costs money to generate, it costs money to document, it costs money to store. Web archives, and the Internet Archive in particular, are not adequately funded for the immense scale of their task, as I pointed out in The Half-Empty Archive. So better metadata means less data. It is all very well for researchers to lay down the law about the kind of metadata that is "absolutely imperative", "a necessity" or "more and more imperative" but unless they are prepared to foot the bill for generating, documenting and storing this metadata, they get what they get and they don't get upset.
The Internet Archive did not become, as I write this, the 235th most visited site on the Internet by catering to researchers. There aren't enough Web archive researchers to notice in their audience. The mass audience the Wayback Machine serves doesn't care whether it has too many snapshots of "a Russian autoparts website", or whether the crawl metadata is accurate. The audience cares whether it has snapshots of the site they are looking for. And it has, enough of the time to make it far more useful than any national library's website (loc.gov is currently ranked 4650th). The Internet Archive's goal of "universal access to knowledge" makes it more like a public library than a research library.
Despite that it is clear, for example from the talks by Andy Jackson, Jane Winters, Helen Hockx-Yu and Josh Cowls about research on the BL's UK domain crawls at the IIPC's 2015 GA, that interesting research can be done on the current crawls with their limited metadata. Would the quality of this research be increased more by spending an incremental pound on a small amount more metadata or a larger amount more data? In any case, this is an academic question. The BL's budget for Web archiving is vastly more likely to be cut than increased.