- Part 3 will assess how practical some suggested improvements might be.
- Part 4 will look in some detail at the Web DRM problem introduced in Part 1.
- Part 5 will discuss a "counsel of despair" approach that I've hinted at in the past.
In my talk at the Fall 2014 CNI, based on The Half-Empty Archive, I noted some estimates of how likely a given URI was to be archived:
I then pointed out a number of reasons why these estimates were likely to be optimistic. So, back in 2014 more than, probably much more than, half the Web wasn't archived. The fundamental reason why so little was archived was cost, and in particular the cost of ingesting Web content, which is the largest single component of Web archiving cost.
- The Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.
- Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
Their results are somewhat difficult to interpret, but for their two more random samples they report:
- tried to access the URL from the live Web.
- used Memento to ask the major Web archives whether they had at least one copy of that URL.URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
I've written many times about one of the major reasons for the high cost of ingest, for example in a 2011 post inspired by a concert Roger McNamee's band Moonalice played in Palo Alto's Rinconada Park:
Last month Kalev Leetaru added a third installment to his accurate but somewhat irritating series of complaints about the state of Web archiving:
- How Much Of The Internet Does The Wayback Machine Really Archive? showed that the Internet Archive's Wayback Machine contained a rather strange selection of Web pages, and called for better metadata to explain why.
- The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web "revealed" that the Internet Archive runs many different collection policies, contributing to the strangeness.
- Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web describes how the evolution of the Web has made simple crawling techniques ineffective.
Web archives, and the Internet Archive in particular, are not adequately funded for the immense scale of their task, ... So better metadata means less data. It is all very well for researchers to lay down the law about the kind of metadata that is "absolutely imperative", "a necessity" or "more and more imperative" but unless they are prepared to foot the bill for generating, documenting and storing this metadata, they get what they get and they don't get upset.John Berlin at Old Dominion University has a fascinating detailed examination of why CNN.com has been unarchivable since November 1st, 2016:
CNN.com has been unarchivable since 2016-11-01T15:01:31, at least by the common web archiving systems employed by the Internet Archive, archive.is, and webcitation.org. The last known correctly archived page in the Internet Archive's Wayback Machine is 2016-11-01T13:15:40, with all versions since then producing some kind of error (including today's; 2017-01-20T09:16:50). This means that the most popular web archives have no record of the time immediately before the presidential election through at least today's presidential inauguration.The TL;DR is that:
Clearly, scaling this kind of detailed analysis to all major Web sites that use CDNs (content delivery networks) is infeasible. Among these Web sites are the major academic publishing platforms, so the problem Berlin describes is very familiar to the LOCKSS team. It isn't as severe for us; there are a limited number of publishing platforms and changes they make tend to affect all the journals using them. But it still imposes a major staff cost.
It is important to note that there are browser-based Web archiving technologies that are capable of collecting CNN in a replay-able form:
the web archiving community is still stuck in a quarter-century-old mindset of how the web works and has largely failed to adapt to the rapidly evolving world of video, social media walled gardens, dynamic page generation and the mobile webThis is ludicrous. The LOCKSS team published our work on crawling AJAX-based academic journals in 2015, and it started more than three years earlier. Even then, we were building on work already done by others in the Web archiving community.
The reason these efforts haven't been more widely applied isn't because Web archives are "stuck in a quarter-century-old mindset". Its because applying them costs money.
The Internet Archive's budget is in the region of $15M/yr, about half of which goes to Web archiving. The budgets of all the other public Web archives might add another $20M/yr. The total worldwide spend on archiving Web content is probably less than $30M/yr, for content that cost hundreds of billions to create. The idea that these meager resources would stretch to archiving sites the size of YouTube or FaceBook to Leetaru's satisfaction is laughable.
It is true that the current state of Web archiving isn't good. But Leetaru's suggestion for improving it:
greater collaboration is needed between the archiving community and the broader technology industry, especially companies that build the state-of-the-art crawling infrastructures that power modern web services.doesn't begin to address the lack of funding. Nor does it address the problem of the motivations of web sites like CNN:
- Being archived doesn't do anything for a site's bottom line. It may even be a negative if it exposes machinations such as the notorious "Mission Accomplished" press release.
- They don't see any of the costs they impose on Web archives, so they don't weigh in the scale against the benefits of "optimizing the user experience". Even if they did weigh in the scale, they'd be insignificant.
Hmm, I wonder if the much-reviled AMP format is easier to archive because it reduces/standardizes the AJAX BS.
A good way to understand the craziness of the modern Web is to install the uMatrix extension which blocks cross-domain requests; most sites show no content.
Ben Tausig tweets:
"Turns out it's difficult to cite periodicals in countries w/o a free press. Many of the archives of Thailand's @MatichonOnline are just gone ... Tons of great reporting, articles I really leaned on, are no longer available. I have no idea how to cite them."
Fortunately, it appears that the Internet Archive knew about Matichon Online. But using oldweb.today it looks like no other Web archives did.
Post a Comment