DSHR's Blog: The Amnesiac Civilization: Part 2

Part 1 of The Amnesiac Civilization predicted that the state of Web archiving would soon get much worse. How bad it is right now and why? Follow me below the fold for Part 2 of the series. I'm planning at least three more parts:

Part 3 will assess how practical some suggested improvements might be.
Part 4 will look in some detail at the Web DRM problem introduced in Part 1.
Part 5 will discuss a "counsel of despair" approach that I've hinted at in the past.

In my talk at the Fall 2014 CNI, based on The Half-Empty Archive, I noted some estimates of how likely a given URI was to be archived:

The Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.

Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:

tried to access the URL from the live Web.

used Memento to ask the major Web archives whether they had at least one copy of that URL.

Their results are somewhat difficult to interpret, but for their two more random samples they report:
URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.

I then pointed out a number of reasons why these estimates were likely to be optimistic. So, back in 2014 more than, probably much more than, half the Web wasn't archived. The fundamental reason why so little was archived was cost, and in particular the cost of ingesting Web content, which is the largest single component of Web archiving cost.

I've written many times about one of the major reasons for the high cost of ingest, for example in a 2011 post inspired by a concert Roger McNamee's band Moonalice played in Palo Alto's Rinconada Park:

I've been warning for some time that one of the fundamental problems facing digital preservation is the evolution of content from static to dynamic.

The evolution of the Web from interlinked static documents to a JavaScript programming environment has greatly increased the cost of ingesting the surface Web, and decreased both its reliability and its representative nature. This process has continued since 2014.

Last month Kalev Leetaru added a third installment to his accurate but somewhat irritating series of complaints about the state of Web archiving:

How Much Of The Internet Does The Wayback Machine Really Archive? showed that the Internet Archive's Wayback Machine contained a rather strange selection of Web pages, and called for better metadata to explain why.
The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web "revealed" that the Internet Archive runs many different collection policies, contributing to the strangeness.
Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web describes how the evolution of the Web has made simple crawling techniques ineffective.

My irritation with Leetaru isn't because his descriptions of the deficiencies of current Web archives aren't accurate. As I described in You Get What You Get And You Don't Get Upset, it stems from his unwillingness to acknowledge the economics of Web archiving:

Web archives, and the Internet Archive in particular, are not adequately funded for the immense scale of their task, ... So better metadata means less data. It is all very well for researchers to lay down the law about the kind of metadata that is "absolutely imperative", "a necessity" or "more and more imperative" but unless they are prepared to foot the bill for generating, documenting and storing this metadata, they get what they get and they don't get upset.

John Berlin at Old Dominion University has a fascinating detailed examination of why CNN.com has been unarchivable since November 1st, 2016:

CNN.com has been unarchivable since 2016-11-01T15:01:31, at least by the common web archiving systems employed by the Internet Archive, archive.is, and webcitation.org. The last known correctly archived page in the Internet Archive's Wayback Machine is 2016-11-01T13:15:40, with all versions since then producing some kind of error (including today's; 2017-01-20T09:16:50). This means that the most popular web archives have no record of the time immediately before the presidential election through at least today's presidential inauguration.

The TL;DR is that:

the archival failure is caused by changes CNN made to their CDN; these changes are reflected in the JavaScript used to render the homepage.

The detailed explanation takes about 4400 words and 15 images. The changes CNN made appear intended to improve the efficiency of their publishing platform. From CNN's point of view the benefits of improved efficiency vastly outweigh the costs of being unarchivable (which in any case CNN doesn't see).

Clearly, scaling this kind of detailed analysis to all major Web sites that use CDNs (content delivery networks) is infeasible. Among these Web sites are the major academic publishing platforms, so the problem Berlin describes is very familiar to the LOCKSS team. It isn't as severe for us; there are a limited number of publishing platforms and changes they make tend to affect all the journals using them. But it still imposes a major staff cost.

It is important to note that there are browser-based Web archiving technologies that are capable of collecting CNN in a replay-able form:

The solution is not as simple as one may hope, but a preliminary solution (albeit band-aid) would be to archive the page using tools such as WARCreate, Webrecorder or Perma.cc. These tools are effective since they preserve a fully rendered page along with all network requests made when rendering the page. This ensures that the JavaScript requested content and rendered sections of the page are replayable. Replaying of the page without the effects of that line of code is possible but requires the page to be replayed in an iframe. This method of replay is employed by Ilya Kreymer's PyWb (Python implementation of the Wayback Machine) and is used by Webrecorder and Perma.cc.

But these techniques are too expensive for general Web archives to apply to all Web sites, and it is difficult to distinguish between sites that need them and those that don't. For general Web archives this is an increasingly serious problem that is gradually eroding the coverage they can afford to maintain of the popular (and thus important for future scholars) Web.

Leetaru concludes:

the web archiving community is still stuck in a quarter-century-old mindset of how the web works and has largely failed to adapt to the rapidly evolving world of video, social media walled gardens, dynamic page generation and the mobile web

This is ludicrous. The LOCKSS team published our work on crawling AJAX-based academic journals in 2015, and it started more than three years earlier. Even then, we were building on work already done by others in the Web archiving community.
The reason these efforts haven't been more widely applied isn't because Web archives are "stuck in a quarter-century-old mindset". Its because applying them costs money.

The Internet Archive's budget is in the region of $15M/yr, about half of which goes to Web archiving. The budgets of all the other public Web archives might add another $20M/yr. The total worldwide spend on archiving Web content is probably less than $30M/yr, for content that cost hundreds of billions to create. The idea that these meager resources would stretch to archiving sites the size of YouTube or FaceBook to Leetaru's satisfaction is laughable.

It is true that the current state of Web archiving isn't good. But Leetaru's suggestion for improving it:

greater collaboration is needed between the archiving community and the broader technology industry, especially companies that build the state-of-the-art crawling infrastructures that power modern web services.

doesn't begin to address the lack of funding. Nor does it address the problem of the motivations of web sites like CNN:

Being archived doesn't do anything for a site's bottom line. It may even be a negative if it exposes machinations such as the notorious "Mission Accomplished" press release.
They don't see any of the costs they impose on Web archives, so they don't weigh in the scale against the benefits of "optimizing the user experience". Even if they did weigh in the scale, they'd be insignificant.

There is no way to greatly improve Web archiving without significantly increased resources. Library and archive budgets have been under sustained attack for years. Neither I nor Leetaru has any idea where an extra $30-50M/yr would come from. Much less isn't going to stop the rot.

2 comments:

Wes Felter said...: Hmm, I wonder if the much-reviled AMP format is easier to archive because it reduces/standardizes the AJAX BS.

A good way to understand the craziness of the modern Web is to install the uMatrix extension which blocks cross-domain requests; most sites show no content.; March 8, 2017 at 3:01 PM
David. said...: Ben Tausig tweets:

"Turns out it's difficult to cite periodicals in countries w/o a free press. Many of the archives of Thailand's @MatichonOnline are just gone ... Tons of great reporting, articles I really leaned on, are no longer available. I have no idea how to cite them."

Fortunately, it appears that the Internet Archive knew about Matichon Online. But using oldweb.today it looks like no other Web archives did.; June 1, 2017 at 8:04 AM

Wednesday, March 8, 2017

The Amnesiac Civilization: Part 2

2 comments: