DSHR's Blog: Michael Nelson's CNI Keynote: Part 2

My "lengthy disquisition" on Michael Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides). continues here. Part 1 had an introduction and discussion of two of my issues with Nelson's big picture.

Below the fold I address my remaining issues with Nelson's big picture of the state of the art. Part 3 will compare his and my views of the path ahead.

A House Built On Sand (Matthew 7:26)

My most recent post to make the point that the Web is inherently untrustworthy was Trust In Digital Content. After making essentially the same point by recounting examples of violated trust from Twitter, Facebook and Google, Nelson asks rhetorically (Slide 47):

Why do we expect things to be different for web archives?

He goes on to discuss some possible attacks on Web archives, for example the theoretical possibility that Brewster Kahle could subvert the Wayback Machine (Slide 51). Clearly, because Web archives are themselves Web sites they are vulnerable to all the attacks that plague Web sites, such as insider attacks.

But Nelson doesn't address the fundamental issue that Web archive replay interfaces are just Web pages. Thus expecting Web archive replay to be more reliable than the medium from which it is constructed is unrealistic. Especially since, unlike most other Web sites, the function of a Web archive is to hold and disseminate any and all externally generated content.

The archive has no control over its content in the sense that platforms do. Despite being massively resourced, control over content or "content moderation" doesn't work at scale for Facebook, YouTube or Twitter. For web archives it would not merely be technically and economically infeasible, it would be counter to their mission. As Nelson shows, they are thus vulnerable to additional attacks from which "content moderation" and similar policies would protect sites like Facebook, if they worked.

Diversity Is A Double-Edged Sword

Nelson's discussion of the Joy Reid blog firestorm was entitled Why we need multiple web archives: the case of blog.reidreport.com:

Until the Internet Archive begins serving blog.reidreport.com again, this is a good time to remind everyone that there are web archives other than the Internet Archive. The screen shot above shows the Memento Time Travel service, which searches about 26 public web archives. In this case, it found mementos (i.e., captures of web pages) in five different web archives: Archive-It (a subsidiary of the Internet Archive), Bibliotheca Alexandrina (the Egyptian Web Archive), the National Library of Ireland, the archive.is on-demand archiving service, and the Library of Congress.

Multiple Web archives matter because, in the wake of their spurious allegations that the Internet Archie had been compromised, blog.reidreport.com had specifically excluded the Internet Archive's crawler in their robots.txt file. In accordance with the Wayback Machine's policy, this made the entire archive of the site inaccessible. Thus it was only these other archives that could supply the information Nelson needed to confront the allegations.

Diversity in Web archiving is a double-edged sword because:

The ability to compare content archived using different crawlers, stored in independent archives using different storage technologies in varying jurisdictions, and replayed using different technologies is fundamental to the credibility of society's memory of the Web.
But the presence of multiple Web archives, of varying trustworthiness (Slide 52) in itself can cause FUD, opens the door for malign archives (Slide 66), and plays against the Web's "increasing returns to scale" economics.

In Slide 67, Nelson asks "What if we start out with > (n/2) + 1 archives compromised?" and cites our Requirements for Digital Preservation Systems: A Bottom-Up Approach. This illustrates two things:

Web archives are just Web sites. Just as it is a bad idea to treat all Web sites as equally trustworthy, it is a bad idea to treat all Web archives as equally trustworthy (Slide 52). This places the onus on operators of Memento aggregators to decide on an archive's trustworthiness.
The "> (n/2) + 1 archives compromised" situation is a type of Sybil attack. Ways to defend against Sybil attacks have been researched since the early 90s. We applied them to Web archiving in the LOCKSS protocol, albeit in a way that wouldn't work at Internet Archive scale. But at that scale n=1.

Nelson claims that LOCKSS assumes all nodes are initially trustworthy (video 43:30) , but this is not the case. He is correct that the LOCKSS technology is not directly applicable to general Web content, for two reasons:

The content is so dynamic.
The aren't enough Web archives to have Lots Of Copies and Keep Stuff Safe.

Of course, to the extent to which Web archiving technology is a monoculture, a zero-day vulnerability could in principle compromise all of the archives simultaneously, not a recoverable situation.

It turns out that the Web's "increasing returns to scale" are particularly powerful in Web archiving. It is important, but only one of many Internet Archive activities:

Adding hardware and staff, the Internet Archive spends each year on Web archiving $3,400,000, or 21% of the total.

It isn't clear what proportion of the global total spend on Web archiving $3.4M is, but it appears to preserve vastly more content that all other Web archives combined. Each dollar spent supporting another archive is a dollar that would preserve more content if spent at the Internet Archive.

What Is This "Page" Of Which You Speak?

On Slide 14 Nelson says:

We need significant and continuous investment today to be able to say a page "used to look like this"

I certainly agree that we need significant and continuous investment in Web archiving technology, but no matter how much we invest we will never be able to say "the page looked like this".

At least since the very first US web page, bought up by Paul Kunz at SLAC around 6^th Dec. 1991, some Web pages have been dynamic in the sense of Nelson's Slides 24-26. Requests to the same URL at different times may return different content.

Because of the way Nelson talks about the problems, implicit throughout is the idea that the goal of Web archiving is to accurately reproduce "the page" or "the past". But even were the crawler technology perfect, there could be no canonical "the page" or "the past". "The page" is in the mind of the beholder. All that the crawler can record is that, when this URL was requested from this IP address with this set of request headers and this set of cookies, this Memento is the bitstream that was returned.

As I wrote more than two years ago in The Amnesiac Civilization: Part 3:

There are about 3.4*10⁹ Internet users from about 200 countries, so there are about 6.8*10¹¹ possible versions of every Web page for each browser and device combination. Say there are 100 of these combinations, and the average Web page is about 2.3*10⁶ bytes. So storing a single Web page could take up to about 1.6*10²⁰ bytes, or 160 exabytes.

But storage isn't actually the problem, since deduplication and compression would greatly reduce the storage needed. The problem is that in order to be sure the archive has found all the versions, it has to download them all before it can do the deduplication and compression.

I believe the Internet Archive's outbound bandwidth is around 2*10⁹ byte/s. Assuming the same inbound bandwidth to ingest all those versions of the page, it would take about 8*10¹⁰ seconds, or about 2.5*10³ years, to ingest a single page. And that assumes that the Web site being archived would be willing to devote 2GB/s of outbound bandwidth for two-and-a-half millenia to serving the archive rather than actual users.

The point here is to make it clear that, no matter how much resource is available, knowing that an archive has collected all, or even a representative sample, of the versions of a Web page is completely impractical. This isn't to say that trying to do a better job of collecting some versions of a page is pointless, but it is never going to provide future researchers with the certainty they crave.

Clearly, only a miniscule proportion of the 6.8*10¹³ possible versions of "the page" will ever be seen by a human. But it is highly likely that the set of versions ever seen by humans has no overlap with the set of versions ever collected by any Web archive's crawler. Even were both crawler and replay technology perfect, what would it mean to say the page "used to look like this" when no-one had ever seen the page looking like that?

We assume that the differences between visits at similar times, from different browsers at different IP addresses with different cookies and headers are immaterial, but we cannot know that. The fact that this isn't true of, for example, Facebook is at the heart of the current disinformation campaigns. They use behavioral targeting and recommendation engines to present intentionally different content to different visits.

Nelson and his co-authors have published in this area. For example in A Framework for Aggregating Private and Public Web Archives they developed techniques by which a user who had collected their personalized content in a private archive could seamlessly replay both it and content from public archives. In A Method for Identifying Personalized Representations in Web Archives they discuss the problem of canonicalizing multiple representations of "the page".

Comparing content from multiple archives would increase our confidence that the differences are immaterial, but see the argument of the previous section.

"The essence of a web archive is ...

... to modify its holdings" (Slide 16).

I find this a very odd thing for Nelson to say, and the oddity is compounded on the next slide when he cites our 2005 paper Transparent Format Migration of Preserved Web Content. At the time we wrote it the consensus was that the preservation function of digital archives required them to modify their holdings by migrating files from doomed formats to less doomed ones. The paper argues that, in the case of Web archives, this is neither necessary nor effective.

We argued that Web content should be preserved in its original format. If and when browsers no longer support that format, the archive can create an access copy in a supported format if and when a request from a reader for that content is received. The paper reports on a proof-of-concept implementation from now a decade and a half ago. Even back then we would have said "the essence of a Web archive is to collect and preserve its holdings unchanged for future access".

More recent work by Ilya Kreymer has shown that even transparent on-access format migration is not needed. oldweb.today shows how emulation technology can be used to replay preserved Web content in its original format with no modification whatsoever.

Kreymer's work was one of several efforts showing that Web archives needed both a Wayback Machine-like replay interface, and also an interface providing access to the unmodified content as originally collected, the raw Mementos. In March 2016 Kreymer summarized discussions he and I had had about a problem he'd encountered building oldweb.today thus:

a key problem with Memento is that, in its current form, an archive can return an arbitrarily transformed object and there is no way to determine what that transformation is. In practice, this makes interoperability quite difficult.

By August 2016 Nelson, Herbert Van de Sompel and others had published a very creative solution to the problem of how access to unmodified content in Web archives could be requested. So nearly three years ago Nelson believed that the essence of a Web archive included access to unmodified content.

I believe that what Nelson intended to say is "The essence of web archive replay is to modify the requested content". But even if oldweb.today were merely a proof of concept, it proves that this statement applies only to the Wayback Machine and similar replay technologies.

Who Uses This Stuff Anyway?

In my view the essential requirement for a site to be classified as a Web archive is that it provide access to the raw Mementos. There are two reasons for this. First, my approach to the problems Nelson describes is to enhance the transparency of Web archives. An archive unable to report the bitstream it received from the target Web site is essentially saying "trust me". It cannot provide the necessary transparency.

The second and more important one is that Web archives serve two completely different audiences. The replay UI addresses human readers. But increasingly researchers want to access Web archives programmatically, and for this they need an API providing access to raw Mementos, either individually or by querying the archive's holdings with a search, and receiving the Mementos that satisfy the query in bulk as a WARC file. Suitable APIs were developed in the WASAPI project, a 2015 IMLS-funded project involving the Internet Archive, Stanford Libraries, UNT and Rutgers.

Web Archives As Fighting Cats

I agree with Nelson that the state of Web archive interoperablity leaves much to be desired. But he vastly exaggerates the problem using a highly misleading strawman demonstration. His "generational loss" demo (Slides 88-93) is a game of telephone tag involving pushing the replay UI of one archive into another, then pushing the resulting replay UI into another, and so on.

Transferring content in this way cannot possibly work. One of the major themes of Nelson's entire talk is that the "essence of Web archive [replay] is to modify its holdings". Thus it is hard to see why the fact that successive modifications lose information is relevant to Web archive interoperation.

Web archives, such as those Nelson used in his telephone tag pipeline, that modify their content on replay must interoperate using raw Mementos. That is why for a long time I have been active in promoting ways to do so. Many years ago I implemented the ability for LOCKSS boxes to export and import raw Mementos in the WARC file standard, which has long been used to transfer content from Archive-It accounts to LOCKSS boxes for preservation. I was active in pushing for the WASAPI project, and took part in the design discussions. I started work back in 2014 on what became the Mellon-funded LAAWS (LOCKSS Architected As Web Services) project. The designed-for-interoperation result is now being released, a collection of Web services for ingest, preservation and replay based on the WARC standard for representing raw Mementos.

Despite the upside-down tortoise joke, Nelson's references to LOCKSS in this context are also misleading. LOCKSS boxes interoperate among themselves and with other Web archives not on the replayed Web content but on the raw Mementos (see CLOCKSS: Ingest Pipeline and LOCKSS: Polling and Repair Protocol).

A Voight-Kampff Test?

Unlike Nelson, I'm not obsessed with Do Androids Dream Of Electric Sheep? and Blade Runner. One thing that puzzled me was the need for the Voight-Kampff machine. It is clearly essential to the story that humans can write to replicant brains, to implant the fake memories. So why isn't this a read/write interface, like the one Mary Lou Jepsen's Openwater is now working on in real life? If the replicant brain could be read, its software could be verified.

Unlike replicants, Web archives have a read interface, ideally the one specified by Nelson and his co-authors. So the raw collected Mementos can be examined; we don't need a Web Voight-Kampff machine questioning what we see via the replay interface to decide if it correctly describes "the past". As I understand it, examining raw Mementos is part of the Old Dominion group's research methodology for answering this question.

The broader question of whether the content an archive receives from a Web site is "real" or "fake" isn't the archive's job to resolve (Slide 58). The reason is the same as the reason archives should not remove malware from the content they ingest. Just as malware is an important field of study, so is disinformation. Precautions need to be taken in these studies, but depriving them of the data they need is not the answer.

To Be Continued

The rest of the post is still in draft. The third part will address the way forward:

Part 3:

What Can't Be Done?
What Can Be Done?

Source Transparency
Temporal Transparency
Fidelity Transparency 1
Fidelity Transparency 2

Conclusion

DSHR's Blog

Tuesday, June 18, 2019

Michael Nelson's CNI Keynote: Part 2