Thursday, June 13, 2019

Michael Nelson's CNI Keynote: Part 1

Michael Nelson and his group at Old Dominion University have made major contributions to Web archiving. Among them are a series of fascinating papers on the problems of replaying archived Web content. I've blogged about several of them, most recently in All Your Tweets Are Belong To Kannada and The 47 Links Mystery. Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides) understandably focuses on recounting much of this important research. I'm a big fan of this work, and there is much to agree with in the rest of the talk.

But I have a number of issues with the big picture Nelson paints. Part of the reason for the gap in posting recently was that I started on a draft that discussed both the big picture issues and a whole lot of minor nits, and I ran into the sand. So I finally put that draft aside and started this one. I tried to restrict myself to the big picture, but despite that it is still too long for a single post. Follow me below the fold for the first part of a lengthy disquisition.


Taken as a whole, Nelson seems to be asking:
Can we take what we see from Web archives at face value?
Nelson is correct that the answer is "No", but there was no need to recount all the detailed problems in his talk to arrive at this conclusion. All he needed to say was:
Web archive interfaces such as the Wayback Machine are Web sites like any other. Nothing that you see on the Web can be taken at face value.
As we have seen recently, an information environment allowing every reader to see different, individually targeted content allows malign actors to mount powerful, sophisticated disinformation campaigns. Nelson is correct to warn that Web archives will become targets of, and tools for these disinformation campaigns.

I'm not saying that the problems Nelson illustrates aren't significant, and worthy of attention. But the way they are presented seems misleading and somewhat counter-productive. Nelson sees it as a vulnerability that people believe the Wayback Machine is a reliable source for the Web's history. He is right that malign actors can exploit this vulnerability. But people believe what they see on the live Web, and malign actors exploit this too. The reason is that most people's experience of both the live Web and the Wayback Machine is that they are reasonably reliable for everyday use.

The structure I would have used for this talk would have been to ask these questions:
  • Is the Wayback Machine more or less trustworthy as to the past than the live web? Answer: more.
  • How much more trustworthy? Answer: significantly but not completely.
  • How can we make the Wayback Machine, and Web archives generally more trustworthy? Answer: make them more transparent.

What Is This Talk About?

My first big issue with Nelson's talk is that, unless you pay very close attention, you will think that it is about "web archives" and their problems. But it is actually almost entirely about the problems of replaying archived Web content. The basic problem is that the process of replaying content from Web archives frequently displays pages that never actually existed. Although generally each individual component (Memento) of the replayed page existed on the Web at some time in the past, no-one could ever have seen a page assembled from those Mementos.

Nelson hardly touches on the problems of ingest that force archives to assemble replayed pages from non-contemporaneous sets of Mementos. His discussion of the threats to the collected Mementos between ingest and dissemination is superficial. The difference between these issues and the problems of replay that Nelson describes in depth is that inadequate collection and damage during storage are irreversible whereas replay can be improved through time (Slide 18). Nelson even mentions a specific recent case of fixed replay problems, "zombies" (Slide 30). If more resources could be applied, more of them could be fixed more quickly.

Many of my problems with Nelson's talk would have gone away if, instead of saying "web archives" he had said "web archive replay" or, in most cases "Wayback Machine style replay".

Winston Smith Is In The House!

Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". Each stage of digital preservation is vulnerable to attack, but only one to "Winston Smith" attacks:
  • Ingest is vulnerable to two kinds of attack, in which Mementos unrepresentative of the target Web site end up in the archive's holdings at the time of collection:
    • A malign target Web site can detect that the request is coming from an archive and respond with content that doesn't match what a human would have seen. Of course, a simple robots.txt file can prevent archiving completely.
    • A malign "man-in-the-middle" can interpose between the target Web site and the archive, substituting content in the response. This can happen even if the archive's crawler uses HTTPS, via DNS tampering and certificate spoofing.
  • Preservation is vulnerable to a number of "Winston Smith" attacks, in which the preserved contents are modified or destroyed after being ingested but before dissemination is requested. In 2005 we set out a comprehensive list of such attacks in Requirements for Digital Preservation Systems: A Bottom-Up Approach, and ten years later applied them to the CLOCKSS Archive in CLOCKSS: Threats and Mitigations.
  • Dissemination is vulnerable to a number of disinformation attacks, in which the Mementos disseminated do not match those stored Mementos responsive to the user's request. Nelson uses the case of Joy Reid's blog (Slide 33) to emphasize that "Winston Smith" attacks aren't necessary for successful disinformation campaigns. All that is needed is to sow Fear, Uncertainty and Doubt (FUD) as to the veracity of Web archives (Slide 41). In the case of Joy Reid's blog, all that was needed to do this was to misinterpret the output of the Wayback Machine; an attack wasn't necessary.
Nelson then goes on to successfully cast this FUD on Web archives by mentioning the risks of, among others:
  • Insider attacks (Slide 51, Slide 64),which can in principle perform "Winston Smith" rewriting attacks. But because of the way preserved content is stored in WARC files with hashes, this is tricky to do undetectably. An easier insider attack is to tamper with the indexes that feed the replay pipeline so they point to spurious added content instead.
  • Using Javascript to tamper with the replay UI to disguise the source of fake content (Slide 59). Although Nelson shows a proof-of-concept (Slide 61), this attack is highly specific to the replay technology. Using an alternate replay technology, for example, it would be an embarrassing failure.
  • Deepfakes (Slide 56). This is a misdirection on Nelson's part, because deepfakes are an attack on the trustworthiness of the Web, not specifically on Web archives. It is true that they could be used as the content for attacks on Web archives, but there is nothing that Web archives can do to specifically address attacks with deepfake content as opposed to other forms of manipulated content. In Slide 58 Nelson emphasizes that it isn't the archive's job to detect or suppress fake content from the live Web.
In Slide 62, Nelson points to a group of resources from two years ago including Jack Cushman and Ilya Kreymer's Thinking like a hacker: Security Considerations for High-Fidelity Web Archives and my post about it. He says "fixing this, preventing web archives being an attack vector, is going to be a great deal of work". It isn't "going to be", it is a ongoing effort. At least some of the attacks identified two years ago have already been addressed; for example the Wayback Machine uses the Content-Security-Policy header. Clearly, like all Web technologies, vulnerabilities in Web archiving technologies will emerge through time and need to be addressed.

Slide 69 provides an example of the one of two most credible ways Web archives can be attacked, by their own governments. This is a threat about which we have been writing for at least five years in the context of academic journals. Nelson underplays the threat, because it is far more serious than simply disinformation.

Governments, for example the Harper administration in Canada and the Trump administration, have been censoring scientific reports and the data on which they depend wholesale.  Web archives have been cooperating in emergency efforts to collect and safeguard this information, primarily by archiving them in other jurisdictions. But both the US with the CLOUD Act and the EU claim extraterritorial jurisdiction over data in servers. A recent example was documented in Chris Butler's post Official EU Agencies Falsely Report More Than 550 URLs as Terrorist Content on the Internet Archive's blog:
In the past week, the Internet Archive has received a series of email notices from French Internet Referral Unit (French IRU) falsely identifying hundreds of URLs on as “terrorist propaganda”. At least one of these mistaken URLs was also identified as terrorist content in a separate take down notice sent under the authority of the French government’s L’Office Central de Lutte contre la Criminalité liée aux Technologies de l’Information et de la Communication (OCLCTIC).
It would be bad enough if the mistaken URLs in these examples were for a set of relatively obscure items on our site, but the French IRU’s lists include some of the most visited pages on and materials that obviously have high scholarly and research value.
The alleged terrorist content included archives of the Grateful Dead's music, CSPAN, Rick Prelinger's industrial movies and scientific preprints from

The other most credible way Web archives can be attacked is, as the Joy Reid case illustrates, by the target Web site abusing copyright or robots.txt. I wrote a detailed post about this last December entitled Selective Amnesia. The TL;DR is that if sites don't want to be archived they can easily prevent it, and if they subsquently regret it they can easily render the preserved content inaccessible via robots.txt, or via a DMCA takedown. And if someone doesn't want someone else's preserved content accessed, they have many legal avenues to pursue, starting with a false claim of copyright:
The fundamental problem here is that, lacking both a registry of copyright ownership, and any effective penalty for false claims of ownership, archives have to accept all but the most blatantly false claims, making it all too easy for their contents to be censored.

I haven't even mentioned the "right to be forgotten", the GDPR, the Australian effort to remove any judicial oversight from takedowns, or the EU's Article 13 effort to impose content filtering. All of these enable much greater abuse of takedown mechanisms.

To Be Continued

The rest of the post is still in draft. It will probably consist of two more parts:

Part 2:

  • A House Built On Sand
  • Diversity Is A Double-Edged Sword
  • What Is This "Page" Of Which You Speak?
  • The essence of a web archive is ...
  • A Voight-Kampff Test?

Part 3:

  • What Can Be Done?
    • Source Transparency
    • Temporal Transparency
    • Fidelity Transparency 1
    • Fidelity Transparency 2
  • Conclusion

No comments: