Tuesday, September 19, 2017

Attacking (Users Of) The Wayback Machine

Right from the start, nearly two decades ago, the LOCKSS system assumed that:
Alas, even libraries have enemies. Governments and corporations have tried to rewrite history. Ideological zealots have tried to suppress research of which they disapprove.
The LOCKSS polling and repair protocol was designed to make it as difficult as possible for even a powerful attacker to change content preserved in a decentralized LOCKSS network, by exploiting excess replication and the lack of a central locus of control.

Just like libraries, Web archives have enemies. Jack Cushman and Ilya Kreymer's (CK) talk at the 2017 Web Archiving Conference identified seven potential vulnerabilities of centralized Web archives that an attacker could exploit to change or destroy content in the archive, or mislead an eventual reader as to the archived content.

Now, Rewriting History: Changing the Archived Web from the Present by Ada Lerner et al (L) identifies four attacks that, without compromising the archive itself, caused browsers using the Internet Archive's Wayback Machine to view pages that look different to the originally archived content. It is important to observe that the title is misleading, and that these attacks are less serious than those that compromise the archive. Problems with replaying archived content are fixable, loss or damage to archived content is not fixable.

Below the fold I examine L's four attacks and relate them to CK's seven vulnerabilities.

To review, CK's seven vulnerabilities are:
  1. Archiving local server files, in which resources local to the crawler end up in the archive.
  2. Hacking the headless browser, in which vulnerabilities in the execution of Javascript by the crawler are exploited.
  3. Stealing user secrets during capture, a vulnerability of user-driven crawlers which typically violate cross-domain protections.
  4. Cross site scripting to steal archive logins:
    When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe.
  5. Live web leakage on playback:
    Especially with Javascript in archived pages, it is hard to make sure that all resources in a replayed page come from the archive, not from the live Web. If live Web Javascript is executed, all sorts of bad things can happen. Malicious Javascript could exfiltrate information from the archive, track users, or modify the content displayed.
  6. Show different page contents when archived:
    it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.
  7. Banner spoofing:
    When replayed, malicious pages can overwrite the archive's banner, misleading the reader about the provenance of the page.
Vulnerabilities CK1 through CK4 are attacks on the archive itself, possibly leading to corruption and loss. The remaining three are attacks on the eventual reader, similar to of L's four. You need to read the paper to get the full details of their attacks, but in summary they are are:
  1. Archive-Escape Abuse: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner, which they show is common. The resource was not present in the archive, so when they obtained control of the domain they were able to serve from it malicious JavaScript that the page served from the Wayback Machine would include. This is a version of vulnerability CK5.
  2. Same-Origin Escape Abuse: The attackers identified an archived victim page that, in an iframe from a third-party domain, included malicious JavaScript. On the live Web the Same-Origin policy prevented it from executing, but when served from the Wayback Machine the page and the iframe had the same origin. This is related to vulnerability CK4. It requires foresight, since the iframe code must be present at ingest time.
  3. Same-Origin Escape + Archive-Escape: The attackers combined L1 and L2 by including in the iframe code that deliberately generated archive escapes. It again requires foresight, since the escape-generating code must be present at ingest time.
  4. Anachronism-Injection: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner. The resource was not present in the archive, so when they obtained control of the domain they could use the Wayback Machine's "Save Page Now" facility to create an archived version of the resource. Now when the Wayback Machine served the page, the attackers' version of the resource would be served from the archive. The only way to defend against this attack, since the attacker's version of the resource will always be the closest in time to the victim page, would be to restrict searches for nearest-in-time resources to a small time range.
Unlike L, CK note that Web archives could prevent leaks to the live Web:
Injecting the Content-Security-Policy (CSP) header into replayed content could mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).
Web archives should; browsers have supported the CSP header for at least 4 years. The version of the Wayback Machine used by the Internet Archive's ArchiveIt service uses CSP to prevent live Web leakage, but the main Wayback Machine currently doesn't. If it did, L1 through L3 would be ineffective.

All this being said, there are some important caveats that users of preserved Web content should bear in mind. It is extremely likely that the payload of a URL delivered by the Wayback Machine is the same as that its crawler collected at the specified time. However, this does not mean that the rendered page in your browser looks the same as it would have had you visited the page when the Wayback Machine's crawler did:
  • If the Web archive's replay system does not use CSP, all bets are off.
  • Browsers evolve, rendering pages differently. Using oldweb.today can mitigate, but not eliminate this problem, as I wrote in The Internet Is for Cats.
  • The embedded resources, such as images, CSS files, and JavaScript libraries, may not have been collected at the same time as the page itself, so may be different, as in the L4 attack.
  • At collection time, the owner of the page's domain, or the domain of any of the embedded resources, or even someone who had compromised the Web servers of the page or any of its embedded resources, could be malicious. As in the CK6 vulnerability, they could detect that the page was being archived and deliver to the crawler a payload different from that they would have delivered to a browser.
The bottom line is that all critical uses of preserved Web content, such as legal evidence, should be based on the source of the payload, not on a rendered page image.

7 comments:

Unknown said...

Thank you for this David.

One update/correction.

Dr. Ada Lerner (the lead author of the paper you reference) reached out to me a few weeks ago and, as a result, we added CSP headers to the "main" Wayback Machine (web.archive.org).

As such we have mitigated the first condition in the paper.

We will soon add a feature to allow users of the Wayback Machine to easily view the archive time/date of all "page" elements, mitigating the fourth condition.

- Mark Graham, Director, the Wayback Machine at the Internet Archive

David. said...

Mark Graham has a post at the Internet Archive blog describing the new "Timestamps" feature of the Wayback Machine:

"The Wayback Machine has an exciting new feature: it can list the dates and times, the Timestamps, of all page elements compared to the date and time of the base URL of a page. This means that users can see, for instance, that an image displayed on a page was captured X days before the URL of the page or Y hours after it. Timestamps are available via the “About this capture” link on the right side of the Wayback Toolbar."

This is the mitigation of the fourth condition mentioned in Mark's comment.

David. said...

The possibility of attacks on the content of the Internet Archive has exploded into public consciousness with Joy-Ann Reid's claim that:

"she was the victim of “hackers”: somehow, nefarious disinformation agents managed to hack not her blog (which is now deleted), but rather the Wayback Machine and its digital archive. They penetrated the Wayback Machine and then, according to Reid, added some anti-gay content."

The Internet Archive's response was skeptical:

"When we reviewed the archives, we found nothing to indicate tampering or hacking of the Wayback Machine versions. At least some of the examples of allegedly fraudulent posts provided to us had been archived at different dates and by different entities."

In any case, Reid mounted her own "attack" on the Wayback Machine:

"At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog. Also, the blog URL which previously pointed to an msnbc.com page now points to a generic parked page."

Anyone can make their website's content unavailable to users of the Wayback Machine in this way.

David. said...

Hayley Miller examined Joy-Ann Reid's claims and found:

"The Library of Congress, which uses a local installation of the Wayback Machine, contains the disputed posts, CNN reported Tuesday. Archive.today, another archiving site, also contained the posts, HuffPost discovered."

The idea of an attack on the Wayback Machine is pretty much ruled out.

tassiechick said...

It doesnt matter what Haley found. She's just a journalist reporting what other people are saying. She didnt investigate a thing. The idea of an attack on Wayback Machine is pretty much NOT ruled out at all. SO far all that's sure is the initial screen caps alleging she made the posts turned out to be doctored via photoshop, the alleged posts have no comments on them which is not at all plausible, the posts were not seen by anyone who visited the blog regularly at the time, the posts were not there when Joy apologized to Christ for another post some months ago & have suddenly appeared. There is ample evidence that they are fabricated.

David. said...

I let the comment above through moderation to support my assertion that the possibility of attacks on the Wayback Machine's content have become controversial. However, some observations are in order:

1. The comment provides no sources, let alone ones that Wikipedia would regard as "reliable", for its claims.

2. The comment fails to address Miller's reporting, backed by links and supported by the Internet Archive, that two other Web archives also contained the posts Reid disclaimed. So a putative attacker would have had to compromise the Library of Congress (which uses the same capture but different storage technology from the Wayback machine) and archive.today (which uses different technology for both capture and storage). This makes the idea of an attack on the Wayback Machine improbable.

tassiechick's profile is sparse to say the least and her blog has only one post, dating from 2012 and making implausible claims while asking for donations to what looks like a non-existent fan site.

Here is a list of other relatively reliable sources debunking Reid's claims:

1. The Daily Beast.

2. Paste.

3. Vox.

4. New York Magazine.

5. Spin.

No similar comments will in future survive moderation.

David. said...

The Joy-Ann Reid story continues with Michael Nelson's Why we need multiple web archives: the case of blog.reidreport.com and his appearance on CNN.