Tuesday, September 19, 2017

Attacking (Users Of) The Wayback Machine

Right from the start, nearly two decades ago, the LOCKSS system assumed that:
Alas, even libraries have enemies. Governments and corporations have tried to rewrite history. Ideological zealots have tried to suppress research of which they disapprove.
The LOCKSS polling and repair protocol was designed to make it as difficult as possible for even a powerful attacker to change content preserved in a decentralized LOCKSS network, by exploiting excess replication and the lack of a central locus of control.

Just like libraries, Web archives have enemies. Jack Cushman and Ilya Kreymer's (CK) talk at the 2017 Web Archiving Conference identified seven potential vulnerabilities of centralized Web archives that an attacker could exploit to change or destroy content in the archive, or mislead an eventual reader as to the archived content.

Now, Rewriting History: Changing the Archived Web from the Present by Ada Lerner et al (L) identifies four attacks that, without compromising the archive itself, caused browsers using the Internet Archive's Wayback Machine to view pages that look different to the originally archived content. It is important to observe that the title is misleading, and that these attacks are less serious than those that compromise the archive. Problems with replaying archived content are fixable, loss or damage to archived content is not fixable.

Below the fold I examine L's four attacks and relate them to CK's seven vulnerabilities.

To review, CK's seven vulnerabilities are:
  1. Archiving local server files, in which resources local to the crawler end up in the archive.
  2. Hacking the headless browser, in which vulnerabilities in the execution of Javascript by the crawler are exploited.
  3. Stealing user secrets during capture, a vulnerability of user-driven crawlers which typically violate cross-domain protections.
  4. Cross site scripting to steal archive logins:
    When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe.
  5. Live web leakage on playback:
    Especially with Javascript in archived pages, it is hard to make sure that all resources in a replayed page come from the archive, not from the live Web. If live Web Javascript is executed, all sorts of bad things can happen. Malicious Javascript could exfiltrate information from the archive, track users, or modify the content displayed.
  6. Show different page contents when archived:
    it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.
  7. Banner spoofing:
    When replayed, malicious pages can overwrite the archive's banner, misleading the reader about the provenance of the page.
Vulnerabilities CK1 through CK4 are attacks on the archive itself, possibly leading to corruption and loss. The remaining three are attacks on the eventual reader, similar to of L's four. You need to read the paper to get the full details of their attacks, but in summary they are are:
  1. Archive-Escape Abuse: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner, which they show is common. The resource was not present in the archive, so when they obtained control of the domain they were able to serve from it malicious JavaScript that the page served from the Wayback Machine would include. This is a version of vulnerability CK5.
  2. Same-Origin Escape Abuse: The attackers identified an archived victim page that, in an iframe from a third-party domain, included malicious JavaScript. On the live Web the Same-Origin policy prevented it from executing, but when served from the Wayback Machine the page and the iframe had the same origin. This is related to vulnerability CK4. It requires foresight, since the iframe code must be present at ingest time.
  3. Same-Origin Escape + Archive-Escape: The attackers combined L1 and L2 by including in the iframe code that deliberately generated archive escapes. It again requires foresight, since the escape-generating code must be present at ingest time.
  4. Anachronism-Injection: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner. The resource was not present in the archive, so when they obtained control of the domain they could use the Wayback Machine's "Save Page Now" facility to create an archived version of the resource. Now when the Wayback Machine served the page, the attackers' version of the resource would be served from the archive. The only way to defend against this attack, since the attacker's version of the resource will always be the closest in time to the victim page, would be to restrict searches for nearest-in-time resources to a small time range.
Unlike L, CK note that Web archives could prevent leaks to the live Web:
Injecting the Content-Security-Policy (CSP) header into replayed content could mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).
Web archives should; browsers have supported the CSP header for at least 4 years. The version of the Wayback Machine used by the Internet Archive's ArchiveIt service uses CSP to prevent live Web leakage, but the main Wayback Machine currently doesn't. If it did, L1 through L3 would be ineffective.

All this being said, there are some important caveats that users of preserved Web content should bear in mind. It is extremely likely that the payload of a URL delivered by the Wayback Machine is the same as that its crawler collected at the specified time. However, this does not mean that the rendered page in your browser looks the same as it would have had you visited the page when the Wayback Machine's crawler did:
  • If the Web archive's replay system does not use CSP, all bets are off.
  • Browsers evolve, rendering pages differently. Using oldweb.today can mitigate, but not eliminate this problem, as I wrote in The Internet Is for Cats.
  • The embedded resources, such as images, CSS files, and JavaScript libraries, may not have been collected at the same time as the page itself, so may be different, as in the L4 attack.
  • At collection time, the owner of the page's domain, or the domain of any of the embedded resources, or even someone who had compromised the Web servers of the page or any of its embedded resources, could be malicious. As in the CK6 vulnerability, they could detect that the page was being archived and deliver to the crawler a payload different from that they would have delivered to a browser.
The bottom line is that all critical uses of preserved Web content, such as legal evidence, should be based on the source of the payload, not on a rendered page image.

2 comments:

Mark Graham said...

Thank you for this David.

One update/correction.

Dr. Ada Lerner (the lead author of the paper you reference) reached out to me a few weeks ago and, as a result, we added CSP headers to the "main" Wayback Machine (web.archive.org).

As such we have mitigated the first condition in the paper.

We will soon add a feature to allow users of the Wayback Machine to easily view the archive time/date of all "page" elements, mitigating the fourth condition.

- Mark Graham, Director, the Wayback Machine at the Internet Archive

David. said...

Mark Graham has a post at the Internet Archive blog describing the new "Timestamps" feature of the Wayback Machine:

"The Wayback Machine has an exciting new feature: it can list the dates and times, the Timestamps, of all page elements compared to the date and time of the base URL of a page. This means that users can see, for instance, that an image displayed on a page was captured X days before the URL of the page or Y hours after it. Timestamps are available via the “About this capture” link on the right side of the Wayback Toolbar."

This is the mitigation of the fourth condition mentioned in Mark's comment.