Alas, even libraries have enemies. Governments and corporations have tried to rewrite history. Ideological zealots have tried to suppress research of which they disapprove.The LOCKSS polling and repair protocol was designed to make it as difficult as possible for even a powerful attacker to change content preserved in a decentralized LOCKSS network, by exploiting excess replication and the lack of a central locus of control.
Just like libraries, Web archives have enemies. Jack Cushman and Ilya Kreymer's (CK) talk at the 2017 Web Archiving Conference identified seven potential vulnerabilities of centralized Web archives that an attacker could exploit to change or destroy content in the archive, or mislead an eventual reader as to the archived content.
Now, Rewriting History: Changing the Archived Web from the Present by Ada Lerner et al (L) identifies four attacks that, without compromising the archive itself, caused browsers using the Internet Archive's Wayback Machine to view pages that look different to the originally archived content. It is important to observe that the title is misleading, and that these attacks are less serious than those that compromise the archive. Problems with replaying archived content are fixable, loss or damage to archived content is not fixable.
Below the fold I examine L's four attacks and relate them to CK's seven vulnerabilities.
To review, CK's seven vulnerabilities are:
- Archiving local server files, in which resources local to the crawler end up in the archive.
- Stealing user secrets during capture, a vulnerability of user-driven crawlers which typically violate cross-domain protections.
- Cross site scripting to steal archive logins:
When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe.
- Live web leakage on playback:
- Show different page contents when archived:
it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.
- Banner spoofing:
When replayed, malicious pages can overwrite the archive's banner, misleading the reader about the provenance of the page.
- Same-Origin Escape + Archive-Escape: The attackers combined L1 and L2 by including in the iframe code that deliberately generated archive escapes. It again requires foresight, since the escape-generating code must be present at ingest time.
Injecting the Content-Security-Policy (CSP) header into replayed content could mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).Web archives should; browsers have supported the CSP header for at least 4 years. The version of the Wayback Machine used by the Internet Archive's ArchiveIt service uses CSP to prevent live Web leakage, but the main Wayback Machine currently doesn't. If it did, L1 through L3 would be ineffective.
All this being said, there are some important caveats that users of preserved Web content should bear in mind. It is extremely likely that the payload of a URL delivered by the Wayback Machine is the same as that its crawler collected at the specified time. However, this does not mean that the rendered page in your browser looks the same as it would have had you visited the page when the Wayback Machine's crawler did:
- If the Web archive's replay system does not use CSP, all bets are off.
- Browsers evolve, rendering pages differently. Using oldweb.today can mitigate, but not eliminate this problem, as I wrote in The Internet Is for Cats.
- At collection time, the owner of the page's domain, or the domain of any of the embedded resources, or even someone who had compromised the Web servers of the page or any of its embedded resources, could be malicious. As in the CK6 vulnerability, they could detect that the page was being archived and deliver to the crawler a payload different from that they would have delivered to a browser.