- Archiving local server files
- Hacking the headless browser
- Stealing user secrets during capture
- Cross site scripting to steal archive logins
- Live web leakage on playback
- Show different page contents when archived
- Banner spoofing
- First, view the slides.
- Second, visit http://warc.games., which is a sandbox with
a local version of Webrecorder that has not been patched to fix known exploits, and a number of challenges for you learn how they might apply to web archives in general.
Archiving local server filesA page being archived might have links that, when interpreted in the context of the crawler, point to local resources that should not end up in public Web archives. Examples include:
Hacking the headless browserNowadays collecting many Web sites requires executing the content in a headless browser such as PhantomJS. They all have vulnerabilities, only some of which are known at any given time. The same is true of the virtualization infrastructure. Isolating the crawler in a VM or a container does add another layer of complexity for the attacker, who now needs exploits not just for the headless browser but also for the virtualization infrastructure. But it requires that both need to be kept up-to-date. This isn't a panacea, just risk reduction.
Stealing user secrets during captureUser-driven Web recorders place user data at risk, because they typically hand URLs to be captured to the recording process as suffixes to a URL for the Web recorder, thus vitiating the normal cross-domain protections. Everything, login pages, third-party ads, etc. is regarded as part of the Web recorder domain.
Cross site scripting to steal archive loginsSimilarly, the URLs used to replay content must be carefully chosen to avoid the risk of cross-site scripting attacks on the archive. When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe. For example:
- Archive domain: https://perma.cc/
- Content domain: https://perma-archives.org/
Injecting the Content-Security-Policy (CSP) header into replayed content can mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).
Show different page contents when archivedI wrote previously about the fact that these days the content of almost all web pages depends not just on the browser, but also the user, the time, the state of the advertising network and other things. Thus it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.
This is another reason why both the crawler and the replayer should be run in isolated containers or VMs. The bigger question of how crawlers can be configured to obtain representative content from personalized, geolocated, advert-supported web-sites is unresolved, but out of scope for Cushman and Kreymer's talk.