Thursday, October 3, 2019

Guest post: Ilya Kreymer's Client-Side Replay Technology

Ilya Kreymer gave a brief description of his recent development of client-side replay for WARC-based Web archives in this comment on my post Michael Nelson's CNI Keynote: Part 3. It uses Service Workers, which Matt Gaunt describes in Google's Web Fundamentals thus:
A service worker is a script that your browser runs in the background, separate from a web page, opening the door to features that don't need a web page or user interaction. Today, they already include features like push notifications and background sync. In the future, service workers might support other things like periodic sync or geofencing. The core feature discussed in this tutorial is the ability to intercept and handle network requests, including programmatically managing a cache of responses.
Client-side replay was clearly an important advance, so I asked him for a guest post with the details. Below the fold, here it is.

Introducing wabac.js: Viewing Web Archives Directly in the Browser

As the web has evolved, many types of digital content can be opened and viewed directly in modern web browsers. A user can view images, watch video, listen to audio, read a book or publication, and even interact with 3D models, directly in the web browser. All of this content can be delivered as discrete (or streaming) files and rendered neatly by the browser. This lends well to web-based digital repositories, which can store these objects (and associated metadata) and serve them via the web.

But what about archiving web content itself and presenting it in a browser? The web is oriented around a network protocol (HTTP), not particular media formats. Thus, web archiving involves the capture of web (HTTP) traffic from multiple web servers, and later replay of that HTTP traffic in a different context. Web archiving has always been a form of ‘self-emulation’ of the web in the same medium. To replay accurately web content captured previously requires not just a web browser, but also a web server, which can ‘emulate’ any number of web servers, and a system that can ‘emulate’ any number of sites when rendered in the browser. For this reason, a browser’s ‘Save Page As...’ functionality never really quite worked for anything but the simplest of sites: the browser does not store the network traffic and does not have the ability to emulate itself!

But what if it were possible to ‘render’ an arbitrary web archive, directly in the browser, just as easily as it is to view a video or a PDF?

Thanks to a technology originally developed for offline browsing, called Service Workers, this is now possible! Service workers allow a browser to also act as a web server for a given domain. This feature was designed for offline caching to speed up complex web applications, but turns out to have a significant impact for web archives. Using service workers, it is now possible to “emulate” a web server in the browser, thereby loading a web archive using the browser itself.

Initial research on using service workers for web archive replay has been published by Sawood Alam et al of ODU's WS-DL group (Client-Side Reconstruction of Composite Mementos Using ServiceWorker). I have take the idea a step further, porting aspects of Webrecorder's existing python-based software to run directly in the browser as a service worker.

The result is the following prototype, Web Archive Browsing Advanced Client (WABAC), available on our github as wabac.js and as a static site, https://wab.ac/. Much like opening a PDF in the browser, using https://wab.ac/ allows opening and rendering WARC or HAR files directly in the browser (the files are not uploaded anywhere and do not leave the users' machine).

The implications of this technology could be far reaching for web archives.

First, the infrastructure necessary to support web archives could be reduced to primarily the cost of storage. Any system that can store and serve static files over HTTP, including github, S3, or any institutional repository, can thus function as a web archive. The costs associated with running a web archive can thus be reduced to the cost of existing storage, making storing web archives no different than storing other types of digital content.

For example, the following links loads a web archive (via a WARC file) on-demand from github, and renders a blog post or Twitter feed.

Since github provides free hosting for small files, the archived page is available entirely for free:
The trade-off currently is slightly longer load time as the full WARC needs to be downloaded, and more CPU processing in the users’ browser. (The download time can likely be reduced by using a pre-computed lookup index, which has not yet been implemented).

Second, this approach could increase trust in web archives by increasing transparency. Current web archive replay infrastructures involve complex server-side operations, such as URL rewriting, banner insertion, which could cause questions about the reliability of web archives.

Due to how web archives currently work, web archive replay requires modifying the original context (to ‘emulate’ the original context) and serving a modified version in order for it to appear as expected, or with the archives branding, in the browser. These modifications happen on the web archive server and are thus opaque to the user: one could always claim the web archive has been improperly modified.

The service worker approach does not eliminate the complexity but it increases transparency, by moving all of the modifications necessary for rendering to the users’ browser. Thus, the browser must receive the raw content and render it in the browser. With the web archive replay happening in the browser and not on a remote server, it is possible to verify the rendering process, especially if the replay software is fully open source.

Of course, trusting the raw content becomes even more imperative, and more work is needed there. The WARC format, the standard format for archived web content, does not itself contain a way to verify that the data has not been tampered with. There have been various suggestions for how to verify that raw web archive data received from another server has not been tampered with, including by also storing certificates, signing the WARC files themselves, or possibly exploring the new signed exchange standard that is being proposed. Similar to other mediums, some form of verification of WARC data will be necessary to avoid ‘deep-fake’ WARCs that misrepresent, and much more work in this area is still needed.

However, the client-side rendering approach clearly separates the rendering of a web archive from the raw data itself.

The clear distinction between the raw web archive data (eg. WARC files) and the software needed to replay the web archive data has a number of useful properties for digital preservation, aligned with other types of content.

For example, it certainly makes sense that a PDF file or video is separate from the PDF viewer or video player that is used to access or view these files. (There can be various ways of associating a relationship). When a video or PDF does not work in a particular player or PDF viewer, we can try a different viewer, to determine if the issue is with the raw content or the software.

But no such distinction currently exists for web archives, obfuscating issues and reducing trust in web archives. Web archives are usually referenced by a url, such as:

https://web.archive.org/web/20180428140459/https://www.cnn.com/

but this simultaneously implies a reference to a particular archive of https://www.cnn.com/ from 2018-04-28 available from Internet Archive AND Internet Archives’ rendering of the archived page using a particular version of their (non-open source) web archive replay software. In this case, the rendering does not work. Does it mean the capture is invalid, or that the rendering software does not work? We do not know based on the results, because the raw content is intermingled with the replay software, and so trust in web archives is lessened. If IA has issues with this page, what other issues do they have? Maybe they aren’t able to capture the data at all?

It turns out, the issue in this case is with the rendering, not the capture. IA does indeed have a capture of this page, but their current replay software is unable to render it properly. Fortunately, using the service worker replay system, it is possible to apply an ‘alternative’ rendering system to the same content!

The service-worker based replay system itself consists simply of HTML and Javascript files, and thus can be added to an existing web archive, and then used to provide an alternative replay. It turns out that it is possible, as a proof-of-concept, to add an experimental, alternative client-side replay system directly to Internet Archive’s Wayback Machine. An example of this is as follows, which should yield a better replay of the same complex page:

https://web.archive.org/web/20191001020034/https://wab.ac/#/20180428140459|https://www.cnn.com/

This approach works because the client-side service-worker based replay system is itself just a web page, and thus can itself be archived and versioned as needed. The improved replay reconstructs the page in the client browser, without costing IA anything, although browser cpu and memory usage may increase and it may not always work (again, this is a prototype!)

Just like PDFs or videos, a web archive could be associated with a particular replay software and if a better version of the web archive software is developed, it could be associated with an improved version. This approach should make web archives more comprehensible and future-proof, as particular versions of replay software can be associated with raw content, while new web archives could be required to use a new version.

In the above example, the first timestamp (20191001020034) is the timestamp of the replay software, while the second timestamp is the actual timestamp of the content to be replayed (20180428140459). In fact, all of the versions of the client-side replay can be viewed by loading:

https://web.archive.org/web/*/https://wab.ac/

This is sort of an elegant hack onto an existing system, and not necessarily the ideal way to do this, but one way of illustrating the distinction! For better reliability, the replay software should be versioned and stored separately from the archived content.

Applying an alternative replay system to IA’s existing replay works because IA’s Wayback Machine provides an ‘unofficial’ way of retrieving the raw content. While IA does not make all the WARCs publicly accessible for download, it is possible to load the raw content by using a special modifier in the url. The ability to load raw unmodified content was also discussed as a possible extension to the Memento protocol along with a more general preference system.

However, the final Memento extension proposal, with multiple 'dimensions of rawness', was perhaps overly complex: what is needed is a simply way to transmit the raw contents of a WARC, either as a full WARC or per WARC record (memento) without any modifications. Then, all necessary transformations and rewriting can be applied in the browser.

Why is per-memento access to raw content still useful if the WARC files can be downloaded directly? Ideally, the system would always work with raw WARC files, available as static files. In practice, large archives rarely make the raw WARC files available, for security and access control reasons. For example, a typical WARC from a crawl might be 1GB in size and may contain multiple resources, but an archive may choose to exclude or embargo a particular resource, say a 1MB file 200MB into the WARC. This would not be possible if the entire WARC is made available for download. For archives that require fine-grained access control, providing per-memento access to raw data is the best option for browser-based replay.

Thus far, the service worker browser-based replay proof-of-concept has been tested in two different ways: Small-scale object bound archives, such as a single blog post, a social media page, where the entire WARC is available and small enough to downloaded at once, and also embedded into very large existing web archives, such as IA’s Wayback Machine, where the client side-replay system sits on top of an existing replay mechanism capable of serving raw archived content or mementos.

While more research and testing is needed, the approach presents a compelling path to browser-based web archive rendering for archives of all sizes and could significantly improve web archive transparency and trust, reduce operational costs of running web archives, and improve digital preservation workflows.

No comments: