Wednesday, January 13, 2016

Guest post: Ilya Kreymer on oldweb.today

Recently, the remarkably productive Ilya Kreymer put up an emulation-based system for displaying archived Web pages using contemporary browsers at http://oldweb.today/. I mentioned it in my talk at the last CNI meeting, but I had misunderstood the details, so Ilya had to correct me.

Ilya's work is much more important that I originally realized. It isn't just a very good example of the way that emulation can layer useful services over archived content. It is also a different approach to delivering emulations, leveraging the current trend towards containers and thus less dependent on specialized, preservation-only technology.

I asked Ilya to write a guest post explaining how it works, which is below the fold.

In this guest post, I would like to talk a bit about http://oldweb.today, a new system designed combine web archives with emulated, or virtual, web browsers, allowing users to browse old websites using a selection of old (and new) browsers.

In the report on emulation, the idea of the Internet Emulator is presented, and oldweb.today is one approach to such a system. The general idea of oldweb.today is to present users with virtual browsers running on remote machines, and to also augment the traditional web browsing experience by adding a time dimension. Since browsers do not traditionally have a time dimension, the time selector is presented “outside” the virtual browser, in the oldweb.today containing page.

The oldweb.today system is designed to interact with web archives, over HTTP, and also extends previous work on using the Memento protocol, also over HTTP. To be quite fair, it is really a web emulator, rather than a general Internet/network emulator.

As such, it consists of two important systems: a web browser emulator and memento aggregation system.

Emulators + Docker

The emulation component is possible due to several key technologies, the first of which is Docker.

Docker provides a container system on top of Linux kernel apis that allows running applications in isolated containers on a host Linux machine: (https://docs.docker.com/engine/introduction/understanding-docker/)

Amongst other features most relevant to oldweb.today, Docker containers provide, a sandboxed file system, cpu scheduling and virtual network(s). Docker containers can be launched and stopped quickly and are built from Docker images, which are stored in a custom layered filesystem spec format but are actually defined through an easy to use text format called Dockerfile. Docker images can also extend other images and can be shared publicly via Docker’s public registry.

Docker containers are limited to running on Linux (at least for now). To support non-Linux browsers, a Docker container can run one of several emulator applications (that run on Linux).

Currently oldweb.today includes the following open-source emulators:

  • WINE -- A “Compatibility” layer runs Window applications in Linux, it is used to run Windows versions of IE4, IE5 and Netscape. Not a perfect emulation, but was chosen to be able to run Windows browsers without actually including full versions of Windows.
  • Basilisk - MacOS 68k Emulator is used to run Netscape 3.04, Netscape 4.08, IE 4.01 on System 7.5.5

  • Sheepshaver - PowerPC Emulator, used to run Nestcape 4.8 and IE 5.1.7 on Mac OS System 7.5.5

  • Previous - a NEXTStep 68k emulator, used to run Tim Berners-Lee first browser, WWW, on a NextSTEP system.

It may be interesting to point out that there are different layers of ‘emulation’ here. There are full OS emulators (Basilisk, Sheepshaver, Previous) which emulate full operating systems, specifically System 7.5.5 with UI extensions. (System 7.5.3 was released publicly by Apple, and 7.5.5 is a free update)

There is WINE which is really a Windows ‘compatibility layer’ for Linux and avoids running or distributing Windows itself. Finally, there are Linux browsers (Netscape 4.79, Mosaic, Lynx, and modern Firefox, Chrome) which run directly in the Docker container, and aren’t “emulated” in the traditional sense, but perhaps “virtualized” for the user. Of course, other emulators and Linux browsers can be added as well if needed.

Docker supports many other useful features relevant to oldweb.today. One is automated port mapping, allowing a new port on the host machine to be mapped directly to a specific port in a container. For each browser container, two ports, one for the VNC connection and one for control messages, is assigned and sent to the user to communicate directly with the remote browser.

On the client side, noVNC, actually a VNC client in HTML5 provides the key client side communication to receive screen capture from remote browser and send input events back.

Finally, Docker Swarm is a new Docker clustering system, which allows multiple machines to be treated as a single machine from Docker’s perspective. Swarm is used in the oldweb.today deployment to enable scaling across multiple machines, although at this time, the scaling has been done manually. At peak load time, it has been used to launch 5+ virtual machines which distributed requests for new browsers containers amongst them.

Based on empirical evidence (at least on amazon EC2 VCPU), each CPU could support between 5-6 containers before performance starts to degrade. By default, Docker automatically splits the CPU time equally across all containers, so no one browser should be able to use more than its fair share. To avoid too many containers slowing things down, oldweb.today employs a limit on max simultaneous browsers as a function of the available CPUs. Users beyond this enter a queue and wait for an empty spot to become available.

Memento Aggregation

Another key aspect of oldweb.today is the embedded Memento Aggregator, using the MemGator aggregator created by Sawood Alam. While oldweb.today could of course work with a single archive, the intent was to create the most accurate replay of “the old web” by querying multiple memento endpoints at the same time, and selecting a combination of the closest available Mementos. MemGator was used in order to have more flexibility over which archives are included in the aggregation and to allow for adjusting of various settings related to timeouts.

This approach builds upon LANL’s Memento Aggregator and Time Travel API, and the collaboration with LANL on creating Memento Reconstruct, which used the similar approach for querying multiple archives to “reconstruct” the most accurate available aggregated Memento.

While Reconstruct used the traditional url rewriting (aka. wayback machine) approach to web archive replay, oldweb.today avoids url rewriting altogether. Instead, the original HTTP traffic is served directly to the emulated browsers, essentially unaltered. (A couple exceptions: Some cleanup is done on the headers for the WWW browser and Mosaic to avoid overwhelming them with unknown headers or gzip-encoded data).

HTTP Proxy

This is possible due to a wonderful feature of the HTTP protocol: the concept of the proxy server, which has been included in HTTP since 1.0 and possibly earlier. The proxy server was designed as a way to allow firewalled systems to access the web: all connections would go through the designated proxy.

The presence of HTTP proxy support provides an elegant solution for web archives, entirely at the HTTP protocol level. By using HTTP proxy mode, all the complexities of url rewriting traditionally found in web archive replay can be eliminated. Unfortunately, configuring proxy mode manually is cumbersome and impractical to occasionally browser a web archive.
For an emulated browser, these difficulties can be eliminated, as all the necessary settings can be preconfigured automatically.

The browsers launched by oldweb.today in Docker containers are already configured with a proxy server, pointing to another Docker container running pywb software which acts as an HTTP/S proxy server. When a user changes the time dimension, this information is sent to the proxy server container and mapped to the appropriate Docker container for the emulated browser, later applying this as the Memento request datetime when querying the aggregator.

The web emulator system provided by oldweb.today allows users to browse archived web data, in its original unaltered form, with a variety of different browsers running in a controlled environment. This is all made possible by several factors: the relative stability of the HTTP protocol and features such as HTTP proxy, the adoption of Memento protocol, along with the maturity of Linux container systems such as Docker. There are a lot more improvements that can be made and edge cases to address, but this should provide a solid basis for further work.

Updated with corrections from Ilya 1/13/16 0940.