Wednesday, April 13, 2016

The Architecture of Emulation on the Web

In the program my talk at the IIPC's 2016 General Assembly in Reykjavík was entitled Emulation & Virtualization as Preservation Strategies. But after a meeting to review my report called by the Mellon Foundation I changed the title to The Architecture of Emulation on the Web. Below the fold, an edited text of the talk with an explanation for the change, and links to the sources.

Title

Its a pleasure to be here and I'm grateful to the organizers for inviting me to talk today. As usual, you don't need to take notes or ask for the slides, the text of the talk with links to the sources will go up on my blog shortly.

Thanks to funding from the Mellon Foundation I spent last summer on behalf of the Mellon and Sloan Foundations, and IMLS researching and writing a report entitled Emulation & Virtualization as Preservation Strategies. Jeff Rothenberg's 1995 Ensuring the Longevity of Digital Documents identified emulation and migration as the two possible techniques and came down strongly in favor of emulation. Despite this, migration has been overwhelmingly favored until recently. What has changed is that emulation frameworks have been developed that present emulations as a normal part of the Web.

Last month there was a follow-up meeting at the Mellon Foundation. In preparing for it, I realized that there was an important point that the report identified but didn't really explain properly. I'm going to try to give a better explanation today, because it is about how emulations of preserved software appear on the web, and thus how they can be become part of the Web that we collect, preserve and disseminate. I'll start by describing how the three emulation frameworks I studied appear on the Web, then illustrating the point with an analogy, and suggesting how it might be addressed.

When I gave a talk about the report at CNI I included live demos. It was a disaster; Olive was the only framework that worked via hotel WiFi. I have pre-recorded the demos using Kazam and a Chromium browser on my Ubuntu 14.04 system.

Theresa Duncan's CD-ROMs

The Theresa Duncan CD-ROMs.
From 1995 to 1997 Theresa Duncan produced three seminal feminist CD-ROM games, Chop Suey, Smarty and Zero Zero. Rhizome, a project hosted by the New Museum in New York, has put emulations of them on the Web. You can visit http://archive.rhizome.org/theresa-duncan-cdroms/, click any of the "Play" buttons and have an experience very close to that of playing the CD on MacOS 7.5 . This has proved popular. For several days after their initial release they were being invoked on average every 3 minutes.

What Is Going On?

What happened when I clicked Smarty's Play button?
  • The browser connects to a session manager in Amazon's cloud, which notices that this is a new session.
  • Normally it would authenticate the user, but because this CD-ROM emulation is open access it doesn't need to.
  • It assigns one of its pool of running Amazon instances to run the session's emulator. Each instance can run a limited number of emulators. If no instance is available when the request comes in it can take up to 90 seconds to start another.
  • It starts the emulation on the assigned instance, supplying metadata telling the emulator what to run.
  • The emulator starts. After a short delay the user sees the Mac boot sequence, and then the CD-ROM starts running.
  • At intervals, the emulator sends the session manager a keep-alive signal. Emulators that haven't sent one in 30 seconds are presumed dead, and their resources are reclaimed to avoid paying the cloud provider for unused resources.

bwFLA architecture
bwFLA

Rhizome, and others such as Yale, the DNB and ZKM Karlsruhe use technology from the bwFLA team at the University of Freiburg to provide Emulation As A Service (EAAS). Their GPLv3 licensed framework runs in "the cloud" to provide comprehensive management and access facilities wrapped around a number of emulators. It can also run as a bootable USB image or via Docker. bwFLA encapsulates each emulator so that the framework sees three standard interfaces
  • Data I/O, connecting the emulator to data sources such as disk images, user files, an emulated network containing other emulators, and the Internet.
  • Interactive Access, connecting the emulator to the user using standard HTML5 facilities.
  • Control, providing a Web Services interface that bwFLA's resource management can use to control the emulator.
The communication between the emulator and the user takes place via standard HTTP on port 80; there is no need for a user to install software, or browser plugins, and no need to use ports other than 80. Both of these are important for systems targeted at use by the general public.

bwFLA's preserved system images are stored as a stack of overlays in QEMU's "qcow2'' format. Each overlay on top of the base system image represents a set of writes to the underlying image. For example, the base system image might be the result of an initial install of Windows 95, and the next overlay up might be the result of installing Word Perfect into the base system. Or the next overlay up might be the result of redaction. Each overlay contains only those disk blocks that differ from the stack of overlays below it. The stack of overlays is exposed to the emulator as if it were a normal file system via FUSE.

The technical metadata that encapsulates the system disk image is described in a paper presented to the iPres conference in November 2015, using the example of emulating CD-ROMs. Broadly, it falls into two parts, describing the software and hardware environments needed by the CD-ROM in XML. The XML refers to the software image components via the Handle system, providing a location-independent link to access them.

TurboTax

TurboTax97 on Windows 3.1
I can visit https://olivearchive.org/launch/11/ and get 1997's TurboTax running on Windows 3.1. The pane in the browser window has top and bottom menu bars, and between them is the familiar Windows 3.1 user interface.

What Is Going On?

The top and bottom menu bars come from a program called VMNetX that is running on my system. Chromium invoked it via a MIME-type binding, and VMNetX then created a suitable environment in which it could invoke the emulator that is running Windows 3.1, and TurboTax. The menu bars include buttons to power-off the emulated system, control its settings, grab the screen, and control the assignment of the keyboard and mouse to the emulated system.

The interesting question is "where is the Windows 3.1 system disk with TurboTax installed on it?"

Olive

The answer is that the "system disk" is actually a file on a remote Apache Web server. The emulator's disk accesses are being demand-paged over the Internet using standard HTTP range queries to the file's URL.

This system is Olive, developed at Carnegie Mellon University by a team under my friend Prof. Mahadev Satyanarayanan, and released under GPLv2. VMNetX uses a sophisticated two-level caching scheme to provide good emulated performance even over slow Internet connections. A "pristine cache" contains copies of unmodified disk blocks from the "system disk". When a program writes to disk, the data is captured in a "modified cache". When the program reads a disk block, it is delivered from the modified cache, the pristine cache or the Web server, in that order. One reason this works well is that successive emulations of the same preserved system image are very similar, so pre-fetching blocks into the pristine cache is effective in producing YouTube-like performance over 4G cellular networks.

VisiCalc

VisiCalc on Apple ][
You can visit https://archive.org/details/VisiCalc_1979_SoftwareArts and run Dan Bricklin and Bob Frankston's VisiCalc from 1979 on an emulated Apple ][. It was the world's first spreadsheet. Some of the key-bindings are strange to users conditioned by decades of Excel, but once you've found the original VisiCalc reference card, it is perfectly usable.

What Is Going On?

The Apple ][ emulator isn't running in the cloud, as bwFLA's does, nor is it running as a process on my machine, as Olive's does. Instead, it is running inside my browser. The emulators have been compiled into JavaScript, using emscripten. When I clicked on the link to the emulation, metadata describing the emulation including the emulator to use was downloaded into my browser, which then downloaded the JavaScript for the emulator and the system image for the Apple ][ with VisiCalc installed.

Emularity

This is the framework underlying the Internet Archive's software library, which currently holds nearly 36,000 items, including more than 7,300 for MS-DOS, 3,600 for Apple, 2,900 console games and 600 arcade games. Some can be downloaded, but most can only be streamed.

The oldest is an emulation of a PDP-1 with a DEC 30 display running the Space War game from 1962, more than half a century ago. As I can testify having played this and similar games on Cambridge University’s PDP-7 with a DEC 340 display seven years later, this emulation works well

The quality of the others is mixed. Resources for QA and fixing problems are limited; with a collection this size problems are to be expected. Jason Scott crowd-sources most of the QA; his method is to see if the software boots up and if so, put it up and wait to see whether visitors who remember it post comments identifying problems, or whether the copyright owner objects. The most common problem is the sound.

It might be thought that the performance of running the emulator locally by adding another layer of emulation (the JavaScript virtual machine) would be inadequate, but this is not the case for two reasons. First, the user’s computer is vastly more powerful than an Apple ][ and, second, the performance of the JavaScript engine in a browser is critical to its success, so large resources are expended on optimizing it.

The movement supported by major browser vendors to replace the JavaScript virtual machine with a byte-code virtual machine called WebAssembly has borne fruit. Last month four major browsers announced initial support, all running the same game, a port of Unity's Angry Bots. This should greatly reduce the pressure for multi-core and parallelism support in JavaScript, which was always likely to be a kludge. Improved performance for in-browser emulation is also likely to make in-browser emulation more competitive with techniques that need software installation and/or cloud infrastructure, reducing the barrier to entry.

The PDF Analogy

Lets make an analogy between emulation and something that everyone would agree is a Web format, PDF. Browsers lack built-in support for rendering PDF. They used to depend on external PDF renderers, such as Adobe Reader via a Mime-Type binding. Now, they download pdf.js and render the PDF internally even though its a format for which they have no built-in support. The Webby, HTML5 way to provide access to formats that don't qualify for built-in support is to download a JavaScript renderer. We don't preserve PDFs by wrapping them in a specific PDF renderer, we preserve them as PDF plus a MimeType. At access time the browser chooses an appropriate renderer, which used to be Adobe Reader and is now pdf.js.

Landing Pages

ACM landing page
There's another interesting thing about PDFs on the web. In many cases the links to them don't actually get you to the PDF. The canonical, location-independent link to the LOCKSS paper in ACM ToCS is http://dx.doi.org/10.1145/1047915.1047917, which currently redirects to http://dl.acm.org/citation.cfm?doid=1047915.1047917 which is a so-called "landing page", not the paper but a page about the paper, on which if you look carefully you can find a link to the PDF.

The fact that it is very difficult for a crawler to find this link makes it hard for archives to collect and preserve scholarly papers. Herbert Van de Sompel and Michael Nelson's Signposting proposal addresses this problem, as to some extent do W3C activities called Packaging on the Web and Portable Web Publications for the Open Web Platform.

Like PDFs, preserved system images, the disk image for a system to be emulated and the metadata describing the hardware it was intended for, are formats that don't qualify for built-in support. The Webby way to provide access to them is to download a JavaScript emulator, as Emularity does. So is the problem of preserving system images solved?

Problem Solved? NO!

No it isn't. We have a problem that is analogous to, but much worse than, the landing page problem. The analogy would be that, instead of a link on the landing page leading to the PDF, embedded in the page was a link to a rendering service. The metadata indicating that the actual resource was a PDF, and the URI giving its location, would be completely invisible to the user's browser or a Web crawler. At best all that could be collected and preserved would be a screenshot.

All three frameworks I have shown have this problem. The underlying emulation service, the analogy of the PDF rendering service, can access the system image and the necessary metadata, but nothing else can. Humans can read a screenshot of a PDF document, a screenshot of an emulation is useless. Wrapping a system image in an emulation like this makes it accessible in the present, not preservable for the future.

If we are using emulation as a preservation strategy, shouldn't we be doing it in a way that is itself able to be preserved?

A MimeType for Emulations?

What we need is a MimeType definition that allows browsers to follow a link to a preserved system image and construct an appropriate emulation for it in whatever way suits them. This would allow Web archives to collect preserved system images and later provide access to them.

The linked-to object that the browser obtains needs to describe the hardware that should be emulated. Part of that description must be the contents of the disks attached to the system. So we need two MimeTypes:
  • A metadata MimeType, say Emulation/MachineSpec, that describes the architecture and configuration of the hardware, which links to one or more resources of:
  • A disk image MimeType, say DiskImage/qcow2, with the contents of each of the disks.
Emulation/MachineSpec is pretty much what the hardware part of bwFLA's internal metadata format does, though from a preservation point of view there are some details that aren't ideal. For example, using the Handle system is like using a URL shortener or a DOI, it works well until the service dies. When it does, as for example last year when doi.org's domain registration expired, all the identifiers become useless.

I suggest DiskImage/qcow2 because QEMU's qcow2 format is a de facto standard for representing the bits of a preserved system's disk image.

And binding to "emul.js"

Then, just as with pdf.js, the browser needs a binding to a suitable "emul.js" which knows, in this browser's environment, how to instantiate a suitable emulator for the specified machine configuration and link it to the disk images.This would solve both problems:
  • The emulated system image would not be wrapped in a specific emulator; the browser would be free to choose appropriate, up-to-date emulation technology.
  • The emulated system image and the necessary metadata would be discoverable and preservable because there would be explicit links to them.
The details need work but the basic point remains. Unless there are MimeTypes for disk images and system descriptions, emulations cannot be first-class Web objects that can be collected, preserved and later disseminated.

3 comments:

David. said...

One of the problems my report on emulation highlighted was the mismatch between the hardware much of the software being emulated was designed for (the PC or even the tablet) and the harwdare it will be emulated on in the future (the phone). The reason is that the PC and tablet market is in free-fall, for at least the last 6 quarters, according to Canalys. The PC market is back to levels of 5 years ago.

David. said...

And IDC just increased its projections for PC market shrinkage.:

"The firm now says PC shipments “... are forecast to decline by 7.3% year over year”. That's “roughly two per cent below earlier projections as conditions have been weaker than expected.” The firm names “weak currencies, depressed commodity prices, political uncertainty, and delayed projects” as the weaker conditions impacting sales."

Part of the reason is also:

"Windows 10 isn't helping matters either, because lots of people are availing themselves of free Windows 10 upgrades rather than buying a new PC. The firm also says that “while a large share of enterprises are evaluating Windows 10, the pace of new PC purchases has not yet stabilized commercial PC shipments.”"

David. said...

My emulation report noted that there are issues with QEMU's support, including regressions in its emulation of older hardware. There have also been a number of security vulnerabilities, highlighted by the decision to remove QEMU from Google Cloud.