Wednesday, July 6, 2016

Talk at JISC/CNI Workshop

I was invited to give a talk at a workshop convened by JISC and CNI in Oxford. Below the fold, an edited text with links to the sources.

Thanks to funding from the Mellon Foundation I spent last summer on behalf of the Mellon and Sloan Foundations, and IMLS researching and writing a report entitled Emulation & Virtualization as Preservation Strategies. Time allows only a taste of what is in the report and subsequent developments, which you can find on my blog linked from the text of this talk.

Migration and emulation were the two techniques identified in Jeff Rothenberg's seminal 1995 Ensuring the Longevity of Digital Documents. He came down strongly in favor of emulation. Despite this, migration has been overwhelmingly favored. The emulators were never a problem, they have been part of the mainstream since the early days of IBM computers. But emulation was thought to be restricted to hackers such as Nick Lee, who put MacOS on his Apple Watch, and Hacking Jules, who put Nintendo64 on his Android Wear. What has changed is that emulation frameworks have been developed that present emulations as a normal part of the Web. You don't even need to know you're seeing an emulation.

Theresa Duncan CD-ROMs

From 1995 to 1997 Theresa Duncan produced three seminal feminist CD-ROM games, Chop Suey, Smarty and Zero Zero. Rhizome, a project hosted by the New Museum in New York, has put emulations of them on the Web. You can visit http://archive.rhizome.org/theresa-duncan-cdroms/, click any of the "Play" buttons and have an experience very close to that of playing the CD on MacOS 7.5 . This has proved popular. For several days after their initial release they were being invoked on average every 3 minutes.

These demos were pre-recorded using Kazam and a Chromium browser on my Acer C720 Chromebook running Ubuntu 14.04.

What Happened?

What happened when I clicked Smarty's Play button?
  • The browser connects to a session manager in Amazon's cloud, which notices that this is a new session.
  • Normally it would authenticate the user, but because this CD-ROM emulation is open access it doesn't need to.
  • It assigns one of its pool of running Amazon instances to run the session's emulator.
  • Each instance can run a limited number of emulators. If no instance is available when the request comes in it can take up to 90 seconds to start another.
  • It starts the emulation on the assigned instance, supplying metadata telling the emulator what to run.
  • The emulator starts.
  • After a short delay the user sees the Mac boot sequence, and then the CD-ROM starts running.
  • At intervals, the emulator sends the session manager a keep-alive signal. Emulators that haven't sent one in 30 seconds are presumed dead, and their resources are reclaimed to avoid paying the cloud provider for unused resources.

bwFLA

Rhizome, and others such as Yale, the DNB and ZKM Karlsruhe use technology from the bwFLA team at the University of Freiburg to provide Emulation As A Service (EAAS). Their GPLv3 licensed framework runs in "the cloud" to provide comprehensive management and access facilities wrapped around a number of emulators. It can also run as a bootable USB image or via Docker. bwFLA encapsulates each emulator so that the framework sees three standard interfaces:
  • Data I/O, connecting the emulator to data sources such as disk images, user files, an emulated network containing other emulators, and the Internet.
  • Interactive Access, connecting the emulator to the user using standard HTML5 facilities.
  • Control, providing a Web Services interface that bwFLA's resource management can use to control the emulator.
The communication between the emulator and the user takes place via standard HTTP on port 80; there is no need for a user to install software, or browser plugins, and no need to use ports other than 80. Both of these are important for systems targeted at use by the general public.

VisiCalc

In 1979 Dan Bricklin and Bob Frankston launched VisiCalc for the Apple ][, the world's first spreadsheet. You can run it on an emulated Apple ][ by visiting https://archive.org/details/VisiCalc_1979_SoftwareArts and clicking the power button. Some of the key-bindings are strange to users conditioned by decades of Excel, but once you've found the original VisiCalc reference card, it is perfectly usable.

What Happened?

The Apple ][ emulator isn't running in the cloud, as bwFLA's does. Instead, it is running inside my browser. The emulators have been compiled into JavaScript, using emscripten. When I clicked on the link to the emulation, metadata describing the emulation including the emulator to use was downloaded into my browser, which then downloaded the JavaScript for the emulator and the system image for the Apple ][ with VisiCalc installed.

Emularity

This is Emularity, the framework underlying the Internet Archive's software library, which currently holds nearly 36,000 items, including more than 7,300 for MS-DOS, 3,600 for Apple, 2,900 console games and 600 arcade games. Some can be downloaded, but most can only be streamed.

The oldest is an emulation of a PDP-1 with a DEC 30 display running the Space War game from 1962, more than half a century ago. As I can testify having played this and similar games on Cambridge University's PDP-7 with a DEC 340 display seven years later, this emulation works well.

The quality of the others is mixed. Resources for QA and fixing problems are limited; with a collection this size problems are to be expected. Jason Scott crowd-sources most of the QA; his method is to see if the software boots up and if so, put it up and wait to see whether visitors who remember it post comments identifying problems, or whether the copyright owner objects. The most common problem is the sound.

It might be thought that the performance of running the emulator locally by adding another layer of emulation (the JavaScript virtual machine) would be inadequate, but this is not the case for two reasons. First, the user's computer is vastly more powerful than an Apple ][ and, second, the performance of the JavaScript engine in a browser is critical to its success, so large resources are expended on optimizing it.

The Internet is for Cats

The Internet is for cats. Well, no, the Internet is for porn. But after porn, it is for cats. Among the earliest cats to colonize the Internet were Nijinski and Pavlova, who were in charge of Mark Weiser and Vicky Reich. On 11 Jan 1995 Mark put up their Web page, and here it is from the Wayback Machine. The text and images are all there and the links work. Pretty good preservation.

The Internet was for Cats

But when Mark put it up, it looked different.

Here is the same page from the Wayback Machine viewed with NCSA Mosaic 2.7, a nearly contemporary browser on a nearly contemporary Linux system, courtesy of Ilya Kreymer's oldweb.today. The background and the fonts are quite different. In some cases this can be important, so this is even better preservation.

oldweb.today

Here is the BBC News front page from 1999 in Internet Explorer 4.01 on Windows. oldweb.today uses Docker to assemble an appropriate OS and browser combination, emulate them, and uses Memento (RFC7089) to aggregate the contents of now about 15 Web archives, for each resource in the page choosing to retrieve it from the archive which has the version collected closest to the requested time.

Use cases

I've shown you three different emulation frameworks with three different use cases, implemented in three different ways:
  • Old CD-ROMs, emulated via a Web service framework.
  • Historically important software, emulated in your browser using JavaScript.
  • Preserved Web content, emulated using Docker container technology.
I'm sure many of you will have other use cases. For us, the use case we're looking at arrived late last year, when the feed the CLOCKSS archive ingests from one of the major publishers suddenly got much bigger. We looked into it, and discovered that among the supplementary material attached to some papers were now virtual machines. If that content is ever triggered, we need to be able to run those VMs under emulation.

If you care about reproducibility of in silico science, it isn't enough to archive the data, or even the data plus the source code of the analysis software. The results depend on the entire stack of software, all the libraries and the operating system.

The Olive project at CMU has the data and source code for CHASTE 3.1, a simulation package for biology and physiology from 2013. But the system can only run on a specific version, 12.04, of the Ubuntu version of Linux. Even recent scientific software has complex dependencies that require archiving the binaries and emulating them.

How do you use emulation?

How do you go about creating a Web page containing an emulation like the ones I've shown you? At a high level, the stages are:
  • Create a bootable disk image in the format your emulation framework needs, which is typically QEMU's "qcow2". It should contain the binaries you want to run installed in the operating system they need.
  • Configure suitable hardware to boot the image by specifiying the CPU type, the amount of memory, the periherals and their contents, such as CD-ROM .iso images. Express this configuration in the metadata format used by your emulation framework.
  • Add the disk image and the configuration metadata to a Web server.
  • Embed the necessary links to connect them into a "landing page" for the emulation such as the ones I've shown you.

Technical Issues

This sequence sounds pretty simple, and it soon should be. But right now there are some technical issues:
  • You need tools to create disk images, and they aren't currently that easy to use.
  • You need tools to create the configuration metadata. The bwFLA team and the DNB have had considerable success automating the process for CD-ROMs, but for other use cases the tools need a lot of work.
  • The way each framework embeds its emulations in a Web page is different and incompatible. The links are to specific emulation instances. Over time emulation technology will improve, and these links will break, rendering the landing pages useless. We need a standard way to embed emulations that leaves the browser to figure out how best to do the emulation, an emulation mime-type and an "emul.js" by analogy with "pdf.js".
Following on from my report, efforts are under way to address these and some other technical issues such as the very tricky issues that crop up when you connect an emulation of old software to the Internet.

Legal Issues

The big barrier to widespread adoption of emulation for preservation is legal. Open source software is not a problem, but proprietary software is protected in two overlapping ways, by copyright and by the End User License Agreement. In theory copyright eventually expires, but the EULA never does. Copyright controls whether you can make and keep copies, such as those in disk images. EULAs vary, but probably control not merely copying but also running the software. And, since the software stack consists of multiple layers each with its own copyright and EULA, you are restricted to the intersection of them all.

There are a few rays of hope. Microsoft academic site licenses these days allow old Microsoft software to be copied for preservation and to be run for internal use. UNESCO's PERSIST is trying to engage major software vendors in a discussion of these legalities. The Internet Archive's massive software collection operates similarly to the DMCA's "safe harbor" provision, in that if the copyright owner objects the emulation is taken down. Objections have been rare, but this is really old software and mostly games. In theory, companies do not lose money because someone preserves and lets people run really old software. In practice, there are two reasons why their lawyers are reluctant to agree to this, the "not one inch" copyright maximalist ethos, and the risk for individual lawyers of making a career-limiting move.

Conclusion

Especially as container technology takes over the mainstream of IT, it is likely that over the next few years it will become evident that migration-based preservation strategies are obsolete.

1 comment:

David. said...

Cameron Neylon has a post based on his talk at this workshop. The talk was really good, I need to blog about it in detail.