Tuesday, January 31, 2017

Preservable emulations

This post is an edited extract from my talk at last year's IIPC meeting. This part was the message I was trying to get across, but I buried the lede at the tail end. So I'm repeating it here to try and make the message clear.

Emulation technology will evolve through time. The way we expose emulations on the Web right now means that this evolution will break them. We're supposed to be preserving stuff, but the way we're doing it isn't preservable. We need to expose emulations to the Web in a future-proof way, a way whereby they can be collected, preserved and reanimated using future emulation technologies. Below the fold, I explain what is needed using the analogy of PDFs.

The PDF Analogy

Lets make an analogy between emulation and something that everyone would agree is a Web format, PDF. Browsers lack built-in support for rendering PDF. They used to depend on external PDF renderers, such as Adobe Reader via a Mime-Type binding. Now, they download pdf.js and render the PDF internally even though its a format for which they have no built-in support. The Webby, HTML5 way to provide access to formats that don't qualify for built-in support is to download a JavaScript renderer. We don't preserve PDFs by wrapping them in a specific PDF renderer, we preserve them as PDF plus a MimeType. At access time the browser chooses an appropriate renderer, which used to be Adobe Reader and is now pdf.js.

Landing Pages

ACM landing page
There's another interesting thing about PDFs on the web. In many cases the links to them don't actually get you to the PDF. The canonical, location-independent link to the LOCKSS paper in ACM ToCS is http://dx.doi.org/10.1145/1047915.1047917, which currently redirects to http://dl.acm.org/citation.cfm?doid=1047915.1047917 which is a so-called "landing page", not the paper but a page about the paper, on which if you look carefully you can find a link to the PDF.

Like PDFs, preserved system images, the disk image for a system to be emulated and the metadata describing the hardware it was intended for, are formats that don't qualify for built-in support. The Webby way to provide access to them is to download a JavaScript emulator, as Emularity does. So is the problem of preserving system images solved?

Problem Solved? NO!

No it isn't. We have a problem that is analogous to, but much worse than, the landing page problem. The analogy would be that, instead of a link on the landing page leading to the PDF, embedded in the page was a link to a rendering service. The metadata indicating that the actual resource was a PDF, and the URI giving its location, would be completely invisible to the user's browser or a Web crawler. At best all that could be collected and preserved would be a screenshot.

All three frameworks, bwFLA, Olive and Emularity, have this problem. The underlying emulation service, the analogy of the PDF rendering service, can access the system image and the necessary metadata, but nothing else can. Humans can read a screenshot of a PDF document, a screenshot of an emulation is useless. Wrapping a system image in an emulation like this makes it accessible in the present, not preservable for the future.

If we are using emulation as a preservation strategy, shouldn't we be doing it in a way that is itself able to be preserved?

A MimeType for Emulations?

What we need is a MimeType definition that allows browsers to follow a link to a preserved system image and construct an appropriate emulation for it in whatever way suits them. This would allow Web archives to collect preserved system images and later provide access to them.

The linked-to object that the browser obtains needs to describe the hardware that should be emulated. Part of that description must be the contents of the disks attached to the system. So we need two MimeTypes:
  • A metadata MimeType, say Emulation/MachineSpec, that describes the architecture and configuration of the hardware, which links to one or more resources of:
  • A disk image MimeType, say DiskImage/qcow2, with the contents of each of the disks.
Emulation/MachineSpec is pretty much what the hardware part of bwFLA's internal metadata format does, though from a preservation point of view there are some details that are workable but not ideal. For example, using the Handle system is like using a URL shortener or a DOI, it works well until the service dies. When it does, as for example last year when doi.org's domain registration expired, all the identifiers become useless.

I suggest DiskImage/qcow2 because QEMU's qcow2 format is a de facto standard for representing the bits of a preserved system's disk image.

And binding to "emul.js"

Then, just as with pdf.js, the browser needs a binding to a suitable "emul.js" which knows, in this browser's environment, how to instantiate a suitable emulator for the specified machine configuration and link it to the disk images.This would solve both problems:
  • The emulated system image would not be wrapped in a specific emulator; the browser would be free to choose appropriate, up-to-date emulation technology.
  • The emulated system image and the necessary metadata would be discoverable and preservable because there would be explicit links to them.
The details need work but the basic point remains. Unless there are MimeTypes for disk images and system descriptions, emulations cannot be first-class Web objects that can be collected, preserved and later disseminated.

No comments: