Tuesday, November 26, 2013

In-browser emulation

Jeff Rothenberg's ground-breaking 1995 article Ensuring the Longevity of Digital Documents described and compared two techniques to combat format obsolescence; format migration and emulation, concluding that emulation was the preferred approach. As time went by and successive digital preservation systems went into production it became clear that almost all of them rejected Jeff's conclusion, planning to use format migration as their preferred response to format obsolescence. Follow me below the fold for a discussion on why this happened and whether it still makes sense.

Most practitioners obsessively collected and attempted to verify format metadata, they implemented elaborate metadata schema to record the successive migrations which content would undergo, they set up registries of format convertors which would perform the migrations when they became necessary, they established format watches to warn of looming format obsolescence and, when it failed to appear, they studied how to assess the potential future vulnerability of formats to obsolescence. Faced with this rejection, Jeff's views evolved. In a talk last year, he endorsed a balanced approach of preparing for both migration and emulation.

Although I believe that the Web-driven transition from formats private to an application to formats as a publication medium invalidated Jeff's view of the likely incidence of format obsolescence, he was correct to prefer emulation as a response should it ever occur. So why didn't the practitioners agree? I believe there were a number of reasons. It was thought that:
  • The creation of suitable emulators was a task for digital preservation alone.
  • The skills required would be arcane.
  • The effort involved in each one would be substantial.
  • A large number of emulators would be required.
  • Setting up the emulation environments required by obsolete formats would be difficult for readers.
Just as, in 1995, Jeff was writing as a phase change was transforming the prospect of format obsolescence, he was also writing as a phase change was transforming the environment for emulation. Looking back, he saw that:
  • Except for the mainframe world, emulation was not a mainstream computing technology. (Apple was then using the technology as they moved from 68000 to PowerPC).
  • Thus there were very few emulation implementors.
  • Thus the creation of each emulator would consume of large chunk of digital preservation resources.
  • The diversity of competing computer architectures would require a large number of emulators.
  • The lack of a uniform, widely deployed graphical user interface would make deploying emulation to readers complex and fragile.
Since 1995, the picture has reversed completely:
  • Emulation has become an essential part of the computing mainstream:
    • Vmware and others have made virtualization of even low-end physical hardware ubiquitous.
    • Languages such as Java have made emulation of abstract virtual machines ubiquitous.
    • The efforts of enthusiasts for preserving early computer games have made emulations of them ubiquitous.
  • The techniques for implementing emulators are well-understood and widely known.
  • The dominance of the x86 and ARM architectures means that the number of emulators needed is essentially fixed at the number that were needed in 1995.
  • The advent of the Web has provided a uniform, widely-deployed graphical user interface with the potential to deliver emulations in a transparent, easy-to-use way.
We now have two examples of transparently delivering emulations of obsolete software environments to Web browsers. First, at IDCC13 and iPRES2013, the team from Freiburg University described their implementation of emulation as a cloud service, in which they instantiate obsolete hardware and software environments as virtual machine in a cloud computing service, using VNC or RDP to deliver the user interface to the reader's browser.

At the Internet Archive, Jason Scott and others instantiate the obsolete hardware and software environment as a virtual machine inside the reader's browser. Jason writes:
Today, the Internet Archive announces the Historical Software Archive, a collection of prominent and historically notable pieces of software, able to be run immediately in your browser.  They range from pioneering applications to obscure forgotten utilities, and from peak-of-perfection designs to industry-crashing classics.
They can do this because the advent of HTML5 has changed the language of the Web from HTML to Javascript:
JSMESS is a Javascript port of the MESS emulator, a mature and breathtakingly flexible computer and console emulator that has been in development for over a decade and a half by hundreds of volunteers. The MESS emulator runs in a large variety of platforms, but is now able to run embedded in most modern browsers, including Firefox, Chrome, Safari and Internet Explorer.
It isn't just ancient computer games that can run in your browser. At Slashdot, warmflatsprite writes:
"It seems that there have been a rash of JavaScript virtual machines running Linux lately (or maybe I just travel in really weird circles). However until now none of them had network support, so they weren't too terribly useful. Sebastian Macke's jor1k project uses asm.js to produce a very fast emulation of the OpenCores OpenRISC processor (or1k) along with a HTML5 canvas framebuffer for graphics support. Recently Ben Burns contributed an emulated OpenCores ethmac ethernet adapter to the project. This sends ethernet frames to a gateway server via websocket where they are switched and/or piped into TAP virtual ethernet adapter. With this you can build whatever kind of network appliance you'd like for the myriad of fast, sandboxed VMs running in your users' browsers. For the live demo all VMs connect to a single private LAN (subnet The websocket gateway also NATs traffic from that LAN out to the open Internet."
In all these cases the reader clicks on a link and the infrastructure delivers an emulated environment that looks just like any other web page. The idea that emulation is too complex and difficult for readers is exploded.

Given the ability to run even modern operating systems with network and graphics in the reader's browser or in cloud virtual machines, it is hard to argue that format migration is essential, or even important, to delivering the reader's original experience. Instead, what is essential is a time-sequence library of binary software environments ready to be instantiated as readers request contemporary content.

Notice the document-centric assumption above, namely that the goal of digital preservation is to reproduce the original reader's experience. In 1995 this was the obvious goal, but now it is not the only goal. As I pointed out in respect of the Library of Congress' Twitter collection, current scholars increasingly want not to read individual documents but to data-mine from the dataset formed by the collection as a whole. It is true that in many cases emulation as described above does not provide the kind of access needed for this. It might be argued that format migration is thus still needed. But there is a world of difference between extracting useful information from a format (for example, bibliographic metadata and text from PDF) and delivering a pixel-perfect re-rendering of it. Extraction is far easier and far less likely to be invalidated by minor format version changes.


johankbdd said...

I found the data-mining use case that you mention in the final section particularly interesting. However, the fact that emulation may not provide the kind of access that is needed for this doesn't necessarily imply that migration is the answer here. Having a software library and/or API (preferably an open source one) that is able to deal with the source data would also give you this kind of extraction functionality, without any need for any format migration whatsoever. This is something you have previously addressed yourself, e.g. in this 2009 blog post where you state how “formats with open source renderers are, for all practical purposes, immune from format obsolescence”.

I'm often puzzled by the number of people who, almost 20 years after Rothenberg's original paper, still seem to be working under the assumption that migration and emulation are the only (or even best) available techniques for ensuring long-term access. There are many more examples where neither would be ideal. Just one example: following some work I recently did on preservation risks of the PDF format, someone commented that emulation of the original environment (hardware, OS, viewer stack) would solve all of these problems. My own take on this: yes, from a purely technical point of view emulation will definitely allow a future user to view today's PDFs, but I'm highly doubtful to what extent that future user will be comfortable to interact with today's software and user interface design. I suspect both will be so dated and arcane by that time, that it will be a serious barrier to actually using those files (by way of analogy, few people today would be comfortable using this 30-year old version of WordStar).

Migration of PDF is notoriously difficult as well, so our best bet is that the format continues to be supported by future viewers. Fortunately PDF is ubiquitous and well supported by a number of open source libraries, so the odds of becoming dependent on either emulation or migration look pretty slim to me (there may be some edge cases, e.g. some of PDFs multimedia features aren't well supported outside Adobe's products).

By no means is this meant as an argument against emulation (or migration, or any other technique in our toolbox for that matter). Rather, I'm just trying to illustrate that the best preservation strategy is highly context-dependent, and that this context may also change over time (e.g. data-mining vs "original look and feel"). The nature of the content (e.g. file format) and this context will determine whether the best solution is based on migration, emulation, (open source) libraries, or any combination of these.

Jason Scott said...

I'm going to tweak this conclusion of yours, subtle genius, to say that while I agree that emulation's maturity has brought a brand new positive player to the experience of digital preservation, its current usefulness is primarily to demonstrate a feasible endgame.

The wider accessible network emulators, be they in-browser solutions like my group or the remote access solutions by others, simply take away one facet of concern or disagreement about the preservation of these materials. We are all proving very rapidly that the big concern is no longer that somebody will have a magnetic waveform duplicate of a vintage piece of storage medium and have nothing to do with it. They will definitely have a place to show it or demonstrate or interact with it.

But the race is on to track down and understand all of the programmatic format elements of all of these pieces of data that we are building such wonderful museum cases for. We have nightmares like Microsoft's constantly shifting of format elements which they still don't give proper documentation about, and obscure tape and diskette format approaches that are not self evident to leader and me later authors.

Finally, we have a lot of work to do in the realm of machine accurate and not just documentation accurate emulation. Closing the loop on accessibility like we are doing will speeded up, but it is definitely work that still needs to be done.

Format migration is still useful! It just has a better endgame in the future.

David. said...

See also this.

David. said...

Now Google gets into the emulation act, with a Native Client emulation of the Amiga 500 and its games.

David. said...

The Internet Archive strikes back with a huge collection of vintage games playable in your browser at the Console Living Room.

David. said...

Not in-browser, but an astonishing achievement nevertheless. A 80s PC in 4043 bytes of obfuscated C.