Most practitioners obsessively collected and attempted to verify format metadata, they implemented elaborate metadata schema to record the successive migrations which content would undergo, they set up registries of format convertors which would perform the migrations when they became necessary, they established format watches to warn of looming format obsolescence and, when it failed to appear, they studied how to assess the potential future vulnerability of formats to obsolescence. Faced with this rejection, Jeff's views evolved. In a talk last year, he endorsed a balanced approach of preparing for both migration and emulation.
Although I believe that the Web-driven transition from formats private to an application to formats as a publication medium invalidated Jeff's view of the likely incidence of format obsolescence, he was correct to prefer emulation as a response should it ever occur. So why didn't the practitioners agree? I believe there were a number of reasons. It was thought that:
- The creation of suitable emulators was a task for digital preservation alone.
- The skills required would be arcane.
- The effort involved in each one would be substantial.
- A large number of emulators would be required.
- Setting up the emulation environments required by obsolete formats would be difficult for readers.
- Except for the mainframe world, emulation was not a mainstream computing technology. (Apple was then using the technology as they moved from 68000 to PowerPC).
- Thus there were very few emulation implementors.
- Thus the creation of each emulator would consume of large chunk of digital preservation resources.
- The diversity of competing computer architectures would require a large number of emulators.
- The lack of a uniform, widely deployed graphical user interface would make deploying emulation to readers complex and fragile.
- Emulation has become an essential part of the computing mainstream:
- Vmware and others have made virtualization of even low-end physical hardware ubiquitous.
- Languages such as Java have made emulation of abstract virtual machines ubiquitous.
- The efforts of enthusiasts for preserving early computer games have made emulations of them ubiquitous.
- The techniques for implementing emulators are well-understood and widely known.
- The dominance of the x86 and ARM architectures means that the number of emulators needed is essentially fixed at the number that were needed in 1995.
- The advent of the Web has provided a uniform, widely-deployed graphical user interface with the potential to deliver emulations in a transparent, easy-to-use way.
At the Internet Archive, Jason Scott and others instantiate the obsolete hardware and software environment as a virtual machine inside the reader's browser. Jason writes:
jor1k project uses asm.js to produce a very fast emulation of the OpenCores OpenRISC processor (or1k) along with a HTML5 canvas framebuffer for graphics support. Recently Ben Burns contributed an emulated OpenCores ethmac ethernet adapter to the project. This sends ethernet frames to a gateway server via websocket where they are switched and/or piped into TAP virtual ethernet adapter. With this you can build whatever kind of network appliance you'd like for the myriad of fast, sandboxed VMs running in your users' browsers. For the live demo all VMs connect to a single private LAN (subnet 10.5.0.0/16). The websocket gateway also NATs traffic from that LAN out to the open Internet."In all these cases the reader clicks on a link and the infrastructure delivers an emulated environment that looks just like any other web page. The idea that emulation is too complex and difficult for readers is exploded.
Given the ability to run even modern operating systems with network and graphics in the reader's browser or in cloud virtual machines, it is hard to argue that format migration is essential, or even important, to delivering the reader's original experience. Instead, what is essential is a time-sequence library of binary software environments ready to be instantiated as readers request contemporary content.
Notice the document-centric assumption above, namely that the goal of digital preservation is to reproduce the original reader's experience. In 1995 this was the obvious goal, but now it is not the only goal. As I pointed out in respect of the Library of Congress' Twitter collection, current scholars increasingly want not to read individual documents but to data-mine from the dataset formed by the collection as a whole. It is true that in many cases emulation as described above does not provide the kind of access needed for this. It might be argued that format migration is thus still needed. But there is a world of difference between extracting useful information from a format (for example, bibliographic metadata and text from PDF) and delivering a pixel-perfect re-rendering of it. Extraction is far easier and far less likely to be invalidated by minor format version changes.