20 years ago, Jeff Rothenberg's seminal Ensuring the Longevity of Digital Documents compared migration and emulation as strategies for digital preservation, strongly favoring emulation. Emulation was already a long-established technology; as Rothenberg wrote Apple was using it as the basis for their transition from the Motorola 68K to the PowerPC. Despite this, the strategy of almost all digital preservation systems since has been migration. Why was this?Below the fold, the text of the talk with links to the sources. The demos in the talk were crippled by the saturated hotel network; please click on the linked images below for Smarty, oldweb.today and VisiCalc to experience them for yourself. The Olive demo of TurboTax is not publicly available, but it is greatly to Olive's credit that it worked well even on a heavily-loaded network.
Preservation systems using emulation have recently been deployed for public use by the Internet Archive and the Rhizome Project, and for restricted use by the Olive Archive at Carnegie-Mellon and others. What are the advantages and limitations of current emulation technology, and what are the barriers to more general adoption?
TitleOnce again, I need to thank Cliff Lynch for inviting me to to give this talk, and for letting me use the participants in Berkeley iSchool's "Information Access Seminars" as guinea-pigs to debug it. This one is basically "what I did on my summer vacation", writing a report under contract to the Mellon Foundation entitled Emulation and Virtualization as Preservation Strategies. As usual, you don't have to take notes or ask for the slides, an expanded text with links to the sources will go up on my blog shortly. The report itself is available from the Mellon Foundation and from the LOCKSS website.
I'm old enough to know that giving talks that include live demos over the Internet is a really bad idea, so I must start by invoking the blessing of the demo gods.
HistoryEmulation and virtualization technologies have been a feature of the information technology landscape for a long time, going back at least to the IBM709 in 1958, but their importance for preservation was first bought to public attention in Jeff Rothenberg's seminal 1995 Scientific American article Ensuring the Longevity of Digital Documents. As he wrote, Apple was using emulation in the transition of the Macintosh from the Motorola 68000 to the Power PC. The experience he drew on was the rapid evolution of digital storage media such as tapes and floppy disks, and of applications such as word processors each with their own incompatible format.
His vision can be summed up as follows: documents are stored on off-line media which decay quickly, whose readers become obsolete quickly, as do the proprietary, closed formats in which they are stored. If this isn't enough, operating systems and hardware change quickly in ways that break the applications that render the documents.
Rothenberg identified two techniques by which digital documents could survive in this unstable environment, contrasting the inability of format migration to guarantee fidelity with emulation's ability to precisely mimic the behavior of obsolete hardware.
Rothenberg's advocacy notwithstanding, most digital preservation efforts since have used format migration as their preservation strategy. The isolated demonstrations of emulation's feasibility, such as the collaboration between the UK National Archives and Microsoft, had little effect. Emulation was regarded as impractical because it was thought (correctly at the time) to require more skill and knowledge to both create and invoke emulations than scholars wanting access to preserved materials would possess.
|MacOS7 on Apple Watch|
|Nintendo 64 on Android Wear|
Recently, teams at the Internet Archive, Freiburg University and Carnegie Mellon University have shown frameworks that can make emulations appear as normal parts of Web pages; readers need not be aware that emulation is occurring. Some of these frameworks have attracted substantial audiences and demonstrated that they can scale to match. This talk is in four parts:
- First I will show some examples of how these frameworks make emulations of legacy digital artefacts, those from before about the turn of the century, usable for unskilled readers.
- Next I will discuss some of the issues that are hampering the use of these frameworks for legacy artefacts.
- Then I will describe the changes in digital technologies over the last two decades, and how they impact the effectiveness of emulation and migration in providing access to current digital artefacts.
- I will conclude with a look at the single biggest barrier that has and will continue to hamper emulation as a preservation strategy.
- One or more emulators capable of executing preserved system images.
- A collection of preserved system images, together with the metadata describing which emulator configured in which way is appropriate for executing them.
- A framework that connects the user with the collection and the emulators so that the preserved system image of the user's choice is executed with the appropriately configured emulator connected to the appropriate user interface.
Theresa Duncan's CD-ROMs
|The Theresa Duncan CD-ROMs.|
What Is Going On?What happened when I clicked Smarty's Play button?
- The browser connects to a session manager in Amazon's cloud, which notices that this is a new session.
- Normally it would authenticate the user, but because this CD-ROM emulation is open access it doesn't need to.
- It assigns one of its pool of running Amazon instances to run the session's emulator. Each instance can run a limited number of emulators. If no instance is available when the request comes in it can take up to 90 seconds to start another.
- It starts the emulation on the assigned instance, supplying metadata telling the emulator what to run.
- The emulator starts. After a short delay the user sees the Mac boot sequence, and then the CD-ROM starts running.
- At intervals, the emulator sends the session manager a keep-alive signal. Emulators that haven't sent one in 30 seconds are presumed dead, and their resources are reclaimed to avoid paying the cloud provider for unused resources.
- Data I/O, connecting the emulator to data sources such as disk images, user files, an emulated network containing other emulators, and the Internet.
- Interactive Access, connecting the emulator to the user using standard HTML5 facilities.
- Control, providing a Web Services interface that bwFLA's resource management can use to control the emulator.
bwFLA's preserved system images are stored as a stack of overlays in QEMU's "qcow2'' format. Each overlay on top of the base system image represents a set of writes to the underlying image. For example, the base system image might be the result of an initial install of Windows 95, and the next overlay up might be the result of installing Word Perfect into the base system. Or, as Cal Lee mentioned yesterday, the next overlay up might be the result of redaction. Each overlay contains only those disk blocks that differ from the stack of overlays below it. The stack of overlays is exposed to the emulator as if it were a normal file system via FUSE.
The technical metadata that encapsulates the system disk image is described in a paper presented to the iPres conference in November 2015, using the example of emulating CD-ROMs. Broadly, it falls into two parts, describing the software and hardware environments needed by the CD-ROM in XML. The XML refers to the software image components via the Handle system, providing a location-independent link to access them.
|BBC News via oldweb.today|
|TurboTax97 on Windows 3.1|
What Is Going On?The top and bottom menu bars come from a program called VMNetX that is running on my system. Chromium invoked it via a MIME-type binding, and VMNetX then created a suitable environment in which it could invoke the emulator that is running Windows 3.1, and TurboTax. The menu bars include buttons to power-off the emulated system, control its settings, grab the screen, and control the assignment of the keyboard and mouse to the emulated system.
The interesting question is "where is the Windows 3.1 system disk with TurboTax installed on it?"
This system is Olive, developed at Carnegie Mellon University by a team under my friend Prof. Mahadev Satyanarayanan, and released under GPLv2. VMNetX uses a sophisticated two-level caching scheme to provide good emulated performance even over slow Internet connections. A "pristine cache" contains copies of unmodified disk blocks from the "system disk". When a program writes to disk, the data is captured in a "modified cache". When the program reads a disk block, it is delivered from the modified cache, the pristine cache or the Web server, in that order. One reason this works well is that successive emulations of the same preserved system image are very similar, so pre-fetching blocks into the pristine cache is effective in producing YouTube-like performance over 4G cellular networks.
|VisiCalc on Apple ][|
Internet ArchiveThis is the framework underlying the Internet Archive's software library, which currently holds nearly 36,000 items, including more than 7,300 for MS-DOS, 3,600 for Apple, 2,900 console games and 600 arcade games. Some can be downloaded, but most can only be streamed.
The oldest is an emulation of a PDP-1 with a DEC 30 display running the Space War game from 1962, more than half a century ago. As I can testify having played this and similar games on Cambridge University’s PDP-7 with a DEC 340 display seven years later, this emulation works well
Concerns: EmulatorsAll three groups share a set of concerns about emulation technology. The first is about the emulators themselves. There are a lot of different emulators out there, but the open source emulators used for preservation fall into two groups:
- QEMU is well-supported, mainstream open source software, part of most Linux distributions. It emulates or virtualizes a range of architectures including X86, X86-64, ARM, MIPS and SPARC. It is used by both bwFLA and Olive, but both groups have encountered irritating regressions in its emulations of older systems, such as Windows 95. It is hard to get the QEMU developers to prioritize fixing these, since emulating current hardware is its primary focus. The recent SOSP workshop featured a paper from the Technion and Intel describing their use of the tools Intel uses to verify chips to verify QEMU. They found and mostly fixed 117 bugs.
- Enthusiast-supported emulators for old hardware including MAME/MESS, Basilisk II, SheepShaver, and DOSBox. These generally do an excellent job of mimic-ing the performance of a wide range of obsolete CPU architectures, but have some issues mapping the original user interface to modern hardware. Jason Scott at the Internet Archive has done great work encouraging the retro-gaming community to fix problems with these emulators but, for long-term preservation, their support causes concerns.
Concerns: MetadataEmulations of preserved software such as those I've demonstrated require not just the bits forming the image of a CD-ROM or system disk, but also several kinds of metadata:
- Technical metadata, describing the environment needed in order for the bits to function. Tools for extracting technical metadata for migration such as JHOVE and DROID exist, as do the databases on which they rely such as PRONOM, but they are inadequate for emulation. The DNB and bwFLA teams' iPRES 2015 paper describes an initial implementation of a tool for compiling and packaging this metadata which worked quite well for the restricted domain of CD-ROMs. But much better, broadly applicable tools and databases are needed if emulation is to be affordable.
- Bibliographic metadata, describing what the bits are so that they can be discovered by potential "readers".
- Usability metadata, describing how to use the emulated software. An example is the VisiCalc reference card, describing the key bindings of the first spreadsheet.
- Usage metadata, describing how the emulations get used by "readers", which is needed by cloud-based emulation systems for provisioning, and for "page-rank" type assistance in discovery. The Web provides high-quality tools in this area, although a balance has to be maintained with user privacy. The Internet Archive's praiseworthy policy of minimizing logging does make it hard to know how much their emulations are used.
Concerns: FidelityIn a Turing sense all computers are equivalent, so it is possible and indeed common for an emulator to precisely mimic the behavior of a computer's CPU and memory. But physical computers are more than a CPU and memory. They have I/O devices whose behavior in the digital domain is more complex than Turing's model. Some of these devices translate between the digital and analog domains to provide the computer's user interface.
|PDP1 front panel by fjarlq / Matt.|
Licensed under CC BY 2.0.
Concerns: Loads & Scaling
|Daily emulation counts|
Their experience led Rhizome to deploy their infrastructure on Amazon's highly scalable ElasticBeanstalk infrastructure. Klaus Rechert computes:
Amazon EC2 charges for an 8 CPU machine about €0.50 per hour. In case of [Bomb Iraq], the average session time of a user playing with the emulated machine was 15 minutes, hence, the average cost per user is about €0.02 if a machine is fully utilized.In the peak, this would have been about €10/day, ignoring Amazon's charges for data out to the Internet. Nevertheless, automatically scaling to handle unpredictable spikes in demand always carries budget risks, and rate limits are essential for cloud deployment.
Why Mostly Games?Using emulation for preservation was pioneered by video game enthusiasts. This reflects a significant audience demand for retro gaming which, despite the easy informal availability of free games, is estimated to be a $200M/year segment of the $100B/year video games industry. Commercial attention to the value of the game industry's back catalog is increasing; a company called Digital Eclipse aspires to become the Criterion Collection of gaming, selling high-quality re-issues of old games. Because preserving content for scholars lacks the business model and fan base of retro gaming, it is likely that it will continue to be a minority interest in the emulation community.
There are relatively few preserved system images other than games for several reasons:
- The retro gaming community has established an informal modus vivendi with the copyright owners. Most institutions require formal agreements covering preservation and access and, just as with academic journals and books, identifying and negotiating individually with every copyright owner in the software stack is extremely expensive.
- If a game is to be successful enough to be worth preserving, it must be easy for an unskilled person to install, execute and understand, and thus easy for a curator to create a preserved system image. The same is not true for artefacts such as art-works or scientific computations, and thus the cost per preserved system image is much higher.
- A large base of volunteers is interested in creating preserved game images, and there is commercial interest in doing so. Preserving other genres requires funding.
- Techniques have been developed for mass preservation of, for example, Web pages, academic journals, and e-books, but no such mass preservation technology is available for emulations. Until it is, the cost per artefact preserved will remain many orders of magnitude higher.
Artefact EvolutionAs we have seen, emulation can be very effective at re-creating the experience of using the kinds of digital artefacts that were being created two decades ago. But the artefacts being created now are very different, in ways that have a big impact on their preservation, whether by migration or emulation.
Before the advent of the Web digital artefacts had easily identified boundaries. They consisted of a stack of components, starting at the base with some specified hardware, an operating system, an application program and some data. In typical discussions of digital preservation, the bottom two layers were assumed and the top two instantiated in a physical storage medium such as a CD.
The connectivity provided by the Internet and subsequently by the Web makes it difficult to determine where the boundaries of a digital object are. For example, the full functionality of what appear on the surface to be traditional digital documents such as spreadsheets or PDFs can invoke services elsewhere on the network, even if only by including links. The crawlers that collect Web content for preservation have to be carefully programmed to define the boundaries of their crawls. Doing so imposes artificial boundaries, breaking what appears to the reader as a homogeneous information space into discrete digital "objects''.
Indeed, what a reader thinks of as "a web page'' typically now consists of components from dozens of different Web servers, most of which do not contribute to the reader's experience of the page. They are deliberately invisible, implementing the Web's business model of universal fine-grained surveillance.
Sir Tim Berners-Lee's original Web was essentially an implementation of Vannevar Bush's Memex hypertext concept, an information space of passive, quasi-static hyper-linked documents. The content a user obtained by dereferencing a link was highly likely to be the same as that obtained by a different user, or by the same user at a different time.
The fact that the artefacts to be preserved are now active makes emulation a far better strategy than migration, but it increases the difficulty of defining their boundaries. One invocation of an object may include a different set of components from the next invocation, so how do you determine which components to preserve?
In 1995, a typical desktop 3.5'' hard disk held 1-2GB of data. Today, the same form factor holds 4-10TB, say 4-5 thousand times as much. In 1995, there were estimated to be 16 million Web users, Today, there are estimated to be over 3 billion, nearly 200 times as many. At the end of 1996, the Internet Archive estimated the total size of the Web at 1.5TB, but today they ingest that much data roughly every 30minutes.
The technology has grown, but the world of data has grown much faster, and this has transformed the problems of preserving digital artefacts. Take an everyday artefact such as Google Maps. It is simply too big and worth too much money for any possibility of preservation by a third party such as an archive, and its owner has no interest in preserving its previous states.
Infrastructure EvolutionWhile the digital artefacts being created were evolving, the infrastructure they depend on was evolving too. For preservation, the key changes were:
- GPUs: As Rothenberg was writing, PC hardware was undergoing a major architectural change. The connection between early PCs and their I/O devices was the ISA bus, whose bandwidth and latency constraints made it effectively impossible to deliver multimedia applications such as movies and computer games. This was replaced by the PCI bus, with much better performance, and multimedia became an essential ingredient of computing devices. This forced a division of system architecture into a Central Processing Unit (CPU) and what became known as Graphics Processing Units (GPUs). The reason was that CPUs were essentially sequential processors, incapable of performing the highly parallel task of rendering the graphics fast enough to deliver an acceptable user experience. Now, much of the silicon in essentially every device with a user interface implements a massively parallel GPU whose connection to the display is both very high bandwidth and very low latency. Most high-end scientific computation now also depends on the massive parallelism of GPUs rather than traditional super-computer technology. Partial para-virtualization of GPUs was recently mainstreamed in Linux 4.4, but its usefulness for preservation is strictly limited.
- Smartphones: Both desktop and laptop PC sales are in free-fall, and even tablet sales are no longer growing. Smartphones are the hardware of choice. They, and tablets, amplify interconnectedness; they are designed not as autonomous computing resources but as interfaces to the Internet. The concept of a stand-alone ``application'' is no longer really relevant to these devices. Their ``App Store'' supplies custom front-ends to network services, as these are more effective at implementing the Web's business model of pervasive surveillance. Apps are notoriously difficult to collect and preserve. Emulation can help with their tight connection to their hardware platform, but not with their dependence on network services. The user interface hardware of mobile devices is much more diverse. In some cases the hardware is technically compatible with traditional PCs, but not functionally compatible. For example, mobile screens typically are both smaller and have much smaller pixels, so an image from a PC may be displayable on a mobile display but it may be either too small to be readable, or if scaled to be readable may be clipped to fit the screen. In other cases the hardware isn't even technically compatible. The physical keyboard of a laptop and the on-screen virtual keyboard of a tablet are not compatible.
- Moore's Law: Gordon Moore predicted in 1965 that the number of transistors per unit area of a state-of-the-art integrated circuit would double about every two years. For about the first four decades of Moore's Law, what CPU designers used the extra transistors for was to make the CPU faster. This was advantageous for emulation; the modern CPU that was emulating an older CPU would be much faster. The computational cost of emulating the old hardware in software would be swamped by the faster hardware being used to do it. Although Moore's Law continued into its fifth decade, each extra transistor gradually became less effective at increasing CPU speed. Further, as GPUs took over much of the intense computation, customer demand evolved from maximum performance per CPU, to processing throughput per unit power. Emulation is a sequential process, so the fact that the CPUs are no longer getting rapidly faster is disadvantageous for emulation.
- Architectural Consolidation: W. Brian Arthur's 1994 book Increasing Returns and Path Dependence in the Economy described the way the strongly increasing returns to scale in technology markets drove consolidation. Over the past two decades this has happened to system architectures. Although it is impressive that MAME/MESS emulates nearly two thousand different systems from the past, going forward emulating only two architectures (Intel and ARM) will capture the overwhelming majority of digital artefacts.
- Threats: Although the Morris Worm took down the Internet in 1988, the Internet environment two decades ago was still fairly benign. Now, Internet crime is one of the world's most profitable activities, as can be judged by the fact that the price for a single zero-day iOS exploit is $1M. Because users are so bad at keeping their systems up-to-date with patches, once a vulnerability is exploited it becomes a semi-permanent feature of the Internet. For example, the 7-year old Conficker worm was recently found infecting brand-new police body-cameras. This threat persistence is a particular concern for emulation as a preservation strategy. Familiarity Breeds Contempt by Clark et al. shows that the interval between discoveries of new vulnerabilities in released software decreases through time. Thus the older the preserved system image, the (exponentially) more vulnerabilities it will contain, and the more likely it is to be compromised as soon as its emulation starts.
Legal IssuesWarnings: I Am Not A Lawyer, and this is US-specific.
Most libraries and archives are very reluctant to operate in ways whose legal foundations are less than crystal clear. There are two areas of law that affect using emulation to re-execute preserved software, copyright and, except for open-source software, the end user license agreement (EULA), a contract between the original purchaser and the vendor.
Software must be assumed to be copyright, and thus absent specific permission such as a Creative Commons or open source license, making persistent copies such as are needed to form collections of preserved system images is generally not permitted. The Digital Millennium Copyright Act (DMCA) contains a "safe harbor'" provision under which sites that remove copies if copyright owners send "takedown notices" are permitted; this is the basis upon which the Internet Archive's collection operates. Further, under the DMCA it is forbidden to circumvent any form of copy protection or Digital Rights Management (DRM) technology. These constraints apply independently to every component in the software stack contained in a preserved system image, thus there may be many parties with an interest in an emulation's legality.
The Internet Archive and others have repeatedly worked through the "Section 108'" process to obtain an exemption to the circumvention ban for programs and video games "distributed in formats that have become obsolete and that require the original media or hardware as a condition of access, when circumvention is accomplished for the purpose of preservation or archival reproduction of published digital works by a library or archive." This exemption appears to cover the Internet Archive's circumvention of any DRM on their preserved software, and its subsequent ``archival reproduction'' which presumably includes execution. It does not, however, exempt the archive from taking down preserved system images if the claimed copyright owner objects, and the Internet Archive routinely does so. Neither does the DMCA exemption cover the issue of whether the emulation violates the EULA.
Streaming media services such as Spotify, which do not result in the proliferation of copies of content, have significantly reduced although not eliminated intellectual property concerns around access to digital media. "Streaming'" emulation systems should have a similar effect on access to preserved digital artefacts. The success of the Internet Archive's collections, much of which can only be streamed, and Rhizome's is encouraging in this respect. Nevertheless, it is clear that institutions will not build, and provide access even on a restricted basis to, collections of preserved system images at the scale needed to preserve our cultural heritage unless the legal basis for doing so is clarified.
Negotiating with copyright holders piecemeal is very expensive and time-consuming. Trying to negotiate a global agreement that would obviate the need for individual agreement would in the best case, take a long time. I predict the time would be infinite rather than long. If we wait to build collections until we have permission in one of these ways much software will be lost.
An alternative approach worth considering would separate the issues of permission to collect from the issues of permission to provide access. Software is copyright. In the paper world, many countries had copyright deposit legislation allowing their national library to acquire, preserve and provide access (generally restricted to readers physically at the library) to copyright material. Many countries, including most of the major software producing countries, have passed legislation extending their national library's rights to the digital domain.
The result is that most of the relevant national libraries already have the right to acquire and preserve digital works, although not the right to provide unrestricted access to them. Many national libraries have collected digital works in physical form. For example, the DNB's CD-ROM collection includes half a million items. Many national libraries are crawling the Web to ingest Web pages relevant to their collections.
It does not appear that national libraries are consistently exercising their right to acquire and preserve the software components needed to support future emulations, such as operating systems, libraries and databases. A simple change of policy by major national libraries could be effective immediately in ensuring that these components were archived. Each national library's collection could be accessed by emulations on-site in "reading-room" conditions, as envisaged by the DNB. No time-consuming negotiations with publishers would be needed.
If national libraries stepped up to the plate in this way, the problem of access would remain. One idea that might be worth exploring as a way to it is lending. The Internet Archive has successfully implemented a lending system for their collection of digitized books. Readers can check a book out for a limited period; each book can be checked out to at most one reader at a time. This has not encountered much opposition from copyright holders. A similar system for emulation would be feasible; readers would check out an emulation for a limited period, and each emulation could be checked out to at most one reader at a time. One issue would be dependencies. An archive might have, say, 10,000 emulations based on Windows 3.1. If checking out one blocked access to all 10,000 that might be too restrictive to be useful.