Tuesday, November 10, 2015

Follow-up to the Emulation Report

Enough has happened while my report on emulation was in the review process that, although I announced its release last week, I already have enough material for a follow-up post. Below the fold, the details, including a really important paper from the recent SOSP workshop.

First, a few links to reinforce points that I made in the report:

One important assumption that lies behind the use of emulation for preservation is that future hardware will be much more powerful than the hardware that originally ran the preserved digital artefact. Moore's Law used to make this a no-brainer for CPU performance and memory size. Although it has recently slowed, the long time scales implicit in preservation mean that these are still a good bets. But the capabilities emulation needs to be more powerful are not limited to CPU and memory. They include the I/O resources needed for communication with the user. The report points out that this is no longer a good bet. Desktop and laptop sales are in free-fall and as The Register reports, even tablet sales have been cratering over the last year. The hardware future users will use to interact with emulations will be a smartphone. It won't have a physical keyboard and its display and pixels will be much smaller. Most current emulations are unusable on a smartphone.

The report starts with an image of a Mac emulator running on an Apple watch. Nick Lee started a trend. Hacking Jules has Nintendo 64 and PSP emulators running on his Android Wear. Not, of course, that these emulated games really recreate the experience of playing on a Nintendo 64 or a PSP. But, as with Nick Lee's Mac, they show that simply running an emulation is not that hard.

One thing that surprised me during the research for the report was that retro-gaming is a $200M/yr business. It just held a convention in Portland, complete with a keynote by Al Alcorn.

Some papers at iPRES2015 addressed issues that were raised in the report:
  • Functional Access to Forensic Disk Images in a Web Service. by Kam Woods et al. describe using Freiburg's emulation-as-a-service on a collection of forensic disk images.
  • Characterization of CDROMs for Emulation-based Access by Klaus Rechert et al is a paper I cited in the report, thanks to a pre-print from Klaus. It describes the DNB's efforts using Freiburg's EAAS to provide access to their collection of CD-ROM images. In particular it describes an automated workflow for extracting the necessary technical metadata.
  • Getting to the Bottom Line: 20 Digital Preservation Cost Questions. Cost is the single most important cause of the Half-Empty Archive. One concern the report raises is that, absent better ingest tools, the per-artefact cost of emulation is too high. Matt Schultz et al describe a resource to help institutions identify the full range of costs that might be associated with any particular digital preservation service.
  • Dragan Espenscheid's beautiful poster about the Theresa Duncan CD-ROMs is worth a look.
The Freiburg team have continued to make progress, including:
Based on the facts that cloud services depend heavily on virtualization, and that preserved system images generally work well, the report is cautiously enthusiastic about the fidelity with which emulators execute their target's instruction set. But it does flag several concerns in this area, such as an apparent regression in QEMU's ability to run Windows 95.

A paper at the recent SOSP by Nadav Amit et al entitled Virtual CPU Verification casts light on the causes and cures of fidelity failures in emulators. They observed that the problem of verifying virtualized or emulated CPUs is closely related to the problem of verifying a real CPU. Real CPU vendors sink huge resources into verifying their products, and this team from the Technion and Intel were able to base their research into X86 emulation on the tools that Intel uses to verify its CPU products.

Although QEMU running on an X86 tries hard to virtualize rather than emulate, it is capable of emulating and the team were able to force it into emulation mode. Using their tools, they were able to find and analyze 117 bugs in QEMU, and fix most of them. Their testing also triggered a bug in the VM BIOS:
But the VM BIOS can also introduce bugs of its own. In our research, as we addressed one of the disparities in the behavior of VCPUs and CPUs, we unintentionally triggered a bug in the VM BIOS that caused the 32-bit version of Windows 7 to display the so-called blue screen of death.
Their conclusion is worth quoting:
Hardware-assisted virtualization is popular, arguably allowing users to run multiple workloads robustly and securely while incurring low performance overheads. But the robustness and security are not to be taken for granted, as it is challenging to virtualize the CPU correctly, notably in the face of newly added features and use cases. CPU vendors invest a lot of effort—hundreds of person years or more—to develop validation tools, and they exclusively enjoy the benefit of having an accurate reference system. We therefore speculate that effective hypervisor validation could truly be made possible only with their help. We further contend that it is in their interest to provide such help, as the majority of server workloads already run on virtual hardware, and this trend is expected to continue. We hope that open source hypervisors will be validated on a regular basis by Intel Open Source Technology Center.
Having Intel validate the open source hypervisors, especially doing so by forcing them to emulate rather than virtualize, would be a big step forward. But note the focus on current uses of virtualization. To what extent the validation process would test the emulation of the hardware features of legacy CPUs important for preservation is uncertain. Though the fact that their verification caught a bug that was relevant only to Windows 7 is encouraging.

19 comments:

IlyaK said...

Hi David,

I would like to add that I am continuing work on Netcapsule (https://github.com/ikreymer/netcapsule), I think a next step in the ''Internet Emulator' effort. It is a fully open source Docker-based system, currently supporting 13 browsers, each running in its own Docker container on-demand, and allowing browsing across 10+ Memento-enabled archives.

Ilya

euanc said...

David,

Your report is excellent, thank you!

I do have one piece of feedback. One thing you seem to have missed in both of your blog posts and in the report is and how the use case you mentioned us having at Yale relates to the cost calculations related to using emulation long term and at a large scale.

One of my ongoing concerns is the problem of software dependent content. Which I consider to be extremely prevalent and problematic. For this reason and due to cost considerations outlined below, we are exploring at Yale the option of using emulation for enabling interaction with every "born" digital object in our archive (and potentially everything digital) using software that is contemporaneous to the objects. On first hearing, this can sound daunting and expensive. Fortunately much of this can be achieved quite inexpensively as a huge number of our files can best be interacted with using a relatively small set of environments. For example, if we set up each version of Microsoft office and/or WordPerfect office on one disk image each and made them available via Emulation as a Service for use with our content, we would be able to enable interaction with many millions of files. The per-file cost for enabling this would be minimal. Furthermore the cost could mostly be born at the point of access, i.e. just in time rather than just in case (which I'd consider applies to migrating everything - it is a relatively large and recurring expense "just in case" the content ever gets used).

The team in Freiburg have also made great progress with implementing something similar to the approach used in the KEEP project to enable the above use case in an even more automated and therefore cost-effective way. They have mapped PRONOM IDs to a number of preconfigured environments so that individual files to be accessed can be analyzed and a set of environments automatically identified that can interact with the files. Under their current implementation a default environment is automatically selected and booted but users can optionally choose another "compatible" environment.

Overall this use case is quite different to the e.g. CD-ROM use case as there is minimal initial effort and therefore cost per item to be preserved. Due to its "just in time" rather than "just in case" nature I think it makes it quite attractive as an option over the long term. Especially compared with the just in case migration alternative.

One other update, my student workers in the Library at Yale are currently processing about 500 CD-ROMs and floppy disks from our general collections per week. We have around 8000, probably 5-6000 of which will eventually go into the EaaS framework. These will take some effort to process (unlike the use case I described above). But I'm hoping the characterization service the bwFLA team have implemented may help with that.

Again, the report is excellent, thank you!

Euan Cochrane

David. said...

Apple is pushing the idea that their new iPad Pro will kill off the PC:

“I think if you’re looking at a PC, why would you buy a PC anymore? No really, why would you buy one?”, asks Tim Cook, Apple’s chief executive,

David. said...

The report points out that the persistence of malware in the Internet is a significant problem for the use of emulation in preservation. It uses 2008's Conficker worm as an illustration. Today, we get a reminder that these threats never go away. Conficker was just discovered in brand-new factory shipments of police body cameras.

David. said...

The report discusses the problems GPUs pose for emulation (Section 3.2.1) and the efforts to provide paravirtualized GPU support in QEMU (Section 4.2.1). This limited but valuable support is now mainstreamed in the Linux 4.4 kernel.

David. said...

Sebastian Anthony at Ars Technica reports that Sony has stealthily released a PS2 emulator for the PS4. According to a Sony spokesperson:

"We are working on utilising PS2 emulation technology to bring PS2 games forward to the current generation. We have nothing further to comment at this point in time."

David. said...

Via Rob Beschizza's report on running the original IBM5150 PC and Adventure in his browser, I found the PCjs project:

"The goals of the JavaScript Machines project are to create fast, full-featured simulations of classic computer hardware, help people understand how these early machines worked, make it easy to experiment with different machine configurations, and provide a platform for running and analyzing old computer software."

David. said...

Zerodium has published a list of prices at which it will buy zero-day exploits. Up to $500K for iOS.

David. said...

The Software Freedom Conservancy, of which QEMU is a member, supported Christopher Hellwig's lawsuit against VMware for GPL violations. As a result is apparently seeing corporate support evaporate, placing its finances in jeopardy. An illustration of the difficulties even apparently well-funded open-source efforts can encounter.

David. said...

Love Hultén has a KickStarter for a handmade game emulation system in a walnut case that can store and emulate 10,000 games.

David. said...

Ilya's system for viewing the Web through old browsers is up at oldweb.today.

David. said...

Oya Rieger et al from Cornell have just released a report entitled Preserving and Emulating Digital Art Objects which looks very interesting. I will write more once I've had a chance to study it.

David. said...

With a little work you can emulate old games on an Apple TV.

David. said...

The report concludes that widespread use of emulation depends on a solution to the obstacles posed by copyright. Anyone in doubt about how hard this will be should read Zachary Crockett's How Mickey Mouse Evades the Public Domain at Priceonomics:

"Disney has done everything in its power to make sure it retains the copyright on Mickey -- even if that means changing federal statutes. Every time Mickey’s copyright is about to expire, Disney spends millions lobbying Congress for extensions, and trading campaign contributions for legislative support. With crushing legal force, they’ve squelched anyone who attempts to disagree with them."

David. said...

Shira Ovide's Smartphones Aren't PC's Only Nemesis is an interesting take on the now four-year-long collapse in sales of PCs:

"Microsoft is trying to change its business model so it can in theory make money even if no one ever buys a new PC again. Meanwhile, Intel and the PC makers still generate sales from each new PC sold and therefore want personal computers to fly off the shelves. And all of the PC companies are trying everything they can to get out of the PC business -- or at least become less dependent on selling computers."

David. said...

Those of us who grew up using Control Data systems now have an emulator for the CDC6400 and much of the Cyber series (which was after my time with CDC machines).

David. said...

Warren Toomey is working on a really challenging emu8lation project, reviving PDP-7 Unix, the starting point for much of today's computing infrastructure including Linux, Android and iOS.

David. said...

Kyle Orland at Ars Technica has a report on Frank Cifaldi's presentation to the Game Developer's Conference. My report refers to his company, Digital Eclipse:

"These movies have always been in print," Cifaldi said. "Games could have been the same way, except we demonized emulation, and devalued our heritage. We've relegated a majority of our past to piracy."

The whole post is worth a read.

David. said...

Some idea of the scale of the retro-gaming industry can be gathered from Sam Machkovech's report on the Portland Retro Gaming Expo.