Tuesday, October 4, 2016

Panel on Software Preservation at iPRES

I was one of five members of a panel on Software Preservation at iPRES 2016, moderated by Maureen Pennock. We each had three minutes to answer the question "what have you contributed towards software preservation in the past year?" Follow me below the fold for my answer.

So, what have I contributed towards software preservation in the past year? In my case, the answer won't take much of my three minutes. I published a 37-page report, funded by the Andrew W. Mellon Foundation, entitled Emulation and Virtualization as Preservation Strategies, wrote 15 blog posts on emulation, and gave 4 talks on the topic.

Obviously, you need to read each and every word of them, but for now I will condense this mass of words into six key points for you:
  1. The barriers to delivering emulations of preserved software are no longer technical. Rhizome's Theresa Duncan CD-ROMs, the Internet Archive's Software Library, and Ilya Kreymer's oldweb.today are examples of emulations transparently embedded in the Web. Fabrice Bellard's v86 Javascript emulator allows you not merely to run Linux or OpenBSD in your browser, but even to boot your own floppy, CD-ROM or disk image in it.
  2. The cost of preparing such emulations is still too high. Some progress has been made towards tools for automatically extracting the necessary technical metadata from CD-ROMs, but overall the state of ingest tooling is inadequate.
  3. Most ways in which emulations are embedded in Web pages are not themselves preservable. The Web page embeds not merely the software and hardware environment to be emulated, which should remain the same, but also the particular technology to be used to emulate it, which will change. That's not the only problem. One-size-fits-all emulation delivers some users a miserable experience. The appropriate emulation technology to use depends on the user's device, browser, latency, bandwidth, etc. What's needed is a standard for representing preserved system images and metadata, an emulation mime-type, and an analog of pdf.js, code that is downloaded to work out an appropriate emulation at dissemination time.
  4. Emulation of software that connects to the Internet is a nightmare, for two reasons. First, it will almost certainly use some network services, which will likely not be there when needed. Second, as the important paper Familiarity Breeds Contempt shows, the software will contain numerous vulnerabilities. It will be compromised as soon as it connects.
  5. Except for open source environments, the legal framework in which software is preserved and run in emulation is highly problematic, being governed by both copyright, and the end user license agreement, of every component in the software stack from the BIOS through the operating system to the application. Because software is copyright, national libraries with copyright deposit should be systematically collecting it to enable future emulations. The only systematic collection I'm aware of is by NIST on behalf of the FBI for forensic purposes.
  6. Even if the cost of ingest could be greatly reduced, a sustainable business model for ingesting, preserving and disseminating software is nowhere to be seen.
As usual, the full text of these remarks with links to the sources will go up on my blog shortly after this session.

I was also asked to prepare a response to one of the questions to be used to spark debate:
Economic sustainability: What evidence is required to determine commercial viability for software preservation services? Can cultural heritage institutions make a business case to rights holders that preserving software can co-exist with commercial viability?
I decided to re-formulate the two questions:
  • Can closed-source software vendors be persuaded to allow their old software to be run in emulation? The picture here is somewhat encouraging. Some major vendors are cooperating to some extent. For example, Microsoft's educational licenses allow their old software to be run in emulation. Microsoft has not objected to the Internet Archive's Windows 3.x Showcase. I've had encouraging conversations with senior people at other major vendors. But this is all about use of old software without payment. It is not seen as depriving the vendor of income.
  • Is there a business model to support emulation services? This picture is very discouraging. Someone is going to have to pay the cost of emulation. As Rhizome found when the Theresa Duncan CD-ROMs went viral, if the end user doesn't pay there can be budget crises. If the end user does pay, its a significant barrier to use, and it starts looking like it is depriving the vendor of income. Some vendors might agree to cost-recovery charging. But typically there are multiple vendors involved. Consider emulating an old Autodesk on Windows environment. That is two vendors. Do they both agree to the principle of cost-recovery, and to the amount of cost-recovery?

    Update: On the fly, I pointed out the analogy between software preservation and e-journal preservation, the other area of preservation that handles expensive commercial content. Three approaches have emerged in e-journal preservation:
    • LOCKSS implements a model in which each purchaser preserves their own purchase. This does not look like it deprives the vendor of income.
    • CLOCKSS implements a model analogous to software escrow, in which software that is no longer available from the vendor is released from escrow. This does not look like it deprives the vendor of income.
    • Portico implements a subscription model, which could be seen as diverting income from the vendor to the archive. Critical to Portico's origin was agreement to this model by Elsevier, the dominant publisher. Other publishers then followed suit.
    This suggests that persuading dominant software publishers to accept a business model is critical.

No comments: