Tuesday, September 8, 2015

Infrastructure for Emulation

I've been writing a report about emulation as a preservation strategy. Below the fold, a discussion of one of the ideas that I've been thinking about as I write, the unique position national libraries are in to assist with building the infrastructure emulation needs to succeed.

Less and less of the digital content that forms our cultural heritage consists of static documents, more and more is dynamic. Static digital documents have traditionally been preserved by migration. Dynamic content is generally not amenable to migration and must be preserved by emulation.

Successful emulation requires the entire software stack be preserved. Not just the bits the content creator generated and over which the creator presumably has rights allowing preservation, but also the operating system, libraries, databases and services upon which the execution of the bits depends. The creator presumably has no preservation rights over this software, necessary for the realization of their work. A creator wishing to ensure that future audiences can access their work has no legal way to do so. In fact, creators cannot even legally sell their work in any durably accessible form. They do not own an instance of the infrastructure upon which it depends, they merely have a (probably non-transferable) license to use an instance of it.

Thus a key to future scholars' ability to access the cultural heritage of the present is that in the present all these software components be collected, preserved, and made accessible. One way to do this would be for some international organization to establish and operate a global archive of software. In an initiative called PERSIST, UNESCO is considering setting up such a Global Repository of software. The technical problems of doing so are manageable, but the legal and economic difficulties are formidable.

The intellectual property frameworks, primarily copyright and the contract law underlying the End User License Agreements (EULAs), under which software is published differ from country to country. At least in the US, where much software originates, these frameworks make collecting, preserving and providing access to collections of software impossible except with the specific permission of every copyright holder. The situation in other countries is similar. International trade negotiations such as the TPP are being used by copyright interests to make these restrictions even more onerous.

For the hypothetical operator of the global software archive to identify the current holder of the copyright on every software component that should be archived, and negotiate permission with each of them for every country involved, would be enormously expensive. Research has shown that the resources devoted to current digital preservation efforts, such as those for e-journals, e-books and the Web, suffice to collect and preserve less than half of the material in their scope. Absent major additional funding, diverting resources from these existing efforts to fund the global software archive would be robbing Peter to pay Paul.

Worse, the fact that the global software archive would need to obtain permission before ingesting each publisher's software means that there would be significant delays before the collection would be formed, let alone be effective in supporting scholars' access.

An alternative approach worth considering would separate the issues of permission to collect from the issues of permission to provide access. Software is copyright. In the paper world, many countries had copyright deposit legislation allowing their national library to acquire, preserve and provide access (generally restricted to readers physically at the library) to copyright material. Many countries, including most of the major software producing countries, have passed legislation extending their national library's rights to the digital domain.

The result is that most of the relevant national libraries already have the right to acquire and preserve digital works, although not the right to provide unrestricted access to them. Many national libraries have collected digital works in physical form. For example, the German National Library's CD-ROM collection includes half a million items. Many national libraries are crawling the Web to ingest Web pages relevant to their collections.

It does not appear that national libraries are consistently exercising their right to acquire and preserve the software components needed to support future emulations, such as operating systems, libraries and databases. A simple change of policy by major national libraries could be effective immediately in ensuring that these components were archived. Each national library's collection could be accessed by emulations on-site. No time-consuming negotiations with publishers would be needed.

An initial step would be for national libraries to assess the set of software components that would be needed to provide the basis for emulating the digital artefacts already in their collections, which of them were already to hand, and what could be done to acquire the missing pieces. The German National Library is working on a project of this kind with the bwFLA team at the University of Freiburg, which will be presented at iPRES2015.

The technical infrastructure needed to make these diverse national software collections accessible as a single homogeneous global software archive is already in place. Existing emulation frameworks access their software components via the Web, and the Memento protocol aggregates disparate collections into a single resource.

Of course, absent publisher agreements it would not be legal for national libraries to make their software collections accessible in this way. But negotiations about the terms of access could proceed in parallel with the growth of the collections. Global agreement would not be needed; national libraries could strike individual, country-specific agreements which would be enforced by their access control systems.

Incremental partial agreements would be valuable. For example, agreements allowing scholars at one national library to access preserved software components at another would reduce duplication of effort and storage without posing additional risk to publisher business models.

By breaking the link that makes building collections dependent on permission to provide access, by basing collections on the existing copyright deposit legislation, and by making success depend on the accumulation of partial, local agreements instead of a few comprehensive global agreements, this approach could cut the Gordian knot that has so far prevented the necessary infrastructure for emulation being established.

1 comment:

David. said...

More than a year ago I wrote about the implications for digital preservation of the long-running struggle between the US Dept. of Justice and Microsoft about access to e-mails stored in Ireland. Today's arguments in the 2nd Circuit Court of Appeals emphasize that these implications also apply to a hypothetical global archive of software.

The Irish government has offered to cooperate with the US under the terms of their "Mutual Legal Assistance Treaty", but the US has not done so. This shows that the goal of the struggle is not to get timely access to the e-mails in question, but to establish a legal precedent that the US has jurisdiction over data anywhere in the world in the custody of companies with US operations.

Under whose jurisdiction would the global software archive be established? And how many other governments would claim jurisdiction over it?