Tuesday, July 9, 2013

The Library of Congress' "Preserving.exe" meeting

In May the Library of Congress held a meeting entitled Preserving.exe: Toward a National Strategy for Preserving Software. I couldn't be there and I've only just managed to read the presentations and other materials. Three quick reactions below the fold.

First, Ben Balter of GitHub gave a presentation he called The Next Cultural Commons which (if my web search found the right one) is an excellent discussion of the advantages of using the open source methodology and tools in government. However, from the preservation point of view it seems to miss the point. Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.

The good news is that open source software is much easier to collect and preserve than all these other things:
  • There are no legal barriers to doing so. Open source software is guaranteed to carry with it a copyright license that explicitly allows all the actions necessary for ingest, preservation and dissemination. Other genres of content should be so lucky.
  • There are no technical barriers to doing so. Open source software is maintained and developed using source code control systems such as git, and stored in repositories such as GitHub. These repositories allow, indeed positively encourage, others to copy their contents either selectively or in bulk. An open source project would be delinquent if it did not use these facilities to maintain its own backup copy of its part of the repository. Other genres of content should be so lucky.
The open source software canon includes the essential infrastructure used by almost all efforts to preserve what is regarded as our digital heritage, the other things that are much harder to preserve. Examples include the Heritrix crawler, the Droid and PRONOM file format tools, the LOCKSS preservation system and Hadoop.

The Internet Archive, national libraries and other institutions are so keenly aware that digital content, especially on the Web, is evanescent that they devote considerable resources to collecting and preserving it. The bad news is that they appear to be so confident that open source repositories such as GitHub and SourceForge will persist "for the life of the republic" that they don't devote any resource to collecting and preserving the important cultural productions they contain, and in particular the very tools institutions use to preserve the rest of the digital cultural record. For example, the Wayback Machine contains the LOCKSS section of the SourceForge website, but only the surrounding content, not the binaries nor the source code. This is not to downplay the importance of the Wayback Machine; crawling the website would not be the best way to collect these resources. But as far as I know no institution is systematically collecting and preserving the open source software canon, although I have been advocating that they should do so for many years.

In the meeting, the only reference to this issue I could find was this (PDF) from the National Endowment for the Humanities:
The history of software development is an area of potential – perhaps emergent – research interest in the context of the history of science and technology; primary source information on this topic could be of increasing value to computer historians.
Ya think? How on earth could a future historian of technology tell the story of our times without access to primary sources in terms of software?

Second, as far as I can tell none of the presentations pointed out that the evolution of the language of the Web from HTML to Javascript means that the Web is now fundamentally software. Since the Web is the medium for most of today's digital cultural heritage, this poses huge problems for our ability to preserve it. These problems are even worse than the problems the meeting discussed since the Web as software is inherently distributed.

Third, the presentations don't appear to have addressed the important distinction between the two forms of software, source and binary. The issues around the preservation of the two forms are quite different. Indeed, most seemed simply to assume that source was unavailable.

In this context, I want to draw attention to an important blog post by Jos van den Oever on the subject of whether it is possible to link the binary back to the source form of the software:
I've been looking into how easy it is to confirm that a binary package corresponds to a source package. It turns out that it is not easy at all. ... I think that the topic of reproducible builds is one that is of fundamental importance to the free software and larger community; the trustworthiness of binaries based on source code is a topic quite neglected. ... What is not appreciated in sufficient measure is that parties can, quite unchecked, distribute binaries that do not correspond to the alleged source code. ... Can a person rely on binaries or should we all compile from source? I hope to raise awareness about the need for a reproducible way to create binaries from source code.
Jos ran an experiment building the tar utility from the source on Debian, Fedora and OpenSUSE, and comparing the result with the tar package in the distribution that built it. He concludes:
A cherished characteristic of computers is their deterministic behaviour: software gives the same result for the same input. This makes it possible, in theory, to build binary packages from source packages that are bit for bit identical to the published binary packages. In practice however, building a binary package results in a different file each time. This is mostly due to timestamps stored in the builds. In packages built on OpenSUSE and Fedora differences are seen that are harder to explain. They may be due to any number of differences in the build environment. If these can be eliminated, the builds will be more predictable. Binary package would need to contain a description of the environment in which they were built.
Compiling software is resource intensive and it is valuable to have someone compile software for you. Unless it is possible to verify that compiled software corresponds to the source code it claims to correspond to, one has to trust the service that compiles the software. Based on a test with a simple package, tar, there is hope that with relatively minor changes to the build tools it is possible to make bit perfect builds.
The comments add further suggestions for improving the repeatability of builds. This is encouraging news and one might hope that Linux distributions, and other open source projects, will pay attention.


Jason Scott said...

Wrong on the Internet Archive front, David. As the software guy over there, I'm overseeing an enormous amount of software, vintage, open source and the rest, that does not leave things on github and sourceforge and makes copies available in various versions for greater persistence.

The materials on the event are incomplete.

David. said...

Jason, I'm sure the Internet Archive contains a lot of software. What I am saying (perhaps not as clearly as I could have) is that as far as I can tell, neither the Internet Archive nor national libraries are systematically collecting and preserving the open source canon as represented in the open source repositories. If you know different, please point me to where it is being done.

The tools that these institutions use to collect and preserve the stuff they care about are to a large extent open source software hosted in these repositories. So if they go away, there will be significant problems.

I hope I am wrong, but the attitude seems to be that these repositories won't go away, so there's no problem. This isn't the attitude that these institutions take to other kinds of content. Why are the contents of the open source repositories different?

David. said...

As regards the software used in science, Victoria Stodden gave an interesting keynote at Open Repositories 2013 (PDF) on the policies and practices around sharing it, but judging by the slides didn't address its long-term preservation. Of course, aspects of sharing such as the copyright license are essential for preservation, so the easier the sharing the easier the preservation.

David. said...

I caught up with Jason Scott at Digital Preservation 2013 and I need to make a correction. Jason and the Internet Archive are mirroring github.com, sourceforge.net and soon kernel.org.

David. said...

Excellent piece by Matthew Kirschenbaum at Slate on this meeting.