Thursday, April 10, 2025

Cliff Lynch's festschrift

Vicky and I were invited to contribute to a festschrift celebrating Cliff Lynch's retirement from the Coalition for Networked Information. We decided to focus on his role in the long-running controversy over how digital information was to be preserved for the long haul.

Below the fold is our contribution, before it was copy-edited for portal: Libraries and the Academy.

Lots Of Cliff Keeps Stuff Safe

Vicky Reich
David S. H. Rosenthal

Abstract

A long time ago in a Web far, far away it is a period of civil war between two conceptions of how digital information could be preserved for posterity. On one side is the mighty Empire, concerned with the theoretical threat of format obsolescence. On the other are the Rebels, devoted to the practical problem of collecting the bits and ensuring that they survive. Among the rebels are the Internet Archive and the LOCKSS Program. This is the story of how the rebels won, thanks in no small part to Cliff Lynch's sustained focus on the big picture.

Thirty Years Ago

It all started just thirty years ago. In January 1995 the idea that the long-term survival of digital information was a significant problem was popularized by Jeff Rothenberg's Scientific American article Ensuring the longevity of digital documents. Rothenberg's concept of a "digital document" was of things like Microsoft Word files on a CD, individual objects encoded in a format private to a particular application. His concern was with format obsolescence; the idea that the rapid evolution of these applications would, over time, make it impossible to access the content of objects using an obsolete format.

Rothenberg was concerned with interpreting the bits; he essentially assumed that the bits would survive. Given the bits, he identified two possible techniques for accessing the content:
  • Format migration: translating the content into a less obsolete format to be accessed by a different application.
  • Emulation: using a software implementation of the original computer's hardware to access the content using the same application.
Emulation was a well-established technique, dating from the early days of IBM computers.

The Web

But five months later an event signalled that Rothenberg's concerns had been overtaken by events. Stanford pioneered the transition of academic publishing from paper to the Web when Vicky was part of the HighWire Press team that put the Journal of Biological Chemistry on the Web. By then it was clear that, going forward, the important information would be encoded in Web formats such as HTML and PDF. Because each format with which Rothenberg was concerned was defined by a single application it could evolve quickly. But Web formats were open standards, implemented in multiple applications. In effect they were network protocols.

The deployment of IPv6, introduced in December 1995, shows that network protocols are extraordinarily difficult to evolve, because of the need for timely updates to many independent implementations. Format obsolescence implies backwards incompatibility; this is close to impossible in network protocols because it would partition the network. As David discussed in 2012's Formats Through Time, the first two decades of the Web showed that Web formats essentially don't go obsolete.

The rapid evolution of Rothenberg's "digital documents" had effectively stopped, because they were no longer being created and distributed in that way. Going forward, there would be a legacy of a static set of static documents in these formats. Libraries and archives would need tools for managing those they acquired, and eventually emulation, the technique Rothenberg favored, would provide them. But by then it turned out that, unless information was on the Web, almost no-one cared about it.

Integrity of Digital Information

Thus the problem for digital preservation was the survival of the bits, not of their format, aggravated by the vast scale of the content to be preserved. In May the following year Brewster Kahle established the Internet Archive to address the evanescence of Web pages. This comes in two forms, link rot, when links no longer resolve, and content drift, when they resolve to different content.

This is where Cliff Lynch enters the story. As he did in many fields, he focused on the big picture. He understood the importance to the big digital preservation picture of simply collecting the content and ensuring its integrity. Already in 1994's The Integrity of Digital Information: Mechanics and Definitional Issues he had written:
A system of information distribution that preserves integrity should also provide the user with a reasonable expectation of correct attribution and source of works. Even if deliberate attempts at fraud, misdirection, or covert revision may sometimes slip through the routine processes of the system these problems can be adjudicated by a formal challenge and examination system ,,, The expectation should be that violations of integrity cannot be trivially accomplished.
And he had noted that even in the print world this expectation was fading:
We assume that print is difficult to alter, that print authorship and source attribution are relatively trustworthy, and that printed works are normally mass-produced in identical copies. In fact, current technology trends undermine these assumptions. Printed publications are becoming increasingly tailored to very narrow audiences, and it has become easy to imitate the format of well-known and professionally presented publications.
Lynch discussed how the survival of the bits could be confirmed using digital hashes, the potential for digital signatures to confirm authenticity, and why such signatures were not used in practice.

LOCKSS

In October 1998 we proposed to Michael Keller, Stanford's Librarian, a decentralized system whereby libraries could cooperate to collect the academic journals to which they subscribed, and preserve them against the three threats we saw, technological, economic and legal. He gave us three instructions:
  • Don't cost me any money.
  • Don't get me into trouble.
  • Do what you want.
Thus was the LOCKSS (Lots Of Copies Keep Stuff Safe) Program born. The prototype was funded first by two small grants from Michael Lesk at the NSF, and then by Donald Waters at the Mellon Foundation, both of whom like Lynch understood the importance of assuring the survival of, and access to, the bits. Development of the first production system was mostly funded by a signficant grant from the NSF and by Sun Microsystems. We didn't cost Keller any money, quite the reverse considering Stanford's overhead on grants!

The LOCKSS system, like the Internet Archive, was a system for ensuring the survival of, and access to, the bits in their original format. This was a problem; somehow, despite Rotheberg's advocacy of emulation, the conventional wisdom in the digital preservation community rapidly became that the sine qua non of digital preservation was defending against format obsolescence by using format migration based upon collecting preservation metadata.

Actually, the sine qua non of digital preservation is ensuring that the bits survive. Neither Kahle nor we saw any return on investing in preservation metadata or format migration. We both saw scaling up to capture more than a tiny fraction of the at-risk content as the goal. Future events showed we were right, but at the time the digital preservation community viewed LOCKSS with great skepticism, as "not real digital preservation".

Paper Library Analogy

In his 1994 paper Lynch had described how the paper world's equivalent of ensuring the bits survive works; "Lots Of Copies Keep Stuff Safe":
When something is published in print, legitimate copies ... are widely distributed to various organizations, such as libraries, which maintain them as public record. These copies bear a publication date, and the publisher essentially authenticates the claims of authorship ... By examining this record, control of which is widely distributed ... it is possible, even years after publication, to determine who published a given work and when it was published. It is very hard to revise the published record, since this involves all of the copies and somehow altering or destroying them.
Compare this with how we summarized libraries' role in our first major paper on LOCKSS, Permanent Web Publishing:
Acquire lots of copies. Scatter them around the world so that it is easy to find some of them and hard to find all of them. Lend or copy your copies when other librarians need them.
From a system engineering viewpoint, we wrote:
Libraries' circulating collections form a model fault-tolerant distributed system. It is highly replicated, and exploits this to deliver a service that is far more reliable than any individual component. There is no single point of failure, no central control to be subverted. There is a low degree of policy coherence between the replicas, and thus low systemic risk. The desired behavior of the system as a whole emerges as the participants take actions in their own local interests and cooperate in ad-hoc, informal ways with other participants.

If librarians are to have confidence in an electronic system, it will help if the system works in a familiar way.

Threats

Lynch's focus on the big picture meant he also understood that economic and legal threats were at least as significant as technological ones. For example, in 1996's Integrity Issues in Electronic Publishing he wrote:
In the networked information environment, the act of publication is ill defined, as is the responsibility for retaining and providing long-term access to various "published" versions of a work. Because of the legal framework under which electronic information is typically distributed, matters are much worse than they are generally perceived to be. Even if the act of publication is defined and the responsibility for the retention of materials is clarified, the integrity of the record of published works is critically compromised by the legal constraints that typically accompany the dissemination of information in electronic formats.
He discussed some electronic journal pilots in a 1996 talk:
One key question Lynch identified was how acceptable transactional pricing systems will be to end users or to producers, suppliers, and rights holders. Will such models cause streams of income and expenditures to become unworkably erratic?
Now there are two lawsuits from the copyright cartels aimed at destroying the Internet Archive, it is easy to understand that the most critical threats to preserved content are legal. A quarter-century ago this was less obvious in general. But even then, facing the oligopoly academic publishers, it was obvious to us that LOCKSS had to be designed around the copyright law.

Lynch continued to remind the library community of the economic and legal threats, and the broader issues impeding preservation of our digital heritage. Early examples include:
  • 1999's Experiential Documents and the Technologies of Remembrance:
    The retention, reuse, management, and control of this new cornucopia of recorded experience and synthesized content in the digital environment will, I expect, become a matter of great controversy. This will include, but not be limited to, privacy, accountability and intellectual property rights in their broadest senses. And these materials will hopefully become an essential and growing part of our library and archival collections in the 21st century - particularly as we sort through these controversies.
  • 1999's On the Threshold of Discontinuity: The New Genres of Scholarly Communication and the Role of the Research Library:
    It is unclear how to finance archiving and preservation of these materials. Their volume is no longer driven by acquisitions budgets or by the scholarly publishing system, but by activities that may take place largely beyond the control of the library. And, of course, costs are open ended and unpredictable for digital preservation, unlike the costs associated with preserving modern printed materials (on acid-free paper).
  • 2001's When documents deceive: Trust and provenance as new factors for information retrieval in a tangled web:
    Digital documents in a distributed environment may not behave consistently; because they are presented both to people who want to view them and software systems that want to index them by computer programs, they can be changed, perhaps radically, for each presentation. Each presentation can be tailored for a specific recipient.
  • 2003's The Coming Crisis in Preserving Our Digital Cultural Heritage:
    Preservation of digital materials is a continuous, active process (requiring steady funding), rather than a practice of benignly neglecting artifacts stored in a hospitable environment, perhaps punctuated by interventions every few decades for repairs.
    And:
    It is probably not an exaggeration to say that the most fundamental problem facing cultural heritage institutions is the ability to obtain digital materials together with sufficient legal rights to be able to preserve these materials and make them available to the public over the long term. Without explicit and affirmative permissions from the rights-holders, this is likely to be impossible.
    And:
    What is threatening us today is not an abuse of centralized power, but rather a low-key, haphazard deterioration of the intellectual and cultural record that is driven primarily by economic motivations and the largely unintended and unforeseen consequences of new intellectual property laws that were enacted at the behest of powerful commercial interests and in the context of new and rapidly evolving technologies.
There were many others.

The "Standard Model"

The LOCKSS team repeatedly made the case that preserving Web content was a different problem from preserving Rothenberg's digital documents, and thus that applying the entire apparatus of "preservation metadata", PREMIS, FITS, JHOVE, and format normalization to Web content was an ineffective waste of scarce resources. Despite this, the drumbeat of criticism that LOCKSS wasn't "real digital preservation" continued unabated.

After six years, the LOCKSS team lost patience and devoted the necessary effort to implement a capability they were sure would never be used in practice. The team implemented, demonstrated and in 2005 published transparent, on-demand format migration of Web content preserved in the LOCKSS network. This was possible because the specification of the HTTP protocol that underlies the Web supports the format metadata needed to render Web content. If it lacked such metadata, Web browsers wouldn't be possible.

Unsurprisingly, this demonstration failed to silence the proponents of the "standard model of digital preservation". So five more years later David published Format Obsolescence: Assessing the Threat and the Defenses, a detailed exposition and critique of the standard model's components, which were:
  • Before obsolescence occurs, a digital format registry collects information about the target format, including a description of how content can be identified as being in the target format, and a specification of the target format from which a renderer can be created.
  • Based on this information, format identification and verification tools are enhanced to allow them to extract format metadata from content in the target format, including the use of the format and the extent to which the content adheres to the format specification. This metadata is preserved with the content.
  • The format registry regularly scans the computing environment to determine whether the formats it registers are obsolescent, and issues notifications.
  • Upon receiving these notifications, preservation systems review their format metadata to determine whether they hold content in an obsolescent format.
  • If they do, they commission an implementor to retrieve the relevant format specification from the format registry and use it to create a converter from the now-obsolescent target format to some less doomed format.
  • The preservation systems then use this converter and their format metadata to convert the preserved content into the less doomed format.
The critique included pointing out that creating a format specification for a proprietary format and then implementing a renderer from it was almost impossible, that the existence of open-source renderers made doing so redundant, that most HTML on the Web failed validation (a consequence of Postel's Law), that there were no examples of widely used formats going obsolete, and that Microsoft's small step in that direction in 2008 met with universal disdain and was abandoned. It also noted that:
the standard model is based on format migration, a technique of which Rothenberg’s article disapproves:
Finally, [format migration] suffers from a fatal flaw. ... Shifts of this kind make it difficult or impossible to translate old documents into new standard forms.
The critique was awarded the 2011 Outstanding Paper Award for Library High Tech, but again failed to silence the standard model's proponents. Although we no longer follow the digital preservation literature closely, it is our impression that over the intervening 15 years advocacy of the standard model has died down, thanks in no small part to Lynch's sustained focus on the big picture.