Wednesday, August 17, 2011

A Brief History of E-Journal Preservation

The workshop on the Future of Research Communication opened with a set of talks about the past, providing background for the workshop discussions. My talk covered the history of e-journal preservation, and the lessons that can be drawn from it. An edited text of the talk, with links to the sources, is below the fold.


If we were starting with a blank sheet of paper to design a mechanism for communicating about research, what would be our requirements? Here is my list:

  • Repeatability
  • Reusability
  • Immediacy
  • Transparency
  • Openness
  • Sustainability
  • Permanence
  • Authenticity
For the last nearly 13 years I've been working in the LOCKSS program at the Stanford Libraries on sustainability, permanence and authenticity in the context of archiving e-journals. In the next few minutes I'm going to review the history of e-journal archiving, to provide some background for the upcoming discussions. George Santayana said 'Those who cannot remember the past are condemned to repeat it.' and I'm hoping this doesn't apply to research communication.

In May 1995 Stanford pioneered the transition of academic publishing from paper to the web when HighWire Press put the Journal of Biological Chemistry on-line. Two things rapidly became obvious:
  • The new medium had capabilities, such as links and search, that made it far more useful than paper, so that the transition would be rapid and complete.
  • The new medium forced libraries to switch from purchasing a copy of the material to which they subscribed and adding it to their collection, to leasing access to the publisher's copy and no longer building a collection.
Librarians had two concerns about leasing materials:
  • If they decided to cancel their subscription, they would lose access not just to future materials, but also to the materials for which they had paid. This problem is called post-cancellation access.
  • If the publisher stopped publishing the materials, future readers would lose access to them entirely. This problem is called preserving the record.
The LOCKSS program started as an attempt to solve both problems (PDF). It tried to restore the purchase model by building a system that would work for the Web the way libraries had always worked with paper. Libraries would continue to build collections containing copies of materials they purchased, held in LOCKSS boxes, the digital equivalent of the stacks. Legal permission for them to do so would be a simple clause added to the content on-line. A peer-to-peer network would allow LOCKSS boxes to collaborate to detect and repair and loss or damage to the content to which they subscribed, the digital analog of the way paper libraries collaborate. The focus was on minimizing the cost of ownership of the content.

While with NSF funding the LOCKSS program was building a prototype, the Mellon Foundation listened to the concerns of librarians and started an initial program at six major libraries to investigate solutions. The LOCKSS program was one effort; each of the other five libraries proposed to set up a centralized archive. Their idea was to obtain from the publishers and hold one backup copy of everything published. Libraries that subscribed to the archive would be given post-cancellation or post-publisher-failure access from the archive. With funding from the Mellon Foundation, the NSF and Sun Microsystems, the LOCKSS program went into production. The other efforts all failed to develop viable systems, because:
  • Libraries such as Harvard were reluctant to outsource a critical function to a competing library such as Yale. On the other hand, funders were reluctant to pay for more than one archive.
  • Publishers were reluctant to deliver their content to a library for it to make money by re-publishing the content to others. This made the contract negotiations necessary to obtain content from the publishers time-consuming and expensive.
  • The concept of a subscription archive was not a solution to the problem of post-cancellation access; it was merely a second instance of exactly the same problem.
The concept of a third-party subscription archive seemed dead until it was resurrected by Bill Bowen, previously President of Princeton University and by this point at the end of his term as head of the Mellon Foundation. He set up, and moved to become Chairman of the Board of, a not-for-profit organization called Ithaka. It included about $55M of the Mellon's cash, a new organization called Portico, a subscription archive of e-journals, and an "independent research" organization called Ithaka S+R than has in many cases functioned as a marketing arm, persuading libraries to outsource their collections. It is important to note that, just as with JSTOR, Portico's business model depends on charging for access to content, even open access content.

Bill Bowen appointed Kevin Guthrie, the board chair of JSTOR who has close personal connections with Elsevier, as Ithaka’s president. Elsevier was persuaded to deposit their content with Portico. Bill Bowen leveraged his close connections with University presidents and persuaded them to subscribe to Portico. Having the most expensive content, and thus the content librarians for which were most worried about post-cancellation access, and the imprimatur of major Universities and the Mellon Foundation, Portico rapidly became the solution librarians would never get fired for choosing.

Despite these advantages, Portico has failed to achieve economic sustainability on its own. As Bill Bowen said discussing the Blue Ribbon Task Force Report:
"it has been more challenging for Portico to build a sustainable model than parts of the report suggest."
Libraries proved unwilling to pay enough to cover its costs. It was folded into a single organization with JSTOR, in whose $50M+ annual cash flow Portico's losses could be buried.

One might think that preserving the record was the task of national libraries, which in most cases have the responsibility for copyright deposit. Unfortunately, the legal and technological barriers to them actually performing this task for web-published materials have in most case proved so formidable that we are still waiting for it to happen. I believe it is still the case that the only national library currently accepting copyright deposit of e-journals (PDF) is the Dutch KB. It has no authority under Dutch law to do so, but has to negotiate the arrangements for deposit with the publishers. Libraries, such as the BL and the DNB, that have a legal mandate have found these negotiations difficult.

The big publishers were unhappy at the lack of progress and in any case preferred a system that they controlled to a system imposed on them by law. A group of big publishers and research libraries set up the not-for-profit CLOCKSS archive, probably at least partly in response to the delay in copyright deposit. CLOCKSS is funded and governed jointly by publishers and libraries. It uses LOCKSS boxes at about a dozen libraries world-wide to implement a dark archive. It holds full runs of publisher's content, to be re-published under Creative Commons licenses if they ever become unavailable from any other source. Examples of journals re-published in this way can be found at the CLOCKSS website.

Thanks to a matching grant from the Mellon Foundation, the LOCKSS program successfully transitioned to economic sustainability in 2007. It has remained sustainable since. It runs the Red Hat model of free, open-source software and payment for support. In addition it performs work on contract to CLOCKSS, operating the network of CLOCKSS boxes, and to the Library of Congress, developing technology for the NDIIPP. The LOCKSS program has no grant funding and no support from Stanford. Sustainability has been possible only by a consistent focus on driving costs out of the system, and by running a very lean organization. Nevertheless, it is hard to believe that the program would have survived had libraries not found LOCKSS boxes also useful for preserving a wide range of non-journal web-published content, including government documents, special collections, datasets and e-books.

What value do the funders of these efforts receive? Most taxpayers have yet to see value from digital copyright deposit of e-journals, which isn't yet happening. In fact, in some cases it has been so difficult for national libraries to implement this function that they have decided to outsource it to Portico.

The libraries paying for Portico have in only a few cases obtained post-cancellation access from Portico. It turns out that librarians were wrong; publishers prefer to deliver post-cancellation access from their own sites rather than have it delivered by a third party. In the Web world hits to your site are valuable even if they are not from subscribers, as we see with the New York Times' porous paywall. Publishers also prefer to deliver their content with their branding rather than, as Portico does, with Portico's branding substituted. Libraries do, however, get value from their Portico subscription. It provides them with four username/password pairs that give access to Portico's entire content, not just the content to which the library subscribed, for the purpose of "auditing" the service. If they don't use it too much, libraries can deliver subscription content to readers without paying the publisher. In effect, Portico leaks content, but only slowly.

I believe Ithaka regards Portico's losses as an investment in a future in which they will own the only database of all the world's academic content, which they can monetize as they currently do with JSTOR. Neither CLOCKSS, which only releases content no longer available from any publisher, nor national libraries, which typically have very restrictive conditions on access to copyright deposit materials (an example), would provide effective competition to this monopoly.

The value libraries receive from running LOCKSS boxes is both from post-cancellation access and from preserving the record. An early design decision was to make the operation of LOCKSS totally transparent to readers. If configured as originally designed, a library's LOCKSS box ensures that the library's readers continue to access preserved content at its original URL. A casual reader cannot tell that the LOCKSS box has done anything. Persuading libraries to pay for a system that delivered no visible value to end users was hard. Eventually, we sacrificed our user interface design principles and allowed LOCKSS boxes to re-publish their content at their own URLs, thus at least showing that the box existed and was doing something useful. A similar problem faces the excellent work of the Memento team.

Another problem is that the value of post-cancellation access dominates the value of preserving the record. What librarians want their LOCKSS box to collect is the expensive content from the major publishers. Precisely because it is expensive, it is not at risk of being lost. The content that is at risk of being lost from the record is from the smallest publishers or from government sources. The content is largely open access, and thus generates no post-cancellation access value. Although LOCKSS is being used to preserve open access content, such as government documents, funding this preservation is much harder. Digital preservation resources are thus massively mis-directed.

Some lessons we can draw from this history include:
  • Funding for the long-term preservation of content needs to be buried inside something that delivers value in the short term.
  • Brian Arthur's "increasing returns", aka "network effects", are so powerful that once one institution captures mind-share in a market niche it will become a monopoly in that niche.
  • Not-for-profit monopolies are even harder to dislodge than for-profit monopolies, since they are deemed to operate in the public interest and are thus effectively immune from anti-trust considerations.
  • Content is copyright. Copyright is broken. Broken copyright controls the design of systems to publish, preserve and manipulate content (PDF). Thus systems that publish, preserve or manipulate content are broken.

No comments: