Sunday, June 10, 2007

Why Preserve E-Journals? To Preserve The Record

In the previous post in this series I examined one of the two answers to the question "why preserve e-journals?". Libraries subscribing to e-journals want post-cancellation access for their readers to the content they purchased. Given that the problem only arises for subscription journals that don't operate a moving wall, that cancellation is a fairly rare event, and that canceled content can't have been that important anyway (or why was it canceled?), it is hard to see this reason for preservation justifying current levels of investment in e-journal preservation.

I will now turn to the other answer to the question, that is in order to preserve the integrity of the record of scholarship. This is, or should be, a concern not of individual libraries, but of libraries working together in the interests of society as a whole.

To understand what is needed to preserve the integrity of the record of scholarship as it migrates to the digital world, we need to understand the threats to the integrity. The LOCKSS team set out our model of the threats to which digital content is subject in a paper in D-Lib. Briefly, they are:

  • Media Failure or loss.

  • Hardware Failure.

  • Software Failure.

  • Communication Errors.

  • Failure of Network Services.

  • Media and Hardware Obsolescence.

  • Software and Format Obsolescence.

  • Operator Error.

  • Natural Disaster.

  • External Attack.

  • Internal Attack.

  • Economic Failure.

  • Organizational Failure.

All digital content is subject to all these threats, but different threats affect different content differently. The major publishers have been in business a long time, most well over a century and Oxford University Press well over 5 centuries. They understand that their content is their core business asset, and are highly motivated to ensure that these threats don't destroy its value. Irrespective of external efforts, they already devote significant resources (which ultimately come from their subscribers, the libraries) to preserving these assets. E-journals in their custody are not likely to die of media, hardware, software or format obsolescence or failure. Smaller publishers, whether for-profit or not, as long as they survive the consolidation of the publishing market, will also understand that both their chances of survival and their value as an acquisition depend on maximizing the value of their content, and will be motivated to preserve it.

Yet much of the external money being invested in preserving e-journals targets precisely this low-risk content, from the largest to the medium-sized publishers. The first question librarians ask of an e-journal preservation system is "do you have Elsevier content?" Elsevier has been in business since 1880, longer than many of the libraries asking the question. Their content would not be at significant risk even if copies of it were not, as they actually are, already on-site at about a dozen libraries around the world. This investment is merely adding to the already substantial existing investment from the libraries via the publishers in preserving their content. Are these sensible investment policies?

The content that is at the most risk from these threats is that of the smallest publishers. In the humanities, this is the content that is being preserved by the LOCKSS Humanities Project. This is the content that, were it on paper, research libraries would be collecting to provide future scholars with the material they need to study the culture of our times. The publishers and their content are visibly at-risk; among those initially selected by a group of humanities collection specialists at major research libraries in 2002, several disappeared from the Web before the project could collect them, and others disappeared since. For example, World Haiku Review's content was lost by the journal's hosting platform and had to be recovered from the LOCKSS system.

The e-journals from these tiny publishers are almost all open access. Libraries have no financial incentive, in the form of concern about post-cancellation access, to preserve them. Some of this content is not being collected by the Internet Archive (e.g. Princeton Report on Knowledge) and some is, although the Archive's collection and preservation (pdf) is somewhat haphazard. In many fields communication has moved almost completely to the Web. Poetry, dance, media criticism and many other areas find the Web, with its low cost, multimedia capability and rapid, informal exchanges a much more congenial forum than print. There seem to be three main reasons why research libraries are no longer collecting the record of these fields:

  • Faculty have little awareness of the evanescent nature of Web content.

  • Thus there is little demand on the collections specialists, whose skills have been eroded as the "big deal" lease licenses take over.

  • The "big deal" licenses have consumed more and more of the budget.

Both the LOCKSS system and the Internet Archive's Heritrix crawler provide tools libraries could use, if they wanted to. The provisions of the DMCA mean that more systematic collection and preservation than the Internet Archive can manage requires either a Creative Commons license or specific permission from the publisher. Awareness and use of Creative Commons licenses is limited in these areas. Experience in the LOCKSS Humanities Project has shown that, although permission is always granted, the process of tracking down the publisher, having the necessary conversation, and getting them to add the permission statement to their web site is time-consuming. A small fraction of the resources going in to preserving low-risk e-journals from large publishers could preserve a huge amount of high-risk content for future scholars, by encouraging and supporting librarians in using the available tools.

And yet, simply trusting even the biggest publishers to preserve their own content is not a responsible approach. To make this clear, I'll start by looking at the problem of preserving the integrity of the record in a different, easier to explain area, that of US federal government documents.

Since 1813 federal documents have been printed by an executive branch agency, the Government Printing Office (GPO), and distributed to a network of over 1,000 libraries around the US under the Federal Depository Library Program (FDLP). These are University, State and public libraries, and each collects a subset of the documents matching the interests of their readers. The documents remain the property of the US government, but copies may be made as they are not copyright. Under various reciprocal arrangements, many documents are also supplied to the national libraries of other countries.

The goal of the FDLP was to provide citizens with ready access to their government's information. But, even though this wasn't the FDLP's primary purpose, it provided a remarkably effective preservation system. It created a large number of copies of the material to be preserved, the more important the material, the more copies. These copies were on low-cost, durable, write-once, tamper-evident media. They were stored in a large number of independently administered repositories, some in different jurisdictions. They are indexed in such a way that it is easy to find some of the copies, but hard to be sure that you have found them all.

Preserved in this way, the information was protected from most of the threats to which stored information is subject. The FDLP's massive degree of replication protected against media decay, fire, flood, earthquake, and so on. The independent administration of the repositories protected against human error, incompetence and many types of process failures. But, perhaps most important, the system made the record tamper evident.

Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception. Government documents are routinely recalled from the FDLP, and some are re-issued after alteration.

An illustration is Volume XXVI of Foreign Relations of the United States, the official history of the US State Department. It covers Indonesia, Malaysia, Singapore and the Philippines between 1964 and 1968. It was completed in 1997 and underwent a 4-year review process. Shortly after publication in 2001, the fact that it included official admissions of US complicity in the murder of at least 100,000 Indonesian "communists"by Suharto's forces became an embarrassment, and the CIA attempted to prevent distribution. This effort became public, and was thwarted when the incriminating material was leaked to the National Security Archive and others.

The important property of the FDLP is that in order to suppress or edit the record of government documents, the administration of the day has to write letters, or send US Marshals, to a large number of libraries around the country. It is hard to do this without attracting attention, as happened with Volume XXVI. Attracting attention to the fact that you are attempting to suppress or re-write history is self-defeating. This deters most attempts to do it, and raises the bar of desperation needed to try. It also ensures that, without really extraordinary precautions, even if an attempt succeeds it will not do so without trace. That is what tamper-evident means. It is almost impossible to make the record tamper-proof against the government in power, but the paper FDLP was a very good implementation of a tamper-evident record.

It should have become evident by now that I am using the past tense when describing the FDLP. The program is ending and being replaced by FDSys. This is in effect a single huge web server run by the GPO on which all government documents will be published. The argument is that through the Web citizens have much better and more immediate access to government information than through an FDLP library. That's true, but FDSys is also Winston Smith's dream machine, providing a point-and-click interface to instant history suppression and re-writing.

It isn't just official government documents with which governments may have problems. In recent testimony (pdf) to the House Committee on Oversight and Government Reform entitled "Political Interference with Government Climate Change Science", NASA climate scientist James Hansen described the lengths to which the Bush administration was prepared to go in its attempts to suppress or adjust the speech and writings of scientists. These included suppression or editing of the testimony of individuals to congress, press releases, conference presentations, press interviews and web postings. The administration also used budgetary and public pressure (pdf) to persuade scientists to self-censor papers before submission. Although we don't have evidence that the government has changed journal papers after they have been published, it must be likely that if this administration thought they could get away with it they would be tempted to do so.

Just as it seems imprudent for all copies of each element of the record of government information to be in the custody of a single government agency, it seems imprudent for all copies of each element of the record of scholarship to be in the custody of a single publisher. Governments are far from alone in potentially being tempted to tamper with the record. Consider drug and other companies involved in patent and other lawsuits, for example. Arcane details of the published record can be essential evidence in these cases. Just as it would be easy for the government to change information on FDSys, a very small percentage of the lawyer's fees, let alone the potential settlement, would be more than enough to bribe or coerce the system administrators of the publisher's web platform to correct "flaws" in the record. As these systems are currently implemented, the probability that the change would not be detected is overwhelming. Post-publication changes to e-journal content are routine, typically to correct errors such as typos.

What lessons can we take from the FDLP and the paper library system that has protected the record of scholarship until recently?

The key architectural features we need to copy from paper systems to the digital systems are massive replication in many independent repositories that do not trust each other and are implemented and operated transparently. The one architectural feature that is needed in the digital system that the paper system lacks is mutual audit among the replicas. This is necessary because, unlike paper, digital content is stored on media that is easily and tracelessly rewritable, and because technology exists (pdf) that could potentially re-write many of the on-line copies in a short period of time.

To sum up both this and the preceding post, investment in preserving e-journals is going primarily into establishing a very few, centralized, monolithic "trusted third party" repositories, implemented using expensive, proprietary, enterprise-scale technology and incurring high staff costs. These expensive facilities, quite naturally, focus on preserving premium, high-cost e-journals. Doing so seems natural, because a single negotiation with a large publisher brings a vast amount of content, which makes the repository look attractive to the librarians whose serials budget must in the end bear the cost. Although the LOCKSS system's distributed, peer-to-peer architecture has the features needed, market pressures mean that it too is being targeted at low-risk content.

This is a poor use of resources. The investment in technology and staff is replicating investments already made and paid for (via subscription costs by the publishers) by the same librarians making it. The content being preserved is at very low risk of loss through accident or incompetence. It is even at low risk of cancellation, since these publishers use the "big deal" bundled lease approach to make cancellation extremely painful. The investments are not effective at preventing the record of scholarship being tampered with, since the "trusted third party" architecture lacks the essential tamper-proofing features of the paper system. The investments are not even particularly effective at ensuring post-cancellation access, since a significant proportion of publishers won't allow the repositories to provide it. Not to mention the untested legal and operational obstacles in the path of third party repositories providing such access. Meanwhile, the content future scholars will need that is actually at serious risk of loss through accident, incompetence or economic failure is not being effectively collected or preserved.

Why Preserve E-Journals? Post-Cancellation Access

Much of the investment currently going in to digital preservation concentrates on preserving e-journals. The LOCKSS technology was originally developed in response to librarians' concerns about e-journals. It is now being used for a range of other content types, but the bulk of the system's use is for e-journals, both in the worldwide LOCKSS network and in the CLOCKSS program. Other efforts target e-journals specifically, such as Ex Libris' JOS (pdf), Portico, the Koninklijke Bibliotheek's e-Depot, the British Library, and others.

The reasons why e-journals became the target of choice include history, economics and technical convenience. In this post I will analyze these reasons in the light of what is now almost a decade of experience, and argue that they make less sense than they should.

There are two main answers to the question "why preserve e-journals?":

  • Post-cancellation access to subscription material.

  • Maintaining the integrity of the record of scholarship.

In this post I'll look at post-cancellation access. I'll return to the problem of maintaining the integrity of the record in a subsequent post.

Many libraries' interest in preserving e-journals arose when it became obvious that a side-effect of the transition of academic publishing to the Web was to change what the libraries were buying with their subscription dollars.

In the paper world the library purchased a physical copy of the content. Their reader's continued access to the content did not depend on continuing subscription payments, only on the library's decision whether or not to de-accession it. In the Web world, the library's subscription leased access to the publisher's copy of the content. Their reader's continued access to the content is perpetually at the mercy of the publisher's pricing policy.

This uncertainty didn't make the librarians happy, and since they write the checks that keep the publishers churning out content, they had various ways to communicate their unhappiness to the publishers. The first, immediate response was to insist on receiving paper copies as well as web access. It rapidly became obvious that this wasn't an acceptable solution to anyone. The libraries' readers rapidly found that they were vastly more productive working with web content. On-line use greatly outpaced use of paper. The publishers soon realized, not just that their readers preferred the Web, but more importantly that it was much cheaper to publish on the Web than on paper. Could librarians be persuaded to accept electronic-only publishing, while maintaining the same subscription pricing?

The major impediment to this tempting prospect was the librarians' insecurity about future access to the content to which they subscribed. Even publisher's promises that they would provide ex-subscribers free access to the content they had paid for weren't convincing; librarians were rightly skeptical of promises whose costs weren't covered.

Two broad approaches to post-cancellation access have been tried. One is to restore the paper model by preserving local copies, in which libraries pay to receive a copy of the content which they can keep and use to satisfy future requests for access. The other is to devise and implement an escrow service, a third party which receives a copy of the content from the publisher and, subject to the agreement of the publisher, to which it can provide ex-subscribers access.

After about a decade of concern about post-cancellation access, we have a small number of partial solutions to the problem. Some are local copy solutions and have been in production use for some years. A few libraries including University of Toronto (.ppt) and Los Alamos use a commercial system from Elsevier (now sold to Ex Libris) called JOS (Journals On Site) to preserve local copies of journals from Elsevier and some other major publishers. About 200 libraries use the LOCKSS system to preserve content from a wide range of publishers, largely disjoint from those in JOS. Others are escrow services, including some copyright deposit schemes at national libraries, and the Portico system. None has yet achieved anything approaching full coverage of the field, all are at a nascent stage. None is routinely providing readers with post-cancellation access.

As time has gone by with no simple, affordable, one size fits all publishers and libraries system for post-cancellation access the world has changed.

First, paper journals are no longer the version of record; for many of the most cited, highest impact journals the version delivered over the network has more information and more highly valued functions. The paper version is incomplete.

Second, the various ways publishers have tried to deliver physical copies of e-journal content, for example on CD-ROM, have proved to be so much trouble to deal with that they have been discredited as a means of post-cancellation access.

Third, the continual increase in subscription costs and the availability of cheap Web publishing platforms is driving a movement for open access to scholarship. It isn't certain that this will continue, and the effect varies greatly from field to field, but to the extent to which the trend continues it again reduces the importance of a solution to post-cancellation access. There is no subscription to cancel.

Fourth, the pressure for open access to the scientific literature has led many subscription journals to adopt a moving wall. Access to the content is restricted to subscribers for a period of time after it is first published, ranging from a few months to five years. After that, access is opened to anyone. The idea is that researchers active in a field will need immediate access to new content, and will justify the subscription to their librarians. Thus librarians will believe that, when their future readers want access, the moving wall will still be in effect to satisfy them. Thus they will be satisfied with a Web-only subscription.

Fifth, some other publishers have decided that charging for their back content on a pay-per-view basis is an important revenue source. These publishers are unlikely to participate in any solution for post-cancellation access.

Sixth, big publishers increasingly treat their content not as separate journals but as a single massive database. Subscription buys access to the whole database. If a library cancels their subscription, they lose access to the whole database. This bundling, or "big deal", leverages a small number of must-have journals to ensure that cancellation of even low-value journals, the vast majority in the bundle, is very unlikely. It is more expensive to subscribe individually to the few high-value journals than to take the "big deal". Thus cancellation of large publisher journals is a low risk, which is the goal of the "big deal" scheme.

Publishers who charge for back content typically do not allow their journals to be preserved using the LOCKSS system. They may provide their content to the nascent schemes for electronic copyright deposit at national libraries, but under very restrictive terms for access. For example, the Koninklijke Bibliotheek and the forthcoming British Library schemes both provide full access only to readers physically at the library itself; others get no or minimal access. National libraries are not a realistic solution to providing post-cancellation access to readers at subscribing libraries. Again, although these publishers may deposit content in the Portico system they're unlikely to sign the:
"rider to the agreement that a participating publisher signs if they choose to name Portico as a mechanism to fill post-cancellation access claims submitted by participating libraries." (emphasis added)

The rider in question is as follows:
"Perpetual Access.[Publisher] agrees that Portico shall provide access to the [content] to [Publisher]'s former purchasers or subscribers. Participating [Libraries] may submit perpetual access claims to Portico by certifying, either electronically or in writing, that they were a purchaser or subscriber to [the content] to which they are seeking access. ... Portico may Deliver the requested [content] if [publisher] has not notified Portico and the [library] of its objection ... in writing within thirty (30) days."

Thus for each library and each publisher, post-cancellation access is subject to the agreement of the publisher after the subscription has been canceled. Despite having a current subscription to Portico after they cancel their subscription to the publisher's content, and despite the publisher's having signed the rider, libraries can't be fully confident of receiving post-cancellation access. For example, suppose that a publisher signs the rider and is then sold to another that regards charging for post-cancellation access as important to its business model. The new owner could simply institute a policy of objecting to all perpetual access claims.

About 1/3 of Portico's publishers currently have not signed the rider. The only access a library obtains to their content is described here:
"The participating Library may designate up to four staff members per campus or system branch that will be provided password protected full access to the Portico archive for verification and testing purposes only." (emphasis added)

It is clear that a scrupulous library cannot look on Portico as a universal, robust solution for post-cancellation access.

There are two fundamental contradictions in the attempt to solve the problem of access to content after a subscription to a service (the publisher) is canceled by subscribing to a service (the preservation system) which prevents access after its subscription is canceled. First, it is not a solution, it is another instance of the same problem. Second, to the extent to which the subscription to the second service is regarded as insurance, it suffers from the same moral hazard as someone who takes out fire insurance then burns down the building himself. Insurance is being purchased against the direct consequences of voluntary actions by the insured. In other areas claims against such policies are treated as insurance fraud.

So we see that no matter how ingenious the proponents of digital preservation for e-journals, there is no realistic prospect of a single solution that provides post-cancellation access for 100% of subscription content the way that paper did. Generally speaking, the smaller publishers will be more likely allow one or more preservation systems to provide post-cancellation access, and the larger for-profit publishers will be less likely. There will always be some level of uncertainty as to whether access will actually be available when it is needed.

The following post looks at the second reason for preserving e-journals, maintaining the integrity of the record of scholarship.

Saturday, June 9, 2007

A Petabyte For A Century

A talk at the San Diego Supercomputer Center in September 2006 was when I started arguing (pdf) that one of the big problems in digital preservation is that we don't know how to measure how well we are doing it, and that makes it difficult to improve how well we do it. Because supercomputer people like large numbers, I started using the example of keeping a petabyte of data for a century to illustrate the problem. This post expands on my argument.

Lets start by assuming an organization has a petabyte of data that will be needed in 100 years. They want to buy a preservation system good enough that there will be a 50% chance that at the end of the 100 years every bit in the petabyte will have survived undamaged. This requirement sounds reasonable, but it is actually very challenging. They want 0.8 exabit-years of preservation with a 50% chance of success. Suppose the system they want to buy suffered from bit rot, a process that had a very small probability of flipping a bit at random. By analogy with the radioactive decay of atoms, they need the half-life of bits in the system to be at least 0.8 exa-years, or roughly 100,000,000 times the age of the universe.

In order to be confident that they are spending money wisely, the organization commissions an independent test lab to benchmark the competing preservation systems. The goal is to measure the half-life of bits in each system to see whether it meets the 0.8 exa-year target. The contract for the testing specifies that results are needed in a year. What does the test lab have to do?

The lab needs to assemble a big enough test system so that, if the half-life is exactly 0.8 exa-year, it will see enough bit flips to be confident that the measurement is good. Say it needs to see 5 bit flips or fewer to claim that the half-life is long enough. Then the lab needs to test an exabyte of data for a year.

The test consists of writing an exabyte of data into the system at the start of the year and reading it back several times, lets say 9 times, during the year to compare the bits that come out with the bits that went in. So we have 80 exabits of I/O to do in one year, or roughly 10 petabits/hour, which is an I/O rate of about 3 terabits/sec. That is 3,000 gigabit Ethernet interfaces running at full speed continuously for the whole year.

At current storage prices just the storage for the test system will cost hundreds of millions of dollars. When you add on the cost of the equipment to sustain the I/O and do the comparisons, and the cost of the software, staff, power and so on, its clear that the test to discover whether a system would be good enough to keep a petabyte of data for a century with a 50% chance of success would cost in the billion-dollar range. This is of the order of 1,000 times the purchase price of the system, so the test isn't feasible.

I'm not an expert on experimental design, and this is obviously a somewhat simplistic thought-experiment. But, suppose that the purchasing organization was prepared to spend 1% of the purchase price per system on such a test. The test would then have to cost roughly 100,000 times less than my thought-experiment to be affordable. I leave this 100,000-fold improvement as an exercise for the reader.