I will now turn to the other answer to the question, that is in order to preserve the integrity of the record of scholarship. This is, or should be, a concern not of individual libraries, but of libraries working together in the interests of society as a whole.
To understand what is needed to preserve the integrity of the record of scholarship as it migrates to the digital world, we need to understand the threats to the integrity. The LOCKSS team set out our model of the threats to which digital content is subject in a paper in D-Lib. Briefly, they are:
- Media Failure or loss.
- Hardware Failure.
- Software Failure.
- Communication Errors.
- Failure of Network Services.
- Media and Hardware Obsolescence.
- Software and Format Obsolescence.
- Operator Error.
- Natural Disaster.
- External Attack.
- Internal Attack.
- Economic Failure.
- Organizational Failure.
All digital content is subject to all these threats, but different threats affect different content differently. The major publishers have been in business a long time, most well over a century and Oxford University Press well over 5 centuries. They understand that their content is their core business asset, and are highly motivated to ensure that these threats don't destroy its value. Irrespective of external efforts, they already devote significant resources (which ultimately come from their subscribers, the libraries) to preserving these assets. E-journals in their custody are not likely to die of media, hardware, software or format obsolescence or failure. Smaller publishers, whether for-profit or not, as long as they survive the consolidation of the publishing market, will also understand that both their chances of survival and their value as an acquisition depend on maximizing the value of their content, and will be motivated to preserve it.
Yet much of the external money being invested in preserving e-journals targets precisely this low-risk content, from the largest to the medium-sized publishers. The first question librarians ask of an e-journal preservation system is "do you have Elsevier content?" Elsevier has been in business since 1880, longer than many of the libraries asking the question. Their content would not be at significant risk even if copies of it were not, as they actually are, already on-site at about a dozen libraries around the world. This investment is merely adding to the already substantial existing investment from the libraries via the publishers in preserving their content. Are these sensible investment policies?
The content that is at the most risk from these threats is that of the smallest publishers. In the humanities, this is the content that is being preserved by the LOCKSS Humanities Project. This is the content that, were it on paper, research libraries would be collecting to provide future scholars with the material they need to study the culture of our times. The publishers and their content are visibly at-risk; among those initially selected by a group of humanities collection specialists at major research libraries in 2002, several disappeared from the Web before the project could collect them, and others disappeared since. For example, World Haiku Review's content was lost by the journal's hosting platform and had to be recovered from the LOCKSS system.
The e-journals from these tiny publishers are almost all open access. Libraries have no financial incentive, in the form of concern about post-cancellation access, to preserve them. Some of this content is not being collected by the Internet Archive (e.g. Princeton Report on Knowledge) and some is, although the Archive's collection and preservation (pdf) is somewhat haphazard. In many fields communication has moved almost completely to the Web. Poetry, dance, media criticism and many other areas find the Web, with its low cost, multimedia capability and rapid, informal exchanges a much more congenial forum than print. There seem to be three main reasons why research libraries are no longer collecting the record of these fields:
- Faculty have little awareness of the evanescent nature of Web content.
- Thus there is little demand on the collections specialists, whose skills have been eroded as the "big deal" lease licenses take over.
- The "big deal" licenses have consumed more and more of the budget.
Both the LOCKSS system and the Internet Archive's Heritrix crawler provide tools libraries could use, if they wanted to. The provisions of the DMCA mean that more systematic collection and preservation than the Internet Archive can manage requires either a Creative Commons license or specific permission from the publisher. Awareness and use of Creative Commons licenses is limited in these areas. Experience in the LOCKSS Humanities Project has shown that, although permission is always granted, the process of tracking down the publisher, having the necessary conversation, and getting them to add the permission statement to their web site is time-consuming. A small fraction of the resources going in to preserving low-risk e-journals from large publishers could preserve a huge amount of high-risk content for future scholars, by encouraging and supporting librarians in using the available tools.
And yet, simply trusting even the biggest publishers to preserve their own content is not a responsible approach. To make this clear, I'll start by looking at the problem of preserving the integrity of the record in a different, easier to explain area, that of US federal government documents.
Since 1813 federal documents have been printed by an executive branch agency, the Government Printing Office (GPO), and distributed to a network of over 1,000 libraries around the US under the Federal Depository Library Program (FDLP). These are University, State and public libraries, and each collects a subset of the documents matching the interests of their readers. The documents remain the property of the US government, but copies may be made as they are not copyright. Under various reciprocal arrangements, many documents are also supplied to the national libraries of other countries.
The goal of the FDLP was to provide citizens with ready access to their government's information. But, even though this wasn't the FDLP's primary purpose, it provided a remarkably effective preservation system. It created a large number of copies of the material to be preserved, the more important the material, the more copies. These copies were on low-cost, durable, write-once, tamper-evident media. They were stored in a large number of independently administered repositories, some in different jurisdictions. They are indexed in such a way that it is easy to find some of the copies, but hard to be sure that you have found them all.
Preserved in this way, the information was protected from most of the threats to which stored information is subject. The FDLP's massive degree of replication protected against media decay, fire, flood, earthquake, and so on. The independent administration of the repositories protected against human error, incompetence and many types of process failures. But, perhaps most important, the system made the record tamper evident.
Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception. Government documents are routinely recalled from the FDLP, and some are re-issued after alteration.
An illustration is Volume XXVI of Foreign Relations of the United States, the official history of the US State Department. It covers Indonesia, Malaysia, Singapore and the Philippines between 1964 and 1968. It was completed in 1997 and underwent a 4-year review process. Shortly after publication in 2001, the fact that it included official admissions of US complicity in the murder of at least 100,000 Indonesian "communists"by Suharto's forces became an embarrassment, and the CIA attempted to prevent distribution. This effort became public, and was thwarted when the incriminating material was leaked to the National Security Archive and others.
The important property of the FDLP is that in order to suppress or edit the record of government documents, the administration of the day has to write letters, or send US Marshals, to a large number of libraries around the country. It is hard to do this without attracting attention, as happened with Volume XXVI. Attracting attention to the fact that you are attempting to suppress or re-write history is self-defeating. This deters most attempts to do it, and raises the bar of desperation needed to try. It also ensures that, without really extraordinary precautions, even if an attempt succeeds it will not do so without trace. That is what tamper-evident means. It is almost impossible to make the record tamper-proof against the government in power, but the paper FDLP was a very good implementation of a tamper-evident record.
It should have become evident by now that I am using the past tense when describing the FDLP. The program is ending and being replaced by FDSys. This is in effect a single huge web server run by the GPO on which all government documents will be published. The argument is that through the Web citizens have much better and more immediate access to government information than through an FDLP library. That's true, but FDSys is also Winston Smith's dream machine, providing a point-and-click interface to instant history suppression and re-writing.
It isn't just official government documents with which governments may have problems. In recent testimony (pdf) to the House Committee on Oversight and Government Reform entitled "Political Interference with Government Climate Change Science", NASA climate scientist James Hansen described the lengths to which the Bush administration was prepared to go in its attempts to suppress or adjust the speech and writings of scientists. These included suppression or editing of the testimony of individuals to congress, press releases, conference presentations, press interviews and web postings. The administration also used budgetary and public pressure (pdf) to persuade scientists to self-censor papers before submission. Although we don't have evidence that the government has changed journal papers after they have been published, it must be likely that if this administration thought they could get away with it they would be tempted to do so.
Just as it seems imprudent for all copies of each element of the record of government information to be in the custody of a single government agency, it seems imprudent for all copies of each element of the record of scholarship to be in the custody of a single publisher. Governments are far from alone in potentially being tempted to tamper with the record. Consider drug and other companies involved in patent and other lawsuits, for example. Arcane details of the published record can be essential evidence in these cases. Just as it would be easy for the government to change information on FDSys, a very small percentage of the lawyer's fees, let alone the potential settlement, would be more than enough to bribe or coerce the system administrators of the publisher's web platform to correct "flaws" in the record. As these systems are currently implemented, the probability that the change would not be detected is overwhelming. Post-publication changes to e-journal content are routine, typically to correct errors such as typos.
What lessons can we take from the FDLP and the paper library system that has protected the record of scholarship until recently?
The key architectural features we need to copy from paper systems to the digital systems are massive replication in many independent repositories that do not trust each other and are implemented and operated transparently. The one architectural feature that is needed in the digital system that the paper system lacks is mutual audit among the replicas. This is necessary because, unlike paper, digital content is stored on media that is easily and tracelessly rewritable, and because technology exists (pdf) that could potentially re-write many of the on-line copies in a short period of time.
To sum up both this and the preceding post, investment in preserving e-journals is going primarily into establishing a very few, centralized, monolithic "trusted third party" repositories, implemented using expensive, proprietary, enterprise-scale technology and incurring high staff costs. These expensive facilities, quite naturally, focus on preserving premium, high-cost e-journals. Doing so seems natural, because a single negotiation with a large publisher brings a vast amount of content, which makes the repository look attractive to the librarians whose serials budget must in the end bear the cost. Although the LOCKSS system's distributed, peer-to-peer architecture has the features needed, market pressures mean that it too is being targeted at low-risk content.
This is a poor use of resources. The investment in technology and staff is replicating investments already made and paid for (via subscription costs by the publishers) by the same librarians making it. The content being preserved is at very low risk of loss through accident or incompetence. It is even at low risk of cancellation, since these publishers use the "big deal" bundled lease approach to make cancellation extremely painful. The investments are not effective at preventing the record of scholarship being tampered with, since the "trusted third party" architecture lacks the essential tamper-proofing features of the paper system. The investments are not even particularly effective at ensuring post-cancellation access, since a significant proportion of publishers won't allow the repositories to provide it. Not to mention the untested legal and operational obstacles in the path of third party repositories providing such access. Meanwhile, the content future scholars will need that is actually at serious risk of loss through accident, incompetence or economic failure is not being effectively collected or preserved.