Sunday, June 10, 2007

Why Preserve E-Journals? To Preserve The Record

In the previous post in this series I examined one of the two answers to the question "why preserve e-journals?". Libraries subscribing to e-journals want post-cancellation access for their readers to the content they purchased. Given that the problem only arises for subscription journals that don't operate a moving wall, that cancellation is a fairly rare event, and that canceled content can't have been that important anyway (or why was it canceled?), it is hard to see this reason for preservation justifying current levels of investment in e-journal preservation.

I will now turn to the other answer to the question, that is in order to preserve the integrity of the record of scholarship. This is, or should be, a concern not of individual libraries, but of libraries working together in the interests of society as a whole.

To understand what is needed to preserve the integrity of the record of scholarship as it migrates to the digital world, we need to understand the threats to the integrity. The LOCKSS team set out our model of the threats to which digital content is subject in a paper in D-Lib. Briefly, they are:

  • Media Failure or loss.

  • Hardware Failure.

  • Software Failure.

  • Communication Errors.

  • Failure of Network Services.

  • Media and Hardware Obsolescence.

  • Software and Format Obsolescence.

  • Operator Error.

  • Natural Disaster.

  • External Attack.

  • Internal Attack.

  • Economic Failure.

  • Organizational Failure.

All digital content is subject to all these threats, but different threats affect different content differently. The major publishers have been in business a long time, most well over a century and Oxford University Press well over 5 centuries. They understand that their content is their core business asset, and are highly motivated to ensure that these threats don't destroy its value. Irrespective of external efforts, they already devote significant resources (which ultimately come from their subscribers, the libraries) to preserving these assets. E-journals in their custody are not likely to die of media, hardware, software or format obsolescence or failure. Smaller publishers, whether for-profit or not, as long as they survive the consolidation of the publishing market, will also understand that both their chances of survival and their value as an acquisition depend on maximizing the value of their content, and will be motivated to preserve it.

Yet much of the external money being invested in preserving e-journals targets precisely this low-risk content, from the largest to the medium-sized publishers. The first question librarians ask of an e-journal preservation system is "do you have Elsevier content?" Elsevier has been in business since 1880, longer than many of the libraries asking the question. Their content would not be at significant risk even if copies of it were not, as they actually are, already on-site at about a dozen libraries around the world. This investment is merely adding to the already substantial existing investment from the libraries via the publishers in preserving their content. Are these sensible investment policies?

The content that is at the most risk from these threats is that of the smallest publishers. In the humanities, this is the content that is being preserved by the LOCKSS Humanities Project. This is the content that, were it on paper, research libraries would be collecting to provide future scholars with the material they need to study the culture of our times. The publishers and their content are visibly at-risk; among those initially selected by a group of humanities collection specialists at major research libraries in 2002, several disappeared from the Web before the project could collect them, and others disappeared since. For example, World Haiku Review's content was lost by the journal's hosting platform and had to be recovered from the LOCKSS system.

The e-journals from these tiny publishers are almost all open access. Libraries have no financial incentive, in the form of concern about post-cancellation access, to preserve them. Some of this content is not being collected by the Internet Archive (e.g. Princeton Report on Knowledge) and some is, although the Archive's collection and preservation (pdf) is somewhat haphazard. In many fields communication has moved almost completely to the Web. Poetry, dance, media criticism and many other areas find the Web, with its low cost, multimedia capability and rapid, informal exchanges a much more congenial forum than print. There seem to be three main reasons why research libraries are no longer collecting the record of these fields:

  • Faculty have little awareness of the evanescent nature of Web content.

  • Thus there is little demand on the collections specialists, whose skills have been eroded as the "big deal" lease licenses take over.

  • The "big deal" licenses have consumed more and more of the budget.

Both the LOCKSS system and the Internet Archive's Heritrix crawler provide tools libraries could use, if they wanted to. The provisions of the DMCA mean that more systematic collection and preservation than the Internet Archive can manage requires either a Creative Commons license or specific permission from the publisher. Awareness and use of Creative Commons licenses is limited in these areas. Experience in the LOCKSS Humanities Project has shown that, although permission is always granted, the process of tracking down the publisher, having the necessary conversation, and getting them to add the permission statement to their web site is time-consuming. A small fraction of the resources going in to preserving low-risk e-journals from large publishers could preserve a huge amount of high-risk content for future scholars, by encouraging and supporting librarians in using the available tools.

And yet, simply trusting even the biggest publishers to preserve their own content is not a responsible approach. To make this clear, I'll start by looking at the problem of preserving the integrity of the record in a different, easier to explain area, that of US federal government documents.

Since 1813 federal documents have been printed by an executive branch agency, the Government Printing Office (GPO), and distributed to a network of over 1,000 libraries around the US under the Federal Depository Library Program (FDLP). These are University, State and public libraries, and each collects a subset of the documents matching the interests of their readers. The documents remain the property of the US government, but copies may be made as they are not copyright. Under various reciprocal arrangements, many documents are also supplied to the national libraries of other countries.

The goal of the FDLP was to provide citizens with ready access to their government's information. But, even though this wasn't the FDLP's primary purpose, it provided a remarkably effective preservation system. It created a large number of copies of the material to be preserved, the more important the material, the more copies. These copies were on low-cost, durable, write-once, tamper-evident media. They were stored in a large number of independently administered repositories, some in different jurisdictions. They are indexed in such a way that it is easy to find some of the copies, but hard to be sure that you have found them all.

Preserved in this way, the information was protected from most of the threats to which stored information is subject. The FDLP's massive degree of replication protected against media decay, fire, flood, earthquake, and so on. The independent administration of the repositories protected against human error, incompetence and many types of process failures. But, perhaps most important, the system made the record tamper evident.

Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception. Government documents are routinely recalled from the FDLP, and some are re-issued after alteration.

An illustration is Volume XXVI of Foreign Relations of the United States, the official history of the US State Department. It covers Indonesia, Malaysia, Singapore and the Philippines between 1964 and 1968. It was completed in 1997 and underwent a 4-year review process. Shortly after publication in 2001, the fact that it included official admissions of US complicity in the murder of at least 100,000 Indonesian "communists"by Suharto's forces became an embarrassment, and the CIA attempted to prevent distribution. This effort became public, and was thwarted when the incriminating material was leaked to the National Security Archive and others.

The important property of the FDLP is that in order to suppress or edit the record of government documents, the administration of the day has to write letters, or send US Marshals, to a large number of libraries around the country. It is hard to do this without attracting attention, as happened with Volume XXVI. Attracting attention to the fact that you are attempting to suppress or re-write history is self-defeating. This deters most attempts to do it, and raises the bar of desperation needed to try. It also ensures that, without really extraordinary precautions, even if an attempt succeeds it will not do so without trace. That is what tamper-evident means. It is almost impossible to make the record tamper-proof against the government in power, but the paper FDLP was a very good implementation of a tamper-evident record.

It should have become evident by now that I am using the past tense when describing the FDLP. The program is ending and being replaced by FDSys. This is in effect a single huge web server run by the GPO on which all government documents will be published. The argument is that through the Web citizens have much better and more immediate access to government information than through an FDLP library. That's true, but FDSys is also Winston Smith's dream machine, providing a point-and-click interface to instant history suppression and re-writing.

It isn't just official government documents with which governments may have problems. In recent testimony (pdf) to the House Committee on Oversight and Government Reform entitled "Political Interference with Government Climate Change Science", NASA climate scientist James Hansen described the lengths to which the Bush administration was prepared to go in its attempts to suppress or adjust the speech and writings of scientists. These included suppression or editing of the testimony of individuals to congress, press releases, conference presentations, press interviews and web postings. The administration also used budgetary and public pressure (pdf) to persuade scientists to self-censor papers before submission. Although we don't have evidence that the government has changed journal papers after they have been published, it must be likely that if this administration thought they could get away with it they would be tempted to do so.

Just as it seems imprudent for all copies of each element of the record of government information to be in the custody of a single government agency, it seems imprudent for all copies of each element of the record of scholarship to be in the custody of a single publisher. Governments are far from alone in potentially being tempted to tamper with the record. Consider drug and other companies involved in patent and other lawsuits, for example. Arcane details of the published record can be essential evidence in these cases. Just as it would be easy for the government to change information on FDSys, a very small percentage of the lawyer's fees, let alone the potential settlement, would be more than enough to bribe or coerce the system administrators of the publisher's web platform to correct "flaws" in the record. As these systems are currently implemented, the probability that the change would not be detected is overwhelming. Post-publication changes to e-journal content are routine, typically to correct errors such as typos.

What lessons can we take from the FDLP and the paper library system that has protected the record of scholarship until recently?

The key architectural features we need to copy from paper systems to the digital systems are massive replication in many independent repositories that do not trust each other and are implemented and operated transparently. The one architectural feature that is needed in the digital system that the paper system lacks is mutual audit among the replicas. This is necessary because, unlike paper, digital content is stored on media that is easily and tracelessly rewritable, and because technology exists (pdf) that could potentially re-write many of the on-line copies in a short period of time.

To sum up both this and the preceding post, investment in preserving e-journals is going primarily into establishing a very few, centralized, monolithic "trusted third party" repositories, implemented using expensive, proprietary, enterprise-scale technology and incurring high staff costs. These expensive facilities, quite naturally, focus on preserving premium, high-cost e-journals. Doing so seems natural, because a single negotiation with a large publisher brings a vast amount of content, which makes the repository look attractive to the librarians whose serials budget must in the end bear the cost. Although the LOCKSS system's distributed, peer-to-peer architecture has the features needed, market pressures mean that it too is being targeted at low-risk content.

This is a poor use of resources. The investment in technology and staff is replicating investments already made and paid for (via subscription costs by the publishers) by the same librarians making it. The content being preserved is at very low risk of loss through accident or incompetence. It is even at low risk of cancellation, since these publishers use the "big deal" bundled lease approach to make cancellation extremely painful. The investments are not effective at preventing the record of scholarship being tampered with, since the "trusted third party" architecture lacks the essential tamper-proofing features of the paper system. The investments are not even particularly effective at ensuring post-cancellation access, since a significant proportion of publishers won't allow the repositories to provide it. Not to mention the untested legal and operational obstacles in the path of third party repositories providing such access. Meanwhile, the content future scholars will need that is actually at serious risk of loss through accident, incompetence or economic failure is not being effectively collected or preserved.


Anonymous said...

Correction: The US Government Printing Office is in the Legislative Branch not the Executive.

David. said...

Thank you for the correction. The Legislative Branch's regard for the integrity of the record was exemplified by the actions of Senators Graham, Kyl and Brownback (revealed in testimony in the Hamdan case). Thus I don't see that the correction affects my argument much.

David. said...

In a comment on the post about Post-Cancellation Access, Portico describes their approach to preserving the integrity of the record.

"Portico provides [subscribers] access to archived content when specific trigger events occur ... Trigger events include:

A publisher stops operations; or

A publisher ceases to publish a title; or

A publisher no longer offers back issues; or

Upon catastrophic and sustained failure of a publisher's delivery platform."

The CLOCKSS program uses exactly the same trigger events, the difference is that after the trigger event access to the affected content is provided to everyone by CLOCKSS but only to subscribers by Portico.

Why would the major publishers agree that after certain trigger events their core business asset would be rendered worthless by being made open access? The only plausible explanation is that they believe the probability of a trigger event happening approaches zero. The publishers have figured out that, in the new electronic publishing environment of all-or-nothing access to a comprehensive database of a publisher's content, even the back issues of low-quality journals have a value vastly greater than the cost of keeping them online. If a journal is part of a large publisher's catalog, there is a strong incentive to keep it available. If a journal comes from a small publisher which runs into financial trouble, it will be bought by a large publisher to realise this value.

Although the failure of a publisher is a risk to the integrity of the record, it is not a significant risk in the publishing market that the Web has created. If one happens, it will be the context of a much larger social or economic failure in which, publishers agree, restricting access will be counter-productive.

Ian Hough said...

We must preserve e-literature, be it email or blogs or whatever. Imagine all the letters of great thinkers through the ages that have served as an inspiration to people...if that tradition is suddenly taken away, the later generations (who, let's face it, are losing touch with the older stuff as the world changes due to the very technology being discussed) will collapse under the weight of their own ignorance of who they are and where they came from, etc. Soon everything will be electronic, and the danger then will lie in the choice we have to delete or save everything we write or create electronically. My website, at I see as a kind of testament to fact that I even exist, and if I hadn't made it I wouldn't have anything at all on the web to remember myself by when I turn on the computer. It is a great medium, but we have to respect its throwaway nature and help ourselves to preserve the thoughts we have today so they're here tomorrow.

lady macleod said...

Did you know The British Museum now has a project for preserving a number of "typical" blogs?

Congratulations from North Africa on being a Blog of Note. Interesting Blog.

Anthropositor said...

Well David,
I am not one of the professionals to whom I think your blog is directed, and rarely subscribe to anything if I can help it. I am lamentably not even particularly computer conversant. And of course I have no notion what may be held to be unsuitable or off-topic.

But it has struck me that some of the Search Engines are reputedly in the process of developing means of storing virtually forever virtually everything that goes over the web. The idea of my Emails and blog contents and forum contributions being around into the wee hours of the human species’ possible future, mixed in with the incredible volume of nonsense does not fill me with satisfaction.

Of course, I AM a bit of a packrat. I do save a lot of stuff I have no current use for, on the theory that I am simply not yet imaginative enough to devise a use for it, and my libraries in various locations, of many thousands of books each, will continue to gather dust and provide me with greater choices should I ever have an occasion when I run out of interesting reading material. As for all the other things I save, those things really do become very useful, very often.

But with regard to internet content, whether for subscriber use or for the rest of us plebian riffraff, won’t the ultimate key be the refinement of the search criteria that ultimately sifts it? There is great interest in making our political frameworks more democratic. Perhaps it is more to the point to make information as democratically available as possible. 95% of my education has been in libraries and used book stores.

I realize that there is a certain amount of inevitable obsolescence of storage media and retrieval mechanisms, and that the materials stored in obsolete ways will ultimately be lost if perhaps expensive efforts are not made to transfer it to the more modern storage media. It strikes me that this is a pretty important problem.

I think I would also pay attention to the possibility of massive electromagnetic pulse wave attack. I will not detail how easy it would be for that to occur. Information needs to be alternately stored in such a fashion that it will be retrievable even in the event of an aggressive massive electromagnetic event. there are a lot of well-funded fanatics out there.

no more said...

e-journals, good or bad, is really the one time in history when everyone regardless of age/race/sex/religion/status/location can say to the world "I was here". Thought provoking, humorous, or just a day to day entry of life experience.

In the grand scheme of things i wonder if it is possible or necessary to preserve all or a select. I often think when a person dies their blog lives on, and speaks volumes more than a tombstone ever could about "who i was"

Irene Grumman said...

Dear David,

I found your blog only because you made the Blog-of-the-Month list. I'm not working for a library, publisher or the Government. Yet your theme is one that I often wonder about.

The danger of rewriting or deleting documents is very grave. I agree that replication in many places is vital, even if natural disaster, war, or electromagnetic pulse were the only threats.

I know a professor of Catalan language and literature who found in uncataloged piles of old books and papers a 16th Century page of music which apparently is an unprecedented bit of evidence for language and history in that era. Will such a browse, and such a find, be possible with digital materials?

Publishing the personal journal or diary of an Oregon pioneer woman is considered to add a great deal to our understanding of the era, and of the contribution of women to the American story. Will blogs or email letters survive to tell such stories?

I no longer have access to some stories and essays I wrote in Word with Windows 95. I wrote them in 1999-2003. I hear that I can pay an unknown sum to have them transdigitized for Windows XP, which is already close to obsolete. I have paper copies of most of them
but the point is that the individual user may be out of luck on preservation.

I believe the concerns you raise are very, very important. Preservation, and findability, need to be explained, understood, and supported.

David. said...

What I wrote could be interpreted as suggesting that GPO was driving the replacement of the paper FDLP system by the electronic FDsys. That is an excessively paranoid view. The agencies that generate government documents are driving the transition, because they are producing more and more on the web and less and less on paper. There are thus fewer and fewer paper documents for GPO to funnel into FDLP.

It is also true that one of GPO's goals for FDsys is strong authentication for the documents that they distribute. They plan to use digital signatures for this purpose. This is undoubtedly a good thing, capable of detecting the effects of many of the threats above. But it is important to observe that the signature tells the reader that this is official government information, not necessarily that this is the same official government information that it used to be. Winston Smith was an government official.

Anonymous said...

Almost makes you long for the writing on cave walls method of recording representations of day to day life. At least the technology there was never in danger of becoming obsolete.

The Sports Satirist said...

Are E-Journals like e-books?

David. said...

Reading comments from funnyman and others it seems I should clarify the term e-journals. I'm using it to refer to scholarly journals such as New England Journal of Medicine or Early Modern Literary Studies. They used to be published on paper and formed an important part of library collections. Now their content is increasingly published only on the Web. Systems developed for preserving this content, such as the LOCKSS system, could also be used to preserve e-journals in the broader sense, such as blogs, and also other types of content such as e-books.

Optimum Companies said...

Thank you for your post; I have been extremely frustrated on occasion in my search for certain journal articles that my library no longer stores and can't be tracked down elsewhere online. The preservation of such documents is crucial.