Thursday, March 10, 2016

Talk on Private LOCKSS Networks at PASIG

I stood in for Vicky Reich to give an overview of Private LOCKSS Networks to the PASIG meeting. Below the fold, an edited text of the talk with links to the sources.

Good morning. Vicky Reich should be standing here, but she sends her regrets. Our responsibilities back in California meant that only one of us could be here, so you get me. Vicky and I have very different styles of presentation; this is my overview of the world of LOCKSS Private Networks not hers.

Beginnings

Nearly seventeen and a half years ago, Vicky and I were hiking in the East Bay hills. For the previous three years, she had been in effect the first marketing person for Stanford Libraries' HighWire Press, which pioneered the transition of academic journals from paper to the Web. She was just back from a conference at which librarians had been expressing much concern about the permanence of these costly, new-fangled e-journals. So I asked "how is permanence assured in the paper system?"

Translating Vicky's answer from librarian-speak to computer-speak, the paper system achieved distributed fault-tolerance through massive replication on durable, somewhat tamper-evident media together with a peer-to-peer anti-entropy protocol. I had been interested in fault-tolerance and peer-to-peer technology, so I saw pretty quickly how the paper system could be replicated in the Web world. The next day Vicky & I pitched the idea to Michael Keller, the Stanford Librarian and her boss, who said:
  • Don't cost me any money.
  • Don't get me in to trouble.
  • Do what you like.

Technology

You can't ask better than that. So we built the LOCKSS (Lots Of copies Keep Stuff Safe) system, which worked for e-journals the way library stacks do for paper journals. The analog of the stacks in each library is a LOCKSS box, an inexpensive computer that:
  • Collects content to which the library subscribes from the publisher's Web site.
  • Disseminates it to the library's readers whenever they can't get it from the publisher's Web site.
  • Preserves it by:
Modelling the system on paper libraries was not the obvious way to tackle e-journal preservation. Somewhat later, several other teams tried to set up centralized, third-party archives without success. Our approach had two big advantages:
  • Publishers were used to the idea that libraries held copies of the publisher's e-journal content under long-established restrictions as to what they could do with them. Because the LOCKSS system worked the same way it was easy to get publishers, except for the very largest, to agree to let libraries use it.
  • The system was cheap and easy for libraries to use. Because each library held its own copy, there were a lot of copies. Having a lot of copies meant that a system of many cheap nodes was very reliable; libraries didn't need expensive, enterprise-grade hardware.

Networks

This open source technology forms the Global LOCKSS Network (GLN), in which libraries around the world preserve e-journal content to which they subscribe. It provides them at low cost with the key functions that digital preservation systems need:
  • Ingest, which for subscription e-journal content is complex and continually evolving.
  • Geographic distribution, self-healing storage, and monitoring for preservation.
  • Dissemination, including integration with OpenURL and DOI resolution.
The LOCKSS team has been economically self-sustaining since 2007, from support fees from the libraries of the LOCKSS Alliance, with no grant funding whatsoever until 2012. Then we received a small grant from the Mellon Foundation to fund some specific infrastructure improvements. The results were published in D-Lib a few months ago.

But, to quote Arlo Guthrie, that's not what I came to tell you about. I came to tell you about what happened when librarians realized that there was an easy, cheap, off-the-shelf way to build distributed preservation networks. They started building Private LOCKSS Networks (PLNs). A community decides that some genre of digital content needs to be preserved, and a group of six or more institutions steps up to the plate to each run a LOCKSS box. Here are some examples, first some preserving other genres, and then some preserving e-journals.

Other Genres

The job of the protagonist of Orwell's 1984, Winston Smith, was to rewrite history to conform to the current and ever-changing ideology. All governments are tempted to do this, and most succumb to the temptation. In the paper world rewriting or suppressing government information once published was hard. Because there were lots of copies scattered around it was hard to be sure you had found all of them.

In the US this scattering was the job of the Federal Depository Library Program (FDLP). It was set up before the days of rapid transportation, let alone instantaneous networks, to ensure that citizens across the continent could access information about their government. The transition of Federal information from paper to the Web meant a transition from hundreds of paper copies to a single copy on a government web-server. Winston Smith's job couldn't have been made easier.

So government documents librarians from 21 institutions across the US and Canada came together and, with cooperation from the Government Printing Office, set up a PLN that collects and preserves Federal documents. Included in this collection are Congressional reports, Budget of the US government, congressional hearings, economic reports, documents from US courts, presidential papers, and more.

The winners of last year's CLA/OCLC Award for Innovative Technology, were a group of Canadian government documents librarians from 10 Canadian institutions who, with help from Stanford, set up a PLN in a desperate struggle to save Canadian government information from the ravages of the Harper administration. These ravages were documented by Anne Kingston at Maclean's in a terrifying article, Vanishing Canada: Why we’re all losers in Ottawa’s war on data, about the Harper administration's crusade to prevent anyone finding out what was happening as they strip-mined the nation. They didn't even bother rewriting, they just deleted, and prevented further information being gathered.

The MetaArchive Cooperative is run by the non-profit Educopia Institute. This fast growing international membership organization caters to cultural memory organizations that are collaborating to preserve very high value locally created digital materials. Examples include Purdue's PURR data repository and Greene County Public Library's collection of oral history videos about the 1974 tornado in Xenia, Ohio. Their network is now 40TB, making it the largest Private LOCKSS Network implementation by content size. They published the comprehensive report A Guide to Distributed Digital Preservation.

SAFE (SAFE Archiving FEderation), five institutions in Belgium, Germany and Canada are preserving their locally published open-access collections of electronic theses and dissertations, scientific publications and research data.

e-journals

As I said, almost the only publishers reluctant to let libraries use the LOCKSS technology were the very large ones. When a third-party e-journal archiving service finally went live in 2007, they signed up but were worried about being captive to a monopoly supplier. A group of large publishers approached Stanford to set up CLOCKSS, a dark archive of e-journal content that is a large PLN. It preserves content from almost all large e-journal publishers in a network of 12 large (30TB+) LOCKSS boxes scattered around the world. There are five in the US (Stanford, Rice, Indiana, Virginia, OCLC) and one each in Australia, Canada, Italy, Japan, Hong Kong, Germany and Scotland. If the content is ever unavailable for an extended period, the board can vote to trigger it. This happens a couple of times a year, and triggered content is made available to all under a Creative Commons license. You can find the triggered titles on the CLOCKSS web site, currently 23 of them.

In 2014 the CLOCKSS archive was audited against the Trusted Repository Audit Criteria, the predecessor of ISO16363, and received the first-ever perfect score for "Technologies, Technical Infrastructure, Security" and equalled the previous best overall score. All non-confidential materials submitted to the auditors are posted on documents.clockss.org for you to read.

Ingest is the largest component of almost all digital preservation system costs, and this is certainly true of e-journal preservation. This is especially a problem for the "long tail" of smaller publishers. Many of them use the Open Journal System platform, and thanks to the Public Knowledge Project there is now a PLN providing "one-button" preservation of journals published via OJS.

A large slice of the "long tail" is published in local languages primarily for local scholars, typically as open access. This is certainly true in Brazil, and libraries there have come together at a national level to run a PLN called Cariniana that preserves their national open access academic literature.

Somehow, the definition of "small publisher" has come to be one that publishes ten or fewer journals. This seems pretty big to me, but if we adopt this definition the LOCKSS Program just passed an important milestone. Two weeks ago we sent out a press release announcing that the various networks using LOCKSS technology now preserve content from over 1000 long-tail publishers. There is still a long way to go, but as the press release says:
there are tens of thousands of long tail publishers worldwide, which makes preserving the first 1,000 publishers an important first step to a larger endeavor to protect vulnerable digital content.
Brazil is not the only country addressing e-journal preservation at a national level. As subscription costs increase and library budgets are increasingly stressed, safeguarding not just a nation's open access research output but also the nation's investment in subscription e-journals becomes a major concern. JISC in the UK has funded a project to develop a National Hosting Network (NHN) using LOCKSS technology, and discussions of similar efforts are at various stages with half-a-dozen other countries.

Interestingly, the USA has long had a National Hosting program. Los Alamos National Labs collects all e-journals Federal researchers could access and re-hosts them. The goal is to ensure that details of what their researchers, possibly working on classified projects, look at is not visible outside the Federal government.

Benefiting from exchange of expertise and best practices, an international community is rapidly growing around the requirement for national or consortial custody of web-published content that individual institutions no longer provide. The main reasons other countries are interested in National Hosting Networks are:
  • It is becoming clear that the sweet spot for providing post-cancellation access is at a national level, matching the level at which most countries negotiate with publishers, and the source of most of the subscription funds. A global third-party archive poses essentially the same problem as the original publishers do, access is contingent on continued subscription payment.
  • It is tempting to assume that once the subscription has been paid the content will be available from the publisher, but there are many reasons this may not always be the case:
    • Publisher's subscription management and access systems are typically separate, and errors can occur as information is transferred between them that deny subscribers access.
    • Publisher's access systems are not completely reliable. Among the publishers who have recently suffered significant outages are JSTOR, Elsevier, Springer, Nature, IEEE, Taylor & Francis, and Oxford.
    • The Domain Name System that links readers via URLs to publishers is brittle, subject to failures such as that which took down DOI resolution, and which enable journal websites to be hijacked.
    • Publishers often fail to maintain their URL space consistently, leading to the reference rot and content drift that impacts at least 20% of journal articles.
    • The Internet that connects readers to publishers' Web sites is a best-efforts network passing through many links and routers on the way. The LOCKSS system's crawlers constantly detect complete or partial failure to deliver content.
  • The on-going legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. Thus the access rights notionally conveyed by subscription agreements are actually subject to a complex, opaque and evolving legal framework to which the subscriber, and indeed the government footing the bill, are not parties. The US is claiming the unilateral right to control access to content held anywhere by any organization with ties to the US, such as the major e-journal publishers.
  • As governments, and thus taxpayers, are the ultimate source of most of the approximately €7.6B/yr spent on academic publishing, concern for the value obtained in return is natural. Many countries, such as the UK, negotiate nationally with publishers. Institutions obtain access by opting-in to the national license terms and paying the publisher. The institution can obtain COUNTER usage statistics from the publisher, so can determine its local cost per usage. But the opt-in license and direct institution-to-publisher interaction make it difficult to aggregate this information at a national level to track value-for-money.
To address these issues, the goals of a National Hosting system are to ensure that:
  • Copies of subscribed content are maintained by national institutions within the national boundaries.
  • Using software that is open source, or of national origin.
  • Upon terms allowing access by institutions to content to which they subscribed if, for any reason, it is not available to them from the publisher.
Accomplishing these goals requires:
  1. A database (the entitlement registry) tracking the content to which, at any given time, each institution subscribes. The JISC project has implemented this.
  2. A system for collecting and preserving for future use the content to which any institution subscribes. The LOCKSS technology provides this.
  3. A system for delivering content to readers that they cannot access from the publisher. The LOCKSS technology provides this.
We would like to talk with other countries interested in a National Hosting Network. Much of the cost in setting up and operating these networks can be shared among countries, the more that join in the cheaper they get. LOCKSS applies just as much at a community as at a technical level; many institutions, nations and networks is a better preservation strategy than the consolidated risk of a single provider.

No comments: