Why National Hosting?The details of how universities, research institutes and libraries acquire subscription electronic resources vary considerably between countries. But in most cases they share an important feature in common; the money may flow through the system in different ways, but it starts from the government. This leads governments to focus on a number of issues in the system.
Post-Cancellation AccessIn the paper world, a library's subscription to a journal purchased a copy of the content. It provided the library's readers with access until the library decided to de-accession it, and it could be used to assist other libraries via inter-library loan. The advent of the Web made journal content vastly more accessible and useful, but it forced libraries to switch from purchasing a copy to leasing access to the publisher's copy. If the library stopped paying the subscription, their readers lost access not just to content published in the future, but also to content published in the past while the subscription was being paid. This problem became known as post-cancellation access (PCA), and elicited over time a number of different responses:
- Some publishers promised to provide PCA themselves, providing access to paid-for content without further payment, but librarians were rightly skeptical of these unfunded promises.
- The enthusiasm for open access led HighWire Press to pioneer and some publishers to adopt the moving wall concept. Because the value of content decays through time, providing open access to content older than, for example, 12 months does little to reduce the motivation for libraries to subscribe, while rendering PCA less of an issue.
- Several efforts took place to establish e-journal archives, in order that PCA not be at the whim of the publisher. I posted a historical overview of e-journal archiving to my blog back in 2011.
Fault-Tolerant AccessIt is tempting to assume that once the subscription has been paid the content will be available from the publisher, but there are many reasons this may not always be the case:
- Publisher's subscription management and access systems are typically separate, and errors can occur as information is transferred between them that deny subscribers access.
- Publisher's access systems are not completely reliable. Among the publishers who have recently suffered significant outages are JSTOR, Elsevier, Springer, Nature, IEEE, Taylor & Francis, and Oxford.
- The Domain Name System that links readers via URLs to publishers is brittle, subject to failures such as that which took down DOI resolution, and which enable journal websites to be hijacked.
- Publishers often fail to maintain their URL space consistently, leading to the reference rot and content drift that impacts at least 20% of journal articles.
- The Internet that connects readers to publishers' Web sites is a best-efforts network passing through many links and routers on the way. There are no guarantees that requested content will be delivered completely, or at all. The LOCKSS system's crawlers constantly detect complete or partial failure to deliver content. These could be caused by errors at the publisher, or along the network path to the crawler. Experience suggests that the fewer hops along the route the fewer errors, so at least some errors are network problems.
Extra-territorialityFor most countries, the publishers of the vast majority of the content to which they subscribe are located in, or have substantial business interests in, the USA. Interestingly, the USA has long had a National Hosting program. Los Alamos National Labs collects all e-journals Federal researchers could access and re-hosts them. The goal is to ensure that details of what their researchers, possibly working on classified projects, look at is not visible outside the Federal government.
Edward Snowden's revelations make it clear that the NSA and GCHQ devote massive efforts to ensuring that they can capture Internet traffic, especially traffic that crosses national borders. This includes traffic to e-journal publishers. Details of almost everything your national researchers look at is visible to US and UK authorities, who are not above using this access for commercial advantage. And, of course, many academic publishers are based, or have much of their infrastructure, in the US or the UK, so they can be compelled to turn over access information even if the traffic is encrypted.
An even greater concern is that the recent legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. The EU has also passed laws claiming extra-territorial jurisdiction over data, so is in a poor position to object to US claims. Note that software is also data in this context.
Thus the access rights notionally conveyed by subscription agreements are actually subject to a complex, opaque and evolving legal framework to which the subscriber, and indeed the government footing the bill, are not parties. The US is claiming the unilateral right to terminate access to content held anywhere by any organization with significant business in the US. This clearly includes all major e-journal publishers, such as EBSCO, Elsevier, Wiley, Taylor & Francis, Nature, AAAS and JSTOR.
Tracking Value For MoneyAs governments, and thus taxpayers, are the ultimate source of most of the approximately €7.6B/yr spent on academic publishing, concern for the value obtained in return is natural. Many countries, such as the UK, negotiate nationally with publishers. Institutions obtain access by opting-in to the national license terms and paying the publisher. The institution can obtain COUNTER usage statistics from the publisher, so can determine its local cost per usage. But the opt-in license and direct institution-to-publisher interaction make it difficult to aggregate this information at a national level.
Preserving National Open-Access OutputGovernments also fund most of the research that appears in smaller, locally published open-access journals. Much of this is, for content or language reasons, of little interest outside the country of origin. It is thus unlikely to be preserved unless NHNs, such as Brazil's Cariniana, take on the task.
What Does National Hosting Need To Do?To address these issues, the goals of a National Hosting system are to ensure that:
- Copies of subscribed content are maintained by national institutions within the national boundaries.
- Using software that is open source, or of national origin.
- Upon terms allowing access by institutions to content to which they subscribed if, for any reason, it is not available to them from the publisher.
- A database (the entitlement registry) tracking the content to which, at any given time, each institution subscribes.
- A system for collecting and preserving for future use the content to which any institution subscribes.
- A system for delivering content to readers that they cannot access from the publisher.
How Does LOCKSS Implement National Hosting?The LOCKSS Program at the Stanford Libraries started in 1998. It was the first to demonstrate practical technology for e-journal archiving, the first to enter production, and the first to achieve economic sustainability without grant funding (in 2007). The basic idea was to restore libraries' ability to purchase rather than lease content, by building technology that worked for the Web as the stacks worked for paper.
Libraries run a "LOCKSS box", a low-cost computer that uses open-source software. Just as in the paper world, each library's box individually collects the subscribed content by crawling the publisher's Web site. Boxes extract Dublin Core and DOI metadata so that the library's readers (but no-one else) can access the box's content if for any reason it is ever not available from the publisher. Boxes holding the same content cooperate to:
- Detect and correct any content that a box failed to collect completely.
- Detect and repair any damage to content over time.
Full details of how the CLOCKSS archive works are in the documentation that supported its recent successful TRAC audit. Briefly, CLOCKSS ingests content in two ways:
- A network of 4 LOCKSS boxes harvests content from publishers' Web sites and collaborates to detect and correct collection failures.
- Some publishers supply content via file transfer to a machine which adds metadata and packages it.
Requirement #1, setting up an entitlement registry, involves setting up a database of quads [Institution, Journal ID, start date, end date]. The UK's SafeNet project has done so, and is working to populate it with subscription information. They have defined an API to this database. Countries that already have a suitable database can implement this API, countries that don't can use the UK's open-source implementation.
The LOCKSS software is being enhanced to query the entitlement registry via this API before supplying content to readers, thus satisfying requirement #3 above. Readers can access the content via their institution's OpenURL resolver, or via a national DOI resolver.
A Typical National Hosting Network
file transfer ingest pipeline for CLOCKSS is shown in this diagram, the pipeline for an NHN would be similar. For security reasons, content that the publisher uploads to a CLOCKSS FTP server goes to an isolated machine for that purpose only, and is then downloaded from it to the main ingest machine via rsync. Content that publishers make available on their FTP servers, or via rsync, is downloaded to the main ingest machine directly. Two backup copies, one on-site and one off-site are maintained via rsync to ensure that the content is safe while in transit to the production LOCKSS boxes.
The CLOCKSS archive ingests harvest content using a separate ingest network of 4 machines which crawl the publishers' sites, and are then crawled by the 12 production CLOCKSS boxes. The boxes in an NHN would typically harvest content directly, by crawling the publishers' sites, so an ingest network would not be needed.
The CLOCKSS Archive is dark; the content in production CLOCKSS boxes is never accessed by readers. If content is ever triggered, the CLOCKSS team executes a trigger process to copy content out of the network to a set of triggered content servers, which make it openly accessible. NHNs need to make their content accessible only to authorized users, so they need an access path similar to that for LOCKSS. It must consult the entitlement registry to determine whether the requested content was covered by the requester's institutional subscription.
The LOCKSS software can disseminate harvest content in four ways:
- By acting as an HTTP proxy, in which case the reader whose browser is proxied via their institution's LOCKSS box will access the content at the publisher's URL but, if the publisher does not supply the content, it will be supplied by the LOCKSS box.
- By acting as an HTTP server, in which case the reader will access the content at a URL pointing to the LOCKSS box but including the publisher's URL
- By resolving OpenURL queries, using either the LOCKSS daemon's internal resolver, or an institutional resolver such as SFX. LOCKSS boxes can output KBART reports detailing their holdings, which can be input to an institutional resolver. The LOCKSS box will then appear as a alternate source for the resolved content. Access will be at a URL pointing to the LOCKSS box.
- By resolving DOI queries, using the LOCKSS daemon's internal resolver. Access will be at a URL pointing to the LOCKSS box.
Methods 1 & 2 rely on knowing the original publisher's URL for the content. For file transfer content, this is generally not available, so the method 3 or 4 must be used. The format in which a publisher supplies file transfer content varies, but it generally includes metadata sufficient to resolve OpenURL or DOI queries, and a PDF that can be returned as a result.