Tuesday, December 8, 2015

National Hosting with LOCKSS Technology

For some years now the LOCKSS team has been working with countries to implement National Hosting of electronic resources, including subscription e-journals and e-books. JISC's SafeNet project in the UK is an example. Below the fold I look at the why, what and how of these systems.

Why National Hosting?

The details of how universities, research institutes and libraries acquire subscription electronic resources vary considerably between countries. But in most cases they share an important feature in common; the money may flow through the system in different ways, but it starts from the government. This leads governments to focus on a number of issues in the system.

Post-Cancellation Access

In the paper world, a library's subscription to a journal purchased a copy of the content. It provided the library's readers with access until the library decided to de-accession it, and it could be used to assist other libraries via inter-library loan. The advent of the Web made journal content vastly more accessible and useful, but it forced libraries to switch from purchasing a copy to leasing access to the publisher's copy. If the library stopped paying the subscription, their readers lost access not just to content published in the future, but also to content published in the past while the subscription was being paid. This problem became known as post-cancellation access (PCA), and elicited over time a number of different responses:
  • Some publishers promised to provide PCA themselves, providing access to paid-for content without further payment, but librarians were rightly skeptical of these unfunded promises.
  • The enthusiasm for open access led HighWire Press to pioneer and some publishers to adopt the moving wall concept. Because the value of content decays through time, providing open access to content older than, for example, 12 months does little to reduce the motivation for libraries to subscribe, while rendering PCA less of an issue.
  • Several efforts took place to establish e-journal archives, in order that PCA not be at the whim of the publisher. I posted a historical overview of e-journal archiving to my blog back in 2011.
It is becoming clear that the sweet spot for providing PCA is at a national level, matching the source of most of the subscription funds. A global third-party archive such as ITHAKA's Portico poses essentially the same problem as the original publishers do, access is contingent on continued subscription payment. National archives match national publisher licensing, and are large enough to operate efficiently.

Fault-Tolerant Access

It is tempting to assume that once the subscription has been paid the content will be available from the publisher, but there are many reasons this may not always be the case:
  • Publisher's subscription management and access systems are typically separate, and errors can occur as information is transferred between them that deny subscribers access.
  • Publisher's access systems are not completely reliable. Among the publishers who have recently suffered significant outages are JSTOR, Elsevier, Springer, Nature, IEEE, Taylor & Francis, and Oxford.
  • The Domain Name System that links readers via URLs to publishers is brittle, subject to failures such as that which took down DOI resolution, and which enable journal websites to be hijacked.
  • Publishers often fail to maintain their URL space consistently, leading to the reference rot and content drift that impacts at least 20% of journal articles.
  • The Internet that connects readers to publishers' Web sites is a best-efforts network passing through many links and routers on the way. There are no guarantees that requested content will be delivered completely, or at all. The LOCKSS system's crawlers constantly detect complete or partial failure to deliver content. These could be caused by errors at the publisher, or along the network path to the crawler. Experience suggests that the fewer hops along the route the fewer errors, so at least some errors are network problems.
The ability to fail over from the publisher to a National Hosting Network (NHN), which is closer to the readers, and operates independently of the publisher, can significantly enhance the availability of the content.


For most countries, the publishers of the vast majority of the content to which they subscribe are located in, or have substantial business interests in, the USA. Interestingly, the USA has long had a National Hosting program. Los Alamos National Labs collects all e-journals Federal researchers could access and re-hosts them. The goal is to ensure that details of what their researchers, possibly working on classified projects, look at is not visible outside the Federal government.

Edward Snowden's revelations make it clear that the NSA and GCHQ devote massive efforts to ensuring that they can capture Internet traffic, especially traffic that crosses national borders. This includes traffic to e-journal publishers. Details of almost everything your national researchers look at is visible to US and UK authorities, who are not above using this access for commercial advantage. And, of course, many academic publishers are based, or have much of their infrastructure, in the US or the UK, so they can be compelled to turn over access information even if the traffic is encrypted.

An even greater concern is that the recent legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. The EU has also passed laws claiming extra-territorial jurisdiction over data, so is in a poor position to object to US claims. Note that software is also data in this context.

Thus the access rights notionally conveyed by subscription agreements are actually subject to a complex, opaque and evolving legal framework to which the subscriber, and indeed the government footing the bill, are not parties. The US is claiming the unilateral right to terminate access to content held anywhere by any organization with significant business in the US. This clearly includes all major e-journal publishers, such as EBSCO, Elsevier, Wiley, Taylor & Francis, Nature, AAAS and JSTOR.

Tracking Value For Money

As governments, and thus taxpayers, are the ultimate source of most of the approximately €7.6B/yr spent on academic publishing, concern for the value obtained in return is natural. Many countries, such as the UK, negotiate nationally with publishers. Institutions obtain access by opting-in to the national license terms and paying the publisher. The institution can obtain COUNTER usage statistics from the publisher, so can determine its local cost per usage. But the opt-in license and direct institution-to-publisher interaction make it difficult to aggregate this information at a national level.

Preserving National Open-Access Output

Governments also fund most of the research that appears in smaller, locally published open-access journals. Much of this is, for content or language reasons, of little interest outside the country of origin. It is thus unlikely to be preserved unless NHNs, such as Brazil's Cariniana, take on the task.

What Does National Hosting Need To Do?

To address these issues, the goals of a National Hosting system are to ensure that:
  • Copies of subscribed content are maintained by national institutions within the national boundaries.
  • Using software that is open source, or of national origin.
  • Upon terms allowing access by institutions to content to which they subscribed if, for any reason, it is not available to them from the publisher.
Accomplishing these goals requires:
  1. A database (the entitlement registry) tracking the content to which, at any given time, each institution subscribes.
  2. A system for collecting and preserving for future use the content to which any institution subscribes.
  3. A system for delivering content to readers that they cannot access from the publisher.
A number of countries are at various stages of implementing NHNs using the LOCKSS technology to satisfy these requirements.

How Does LOCKSS Implement National Hosting?

The LOCKSS Program at the Stanford Libraries started in 1998. It was the first to demonstrate practical technology for e-journal archiving, the first to enter production, and the first to achieve economic sustainability without grant funding (in 2007). The basic idea was to restore libraries' ability to purchase rather than lease content, by building technology that worked for the Web as the stacks worked for paper.

Libraries run a "LOCKSS box", a low-cost computer that uses open-source software. Just as in the paper world, each library's box individually collects the subscribed content by crawling the publisher's Web site. Boxes extract Dublin Core and DOI metadata so that the library's readers (but no-one else) can access the box's content if for any reason it is ever not available from the publisher. Boxes holding the same content cooperate to:
  • Detect and correct any content that a box failed to collect completely.
  • Detect and repair any damage to content over time.
Most publishers allowed libraries to use LOCKSS but, for various reasons some of the large publishers were not happy with this model. They took the initiative to set up the CLOCKSS archive, which uses the same technology to implement a dark archive holding each publisher's total output in a network of currently 12 boxes around the world. A National Hosting Network would be similar, but with perhaps half as many boxes.

Full details of how the CLOCKSS archive works are in the documentation that supported its recent successful TRAC audit. Briefly, CLOCKSS ingests content in two ways:
  • A network of 4 LOCKSS boxes harvests content from publishers' Web sites and collaborates to detect and correct collection failures.
  • Some publishers supply content via file transfer to a machine which adds metadata and packages it.
The LOCKSS daemon software on each of the 12 production boxes then collects both kinds of content from the ingest machines by crawling them, under control of the appropriate plugin. Subject to publisher agreement, the same ingest technology can be used to ingest content into NHNs. This would satisfy requirement #2 above.

Requirement #1, setting up an entitlement registry, involves setting up a database of quads [Institution, Journal ID, start date, end date]. The UK's SafeNet project has done so, and is working to populate it with subscription information. They have defined an API to this database. Countries that already have a suitable database can implement this API, countries that don't can use the UK's open-source implementation.

The LOCKSS software is being enhanced to query the entitlement registry via this API before supplying content to readers, thus satisfying requirement #3 above. Readers can access the content via their institution's OpenURL resolver, or via a national DOI resolver.

A Typical National Hosting Network

This diagram shows the configuration of the ingest and preservation aspects of a typical Private LOCKSS Network (PLN). This one has a configuration server controlling a network of 6 LOCKSS boxes, each collecting its content from a content source, and communicating with the other boxes to detect and correct any missing or damaged content. LOCKSS-based NHNs are a type of PLN, and typically would look exactly like the diagram, with 6 or more boxes scattered around the country managed by a central configuration server.

Just as with the CLOCKSS Archive, some content for an NHN would be available via harvest, and some via file transfer. The file transfer ingest pipeline for CLOCKSS is shown in this diagram, the pipeline for an NHN would be similar. For security reasons, content that the publisher uploads to a CLOCKSS FTP server goes to an isolated machine for that purpose only, and is then downloaded from it to the main ingest machine via rsync. Content that publishers make available on their FTP servers, or via rsync, is downloaded to the main ingest machine directly. Two backup copies, one on-site and one off-site are maintained via rsync to ensure that the content is safe while in transit to the production LOCKSS boxes.

The CLOCKSS archive ingests harvest content using a separate ingest network of 4 machines which crawl the publishers' sites, and are then crawled by the 12 production CLOCKSS boxes. The boxes in an NHN would typically harvest content directly, by crawling the publishers' sites, so an ingest network would not be needed.

The CLOCKSS Archive is dark; the content in production CLOCKSS boxes is never accessed by readers. If content is ever triggered, the CLOCKSS team executes a trigger process to copy content out of the network to a set of triggered content servers, which make it openly accessible. NHNs need to make their content accessible only to authorized users, so they need an access path similar to that for LOCKSS. It must consult the entitlement registry to determine whether the requested content was covered by the requester's institutional subscription.

The LOCKSS software can disseminate harvest content in four ways:
  1. By acting as an HTTP proxy, in which case the reader whose browser is proxied via their institution's LOCKSS box will access the content at the publisher's URL but, if the publisher does not supply the content, it will be supplied by the LOCKSS box.
  2. By acting as an HTTP server, in which case the reader will access the content at a URL pointing to the LOCKSS box but including the publisher's URL
  3. By resolving OpenURL queries, using either the LOCKSS daemon's internal resolver, or an institutional resolver such as SFX. LOCKSS boxes can output KBART reports detailing their holdings, which can be input to an institutional resolver. The LOCKSS box will then appear as a alternate source for the resolved content. Access will be at a URL pointing to the LOCKSS box.
  4. By resolving DOI queries, using the LOCKSS daemon's internal resolver. Access will be at a URL pointing to the LOCKSS box.
In each case, the LOCKSS box is configured by default to forward the request to the publisher's Web site and supply content only if the publisher does not.

Methods 1 & 2 rely on knowing the original publisher's URL for the content. For file transfer content, this is generally not available, so the method 3 or 4 must be used. The format in which a publisher supplies file transfer content varies, but it generally includes metadata sufficient to resolve OpenURL or DOI queries, and a PDF that can be returned as a result.


David. said...

The issue of extra-territoriality is getting attention. On December 7 T-Systems, a subsidiary of Deutsche Telecom, announced a public cloud service hosted exclusively in Germany called Intercloud:

"Intercloud’s security architecture is also in line with strict German and European data privacy laws and in compliance with the highest standards. Now that the European Court of Justice has declared the Safe Harbor Agreement between Europe and the US null and void, DSI Intercloud by Cisco and T-Systems is the secure German IaaS service from the cloud."

David. said...

Andrew Orlowski at The Register reports some interesting comments from Brad Smith:

Europeans should sit up and take more notice of Microsoft’s lawsuit against the US government over secret access to their data.

Why? Because it affects much more of their data than the Safe Harbour case, according to Microsoft president and lead counsel Brad Smith.

“The Department of Justice does not need to wait for data to come to the United States to examine it,” he explained. “It can force countries to give it your data without disclosing that access to government, or complying with any European law.”

Smith said 90 per cent of Europeans' data is affected by the Irish warrant case; far more data than is affected by the transatlantic flows governed by safe harbour rules, which Austrian Max Schrems exploded in a European court ruling last year.

David. said...

If you don't understand why the US government is fighting the Microsoft case so hard, you need to read Martin Luther King Jr., Subversives, and the PATRIOT Dragnet by the invaluable Marcy Wheeler. They can't store "all that stuff", so (a) they need to be selective about what they do store:

"I’ve been trying to explain, even to civil liberties supporters, why the current 2-degree targeted dragnet is still too invasive of privacy. We’ve been having this discussion for 2.5 years, and yet still most people don’t care that completely innocent people 2 degrees — 3, until 2014 — away from someone the government has a traffic-stop level of suspicion over will be subjected to the NSA’s “full analytic tradecraft.”"

and (b) they need a way to recover if the selection process failed by getting it from someone who has stored it, in this case Microsoft in Ireland.

David. said...

Andrew Orlowski at The Register has an interview with John Frank, Microsoft's legal strategist for the case of the Dublin e-mails.

David. said...

Add Portico to the list of sites suffering outages; it was experiencing some technical problems this morning.

David. said...

OUP is down again today.

David. said...

For the second time in the last few months the UK's JANET academic network was DDOS-ed meaning that for 5 days access to journals was problematic if they were hosted outside the UK.

David. said...

The Second Circuit has ruled in Microsoft's favor in the case of the e-mails in Ireland:

"United States Court of Appeals for the Second Circuit reversed a lower court’s ruling that Microsoft must turn over email communications for a suspect in a narcotics investigation stored in a Microsoft data center in Dublin."

The precedent that the Justice Dept wanted was:

"that Microsoft’s status as a company based in the United States gave it authority to obtain its data, even if the data was stored outside the country."

Given the determination the Justice Dept. has shown so far:

"The government is likely to appeal the ruling, legal experts said."

David. said...

Not just appeal, but also propose legislation to overturn the court's decision.

David. said...

Yes, the US government has appealed the Second Circuit's decision:

"Federal prosecutors in New York late Thursday asked a federal appeals court to reconsider its July decision that allowed Microsoft to successfully claim that authorities had no legal right to access data stored on its servers outside the country, even with a warrant from a federal judge.

A three-judge panel of the 2nd US Circuit Court of Appeals had ruled that federal law, notably the Stored Communications Act, allows US authorities to seize content in US-based servers, but not in overseas servers—in this case, Dublin, Ireland."

The brief is here.