Tuesday, December 8, 2015

National Hosting with LOCKSS Technology

For some years now the LOCKSS team has been working with countries to implement National Hosting of electronic resources, including subscription e-journals and e-books. JISC's SafeNet project in the UK is an example. Below the fold I look at the why, what and how of these systems.

Why National Hosting?

The details of how universities, research institutes and libraries acquire subscription electronic resources vary considerably between countries. But in most cases they share an important feature in common; the money may flow through the system in different ways, but it starts from the government. This leads governments to focus on a number of issues in the system.

Post-Cancellation Access

In the paper world, a library's subscription to a journal purchased a copy of the content. It provided the library's readers with access until the library decided to de-accession it, and it could be used to assist other libraries via inter-library loan. The advent of the Web made journal content vastly more accessible and useful, but it forced libraries to switch from purchasing a copy to leasing access to the publisher's copy. If the library stopped paying the subscription, their readers lost access not just to content published in the future, but also to content published in the past while the subscription was being paid. This problem became known as post-cancellation access (PCA), and elicited over time a number of different responses:
  • Some publishers promised to provide PCA themselves, providing access to paid-for content without further payment, but librarians were rightly skeptical of these unfunded promises.
  • The enthusiasm for open access led HighWire Press to pioneer and some publishers to adopt the moving wall concept. Because the value of content decays through time, providing open access to content older than, for example, 12 months does little to reduce the motivation for libraries to subscribe, while rendering PCA less of an issue.
  • Several efforts took place to establish e-journal archives, in order that PCA not be at the whim of the publisher. I posted a historical overview of e-journal archiving to my blog back in 2011.
It is becoming clear that the sweet spot for providing PCA is at a national level, matching the source of most of the subscription funds. A global third-party archive such as ITHAKA's Portico poses essentially the same problem as the original publishers do, access is contingent on continued subscription payment. National archives match national publisher licensing, and are large enough to operate efficiently.

Fault-Tolerant Access

It is tempting to assume that once the subscription has been paid the content will be available from the publisher, but there are many reasons this may not always be the case:
  • Publisher's subscription management and access systems are typically separate, and errors can occur as information is transferred between them that deny subscribers access.
  • Publisher's access systems are not completely reliable. Among the publishers who have recently suffered significant outages are JSTOR, Elsevier, Springer, Nature, IEEE, Taylor & Francis, and Oxford.
  • The Domain Name System that links readers via URLs to publishers is brittle, subject to failures such as that which took down DOI resolution, and which enable journal websites to be hijacked.
  • Publishers often fail to maintain their URL space consistently, leading to the reference rot and content drift that impacts at least 20% of journal articles.
  • The Internet that connects readers to publishers' Web sites is a best-efforts network passing through many links and routers on the way. There are no guarantees that requested content will be delivered completely, or at all. The LOCKSS system's crawlers constantly detect complete or partial failure to deliver content. These could be caused by errors at the publisher, or along the network path to the crawler. Experience suggests that the fewer hops along the route the fewer errors, so at least some errors are network problems.
The ability to fail over from the publisher to a National Hosting Network (NHN), which is closer to the readers, and operates independently of the publisher, can significantly enhance the availability of the content.

Extra-territoriality

For most countries, the publishers of the vast majority of the content to which they subscribe are located in, or have substantial business interests in, the USA. Interestingly, the USA has long had a National Hosting program. Los Alamos National Labs collects all e-journals Federal researchers could access and re-hosts them. The goal is to ensure that details of what their researchers, possibly working on classified projects, look at is not visible outside the Federal government.

Edward Snowden's revelations make it clear that the NSA and GCHQ devote massive efforts to ensuring that they can capture Internet traffic, especially traffic that crosses national borders. This includes traffic to e-journal publishers. Details of almost everything your national researchers look at is visible to US and UK authorities, who are not above using this access for commercial advantage. And, of course, many academic publishers are based, or have much of their infrastructure, in the US or the UK, so they can be compelled to turn over access information even if the traffic is encrypted.

An even greater concern is that the recent legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. The EU has also passed laws claiming extra-territorial jurisdiction over data, so is in a poor position to object to US claims. Note that software is also data in this context.

Thus the access rights notionally conveyed by subscription agreements are actually subject to a complex, opaque and evolving legal framework to which the subscriber, and indeed the government footing the bill, are not parties. The US is claiming the unilateral right to terminate access to content held anywhere by any organization with significant business in the US. This clearly includes all major e-journal publishers, such as EBSCO, Elsevier, Wiley, Taylor & Francis, Nature, AAAS and JSTOR.

Tracking Value For Money

As governments, and thus taxpayers, are the ultimate source of most of the approximately €7.6B/yr spent on academic publishing, concern for the value obtained in return is natural. Many countries, such as the UK, negotiate nationally with publishers. Institutions obtain access by opting-in to the national license terms and paying the publisher. The institution can obtain COUNTER usage statistics from the publisher, so can determine its local cost per usage. But the opt-in license and direct institution-to-publisher interaction make it difficult to aggregate this information at a national level.

Preserving National Open-Access Output

Governments also fund most of the research that appears in smaller, locally published open-access journals. Much of this is, for content or language reasons, of little interest outside the country of origin. It is thus unlikely to be preserved unless NHNs, such as Brazil's Cariniana, take on the task.

What Does National Hosting Need To Do?

To address these issues, the goals of a National Hosting system are to ensure that:
  • Copies of subscribed content are maintained by national institutions within the national boundaries.
  • Using software that is open source, or of national origin.
  • Upon terms allowing access by institutions to content to which they subscribed if, for any reason, it is not available to them from the publisher.
Accomplishing these goals requires:
  1. A database (the entitlement registry) tracking the content to which, at any given time, each institution subscribes.
  2. A system for collecting and preserving for future use the content to which any institution subscribes.
  3. A system for delivering content to readers that they cannot access from the publisher.
A number of countries are at various stages of implementing NHNs using the LOCKSS technology to satisfy these requirements.

How Does LOCKSS Implement National Hosting?

The LOCKSS Program at the Stanford Libraries started in 1998. It was the first to demonstrate practical technology for e-journal archiving, the first to enter production, and the first to achieve economic sustainability without grant funding (in 2007). The basic idea was to restore libraries' ability to purchase rather than lease content, by building technology that worked for the Web as the stacks worked for paper.

Libraries run a "LOCKSS box", a low-cost computer that uses open-source software. Just as in the paper world, each library's box individually collects the subscribed content by crawling the publisher's Web site. Boxes extract Dublin Core and DOI metadata so that the library's readers (but no-one else) can access the box's content if for any reason it is ever not available from the publisher. Boxes holding the same content cooperate to:
  • Detect and correct any content that a box failed to collect completely.
  • Detect and repair any damage to content over time.
Most publishers allowed libraries to use LOCKSS but, for various reasons some of the large publishers were not happy with this model. They took the initiative to set up the CLOCKSS archive, which uses the same technology to implement a dark archive holding each publisher's total output in a network of currently 12 boxes around the world. A National Hosting Network would be similar, but with perhaps half as many boxes.

Full details of how the CLOCKSS archive works are in the documentation that supported its recent successful TRAC audit. Briefly, CLOCKSS ingests content in two ways:
  • A network of 4 LOCKSS boxes harvests content from publishers' Web sites and collaborates to detect and correct collection failures.
  • Some publishers supply content via file transfer to a machine which adds metadata and packages it.
The LOCKSS daemon software on each of the 12 production boxes then collects both kinds of content from the ingest machines by crawling them, under control of the appropriate plugin. Subject to publisher agreement, the same ingest technology can be used to ingest content into NHNs. This would satisfy requirement #2 above.

Requirement #1, setting up an entitlement registry, involves setting up a database of quads [Institution, Journal ID, start date, end date]. The UK's SafeNet project has done so, and is working to populate it with subscription information. They have defined an API to this database. Countries that already have a suitable database can implement this API, countries that don't can use the UK's open-source implementation.

The LOCKSS software is being enhanced to query the entitlement registry via this API before supplying content to readers, thus satisfying requirement #3 above. Readers can access the content via their institution's OpenURL resolver, or via a national DOI resolver.

A Typical National Hosting Network

This diagram shows the configuration of the ingest and preservation aspects of a typical Private LOCKSS Network (PLN). This one has a configuration server controlling a network of 6 LOCKSS boxes, each collecting its content from a content source, and communicating with the other boxes to detect and correct any missing or damaged content. LOCKSS-based NHNs are a type of PLN, and typically would look exactly like the diagram, with 6 or more boxes scattered around the country managed by a central configuration server.

Just as with the CLOCKSS Archive, some content for an NHN would be available via harvest, and some via file transfer. The file transfer ingest pipeline for CLOCKSS is shown in this diagram, the pipeline for an NHN would be similar. For security reasons, content that the publisher uploads to a CLOCKSS FTP server goes to an isolated machine for that purpose only, and is then downloaded from it to the main ingest machine via rsync. Content that publishers make available on their FTP servers, or via rsync, is downloaded to the main ingest machine directly. Two backup copies, one on-site and one off-site are maintained via rsync to ensure that the content is safe while in transit to the production LOCKSS boxes.

The CLOCKSS archive ingests harvest content using a separate ingest network of 4 machines which crawl the publishers' sites, and are then crawled by the 12 production CLOCKSS boxes. The boxes in an NHN would typically harvest content directly, by crawling the publishers' sites, so an ingest network would not be needed.

The CLOCKSS Archive is dark; the content in production CLOCKSS boxes is never accessed by readers. If content is ever triggered, the CLOCKSS team executes a trigger process to copy content out of the network to a set of triggered content servers, which make it openly accessible. NHNs need to make their content accessible only to authorized users, so they need an access path similar to that for LOCKSS. It must consult the entitlement registry to determine whether the requested content was covered by the requester's institutional subscription.

The LOCKSS software can disseminate harvest content in four ways:
  1. By acting as an HTTP proxy, in which case the reader whose browser is proxied via their institution's LOCKSS box will access the content at the publisher's URL but, if the publisher does not supply the content, it will be supplied by the LOCKSS box.
  2. By acting as an HTTP server, in which case the reader will access the content at a URL pointing to the LOCKSS box but including the publisher's URL
  3. By resolving OpenURL queries, using either the LOCKSS daemon's internal resolver, or an institutional resolver such as SFX. LOCKSS boxes can output KBART reports detailing their holdings, which can be input to an institutional resolver. The LOCKSS box will then appear as a alternate source for the resolved content. Access will be at a URL pointing to the LOCKSS box.
  4. By resolving DOI queries, using the LOCKSS daemon's internal resolver. Access will be at a URL pointing to the LOCKSS box.
In each case, the LOCKSS box is configured by default to forward the request to the publisher's Web site and supply content only if the publisher does not.

Methods 1 & 2 rely on knowing the original publisher's URL for the content. For file transfer content, this is generally not available, so the method 3 or 4 must be used. The format in which a publisher supplies file transfer content varies, but it generally includes metadata sufficient to resolve OpenURL or DOI queries, and a PDF that can be returned as a result.

32 comments:

David. said...

The issue of extra-territoriality is getting attention. On December 7 T-Systems, a subsidiary of Deutsche Telecom, announced a public cloud service hosted exclusively in Germany called Intercloud:

"Intercloud’s security architecture is also in line with strict German and European data privacy laws and in compliance with the highest standards. Now that the European Court of Justice has declared the Safe Harbor Agreement between Europe and the US null and void, DSI Intercloud by Cisco and T-Systems is the secure German IaaS service from the cloud."

David. said...

Andrew Orlowski at The Register reports some interesting comments from Brad Smith:

Europeans should sit up and take more notice of Microsoft’s lawsuit against the US government over secret access to their data.

Why? Because it affects much more of their data than the Safe Harbour case, according to Microsoft president and lead counsel Brad Smith.

“The Department of Justice does not need to wait for data to come to the United States to examine it,” he explained. “It can force countries to give it your data without disclosing that access to government, or complying with any European law.”

Smith said 90 per cent of Europeans' data is affected by the Irish warrant case; far more data than is affected by the transatlantic flows governed by safe harbour rules, which Austrian Max Schrems exploded in a European court ruling last year.

David. said...

If you don't understand why the US government is fighting the Microsoft case so hard, you need to read Martin Luther King Jr., Subversives, and the PATRIOT Dragnet by the invaluable Marcy Wheeler. They can't store "all that stuff", so (a) they need to be selective about what they do store:

"I’ve been trying to explain, even to civil liberties supporters, why the current 2-degree targeted dragnet is still too invasive of privacy. We’ve been having this discussion for 2.5 years, and yet still most people don’t care that completely innocent people 2 degrees — 3, until 2014 — away from someone the government has a traffic-stop level of suspicion over will be subjected to the NSA’s “full analytic tradecraft.”"

and (b) they need a way to recover if the selection process failed by getting it from someone who has stored it, in this case Microsoft in Ireland.

David. said...

Andrew Orlowski at The Register has an interview with John Frank, Microsoft's legal strategist for the case of the Dublin e-mails.

David. said...

Add Portico to the list of sites suffering outages; it was experiencing some technical problems this morning.

David. said...

OUP is down again today.

David. said...

For the second time in the last few months the UK's JANET academic network was DDOS-ed meaning that for 5 days access to journals was problematic if they were hosted outside the UK.

David. said...

The Second Circuit has ruled in Microsoft's favor in the case of the e-mails in Ireland:

"United States Court of Appeals for the Second Circuit reversed a lower court’s ruling that Microsoft must turn over email communications for a suspect in a narcotics investigation stored in a Microsoft data center in Dublin."

The precedent that the Justice Dept wanted was:

"that Microsoft’s status as a company based in the United States gave it authority to obtain its data, even if the data was stored outside the country."

Given the determination the Justice Dept. has shown so far:

"The government is likely to appeal the ruling, legal experts said."

David. said...

Not just appeal, but also propose legislation to overturn the court's decision.

David. said...

Yes, the US government has appealed the Second Circuit's decision:

"Federal prosecutors in New York late Thursday asked a federal appeals court to reconsider its July decision that allowed Microsoft to successfully claim that authorities had no legal right to access data stored on its servers outside the country, even with a warrant from a federal judge.

A three-judge panel of the 2nd US Circuit Court of Appeals had ruled that federal law, notably the Stored Communications Act, allows US authorities to seize content in US-based servers, but not in overseas servers—in this case, Dublin, Ireland."

The brief is here.

David. said...

The Second Circuit has declined to revisit its decision in favor of Microsoft:

"An evenly split federal appeals court ruled Tuesday that it won't revisit its July decision that allowed Microsoft to squash a US court warrant for e-mail stored on its servers in Dublin, Ireland. The 4-4 vote by the 2nd US Circuit Court of Appeals sets the stage for a potential Supreme Court showdown over the US government's demands that it be able to reach into the world's servers with the assistance of the tech sector."

David. said...

The Department of Justice hasn't been deterred from pursuing extra-territorial jurisdiction:

"A federal magistrate judge handed down an opinion this afternoon, In re Search Warrant No. 16-960-M-01 to Google, ordering Google to comply with a search warrant to produce foreign-stored emails. The magistrate judge disagrees with the U.S. Court of Appeals for the 2nd Circuit’s Microsoft Ireland warrant case, recently denied rehearing by an evenly divided court. Although the new decision is only a single opinion by a single magistrate judge, the decision shows that the Justice Department is asking judges outside the Second Circuit to reject the Second Circuit’s ruling — and that at least one judge has agreed."

David. said...

More details of the magistrate judge's ruling and how the case differs from Microsoft's from Tim Cushing at TechDirt.

David. said...

Alexander J. Martin's Dublin court to decide EU's future relationship with Trump's America reports on another case involving Ireland and the US that is about to go to trial, this time in Dublin:

"The DPC's complaint ... relates to the US National Security Agency's en-masse pilfering of data held by American corporations on EU citizens. This led to the collapse of the Safe Harbor agreement, which allowed American businesses to affirm they were upholding European privacy laws even outside of Europe, after a European Court of Justice ruling in favour of Austrian lawyer Max Schrems in 2015.

A replacement agreement was quickly assembled and dubbed Privacy Shield to the outrage of critics, but during the interim Schrems had complained that companies such as Facebook were continuing to shuffle data westwards using "model contracts" or "standard contractual clauses", which Facebook, Microsoft, and Salesforce claimed allowed customers to practically ignore the judgment.

David. said...

Andrew Orlowski's Judge green lights Microsoft vs Uncle Sam gag order case reports on another Microsoft vs. DoJ case:

"Microsoft is clear to sue the US government for gagging the company from telling users when their data has been accessed by the State. The lawsuit, filed last April, jumped another legal hurdle this week, ... It's Microsoft's fourth legal broadside against the US government on data protection rights for users of cloud services. Microsoft argues that the laws purportedly protecting customers privacy are now outdated and ineffective, and need to be modernised if the public is to trust the cloud."

David. said...

Alexander J. Martin at The Register reports on the Google warrant case:

"In a series of amicus briefs, corporations including Microsoft, Apple, ... Yahoo!, Amazon and Cisco wrote to complain about the February ruling in the Google case."

David. said...

Shaun Nichols at The Register reports that:

"Google has been ordered by a US court to cough up people's private Gmail messages stored overseas – because if that information can be viewed stateside, it is subject to American search warrants, apparently.

During a hearing on Wednesday in California, magistrate judge Laurel Beeler rejected [PDF] the advertising giant's objections to a US government search warrant seeking data stored on its foreign servers. The Mountain View goliath had filed a motion to quash the warrant, and was denied."

David. said...

More detail on Google and other companies push back against extra-territorial warrant is behind Forbes' annoying ad-block defenses. A company that has been caught supplying malvertising should be less arrogant.

David. said...

The Microsoft case has been appealed to the Supreme Court:

The Justice Department on Friday petitioned the US Supreme Court to step into an international legal thicket, one that asks whether US search warrants extend to data stored on foreign servers. The US government says it has the legal right, with a valid court warrant, to reach into the world's servers with the assistance of the tech sector, no matter where the data is stored.

David. said...

Extraterritoriality is catching on. See Joe Mullin's Google must alter worldwide search results, per orders from Canada’s top court:

"Some 345 Datalink webpages have been de-indexed from Google's Canadian site, per various Canadian court orders. But a court in British Columbia issued a broader order, insisting that Google stop listing Datalink's entire website anywhere in the world. Today, Canada's Supreme Court upheld that order."

WHich is precisely what writes such as Tim Cushing have been warning would happen as a result of the DOJ's suit. For an update see his DOJ Asks The Supreme Court To Give It Permission To Search Data Centers Anywhere In The World:

"What the DOJ doesn't seem to understand (or genuinely just doesn't care about) is a decision granting it the power to seize communications from anywhere in the world would result in foreign governments expecting the same treatment when requesting communications stored in the US."

David. said...

A federal judge has upheld the maguistrate judge's order requiring Google to hand over data from servers overseas.

David. said...

David Kravets reports that Google stops challenging most US warrants for data on overseas servers but:

'Microsoft, ... according to the Justice Department filing, "continues to rely" on the 2nd Circuit's decision on a nationwide basis and is "refusing to produce communications that previously would have been disclosed as a matter of course."'

David. said...

More from David Kravets on the Google vs. DoJ warrant fight.

David. said...

Another major publisher, this time Springer Nature, suffers a significant outage:

"Unfortunately, a core component in one of our data centres became unresponsive, which brought down a large part of our infrastructure. Our teams have worked around the clock in multiple time zones to restore systems and services and, as a result, most of our sites are now functioning as normal."

David. said...

The Microsoft vs. DoJ fight about extraterritoriality of warrants will be argued before the Supreme Court, reports David Kravets at Ars Technica:

"The Supreme Court on Monday agreed to decide whether law enforcement authorities, armed with a valid search warrant from a federal judge, can demand that the US tech sector hand over data that is stored on overseas servers. In this case, which is now one of the biggest privacy cases on the high court's docket, the justices will review a lower court's ruling that US warrants don't apply to data housed on foreign servers, in this instance, a Microsoft server in Ireland."

David. said...

"Microsoft should not be able to “shield evidence” held on Irish servers from US prosecutors, a group of 35 US state attorneys general has argued.

The group – which represents Vermont, New Jersey, Illinois, Florida among other states – submitted an amicus brief to the US Supreme Court backing the US Department of Justice’s appeal in its long-running battle against the Windows giant." reports Rebecca Hill at The Register

David. said...

Wiley was down from early Saturday until Monday afternoon. Wiley's "apology":

"this was not an optimal experience for libraries and their users"

David. said...

"Tech giants including Microsoft, Google and Apple have given a proposed US law on overseas data sharing the thumbs-up.

The bipartisan Clarifying Lawful Overseas Use of Data Act (PDF), introduced to the Senate yesterday, aims to iron out confusion around which laws apply when governments want access to data stored in the cloud." writes Rebecca Hill at The Register. It does include some safeguards:

"These include a motion to quash or modify the legal process if it believes the customer isn't a US citizen and that disclosure "creates a material risk" that the firm would violate the laws of another government."

But if it passes it does clarify that the basic position is that data in the custody of companies with significant business in the US is subject to the whims of the US government.

David. said...

The EFF analyses the CLOUD act and points out its many troubling implications.

David. said...

"Supreme Court justices on Tuesday wrestled with Microsoft Corp’s dispute with the U.S. Justice Department over whether prosecutors can force technology companies to hand over data stored overseas, with some signaling support for the government and others urging Congress to pass a law to resolve the issue." reports Reuters.

David. said...

Tim Cushing's Supreme Court Hears Oral Arguments In Microsoft Email Case reports in some detail on the arguments.

David. said...

Bloomberg reports that the EU is just starting to come to terms with the implications of the US CLOUD act.