LOCKSS: Collaborative Distributed Web Archiving For Libraries
The academic record in the form of journal articles, monographs and books is an important part of our cultural heritage. For centuries, libraries have both provided access to and preserved that record by distributing large numbers of copies on durable, reasonably tamper-evident media among repositories around the world, and making up for missing or damaged copies via a cooperative system of inter-library loan.Twenty years ago next month, in May 1995, Stanford Libraries pioneered the transition of academic publishing from this paper system to the Web when HighWire Press put the Journal of Biological Chemistry on-line. Two things rapidly became obvious:
- The new medium had capabilities, such as links and search, that made it far more useful than paper, so that the transition would be rapid and complete.
- The new medium forced libraries to switch from purchasing a copy of the material to which they subscribed and adding it to their collection, to leasing access to the publisher's copy and no longer building a collection.
- If they decided to cancel their subscription, they would lose access not just to future materials, but also to the materials for which they had paid. This problem is called post-cancellation access.
- If the publisher stopped publishing the materials, future readers would lose access to them entirely. This problem is called preserving the record.
- If the only copy was the publisher's, changes to the copy by the publisher, or even a bad guy who broke in to the publisher, would be unlikely to be detected. This problem is called tamper-evidence.
The materials we needed to collect, preserve and disseminate were not merely copyright, thus subject to the draconian provisions of the Digital Millenium Copyright Act (DMCA), but also extremely valuable. These assets generated large profits for the publishers, who owned the copyrights (or at least behaved as if they did). In 2013, Reed Elsevier alone reported over $1.1B in profit from STEM publishing. If you're handling assets that generate a billion a year on the bottom line, you need to be on very safe legal ground.
Clearly, following the example of the Internet Archive by depending on the "safe harbor" provision of the DMCA wasn't a good solution to post-cancellation access. The cancelled publisher could issue a take-down notice, preventing such a system providing access and rendering it useless. The only alternative was a system that obtained explicit permission from the copyright owner to preserve a copy of the content, and to use it after cancellation just as a paper copy would have been used. After all, doing essentially what we do to journal content but without permission led to Aaron Swartz's death.
The LOCKSS Program restored the purchase model by building a system that worked for the Web the way libraries had always worked with paper. Libraries continued to build collections containing copies of materials they purchased, held in LOCKSS boxes, the digital equivalent of the stacks. Each box obtained its content from the publisher under its host institution's subscription agreement. Legal permission for the box to do so was a simple statement added to the content on-line.
So we have a network of boxes, each with the content to which that library subscribes. Important content is in many boxes, less important content in fewer. On average, there are Lots Of Copies of each item. What can be done to Keep Stuff Safe, while keeping the focus on minimizing the cost of ownership of the content?
A peer-to-peer network allows LOCKSS boxes to collaborate to detect and repair any loss or damage to the content in their collection, the digital analog of the way paper libraries collaborate via inter-library loan and copy. This reduces the cost of ownership; boxes can use low-cost hardware.
The protocol by which the peers cooperate turned out to be an interesting research topic, which earned us a Best Paper award at the 2003 SOSP. The protocol performs five functions:
- Location: enabling a box in the network to find at least some of the other boxes in the network that are preserving the same Archival Unit (AU). An AU is typically a volume of a journal, or an e-book.
- Verification: enabling a box to confirm that the content it has collected for an AU is the same (matching URLs, with matching content) as other boxes preserving the same AU have collected, thus identifying and correcting errors in the collection process.
- Authorization: enabling a box A that is preserving an AU to provide content from that AU to another box B preserving the same AU in order to repair damage, because box B has proved to box A that in the past it collected matching content from the publisher for that AU. Box A is a willing repairer for box B.
- Detection: of random loss or damage to content in an AU at a box in order that it might request repairs to the content from other boxes preserving the same AU.
- Prevention: of attempts at deliberate modification of content in an AU at multiple boxes in the network, by detecting that non-random change had occurred at multiple boxes.
The LOCKSS software is a large, open-source daemon that runs on Linux. It is implemented as more than 200K lines of Java, over 1000 classes, using over 100 open-source Java libraries. It includes:
- A highly configurable Web crawler, with the special capabilities needed for academic journals including sophisticated login page handling, license detection, crawler trap avoidance, crawl scheduling and rate-limiting.
- The peer-to-peer preservation system.
- A version of the Jetty web server that acts as both a Web proxy and a Web server to replay the preserved content, and also provides the administrative Web interface for the daemon.
- Services above the Jetty server that support Memento.
- Transparent, on-access format migration driven by HTTP content negotiation.
- Import and export capability for multiple types of archive files including WARC (ISO 28500) files.
- Technology for extracting bibliographic metadata, storing it in a metadata database, and using it to support access to preserved content via both DOI and OpenURL resolution.
Included in the code base is our test code, all of which is run on every build, including every night. For the generic daemon code there are over 5,000 unit tests, and over 1,000 for plugin code. There is also a functional test framework that runs an entire network of LOCKSS daemons preserving synthetic content to exercise detection and recovery from damage and other functions.
In its more than 16 year history the LOCKSS Program has succeeded in implementing this technology, deploying it to about 150 libraries around the world, and persuading over 600 publishers to permit libraries to use it to preserve their content.
As usual, people found uses for the system other than the Global LOCKSS Network (GLN) it was built for. There are now about 15 separate Private LOCKSS Networks (PLNs) using the technology to collect and preserve library special collections, social science datasets, and other types of content. For example, one of our PLNs just won an award for innovative use of technology. It involves a group of Canadian university libraries and Stanford Libraries using Archive-It to collect Canadian government documents, which these days are at considerable risk, and preserving them in a PLN that ensures there are copies at multiple locations, including some not under Canadian jurisdiction.
The LOCKSS team operates one large PLN, the CLOCKSS Archive, on behalf of a not-for-profit jointly managed by the large academic publishers and research libraries. The CLOCKSS PLN currently has 12 boxes in countries around the globe. They ingest the complete output of almost all major academic publishers, both e-journals and e-books. If any of this content is no longer available on the Web from any publisher, the board can vote to "trigger" it, at which point the CLOCKSS Archive makes it public under Creative Commons licenses. On average, some content is triggered about every six months.
Last year, CRL certified the CLOCKSS archive under TRAC, the predecessor to ISO16363. We were awarded an overall score matching the previous best, and their first-ever perfect score in the Technologies, Technical Infrastructure, Security" category. All the non-confidential documentation submitted to the auditors is available, as is a description of the process and a summary of the lessons we learned.
We believe certification is an extremely valuable process that all archives should undertake. There are three increasingly rigorous levels of certification available to data archives; the Data Seal of Approval (DSA), NESTOR's DIN31644, and TRAC/ISO16363. Going straight to the highest level, as CLOCKSS did, is a huge amount of work and rather disruptive; we recommend moving up the levels gradually.
The starting level, the DSA, is awarded after a process of self-audit and peer-review based on 16 guidelines. It typically takes 2-3 months elapsed time. Re-certification is required after 2 years; DANS, the Dutch Data Archive and Networked Services organization, estimated that their recent re-certification against DSA took 250 hours of staff time. DANS reported benefits similar to those we received:
- Moving the archive to a higher level of professionalism.
- Forcing the archive to make its work processes more explicit and decide unresolved issues.
- Building trust from funders, users, etc.
The Red Hat model demands continual incremental improvements to deliver value to the subscribers by addressing their immediate concerns. This makes it difficult to devote resources to longer-term developments that, if not undertaken, will be immediate concerns in a few years. In 2012, the Mellon Foundation gave us a grant that, over its three years, allowed an increase in expenditure of about 10% to address a set of longer-term issues. This grant just concluded, with results that included:
- Development of AJAX crawling capability using Crawljax and proxies including INA's Live Archiving Proxy and WarcMITMProxy.
- Memento (RFC7089) support.
- Shibboleth support.
- Several distinct but synergistic performance improvements to the LOCKSS P2P protocol.
- Vastly improved data collection and analysis tools to monitor the performance of the P2P protocol.
Our current Crawljax-based implementation works well, but fortunately it remains essential for only a relatively small number of sites. We believe the reason is that making a site unfriendly for crawlers tends to reduce its search ranking. We observe that many publishers "enhancing their reader's experience" via AJAX also provide a non-AJAX route to their content to make it less unfriendly to crawlers. Where these alternate routes are available we use them, because AJAX crawls are more complex and expensive than conventional crawls.
I've long been an enthusiastic supporter of the work of the Memento team in enabling uniform, federated access to archived Web collections. Herbert van de Sompel talked about it on Tuesday morning. The underlying HTTP extensions are very effective at providing Wayback Machine style access to a Web archive's collection. We layered services above the LOCKSS daemon's Jetty web server that support them.
But the overall goal of federated access among disparate Web archives requires aggregation, which can be thought of as the time domain version of a Web search engine. There are a number of unresolved issues in implementing Memento aggregation. They include:
- Scaling issues in describing collections to aggregators. Sawood Alam will discuss work the IIPC has supported to address these in the next talk.
- Non-uniform access. Different Web archives have different policies about access to their collections. For example, most national libraries permit access only on site. Whether a Memento actually provides access depends on the relationship between the reader and the archive, the aggregator can't know the answer. There are currently two uses for Mementos. If limited access archives export their Mementos, the statistics will be right but access will be broken. If they don't, the statistics will be wrong but access will work correctly for most readers.
- Soft 404s. René Voorburg recently added a fuzzing technique to robustify.js to detect pages that should return 404 but actually return 200.
- Soft 403s. For example, the Internet Archive claims to have Mementos for many subscription journal articles, but what it actually has are pages returned with a 200 success code refusing access to the article. This is a worse problem than soft 404s; fuzzing doesn't fix it.
Non-uniform access is a particular issue for subscription e-journal archives such as LOCKSS. A LOCKSS box is permitted to provide access to subscription content only to its host institution's readers. Thus it would be misleading to advertise the box's collection to general Memento aggregators. But readers from the box's host institution need an aggregator that knows about the box's collection. So LOCKSS boxes need to act as their own aggregators, merging the result of queries to the public aggregators with their own content.
Historically, e-journals and thus LOCKSS have authorized access based on the IP address of the request. This less-than-secure method is gradually being replaced by Shibboleth, so we have implemented Shibboleth access control for the Memento-based dissemination path. This will allow us to advertise Mementos of open access content to the aggregators, while providing access to Mementos of subscription content only to readers from the box's host institution. Eventually we believe we will need to authenticate the LOCKSS crawler using Shibboleth, but there are significant unresolved issues in doing so at present.
Before |
Compare these two graphs, which show preliminary data from monitoring about a third of the boxes in the Global LOCKSS Network. The graphs are histograms of the time it takes for a newly released unit of content to acquire one additional willing repairer. The better the protocol is working, the shorter this time will be.
After |
Based on a decade of successfully running the Red Hat model, we believe it can form a sustainable basis for long-term digital preservation. In practice, it needs to be supplemented by occasional small grants to address long-term issues.
An essential part of remaining sustainable is a consistent focus on reducing the cost of operations. Our costs are in three main areas:
- Maintaining and developing the daemon software.
- Supporting the boxes in the field.
- Content processing and quality assurance.
Future LOCKSS Architecture |
- Ingest: our AJAX support already involves multiple crawlers coexisting behind a collection proxy. We expect there to be more specialized crawlers in our future.
- Preservation: we are working to replace our current repository with WARC files in Hadoop, probably with Warcbase.
- Dissemination: we can already output content for export via OpenWayback, but we expect to completely replace the Jetty webserver for dissemination with OpenWayback.
One reality of this evolution is that deploying a complex collection of Web services is far more challenging than deploying a single Java program. Supporting remote libraries using the system is a significant cost. We are addressing this by using state of the art deployment technologies. My colleague Daniel Vargas talked about this yesterday, and I hope you found it interesting.
Two good blog posts giving personal reactions to IIPC GA from Tom Cramer and Jefferson Bailey.
ReplyDeleteSawood Alam has a long and detailed post including a vast array of tweets describing the GA.
ReplyDelete