Friday, May 1, 2015

Talk at IIPC General Assembly

The International Internet Preservation Consortium's General Assembly brings together those involved in Web archiving from around the world. This year's was held at Stanford and the Internet Archive. I was asked to give a short talk outlining the LOCKSS Program, explaining how and why it differs from most Web archiving efforts, and how we plan to evolve it in the near future to align it more closely with the mainstream of Web archiving. Below the fold, an edited text with links to the sources.

LOCKSS: Collaborative Distributed Web Archiving For Libraries

The academic record in the form of journal articles, monographs and books is an important part of our cultural heritage. For centuries, libraries have both provided access to and preserved that record by distributing large numbers of copies on durable, reasonably tamper-evident media among repositories around the world, and making up for missing or damaged copies via a cooperative system of inter-library loan.

Twenty years ago next month, in May 1995, Stanford Libraries pioneered the transition of academic publishing from this paper system to the Web when HighWire Press put the Journal of Biological Chemistry on-line. Two things rapidly became obvious:
  • The new medium had capabilities, such as links and search, that made it far more useful than paper, so that the transition would be rapid and complete.
  • The new medium forced libraries to switch from purchasing a copy of the material to which they subscribed and adding it to their collection, to leasing access to the publisher's copy and no longer building a collection.
Librarians had three concerns about leasing materials:
  • If they decided to cancel their subscription, they would lose access not just to future materials, but also to the materials for which they had paid. This problem is called post-cancellation access.
  • If the publisher stopped publishing the materials, future readers would lose access to them entirely. This problem is called preserving the record.
  • If the only copy was the publisher's, changes to the copy by the publisher, or even a bad guy who broke in to the publisher, would be unlikely to be detected. This problem is called tamper-evidence.
The LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford Libraries started 16.5 years ago as an attempt to solve these problems (PDF). It is rather unlike the Web archiving that most of you in the room do. It is collaborative, distributed, highly targeted, and designed around a single major constraint, which is not technical but legal.

The materials we needed to collect, preserve and disseminate were not merely copyright, thus subject to the draconian provisions of the Digital Millenium Copyright Act (DMCA), but also extremely valuable. These assets generated large profits for the publishers, who owned the copyrights (or at least behaved as if they did). In 2013, Reed Elsevier alone reported over $1.1B in profit from STEM publishing. If you're handling assets that generate a billion a year on the bottom line, you need to be on very safe legal ground.

Clearly, following the example of the Internet Archive by depending on the "safe harbor" provision of the DMCA wasn't a good solution to post-cancellation access. The cancelled publisher could issue a take-down notice, preventing such a system providing access and rendering it useless. The only alternative was a system that obtained explicit permission from the copyright owner to preserve a copy of the content, and to use it after cancellation just as a paper copy would have been used. After all, doing essentially what we do to journal content but without permission led to Aaron Swartz's death.

The LOCKSS Program restored the purchase model by building a system that worked for the Web the way libraries had always worked with paper. Libraries continued to build collections containing copies of materials they purchased, held in LOCKSS boxes, the digital equivalent of the stacks. Each box obtained its content from the publisher under its host institution's subscription agreement. Legal permission for the box to do so was a simple statement added to the content on-line.

So we have a network of boxes, each with the content to which that library subscribes. Important content is in many boxes, less important content in fewer. On average, there are Lots Of Copies of each item. What can be done to Keep Stuff Safe, while keeping the focus on minimizing the cost of ownership of the content?

A peer-to-peer network allows LOCKSS boxes to collaborate to detect and repair any loss or damage to the content in their collection, the digital analog of the way paper libraries collaborate via inter-library loan and copy. This reduces the cost of ownership; boxes can use low-cost hardware.

The protocol by which the peers cooperate turned out to be an interesting research topic, which earned us a Best Paper award at the 2003 SOSP. The protocol performs five functions:
  • Location: enabling a box in the network to find at least some of the other boxes in the network that are preserving the same Archival Unit (AU). An AU is typically a volume of a journal, or an e-book.
  • Verification: enabling a box to confirm that the content it has collected for an AU is the same (matching URLs, with matching content) as other boxes preserving the same AU have collected, thus identifying and correcting errors in the collection process.
  • Authorization: enabling a box A that is preserving an AU to provide content from that AU to another box B preserving the same AU in order to repair damage, because box B has proved to box A that in the past it collected matching content from the publisher for that AU. Box A is a willing repairer for box B.
  • Detection: of random loss or damage to content in an AU at a box in order that it might request repairs to the content from other boxes preserving the same AU.
  • Prevention: of attempts at deliberate modification of content in an AU at multiple boxes in the network, by detecting that non-random change had occurred at multiple boxes.
There isn't time to go into the protocol details, but in essence at random intervals each box chooses one of its AUs and a random subset of the other boxes. It calls a poll, in which it challenges the other boxes to prove that they have the same content for that AU. If the box calling the poll agrees with the majority of the boxes voting in the poll, all is well. If it disagrees with the majority, it chooses one of the boxes it disagrees with, asks it for a repair, then confirms that the repair causes it to agree with the majority. Boxes receiving a request for a repair will supply it only if they remember agreeing with the requesting box about that AU in the past. This mechanism ensures that copyright content does not leak. Boxes must initially obtain their content from the publisher, only after proving that they have it can they get a copy from another box.

The LOCKSS software is a large, open-source daemon that runs on Linux. It is implemented as more than 200K lines of Java, over 1000 classes, using over 100 open-source Java libraries.  It includes:
  • A highly configurable Web crawler, with the special capabilities needed for academic journals including sophisticated login page handling, license detection, crawler trap avoidance, crawl scheduling and rate-limiting.
  • The peer-to-peer preservation system.
  • A version of the Jetty web server that acts as both a Web proxy and a Web server to replay the preserved content, and also provides the administrative Web interface for the daemon.
  • Services above the Jetty server that support Memento.
  • Transparent, on-access format migration driven by HTTP content negotiation.
  • Import and export capability for multiple types of archive files including WARC (ISO 28500) files.
  • Technology for extracting bibliographic metadata, storing it in a metadata database, and using it to support access to preserved content via both DOI and OpenURL resolution.
This code is generic, it must be adapted to work with each publishing platform (the Open Journal System is an example of a platform). This is done via a publisher plugin, a set of Java classes that understand, for example, how and when and how fast to crawl the publisher's web site, how to divide the site into AUs, and how to extract bibliographic metadata from the content. The plugin must be configured to work with the specific e-journals or e-books. This is done by means of an XML database of bibliographic information. The set of plugins includes over 500 classes.

Included in the code base is our test code, all of which is run on every build, including every night. For the generic daemon code there are over 5,000 unit tests, and over 1,000 for plugin code. There is also a functional test framework that runs an entire network of LOCKSS daemons preserving synthetic content to exercise detection and recovery from damage and other functions.

In its more than 16 year history the LOCKSS Program has succeeded in implementing this technology, deploying it to about 150 libraries around the world, and persuading over 600 publishers to permit libraries to use it to preserve their content.

As usual, people found uses for the system other than the Global LOCKSS Network (GLN) it was built for. There are now about 15 separate Private LOCKSS Networks (PLNs) using the technology to collect and preserve library special collections, social science datasets, and other types of content. For example, one of our PLNs just won an award for innovative use of technology. It involves a group of Canadian university libraries and Stanford Libraries using Archive-It to collect Canadian government documents, which these days are at considerable risk, and preserving them in a PLN that ensures there are copies at multiple locations, including some not under Canadian jurisdiction.

The LOCKSS team operates one large PLN, the CLOCKSS Archive, on behalf of a not-for-profit jointly managed by the large academic publishers and research libraries. The CLOCKSS PLN currently has 12 boxes in countries around the globe. They ingest the complete output of almost all major academic publishers, both e-journals and e-books. If any of this content is no longer available on the Web from any publisher, the board can vote to "trigger" it, at which point the CLOCKSS Archive makes it public under Creative Commons licenses. On average, some content is triggered about every six months.

Last year, CRL certified the CLOCKSS archive under TRAC, the predecessor to ISO16363. We were awarded an overall score matching the previous best, and their first-ever perfect score in the Technologies, Technical Infrastructure, Security" category. All the non-confidential documentation submitted to the auditors is available, as is a description of the process and a summary of the lessons we learned.

We believe certification is an extremely valuable process that all archives should undertake. There are three increasingly rigorous levels of certification available to data archives; the Data Seal of Approval (DSA), NESTOR's DIN31644, and TRAC/ISO16363. Going straight to the highest level, as CLOCKSS did, is a huge amount of work and rather disruptive; we recommend moving up the levels gradually.

The starting level, the DSA, is awarded after a process of self-audit and peer-review based on 16 guidelines. It typically takes 2-3 months elapsed time. Re-certification is required after 2 years; DANS, the Dutch Data Archive and Networked Services organization, estimated that their recent re-certification against DSA took 250 hours of staff time. DANS reported benefits similar to those we received:
  • Moving the archive to a higher level of professionalism.
  • Forcing the archive to make its work processes more explicit and decide unresolved issues.
  • Building trust from funders, users, etc.
An important part of the TRAC audit is financial sustainability. Organizationally, the LOCKSS team is a part of the Stanford Libraries. We raise all our own money; we pay Stanford overhead and rent for our off-campus space. Although the initial development of the LOCKSS system was funded by the NSF, the Andrew W. Mellon Foundation and Sun Microsystems, such grant funding is not a sustainable basis for long-term digital preservation. Instead, the LOCKSS Program runs the "Red Hat" model of free, open-source software and paid support from the LOCKSS Alliance. In 2005 the Mellon Foundation gave the LOCKSS Program a grant which we had to match from LOCKSS Alliance subscriptions. At the end of the grant we had to be completely off grant funding, and we have been in the black ever since. From 2007 to 2012 we received no grant funds whatsoever.

The Red Hat model demands continual incremental improvements to deliver value to the subscribers by addressing their immediate concerns. This makes it difficult to devote resources to longer-term developments that, if not undertaken, will be immediate concerns in a few years. In 2012, the Mellon Foundation gave us a grant that, over its three years, allowed an increase in expenditure of about 10% to address a set of longer-term issues. This grant just concluded, with results that included:
Long before the grant started we had observed a few e-journal publishers using AJAX to "enhance the reader's experience" in ways that made it very difficult for the LOCKSS crawler to collect a usable version of the site. We expected that this trend would accelerate and thus make an AJAX crawler essential. Together with Masters students at C-MU's Silicon Valley campus, we had done proof-of-concept work based on the Selenium testing framework.

Our current Crawljax-based implementation works well, but fortunately it remains essential for only a relatively small number of sites. We believe the reason is that making a site unfriendly for crawlers tends to reduce its search ranking. We observe that many publishers "enhancing their reader's experience" via AJAX also provide a non-AJAX route to their content to make it less unfriendly to crawlers. Where these alternate routes are available we use them, because AJAX crawls are more complex and expensive than conventional crawls.

I've long been an enthusiastic supporter of the work of the Memento team in enabling uniform, federated access to archived Web collections. Herbert van de Sompel talked about it on Tuesday morning. The underlying HTTP extensions are very effective at providing Wayback Machine style access to a Web archive's collection. We layered services above the LOCKSS daemon's Jetty web server that support them.

But the overall goal of federated access among disparate Web archives requires aggregation, which can be thought of as the time domain version of a Web search engine. There are a number of unresolved issues in implementing Memento aggregation. They include:
  • Scaling issues in describing collections to aggregators. Sawood Alam will discuss work the IIPC has supported to address these in the next talk.
  • Non-uniform access. Different Web archives have different policies about access to their collections. For example, most national libraries permit access only on site. Whether a Memento actually provides access depends on the relationship between the reader and the archive, the aggregator can't know the answer. There are currently two uses for Mementos. If limited access archives export their Mementos, the statistics will be right but access will be broken. If they don't, the statistics will be wrong but access will work correctly for most readers.
  • Soft 404s. René Voorburg recently added a fuzzing technique to robustify.js to detect pages that should return 404 but actually return 200.
  • Soft 403s. For example, the Internet Archive claims to have Mementos for many subscription journal articles, but what it actually has are pages returned with a 200 success code refusing access to the article. This is a worse problem than soft 404s; fuzzing doesn't fix it.
The last two issues are causes of "archive spam", the cluttering of the space of preserved content with garbage.

Non-uniform access is a particular issue for subscription e-journal archives such as LOCKSS. A LOCKSS box is permitted to provide access to subscription content only to its host institution's readers. Thus it would be misleading to advertise the box's collection to general Memento aggregators. But readers from the box's host institution need an aggregator that knows about the box's collection. So LOCKSS boxes need to act as their own aggregators, merging the result of queries to the public aggregators with their own content.

Historically, e-journals and thus LOCKSS have authorized access based on the IP address of the request. This less-than-secure method is gradually being replaced by Shibboleth, so we have implemented Shibboleth access control for the Memento-based dissemination path. This will allow us to advertise Mementos of open access content to the aggregators, while providing access to Mementos of subscription content only to readers from the box's host institution. Eventually we believe we will need to authenticate the LOCKSS crawler using Shibboleth, but there are significant unresolved issues in doing so at present.
Before
Using the improved data collection and analysis tools we showed that the changes we made to the P2P protocol actually made a very significant improvement to its performance.

Compare these two graphs, which show preliminary data from monitoring about a third of the boxes in the Global LOCKSS Network. The graphs are histograms of the time it takes for a newly released unit of content to acquire one additional willing repairer. The better the protocol is working, the shorter this time will be.

The "before" graph was computed from almost a million changes in willing repairer relationships in the network before the first of the changes was turned on. We call this change symmetric polling, it acts to increase the average number of willing repairer relationships that a single poll can generate.

After
The "after" graph was computed from over a third of a million changes in willing repairer relationships after symmetric polling was turned on. The most likely median time to an additional willing repairer improved from 50 days to 30 days per repairer.

Based on a decade of successfully running the Red Hat model, we believe it can form a sustainable basis for long-term digital preservation. In practice, it needs to be supplemented by occasional small grants to address long-term issues.

An essential part of remaining sustainable is a consistent focus on reducing the cost of operations. Our costs are in three main areas:
  • Maintaining and developing the daemon software.
  • Supporting the boxes in the field.
  • Content processing and quality assurance.
When we started, we had to build almost everything from scratch, there wasn't much relevant open source technology available for us to use. That's no longer the case. Looking forward, our goal is to reduce the first of these costs by evolving the LOCKSS system architecture to be more in line with the Web archiving mainstream.

Future LOCKSS Architecture
This architecture diagram shows the direction in which we are evolving:
  • Ingest: our AJAX support already involves multiple crawlers coexisting behind a collection proxy. We expect there to be more specialized crawlers in our future.
  • Preservation: we are working to replace our current repository with WARC files in Hadoop, probably with Warcbase.
  • Dissemination: we can already output content for export via OpenWayback, but we expect to completely replace the Jetty webserver for dissemination with OpenWayback. 
All the interfaces in this diagram will be Web services; the reason for this is summed up in slides 9 and 10 from the terrific Krste Asanović keynote at the 2014 FAST conference. This, and the fact that we will replace large sections of code we currently maintain with code the open-source community maintains, should significantly reduce the effort we need to spend, and the effort it takes others to contribute to the LOCKSS technology. Most importantly, it should also make it easier for others to use parts of the LOCKSS technology, for example the P2P protocol.

One reality of this evolution is that deploying a complex collection of Web services is far more challenging than deploying a single Java program. Supporting remote libraries using the system is a significant cost. We are addressing this by using state of the art deployment technologies. My colleague Daniel Vargas talked about this yesterday, and I hope you found it interesting.

2 comments:

  1. Two good blog posts giving personal reactions to IIPC GA from Tom Cramer and Jefferson Bailey.

    ReplyDelete
  2. Sawood Alam has a long and detailed post including a vast array of tweets describing the GA.

    ReplyDelete