Sunday, June 7, 2015

Brief talk at Columbia

I gave a brief talk during the meeting at Columbia on Web Archiving Collaboration: New Tools and Models to introduce the session on Tools/APIS: integration into systems and standardization. The title was "Web Archiving APIS: Why and Which?" An edited text is below the fold

This talk is based on a paper I've been working on with Nicholas Taylor and Jefferson Bailey. The talk is in three parts:
  • Why APIs?
  • Which APIs?
  • What is the LOCKSS Program doing?

Why APIs?

APIs enable interoperation among Web archiving technologies, and this fosters cooperation among Web archivers. We need enhanced cooperation for five basic reasons:
  • Efficient use of limited resources. Research has shown that less than half the surface Web is preserved. Clearly, the resources devoted to Web archiving are not going to double, so effective use of the resources that are available is critical.
  • Collaborative collection development. One way to increase the effective use of resources is to exchange metadata to support coordinated collection development, so that institutions can focus on parts of the Web of particular interest secure in the knowledge that other parts are being preserved elsewhere. The Keepers Registry for academic journals is an example, though its effect so far in avoiding duplication is debatable.
  • Avoiding monoculture. Other than economics, the most significant risk to collection and preservation of Web content is monoculture. Abby Smith Rumsey stressed this in her talk to the recent IIPC GA. All organizations, policies and technologies have flaws. If only one organization, policy or technology is in play, its flaws are irrecoverable. For example, currently the "preservable Web'' is pretty much defined as "the Web as seen by Heritrix''; Web resources that by design or by bug are not visible to Heritrix are unlikely to be preserved.
  • Exchange of collections. One important aspect of avoiding monoculture is ensuring that important collections exist in multiple replicas in multiple locations using multiple technologies. To achieve this it must be easy for institutions to swap collections; doing so will also aid dissemination.
  • Community involvement. Few Web archiving tools have the kind of broad, distributed base of committers that provides long-term stability. In most cases the reason is that the tools are so large and specialized that the investment in becoming a committer is disproportionate to the potential benefits. Breaking these tools up into smaller components communicating via stable interfaces greatly reduces the barrier to entry for potential committers.
The reasons for using APIs and a service-oriented architecture to foster cooperation are summed up in these two slides, which I stole from Krste Asanović's keynote at the 2014 FAST conference. The first shows that the odds of a programming project's success decay rapidly with size, based on a study of over 50,000 projects over 8 years by the Standish Group. We need to reduce the size of our projects.

The second shows Jeff Bezos' way of dealing with this problem, which he announced in a famous, if never officially published, memo from 2002. His way has worked for Amazon; we need to follow his advice.

Which APIs?

In the paper we try to exhaustively enumerate the APIs that would be useful. There isn't time here to cover them all, please consult the paper. I'm just going to discuss selected examples in the three functional areas of Web archiving, ingest, preservation and dissemination.

Ingest

Discussions at the recent IIPC GA showed strong support for the kind of ingest architecture pioneered by Institut national de l'audiovisuel (INA), with multiple "crawlers" cooperating behind a collection proxy. In this architecture there are three critical APIs:
  • The crawler needs an interface to the collection proxy to communicate metadata about the crawl process, such as a crawl ID. For example, INA's LAP uses custom HTTP headers on the requests it sends to do this.
  • At present, each crawler uses a different user interface to target and control the crawl, and to report on its progress. Cooperation between crawlers would be greatly assisted if a common user interface could manage them.
  • Once the collection proxy has accumulated URLs in one or more WARC files, it needs to submit them to a repository for preservation. Standardizing this interface as a Web service would be valuable. SWORD is an example of an interface of this kind that repositories have implemented.Web crawls are typically very large, so the Web API for submission needs to be supplemented with protocols more suitable for very large transfers than HTTP, such as BitTorrent or GridFTP.

Preservation

Preservation has three main functions, each of which needs to be supported by one or more APIs:
  • Acquiring content, via the other side of the submission interface described under Ingest earlier.
  • Maintaining content integrity.
  • Extracting content, via the other side of the dissemination interfaces to described later.
The interfaces associated with maintaining integrity can be divided into three areas:
  • Detecting bit rot. ACE and other Merkle tree based techniques can be used by repositories to detect the possibility that the stored hashes used to verify integrity have themselves changed. The LOCKSS challenge hash based voting protocol is an alternative.
  • Repairing bit rot. Detection is not much use if detected failures cannot be repaired. The LOCKSS protocol implements repair from peer using a private interface to transfer the repairs, in order to avoid copyright violations. DPN is implementing an API for copying content among repositories, and handling the copyright issues out-of-band. The submission and extraction APIs are not efficient for this purpose, since it is likely that only a small part of a collection needs to be repaired. So an API allowing for the export and import of specific URIs within a collection is needed.
  • Auditing bit rot. Since the cost of storage increases rapidly with increasing reliability, it is necessary to audit repositories to determine whether they are actually delivering the reliability they are being paid to provide. An audit interface is needed because, if a repository promising high reliability but delivering lower reliability, and thus incurring lower costs, faced a low risk of detection, it would take over the market and eliminate the honest competition before it was caught. The LOCKSS polling mechanism allows peers to audit each other, but true third-party auditing at scale is a hard problem.

Dissemination

Web archive's customers require three types of access:
  • Browsing: querying the preserved Web resources for the content of a resource named by a URI and, via Memento, a datetime.
  • Searching: issuing queries to the archive content that return lists of preserved Web resources matching specific text or metadata criteria.
  • Data Mining: performing more complex, customer-specified query processing against the text and metadata of the collection of preserved Web resources.
These queries can be either:
  • Intra-archive: issued directly to and returning results from a single archive.
  • Inter-archive: issued via a front-end, such as the Time Travel Service, and returning results from all available Web archives.
So we have six API areas to consider:
  • Intra-archive browsing. Memento and its implementations in OpenWayback and PyWb pretty much cover these requirements.
  • Inter-archive browsing. The Memento team provides a Memento Aggregator that resolves [URI,datetime] pairs to the appropriate preserved Web resource in any of the Web archives it knows about. Ilya Kreymer has demonstrated a prototype page reconstructor that, based on the same technology, attempts to present an entire Web page whose component resources are the closest preserved resources in time across the entire set of Web archives it knows about. The IIPC is funding research into the best way to connect such inter-archive browsing services to the archives they need, the API to do so is being defined.
  • Intra-archive searching. Apache SOLR is increasingly being used to provide Web archives with text and metadata search capability, although as the size of the collections being indexed increases the resource implications of doing so are a concern.
  • Inter-archive searching. If archives supported a Web service interface to text and metadata searches, an aggregator analogous to the Time Travel Service could be implemented that would fan the search out to the participating archives and merge their results for the user. This would be a considerable improvement for the user but the resource implications at the archives would be very significant.
  • Intra-archive data mining. There are three ways this can be done:
    • Move the data to the query. This requires an API allowing the archive to export the whole of a collection, or more likely that part of a collection meeting specific criteria. For example, Stanford might want to acquire and mine from those parts of a number of archives' collections that were from the stanford.edu domain. This would be, in effect, a generalized version of the export part of the repair API. The data involved might be very large, so note the previous discussion of non-HTTP protocols.
    • Move the query to the data. Moving large datasets around is slow, error-prone, and in some cases expensive. Web archives have been reduced to shipping NAS boxes. The query is typically much smaller than its result, so providing compute resources alongside the data can make the process much quicker and more efficient. This can be costly, but the Internet Archive is experimenting with the concept.
    • Move both to the cloud. Amazon, among other cloud providers, does not charge for moving data into its storage, and it has alomost unlimited compute resources connected to the storage. It may make sense to implement data-mining of Web collections by creating a temporary copy in the cloud and allowing researchers access to it, by analogy with Common Crawl. Expensive, reliable cloud storage is not needed for this, the copy in the cloud is not the preservation copy. I did a cost analysis of this for the Library of Congress' Twitter collection; it was surprisingly affordable.
  • Inter-archive data mining. If archives support a move-the-query-to-the-data API, a query service could be implemented that would allow federated data mining across all participating archives. This would be great for the researchers but the resource implications are intimidating.

What Is LOCKSS Doing?

Looking forward, our goal is to reduce our costs by evolving the LOCKSS system architecture from a monolithic Java program to be more in line with the Web archiving mainstream.

Future LOCKSS Architecture
This conceptual architecture diagram shows the direction in which we are evolving. The green boxes are LOCKSS-specific, the blue boxes are generic:
  • Ingest: our AJAX support already involves multiple crawlers coexisting behind a collection proxy. We expect there to be more specialized crawlers in our future.
  • Preservation: we are working to replace our current repository with WARC files in Hadoop, probably with Warcbase.
  • Dissemination: we can already output content for export via OpenWayback, but we expect to completely replace the Jetty web server we currently use for dissemination with OpenWayback. 
All the interfaces in this diagram will be Web services. As described by my colleague Daniel Vargas at the IIPC GA, we will be using state-of-the-art deployment technologies to manage this collection of components.

2 comments:

IlyaK said...

Hi David,

Thanks for mentioning pywb and the Memento Reconstruct.

As far as APIs go, I also wanted to point out the CDX Server API, which is a url and limited-metadata level query API that is part of OpenWayback and pywb. For example, the API allows one to answer questions such as how many urls from a certain prefix, domain, or subdomain and filtering by status code or mime type, and other information contained in the CDX index, and collapsing and sorting the results. The API is part of OpenWayback (https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp) and pywb (https://github.com/ikreymer/pywb/wiki/CDX-Server-API) and they are mostly in sync (still need to add a few things to the pywb version).

Some of the largest public web archives, including Internet Archive Wayback and now the recently released CommonCrawl Index server (built using pywb) support this API.
The API is designed to allow a user to query the index in bulk, including support for pagination and splitting the query into multiple chunks.

My hope is that this API can become a more fully developed standard for url index access, at least for url and metadata (not text) querying. I think such an API is necessary because other options, such as Memento, are limited to a single url query or are tied to specific third-party product (Apache Solr)

Although it was designed for a single archive, I think the CDX Server API idea could be adapted to multiple archives as well. Just thinking about it now, I think bulk querying multiple archives and merging the results could be an interesting extension to the CDX Server API, taking advantage of the pagination and complex filtering options that are already there.

David. said...

Thanks, Ilya. I will add this to the paper. And I agree with your last point. In a world where Memento aggregation allows the content of multiple archives to be treated as a single resource, inter-archive CDX Server aggregation would be a useful capability.