Friday, January 8, 2016

Aggregating Web Archives

Starting five years ago, I've posted many times about the importance of Memento (RFC7089), and in particular about the way Memento Aggregators in principle allow the contents of all Web archives to be treated as a single, homogeneous resource. I'm part of an effort by Sawood Alam and others to address some of the issues in turning this potential into reality. Sawood has a post on the IIPC blog, Memento: Help Us Route URI Lookups to the Right Archives that reveals two interesting aspects of this work.

First, Ilya Kreymer's oldweb.today shows there is a significant demand for aggregation:
We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives.
Second, the overlap between the collections at different Web archives is low, as shown in Sawood's diagram. This means that the contribution of even small Web archives to the effectiveness of the aggregated whole is significant.This is important in an environment where the Internet Archive has by far the biggest collection of preserved Web pages. It can be easy to think that the efforts of other Web archives add little. But Sawood's research shows that, if they can be effectively aggregated, even small Web archives can make a contribution.

1 comment:

Ed Summers said...

Well said! Thanks for your continued work and writing on this topic David. I think you are right that the future for Web archives (and ultimately the Web itself) is tied up with our ability to work together to collect it. As you've pointed out elsewhere, attention to the diversity of our preservation systems and organizational failure is key.

Relatedly I think these archival profiles that you and Sawood are working on, and the tooling around them, could also be useful in helping archivists decide what needs to be collected. It might be possible to easily identify resources and sets of resources (websites) that are in need of preservation. Has that use case been discussed at all?