Tuesday, July 7, 2015

IIPC Preservation Working Group

The Internet Archive has by far the largest archive of Web content but its preservation leaves much to be desired. The collection is mirrored between San Francisco and Richmond in the Bay Area, both uncomfortably close to the same major fault systems. There are partial copies in the Netherlands and Egypt, but they are not synchronized with the primary systems.

Now, Andrea Goethals and her co-authors from the IIPC Preservation Working Group have a paper entitled Facing the Challenge of Web Archives Preservation Collaboratively that reports on a survey of Web archives' preservation activities in the following areas; Policy, Access, Preservation Strategy, Ingest, File Formats and Integrity. They conclude:
This survey also shows that long term preservation planning and strategies are still lacking to ensure the long term preservation of web archives. Several reasons may explain this situation: on one hand, web archiving is a relatively recent field for libraries and other heritage institutions, compared for example with digitization; on the other hand, web archives preservation presents specific challenges that are hard to meet.
I discussed the problem of creating and maintaining a remote backup of the Internet Archive's collection in The Opposite of LOCKSS. The Internet Archive isn't alone in having less than ideal preservation of its collection. It's clear the major challenges are the storage and bandwidth requirements for Web archiving, and their rapid growth. Given the limited resources available, and the inadequate reliability of current storage technology, prioritizing collecting more content over preserving the content already collected is appropriate.


brewsterkahle said...

Thank you David for working in this area, it is extremely important.

You start off with: "The Internet Archive has by far the largest archive of Web content but its preservation leaves much to be desired." Ouch.

After recovering somewhat, maybe the framing of the problem might not be the best for where we are, as a field.

Just as a research library tends to have many collections, and different access and preservation strategies for these collections, maybe it is time to think of it that way in the digital world.

The Internet Archive has thousands of collections it maintains, and over 3,000 web collections alone. These are often quite distinct, have different provenance, different access conditions, preservation strategies, and the like.

Many of these collections are the results of collaborations with other organizations and in many circumstances means there are multiple copies in these organizations. Some are on tape, some are in different repositories, some are on offline hard drives, some are spinning other places.

We actively encourage people to make copies of the collections we collaborate on. We have tools to help, offer making hard drive copies a service, and are working on new API's to make it even easier.

It might have been my fault by not putting an "s" on the end of Internet Archive, but we do not think of a singular archive.

As someone said: Lots of Copies Keeps Stuff Safe :)


Tobias Steinke said...

But it is not just about lots of copies. Digital preservation is also about enabling future generations accessing a content that was originally made for other hardware and software, This problem might not be so obvious for web archiving, but it should be something to think about and plan for nevertheless. Our article and work in the IIPC PWG are focused on this problem.