The two most important questions to ask when designing a digital preservation system are:
- What digital information am I supposed to preserve? The bane of digital preservation has been the idea that somewhere out there a one-size-fits-all system can be found. The LOCKSS Program published our answer to this question in 2000 (PDF). We designed the system to preserve information, primarily copyright material such as academic journals, published on the Web. Limiting our ambition in this way made it feasible to build a complete, working system rather than a demo.
- What am I supposed to preserve the digital information against? Or in technical jargon, "what is the threat model?" We published our answer to the second question in 2005, as a list of the causes of data loss we thought significant, and how we mitigated them.
- Media failure
- Hardware failure
- Software failure
- Network failure
- Obsolescence
- Natural Disaster
- Operator error
- External Attack
- Insider Attack
- Economic Failure
- Organization Failure
- The customers were paper libraries; we wanted a system they could understand.
- Paper libraries were used to handling copyright material.Working around the copyright law was our single most important design goal.
- Paper libraries had evolved over millennia into a remarkably effective system for preserving information.
The only part of this that wasn't practical to emulate in the digital world was the durable, tamper-evident media. We had to make up for that with technology. The way the system works is simple in essence:
- The publisher grants permission on their Web site, either via a Creative Commons license, or by an explicit statement that LOCKSS has permission to collect and preserve it. For subscription content, this must be some place that only subscribers can see.
- Each LOCKSS box independently collects the content whose permission statement it can see, by crawling the publisher's Web site. This isn't a completely reliable process, but the last step fixes any discrepancies.
- The LOCKSS boxes can serve their content to readers via the Memento HTTP protocol, which provides seamless WayBack Machine-like access to preserved Web content. Memento allows content from a URI (such as a journal) that is preserved at some other URI (such as at the WayBack Machine) to be retrieved from the original URI, even if the original URI knows nothing about Memento.
- The LOCKSS boxes cooperate in a peer-to-peer network to detect and repair damage to their contents. The protocol they use to talk to each other is complex in detail as it has to defend against numerous possible attacks (PDF), but simple if you ignore these defenses. At intervals each box (the poller) creates a random sample of the other boxes (the voters) with the same content, and gets them to vote on its hash. If the poller agrees with the consensus of the voters, all is well. If not, it fetches a repair from one of the boxes which do agree with the consensus, or from the publisher if it still available.
Smaller, private LOCKSS networks (PLNs) also work. For example, the CLOCKSS PLN currently has 12 boxes. Each is configured to contain all the content preserved by the CLOCKSS Archive, so they all have the same content. The PLN takes other defensive measures to make up for this correlation.
Bringing up a physical or virtual LOCKSS box is simple, using a custom Linux Kickstart image. Running a network of boxes requires some management. Some PLNs do this for themselves, and others have it done for them by the LOCKSS team.
Preserving new content involves writing a "plugin" that describes the content to the system, including details such as where it is on the Web, what its boundaries are, when the box is allowed to crawl it, how to detect and filter out personalizations, and so on. Plugins are XML with, in some cases, embedded Java classes. You can have the LOCKSS team write one for you, or do it yourself.
There are even networks you can join, such as the MetaArchive, that arrange the whole preservation process for you.
No comments:
Post a Comment