I have long argued, for example in my 2009 CNI keynote, that while Jeff correctly diagnosed the problems of digital preservation in the pre-Web era, the transition to the Web that started in the mid-90s made those problems largely irrelevant. Jeff's presentation is frustrating, in that it shows how little his thinking has evolved to grapple with the most significant problems facing digital preservation today. Below the fold is my critique of Jeff's keynote.
First, I agree with Jeff when he says (slide 12) that:
In fact, every digital artifact is a programI've been pointing to the issues raised by the evolution of the Web from a document model to a programming environment for some time; I helped run a workshop at the Library of Congress on this problem last May. Some progress is being made in this area, for example see Dirk von Suchodoletz's paper at IDCC2013 on delivering emulations as a cloud service. However, Jeff fails to point out the problem faced, for example, by the Workflow4Ever project. No matter how faithful the emulator may be, almost all programs these days execute in a context of network services, which themselves execute in a context of network services, and so on ad infinitum. Simply preserving the bits and the ability to re-execute them is not enough if the network services the bits call on are no longer available, or return different results. So while emulation may be one important preservation strategy, it is far from a panacea. Every execution of an instance of the kind of digital artifact that readers commonly encounter, such as web pages, is now a unique experience, impossible to repeat or preserve with the fidelity Jeff envisages.
But there is a more fundamental problem with Jeff's presentation. It is exemplified by this quote from slide 41:
- Most so-called “archiving” efforts ignore preservation
- LOCKSS, Portico (journal archiving) offer no real preservation
- Internet Archive seems based on wishful thinking
It is completely untrue that LOCKSS, Portico and the Internet Archive "ignore preservation" and "reject emulation out-of-hand". Jeff would have been more accurate to say "I don't like the approaches of these archiving efforts". He would have been even more accurate had he said "I don't understand the approaches of these archiving efforts".
Jeff is asking "how best to do digital preservation?" He quotes an ALA definition (slide 5):
The goal of digital preservation is the accurate rendering of authenticated content over time.He doesn't understand that LOCKSS, Portico, the Internet Archive and others actually preserving stuff in the real world are facing a different question, "how can we use our limited resources to maximize the value delivered to future readers?"
In Jeff's ideal world, cost is an afterthought relegated to slides 46 and 47 out of 48:
- Perform serious cost and process analyses
- Based on viable technological approaches
Jeff approves of the approach taken by the KB (slide 41):
- KB may still be in the lead
- eDepot designed to address long-term preservation
- Using a two-pronged migration/emulation approach
- As regards migration, more than 8 years ago LOCKSS demonstrated and published the technique by which these archives can transparently migrate content on-access using HTTP's content negotiation capabilities. These allow a preservation system to detect that the reader's browser cannot render the format in which the content was preserved and create a temporary access copy in a format that it can render. More recently, Memento uses content negotiation to provide readers with uniform access to preserved Web content no matter where it is preserved. Thus if and when migration is appropriate, it will be available; the capability is now embedded in the infrastructure of the Web. This is precisely the strategy of preserving the original and creating "vernacular versions" that Jeff endorses early in his talk.
- As regards emulation, all three archives depend for both their preservation and dissemination environments on a stack that is open source, carefully preserved in ASCII (and thus itself effectively immune from format obsolescence) as multiple copies in source code repositories capable of reconstructing the entire stack as it was at any time in the past. These environments include multiple emulations of the hardware environment itself, also in ASCII and capable of being reconstructed as they were at any date in the past. I pointed this out 6 years ago in the second post to this blog. No-one since has shown me a credible scenario in which this approach fails for the vast majority of the content these archives are devoted to preserving. Thus if and when emulation is appropriate, it will be available; the capability is now embedded in the infrastructure of computing.
It is true that these archives are not currently investing large resources into migration. The framework for doing is simple, and it was demonstrated long ago. Recent research at the BL and INA has shown that for the content these archives preserve format obsolescence is not imminent. Nor are they investing large resources into emulation. Doing so would simply waste resources duplicating work that is being done much better by others for reasons having nothing to do with digital preservation. And, again, obsolescence of the hardware and software stack is not imminent. As I noted 6 years ago in the third post to this blog, there are fundamental economic reasons why it is very difficult to justify investing heavily in preparing for hypothetical obsolescence events in the distant future. Recent research in Discounted Cash Flow has greatly reinforced this observation. Jeff's advocacy of perfectionist, expensive, pre-emptive approaches ignores basic economics.
Unlike LOCKSS and the Internet Archive, but more like Portico, the KB has invested heavily in preparing for eventual obsolescence. The result is that their approach is considerably more expensive per object than the LOCKSS or Internet Archive approaches. On 1st April, 2009 the KB's eDepot and the Directory of Open Access Journals (DOAJ) announced a pilot program to preserve open access journals. The importance of the program was explained thus:
The composition of the DOAJ collection (currently 4000 journals) is characterized by a very large number of publishers (2.000+), each publishing a very small number of journals on different platforms, in different formats and in more than 50 different languages. Many of these publishers are – with a number of exceptions – fragile when it comes to financial, technical and administrative sustainability.<correction> I was told that nothing significant came of this partnership, but Marcel Ras in the comments informs us that this isn't true. KB is preserving articles from 650 DOAJ publishers. But this is less than 1/3 of the total number of DOAJ publishers, and the publishers providing content are likely to be the larger and less fragile ones. The LOCKSS approach from the very start in 1998 stressed low cost:
Libraries have to trade off the cost of preserving access to old material against the cost of acquiring new material. They tend to favor acquiring new material. To be effective, subscription insurance must cost much less than the subscription itself.Over 9,000 journals from 520 publishers have adopted the LOCKSS approach. It is true that larger publishers provide the bulk of the journals. But the LOCKSS approach has been successful with precisely the kind of small, fragile publisher that populates the DOAJ, not least because (unlike Portico and CLOCKSS) the Global LOCKSS Network does not charge publishers for preservation. It is elementary business thinking to ensure that your potential customers can afford to buy your product. Nevertheless, it is clear that, just as with the KB's approach, the LOCKSS system is reaching only a small fraction of the at-risk content.
Jeff says that LOCKSS, Portico and the Internet Archive "ignore preservation". But he offers no specifics as to what we should be doing instead to preserve the content we target. He does praise the KB's approach which, like Portico and LOCKSS, has turned out to be too expensive even to ingest most of the content at the greatest risk, let alone preserve it. </correction>
My challenge to Jeff is to do one of the following:
- Tell us exactly what we should be doing to deliver more value to future readers within our current budgets, or:
- Tell us how to get enough additional money to do "real preservation", or:
- Stop making the best be the enemy of the good.
- Scale: Jeff's slides 24-30 are a catalog of software archaeology, artifacts from the pre-Web era. Most are unique, such as the Domesday book and ErlKing, and have been preserved thanks to enormously expensive hand-crafted emulations. These feats may be inspiring, but they are completely irrelevant to the problems facing the future of our digital heritage. No-one can devote these levels of attention or resource to the billions of web sites, millions of journal articles and centuries of video that form our heritage. Jeff's thinking is stuck on the problem of preserving an individual digital artifact. There are cases, such as museum collections of digital artworks, where this is appropriate. But they are a minute fraction of the content that needs preservation. Practical solutions need to handle collections of millions of objects and many terabytes of data affordably.
- Bit preservation: Jeff blithely assumes that keeping the bits safe is a solved problem, not worthy of discussion. But doing so is the sine qua non of Jeff's approved approaches, and at the scale that is needed the technical and economic challenges involved in keeping these enormous volumes of data safe are formidable. Indeed, the one safe prediction is that no matter what we do some bits will get lost.
- Intellectual Property: Jeff completely ignores the immense difficulties that copyright law causes for digital preservation. To take but a single example, the approach Jeff favors at the KB results in a collection the vast bulk of which he may consider well-preserved but is inaccessible except to authorized researchers physically at the KB. Vast resources have been spent to preserve this content, but to all but a tiny fraction of potential future readers they will have been completely ineffective at providing access. By contrast, however imperfect Jeff may consider its preservation strategy, the Internet Archive is as I write the 231st most visited site on the Web, serving about 0.4% of all Web users, about 3M unique IPs daily, from its store of over a quarter of a trillion URLs. Which archive is more effective at delivering value to future readers?
- Risk: Pretty much the only thing everyone actually preserving significant digital collections agrees on is that they don't have enough money to provide future readers with access to the content they would like. This is mostly because they can't afford to collect and preserve everything they want to, but in some cases because even though they can collect it they can't afford to provide access to it. Thus future readers are going to be deprived of access to some content for economic reasons. The more expensive the chosen approach is per-object, the more objects readers won't be able to access. In this light, the question becomes how much resource devoted to mitigating which risk will deliver the most value to future readers. We need to trade-off the cost of mitigation against the predicted likelihood of the risk. So, for example, the LOCKSS system's decision to invest only a few resources in mitigating the risk of format obsolescence is not a decision to "ignore preservation", it is a rational resource allocation decision based on a sophisticated analysis of the likelihood of format obsolescence affecting the content to be preserved, and the technical and operational costs of the various mitigation strategies, all assessed against the same parameters for all the risks in the LOCKSS system's threat model. This analysis, first performed more than a decade ago, has been validated by events since then and by current research.
Criticizing efforts at digital preservation for failing to conform to an abstract view of "real preservation" without considering the trade-offs appropriate to the content in question and the available resources is not helpful. It encourages institutions to lavish resources on mitigating threats that may happen only in the far future while failing now to collect and preserve content that those resources could deliver to future readers in a usable form.