Tuesday, February 12, 2013

Rothenberg still wrong

Last March Jeff Rothenberg gave a keynote entitled Digital Preservation in Perspective:How far have we come, and what's next? to the Future Perfect 2012 conference at the wonderful, must-visit Te Papa Tongarewa museum in Wellington, New Zealand. The video is here. The talk only recently came to my attention, for which I apologize.

I have long argued, for example in my 2009 CNI keynote, that while Jeff correctly diagnosed the problems of digital preservation in the pre-Web era, the transition to the Web that started in the mid-90s made those problems largely irrelevant. Jeff's presentation is frustrating, in that it shows how little his thinking has evolved to grapple with the most significant problems facing digital preservation today. Below the fold is my critique of Jeff's keynote.

First, I agree with Jeff when he says (slide 12) that:
In fact, every digital artifact is a program
I've been pointing to the issues raised by the evolution of the Web from a document model to a programming environment for some time; I helped run a workshop at the Library of Congress on this problem last May. Some progress is being made in this area, for example see Dirk von Suchodoletz's paper at IDCC2013 on delivering emulations as a cloud service. However, Jeff fails to point out the problem faced, for example, by the Workflow4Ever project. No matter how faithful the emulator may be, almost all programs these days execute in a context of network services, which themselves execute in a context of network services, and so on ad infinitum. Simply preserving the bits and the ability to re-execute them is not enough if the network services the bits call on are no longer available, or return different results. So while emulation may be one important preservation strategy, it is far from a panacea. Every execution of an instance of the kind of digital artifact that readers commonly encounter, such as web pages, is now a unique experience, impossible to repeat or preserve with the fidelity Jeff envisages.

But there is a more fundamental problem with Jeff's presentation. It is exemplified by this quote from slide 41:
  • Most so-called “archiving” efforts ignore preservation
    • LOCKSS, Portico (journal archiving) offer no real preservation
    • Internet Archive seems based on wishful thinking
Despite saying he "doesn't want to denigrate anyone in particular", in his talk Jeff calls these efforts "misguided" because they "tend to focus on short-term preservation", "don't really have any long-term preservation elements", and "reject emulation out-of-hand" because "it sounds too much like smoke and mirrors so we're not even going to consider that". (About 30min into the video)

It is completely untrue that LOCKSS, Portico and the Internet Archive "ignore preservation" and "reject emulation out-of-hand". Jeff would have been more accurate to say "I don't like the approaches of these archiving efforts". He would have been even more accurate had he said "I don't understand the approaches of these archiving efforts".

Jeff is asking "how best to do digital preservation?" He quotes an ALA definition (slide 5):
The goal of digital preservation is the accurate rendering of authenticated content over time.
He doesn't understand that LOCKSS, Portico, the Internet Archive and others actually preserving stuff in the real world are facing a different question, "how can we use our limited resources to maximize the value delivered to future readers?"

In Jeff's ideal world, cost is an afterthought relegated to slides 46 and 47 out of 48:
  • Perform serious cost and process analyses
    • Based on viable technological approaches
whereas in the real world where the Internet Archive, Portico and the various LOCKSS networks operate, cost is the most important factor driving the system design.

Jeff approves of the approach taken by the KB (slide 41):
  • KB may still be in the lead
    • eDepot designed to address long-term preservation
    • Using a two-pronged migration/emulation approach
In fact, the approach taken by LOCKSS, Portico and the Internet Archive is precisely a two-pronged migration/emulation approach (I believe I'm representing Portico and the Internet Archive correctly here; they are welcome to correct me in the comments):
  • As regards migration, more than 8 years ago LOCKSS demonstrated and published the technique by which these archives can transparently migrate content on-access using HTTP's content negotiation capabilities. These allow a preservation system to detect that the reader's browser cannot render the format in which the content was preserved and create a temporary access copy in a format that it can render. More recently, Memento uses content negotiation to provide readers with uniform access to preserved Web content no matter where it is preserved. Thus if and when migration is appropriate, it will be available; the capability is now embedded in the infrastructure of the Web. This is precisely the strategy of preserving the original and creating "vernacular versions" that Jeff endorses early in his talk.
  • As regards emulation, all three archives depend for both their preservation and dissemination environments on a stack that is open source, carefully preserved in ASCII (and thus itself effectively immune from format obsolescence) as multiple copies in source code repositories capable of reconstructing the entire stack as it was at any time in the past. These environments include multiple emulations of the hardware environment itself, also in ASCII and capable of being reconstructed as they were at any date in the past. I pointed this out 6 years ago in the second post to this blog. No-one since has shown me a credible scenario in which this approach fails for the vast majority of the content these archives are devoted to preserving. Thus if and when emulation is appropriate, it will be available; the capability is now embedded in the infrastructure of computing.
So, Jeff, what exactly is wrong with this two-pronged approach that is not also wrong with the KB's approach?

It is true that these archives are not currently investing large resources into migration. The framework for doing is simple, and it was demonstrated long ago. Recent research at the BL and INA has shown that for the content these archives preserve format obsolescence is not imminent. Nor are they investing large resources into emulation. Doing so would simply waste resources duplicating work that is being done much better by others for reasons having nothing to do with digital preservation. And, again, obsolescence of the hardware and software stack is not imminent. As I noted 6 years ago in the third post to this blog, there are fundamental economic reasons why it is very difficult to justify investing heavily in preparing for hypothetical obsolescence events in the distant future. Recent research in Discounted Cash Flow has greatly reinforced this observation. Jeff's advocacy of perfectionist, expensive, pre-emptive approaches ignores basic economics.

Unlike LOCKSS and the Internet Archive, but more like Portico, the KB has invested heavily in preparing for eventual obsolescence. The result is that their approach is considerably more expensive per object than the LOCKSS or Internet Archive approaches. On 1st April, 2009 the KB's eDepot and the Directory of Open Access Journals (DOAJ) announced a pilot program to preserve open access journals. The importance of the program was explained thus:
The composition of the DOAJ collection (currently 4000 journals) is characterized by a very large number of publishers (2.000+), each publishing a very small number of journals on different platforms, in different formats and in more than 50 different languages. Many of these publishers are – with a number of exceptions – fragile when it comes to financial, technical and administrative sustainability.
<correction> I was told that nothing significant came of this partnership, but Marcel Ras in the comments informs us that this isn't true. KB is preserving articles from 650 DOAJ publishers. But this is less than 1/3 of the total number of DOAJ publishers, and the publishers providing content are likely to be the larger and less fragile ones. The LOCKSS approach from the very start in 1998 stressed low cost:
Libraries have to trade off the cost of preserving access to old material against the cost of acquiring new material. They tend to favor acquiring new material. To be effective, subscription insurance must cost much less than the subscription itself.
Over 9,000 journals from 520 publishers have adopted the LOCKSS approach. It is true that larger publishers provide the bulk of the journals. But the LOCKSS approach has been successful with precisely the kind of small, fragile publisher that populates the DOAJ, not least because (unlike Portico and CLOCKSS) the Global LOCKSS Network does not charge publishers for preservation. It is elementary business thinking to ensure that your potential customers can afford to buy your product. Nevertheless, it is clear that, just as with the KB's approach, the LOCKSS system is reaching only a small fraction of the at-risk content.

Jeff says that LOCKSS, Portico and the Internet Archive "ignore preservation". But he offers no specifics as to what we should be doing instead to preserve the content we target. He does praise the KB's approach which, like Portico and LOCKSS, has turned out to be too expensive even to ingest most of the content at the greatest risk, let alone preserve it. </correction>

My challenge to Jeff is to do one of the following:
  • Tell us exactly what we should be doing to deliver more value to future readers within our current budgets, or:
  • Tell us how to get enough additional money to do "real preservation", or:
  • Stop making the best be the enemy of the good.
Some other factors entirely missing from Jeff's slides are:
  • Scale: Jeff's slides 24-30 are a catalog of software archaeology, artifacts from the pre-Web era. Most are unique, such as the Domesday book and ErlKing, and have been preserved thanks to enormously expensive hand-crafted emulations. These feats may be inspiring, but they are completely irrelevant to the problems facing the future of our digital heritage. No-one can devote these levels of attention or resource to the billions of web sites, millions of journal articles and centuries of video that form our heritage. Jeff's thinking is stuck on the problem of preserving an individual digital artifact. There are cases, such as museum collections of digital artworks, where this is appropriate. But they are a minute fraction of the content that needs preservation. Practical solutions need to handle collections of millions of objects and many terabytes of data affordably.
  • Bit preservation: Jeff blithely assumes that keeping the bits safe is a solved problem, not worthy of discussion. But doing so is the sine qua non of Jeff's approved approaches, and at the scale that is needed the technical and economic challenges involved in keeping these enormous volumes of data safe are formidable. Indeed, the one safe prediction is that no matter what we do some bits will get lost.
  • Intellectual Property: Jeff completely ignores the immense difficulties that copyright law causes for digital preservation. To take but a single example, the approach Jeff favors at the KB results in a collection the vast bulk of which he may consider well-preserved but is inaccessible except to authorized researchers physically at the KB. Vast resources have been spent to preserve this content, but to all but a tiny fraction of potential future readers they will have been completely ineffective at providing access. By contrast, however imperfect Jeff may consider its preservation strategy, the Internet Archive is as I write the 231st most visited site on the Web, serving about 0.4% of all Web users, about 3M unique IPs daily, from its store of over a quarter of a trillion URLs. Which archive is more effective at delivering value to future readers?
  • Risk: Pretty much the only thing everyone actually preserving significant digital collections agrees on is that they don't have enough money to provide future readers with access to the content they would like. This is mostly because they can't afford to collect and preserve everything they want to, but in some cases because even though they can collect it they can't afford to provide access to it. Thus future readers are going to be deprived of access to some content for economic reasons. The more expensive the chosen approach is per-object, the more objects readers won't be able to access. In this light, the question becomes how much resource devoted to mitigating which risk will deliver the most value to future readers. We need to trade-off the cost of mitigation against the predicted likelihood of the risk. So, for example, the LOCKSS system's decision to invest only a few resources in mitigating the risk of format obsolescence is not a decision to "ignore preservation", it is a rational resource allocation decision based on a sophisticated analysis of the likelihood of format obsolescence affecting the content to be preserved, and the technical and operational costs of the various mitigation strategies, all assessed against the same parameters for all the risks in the LOCKSS system's threat model. This analysis, first performed more than a decade ago, has been validated by events since then and by current research.
No one-size-fits-all approach to digital preservation is viable. Each collection must trade off between its available resources and the threats to its particular types of content. At any time there will be content whose preservation would be too expensive to be justified. The major cause of digital objects not being available to future readers is economic; no-one could afford to preserve them. The more spent per-object, the fewer objects can be preserved.

Criticizing efforts at digital preservation for failing to conform to an abstract view of "real preservation" without considering the trade-offs appropriate to the content in question and the available resources is not helpful. It encourages institutions to lavish resources on mitigating threats that may happen only in the far future while failing now to collect and preserve content that those resources could deliver to future readers in a usable form.


Unknown said...

As program manager of the International e-Depot of the KB, the National Library of the Netherlands, I would like to respond to this post.

The arguments of Jeff Rothenberg concerning the different preservation approaches and archives doing real life preservation are questioned. As Jeff also talks often about the KB, you pointed several arrows at the KB in the blog. I would like to respond to some incorrect observations of you in relation to the international e-Depot.

I certainly do agree with you David that one-size does not fit all. Each collection and each organization needs a different approach and has different resources to be managed and threats to its collections to be kept safe. Jeff argues that the KB approach is still the most viable approach. This off course is a very encouraging conclusion which certainly gives credits to the work the KB is doing for many years now. However, I do agree with you that the approaches of Portico, LOCKSS/CLOCKSS and the Internet Archive are equally viable. I am very much sure that active, and real, preservation is at the highest priority of these initiatives. I certainly do hope so it is, as KB is working closely together with all these initiatives in working on solutions for large scale preservation problems.

The first point I would like to make is that I fully agree on the observation that the economic situation is an important issue for all of us, libraries and archiving solutions. An international comparison of costs and cost drivers would be very interesting to make. There are surveys on this: APARSEN, 4C and Enumerate are just a few of them. What might be interesting is a real life comparison between the costs and cost models of the archiving solutions like e-Depot, Portico, Internet Archive, LOCKSS/CLOCKSS.

Incorrect however is the observation on DOAJ. It is correct, KB carried out a pilot project with DOAJ in 2009. Following on this pilot, the KB signed an archiving agreement with the DOAJ in which we agreed that KB e-Depot will serve as long term deposit for the publishers participating in DOAJ. In contrast with what you stated in your blog, this cooperation is working fine. DOAJ is proving the e-Depot with a constant stream of Open Acces journal titels. The KB is, as far as I know, the only archive which is preserving such an amount of OA titels in a structured way. Just recently governance of DOAJ has been taken over by Infrastructure Services for Open Access C.I.C. (IS4OA). This will give DOAJ a new and stronger base to work. Cooperation between KB and DOAJ continues and we both have the ambition to raise coverage on open access journals to be preserved. In contrast to your observation, the open access publishers (indeed small publishers with fragile business models) do not have to pay for this service. This is arranged between DOAJ and KB.

One last response to your blog is the issue of accessibility. Indeed e-Depot content is available for authorized users who have to come to The Hague and physically visit the library. But I have to stress that the international e-Depot is a dedicated service to guarantee permanent access to scientific publications. That means that it is not developed to provide access today (this is taken care of by the publishers), but to guarantee access for the users of tomorrow. As goes for Portico and CLOCKSS. The international e-Depot is a service that will provide access in case of emergency; so in case of a trigger event and in case of a post-cancellation. Access within the reading rooms of the library is a surplus we agreed upon with the publishers. As the KB is a national library, we are very keen on providing access to our collections for our users. Access to the content of the international e-Depot, which is as I mentioned for long-term usage, today offers and extra service for researchers. But this is certainly not the core of the international e-Depot.

Marcel Ras, Program manager international e-Depot, KB Netherlands. marcel.ras@kb.nl

David. said...

Marcel, thank you for your comment. I'm glad we agree that one-size-fits-all doesn't work, and that economics are key to practical preservation.

As regards access to preserved content, I think we can agree that in a perfect world we would not be spending our limited resources accumulating vast collections never to be accessed except in case of disaster. That the KB, Portico, CLOCKSS and others are doing so is a regrettable consequence of copyright law. Which makes the fact that in a 45-minute talk Jeff didn't mention copyright striking; the point I was making.

As regards DOAJ, I am happy to correct the post but to do so I need data. How many DOAJ publishers and journals is the KB preserving. More to the point of the paragraph, while I am glad that the publishers are not charged, money is not the only obstacle between agreement in principle and actual preservation. How many of them are the small, fragile, single-journal publishers who are most at risk?

Unknown said...

Dear David,
I fully agree with you that Copyright legislation will not make our job to preserve digital information easier. But that is where we have to deal with. KB experience is that it is possible to reach arrangements with publishers on access other than in case of emergency only. I realise that on-site access is a very limited type of access in a digital world, but it is a step.
Regarding DOAJ: KB is preserving 92.000 open access articles form 650 publishers, supplying 900 titles.
These are all small, many of them even single-journal publishers. As for the amount of titles available, it is a start, but certainly not an insignificant one. In my opinion and experience it is a perfect way to preserve a part of the long-tail and the available open access titles.

David. said...

I have at last corrected the post to match the data Marcel kindly supplied. I apologize for the delay in doing so; I was on a much-needed vacation. Look for <correction>.

David. said...

My figures are wrong because the DOAJ journal count was out-of-date. As of today the DOAJ claims 8917 journals, so the 900 titles the KB is preserving represents about 10% of the journals. The DOAJ claims 1,061,803 articles, so the KB's 92,000 articles is about 8.5% of them.

I believe these numbers substantiate my point, which is that the KB and other approaches have been unable to even ingest much of the content they should be preserving. This means that arguing about "real preservation" is irrelevant to the major cause of loss of value to future readers, namely that the content was never ingested in the first place because no-one could afford to preserve it.