Wednesday, May 6, 2009

Sheila Morrissey's comment

Portico's Sheila Morrissey posted a valuable comment on the post that provided the background and sources for my CNI plenary. It set out the conventional wisdom against which I was arguing, but at such length that I felt it was inhibiting discussion. It was also difficult to respond to by adding a comment, among other reasons because there was no easy way to connect my responses to their targets in the comment. I therefore saved the text of Sheila's comment, deleted it from the original post, and reproduced it below the fold, together with my responses. Portico has posted a version of her comment here.

My talk was in three parts. They argued:

  • that Jeff Rothenberg, whose Scientific American article drew attention to the problem of preserving digital documents, naturally saw format obsolescence as the major problem because the desktop publishing paradigm that was predominant as he wrote in 1995 saw documents and their formats as the property of applications,

  • that developments since, primarily the Web, have transformed documents and their formats into publishing infrastructure, and that we have many years of experience showing that economic and technical forces prevent infrastructure, such as network protocols, file systems and operating systems kernels, changing in incompatible ways, thus diminishing the threat of format obsolescence,

  • and that our current techniques, based on Jeff's analysis, are inadequate to society's needs for digital preservation, in part because they are too expensive, and in part because they do not address the technical and even conceptual problems of the effect of the Web on publishing.

To argue against the second part, Sheila would need to show that format obsolescence has in fact been causing documents to be lost. To argue against the third part, Sheila would need to show that our current techniques are in fact affordable and adequate to the technical problems. Sheila does neither of these.

In my overview of the talk I compared it to Steven Sondheim's Into The Woods, because in both cases the happy ending is in the middle. Anyone who was there, as Sheila was not, would tell you that the part that looked forward was at least as pessimistic about the future as Jeff's article. Sheila writes as if this invited CNI plenary were a sales pitch for a particular system, rather than an indictment of all current systems. For example,
LOCKSS is exactly co-extensive with what you propose as the totally sufficient preservation strategy: just get the bits and save the bits, and use open-source software to do so.
Sheila seems unable to address the actual text of my post, erecting instead a straw-man created from manipulated quotes.

Sheila, it must be pointed out, works for an organization whose business model depends on persuading libraries that the problems of digital preservation are so complex and difficult that they cannot and should not attempt to tackle them for themselves, but should pay experts to handle them instead. The techniques they use to perform this persuasion can be divided into three groups:

  • Denigration: downplaying libraries' capabilities,

  • Exaggeration: making the problems look more difficult than they are, primarily by claiming that many tasks are essential which are in fact optional,

  • Obfuscation: confusing the uninitiated reader by using undefined jargon and failing to cite definitions and sources, thus making the arguments accessible only to experts.

Note in particular that Sheila provides only three links, two to a comment she posted earlier, and one to a paper I wrote 6 years ago. She makes many sweeping assertions but fails to back any of them up with evidence, a common way to Obfuscate, and especially ironic in a comment on a post supplying the sources for the assertions I made in my talk.

Hello David,

Just some thoughts on your slides and comments, as together we all grapple with the challenges we face to ensure the legacy of digital artifacts --

In the words of that profound cultural historian, Yogi Berra,

“If you don't know where you're going, you'll wind up somewhere else.”And, as he might have added: “If you don't know where you are, you'll have a hard time getting there.”
Sheila starts with Obfuscation. Or is it just because I'm an Englishman that the relevance of this escapes me?
The lesson of what you call modern history (1995 to 2009), you say, is that all those preservation risks we anticipated in the bad old days of ancient history (the pre-1995 assumptions that we'll have “documents in app-specific formats, hardware & O/S will change rapidly in ways that break applications, apps for rendering formats have a short life”) have been conjured away by the market's invisible hand.
Well, no, I did not say that. The idea that "all those preservation risks ... have been conjured away by the market's invisible hand" is Sheila's Exaggeration. The slide from which she takes the quote was describing Jeff Rothenberg's dystopian vision of the future for documents, not my views. Further, she omits two key bullets. The full quote is:
  • Documents are in app-specific formats,

    • Typical formats are proprietary

    • Attempts to standardize formats will fail

  • Hardware & O/S will change rapidly

    • In ways that break applications

  • Apps for rendering formats have a short life
I provided examples and sources to back up the contention that these predictions have not come true.
Sadly, it's not so.

It was not so in 2003, as you have described elsewhere, when it was found necessary to send old Ethernet cards to enthusiastic early adopters of LOCKSS, because the Ethernet chips in the inexpensive but new hardware they purchased were incompatible with the older OpenBSD drivers in the LOCKSS software distribution.
Implying that a few support calls more than 6 years ago resolved, if memory serves, at a cost of $2.99 plus postage and packing shows that libraries can't do digital preservation is an example of Denigration. But in any case, Sheila is also using Exaggeration here. Her example is of forwards incompatibility. The old software was incapable of driving the new chips. The digital preservation problem is one of backwards not forwards incompatibility; new software incapable of rendering old formats. One release later, and OpenBSD could drive those chips.
And it's not so in 2009, either. We still have multiple hardware architectures and operating systems – not just mainframes and servers and desktops, not just Z10 and Windows XP and Windows Vista and Solaris and the many flavors of Linux – but also cell phones and PDAs, IPODs and Walkmans and Zunes, navigation appliances like TomTom, eBook readers like Kindle, game boxes like Xbox and Wii. And we have whole zoo of new applications and new formats to go with them. Some are merely delivery formats, but some are actual content formats. Some are open; some are proprietary. We have the many variants of geospatial data (GDF, CARiN, S-Dal, PSF, and the myriad sub-species of ShapeFiles). We have formats for eBook readers (proprietary ones like Kindle's MobiPocket-based AZW format, in addition to open ones like OPF).
Are you feeling Obfuscated yet? The implication is that special, expert digital preservation techniques are needed for each and every one of the devices and operating systems named. But there is no explanation of, or cited sources for, why this should be so. Note for example that she points to various MP3 players without noting that the very success of MP3 players as a product is based on their sharing a format which, because of their success, is now very hard to change.
The many proprietary CAD tools and formats currently in use, which have no satisfactory non-proprietary open-source substitutes, already complicate the present communication, not to say the future preservation, of architectural design artifacts.
Mackenzie Smith raised this very point in the question period after my talk. I answered that the CAD market is still comparatively immature, and that one can expect that the customers in this market will eventually rebel against the costs of proprietary lock-in, as Microsoft's have done. But that whether or not they do is actually irrelevant to the problem of preservation. All these systems run on hardware for which there are open source emulators. If the binaries of the CAD tools and their O/S are preserved, the tools can be run using emulation. As I said in the talk, Jeff was right about emulation just wrong about the reason for it.
The fact is, new content formats are being created all the time – some of them proprietary; some with only proprietary rendition tools; many in wide use in both online and offline content repositories of interest to one digital preservation community or another. The technology market does not stand still. The market will always recapitulate exactly the same process that W. Brian Arthur analyzed and that you cite in your talk, describing the "office suite wars": get out there with your own product, do what you have to do to suck up market share, and crowd out whomever you can, whether they got there before you, or came in after you.
Again, Sheila Exaggerates by confusing the introduction of new formats (necessarily accompanied by renderers) with the obsolescence of formats, which is the problem Jeff identified as key to digital preservation.

I pointed out in my post that the market does not "always recapitulate exactly the same process that W. Brian Arthur analyzed", using the Boeing/Airbus, Intel/AMD, and Nvidia/ATI duopolies as examples. Arthur's book did explain the behavior of markets such as desktop publishing, more sophisticated customers have since developed more sophisticated strategies.
And after the market finishes strip-mining a particular application or content domain, what then? The digital preservation community will be left, then as now, to clean up the mess. There will always be artifacts in some defunct format, for which there will have been no sufficient market to ensure freely available open-source renderers. And there will be artifacts nominally in the format that prevails that will be defective. So, for example, we’ll likely be dealing with "crufty" self-published eBooks that fly below Google's content-acquisition radar, or with “crufty” geospatial databases, in whatever dominant format, just as today web harvesters have to contend with "crufty" HTML.
The talk argues that our current digital preservation techniques are massively inadequate to the task at hand. One major reason is that the focus in research and development of digital preservation has been on problems that are not relevant to the overwhelming majority of the content to be preserved. Clearly, there will always be some amount of content in formats so far from the mainstream that the cost of preserving them outweighs the benefits of doing so. But that fact doesn't invalidate the argument. If we spend all our R&D solving the problems of the cruft and by doing so fail to address the big problems set out in the third part of the talk we will have achieved nothing useful.
Simply depositing contemporary open source renderers in a Source-Forge-like depository is no warrant those renderers will perform as needed in the future, even assuming your proposed comprehensive emulation infrastructure. And if those tools fail then, when they are called on “just in time”, where will the living knowledge of these formats be found to make up the deficit?
Sheila here fails to explain any mechanism by which open source renderers would become obsolete, nor to explain how the "living knowledge", by which she appears to mean format specifications, would be used to make up the deficit when it could not do so by creating an open source renderer before the format went obsolete. This topic was covered in these posts.
The market is not going to provide us with a silver bullet that will solve all the problems the market itself creates. Nor will the market alone afford us a way around the unhappy fact that preservation assets have to be managed. And that means, especially for scholarly artifacts, that we're still going to need metadata: technical metadata, descriptive metadata, rights metadata, provenance metadata.
Sheila Exaggerates again, implying that I believe "The market is ... going to provide us with a silver bullet" but failing to cite anywhere I said anything like that. And she Obfuscates by providing a list of metadata types with no indication of what value they add to the preservation process and whether it justifies the cost of acquiring and preserving them.

Also, I am not proposing a "comprehensive emulation infrastructure". I am merely observing that providing a "comprehensive emulation infrastructure" is now a necessary part of the IT mainstream.
An earlier post explains why I don't think an open-source renderer, even if one can be found, is a sufficient substitute for technical metadata.
Sheila Exaggerates her comment to which she links, which doesn't explain that at all. It instead recommends that preservation systems should either (somehow) force publishers to produce content that conforms more closely to format specifications, or do as Portico does in some cases and expensively store multiple versions of the content.
I think the library community, and the academic community in general, would at least want to consider how much less effective search tools will be without descriptive metadata (which, incidentally, need not be, and are not currently, necessarily hand-crafted). Even in an ideal world of open access, it is possible to imagine categories of digital assets for which we will be obliged to track rights metadata. And given, as you say, that the Web is Winston's Smith's dream machine, perhaps provenance metadata, event metadata, collection management metadata, all still have important roles to play in digital preservation, too.

So we have to ask: Is the lesson of history, ancient and modern, that “formats and metadata are non-problems”? That “all we have to do is just collect and keep the bits, and all will be well”? Or does history caution us, with respect to formats and metadata, that it might be a little early yet to be standing in our flight suits on the deck of the good ship Digital Preservation, under a banner reading “Mission Accomplished”?
Once again, Sheila Exaggerates by being a bit careless with quotes. The actual quotes in the slides are:

  • What are the non-problems?

    • Or rather, the problems not big enough to matter


  • Formats

    • Any format with an open-source renderer is not at risk

  • Metadata (at least for documents)

    • Hand-generated metadata

      • Too expensive, search is better & more up-to-date

    • Program-generated metadata

      • Why save the output? You can save the program!

  • Just collect and keep the bits

    • Not collecting is the major reason for stuff being lost

  • If you keep the bits, all will be well

    • Current tools will let you access them for a long time

Sheila hasn't provided any actual evidence to rebut the history of open source stability I provided, nor any showing that the collection and preservation of all of the multifarious types of metadata adds enough value to the preservation process to justify the costs involved. Nor has she refuted my contention that the major reason stuff is being lost is because it isn't being collected. Nor has she provided any evidence that anyone, least of all me, is "standing in our flight suits on the deck of the good ship Digital Preservation, under a banner reading 'Mission Accomplished'". She cannot have read the last third of the talk if she believes I think the mission is accomplished.
Are there perhaps some other lessons we can learn from a wider scan of the past and present of digital preservation?
There certainly are, and the entire third part of the talk was devoted to them. Unfortunately, Sheila proceeds to ignore most of them, such as the problems of intellectual property, and the problems caused by the shift from static documents to dynamic, interlinked collections of services.
Digital preservation is not free.This is one of the key points of your talk, and it's a really crucial insight. We need more and better numbers to help us ensure that we get the biggest possible preservation bang for our always-too-scarce preservation bucks.

So I am sorry, given that you are perhaps the person best positioned to have the necessary data to hand, that, along with giving your estimates of the costs of the Internet Archive and Portico, you did not take the occasion to detail, or indeed even to mention, the cost of C|LOCKSS as a preservation solution. This is all the more a pity as LOCKSS is exactly co-extensive with what you propose as the totally sufficient preservation strategy: just get the bits and save the bits, and use open-source software to do so.
The entire last third of my talk was devoted to reasons why there is currently no such thing as a "totally sufficient preservation strategy". One major reason is that all current approaches are far too expensive to meet society's needs. To show that I took two examples, the Internet Archive and Portico. The Internet Archive is clearly the lowest-cost system I could have chosen. I am not claiming that Portico is the highest-cost system; indeed it is probably the most efficient system using the conventional approach. I used these two examples to show two things:

  • That even the Internet Archive's costs are far too high to meet society's need with funds that could reasonably be expected to be available.

  • That the additional tasks required by the conventional approach Portico implements so efficiently are very large, and thus that they should either be eliminated, or made the subject of aggressive cost-reduction.
Adding other examples would have been redundant; the talk was already an hour long.
Open source projects after all are not magically exempt from economic laws. This is as true of the open-source tools that Portico uses, for example, as it is of the fascinating open-source emulators under development in the Dioscuri project, or, as you have elsewhere observed, of open-source software originally intended as a commercial product (such as Open Office, which, as you noted, entailed a 10-year development effort, and significant on-going subsidy from SUN Microsystems). Free software, as the saying goes, means “free as in speech, not free as in beer”. All vital, useful open source projects entail costs to get them going, and costs to keep them alive. If we are to judge how economical a preservation solution is, or to amortize the cost or preserved content over its lifetime, we need to know

  • what the sunk costs are

  • what the institutional subsidy in the form of what might be termed charge-backs was and is and is projected to be

  • what the ongoing costs are

  • what the projected costs are

  • what the incremental costs of adding both content nodes and subscribers to the network are

  • what fiscal reserves have to be set aside to ensure the continuance of the solution

Obfuscation in action - of course it would be good if the experts knew all these things and could decide which systems were sustainable in advance. But it doesn't address the point the talk was making, which is that we know for certain that all our current approaches, even the Open Source ones such as the Internet Archive's, are far too expensive.
Digital preservation is not one thing.It is many use cases, and many tiers of many solutions. It is many, and many different kinds of, participants, making many different kinds of preservation choices. Different content, different needs, different valuations, different scenarios dictate different cost/benefit analyses, different actions, different solutions.
Obfuscation again - the subject is so diverse that there is really nothing definite that can be said, even by an expert such as Sheila.
Of course we have to consider the challenges of scale, as you say in your talk. And of course this means we have to find ways to employ the economies of scale, and then to consider any risks the inevitable centralization to employ those economies entails. But we would do well to remember that, in Arthur's analysis of the convergence of the high technology market to a single market “solution”, that single solution is often neither the “best” nor the most “efficient” one.

The values of the marketplace, and the solutions of the marketplace, are not the values and solutions of the digital preservation community. The digital preservation community cannot afford a preservation monoculture, which could well be a single point of failure, whether that means a monolithic technology that could fail, or malignant political control of intellectual and cultural assets. We need CLOCKSS and Portico, the Stanford Digital Library and Florida's DAITSS, the British Library and the Koninklijke Bibliotheek and the German National Library, the Internet Archive, LOCKSS private networks, consortia collections, “retail” preservation-in-the-cloud.
Sheila Exaggerates as usual, attacking the straw man of the market. Nowhere in the talk do I argue that the market will solve the digital preservation problem, or that a single solution is either good or inevitable. But I would point out that, as Sheila says "Digital preservation is not free", and thus it cannot ignore the market by simply assuming that anything anyone wants to try will be funded by magic. She also Obfuscates, citing a slew of systems without linking to them so that the uninitiated could easily determine what their differences and similarities might be.
Digital preservation is not a technology.

It is a human activity. It is a collective action. It is people fulfilling the social compact to preserve and to pass along the socially meaningful digital creations of our time, our place, our cultures, our communities. It means people making choices – choices about what to preserve and what not to preserve; choices about what level of investment is appropriate for what categories of objects; choices about how to “divide and conquer” the universe of potentially preservable digital objects; choices about what entities and what technologies are trustworthy agents of preservation.

The market has its place. But surely if we have learned anything from recent events, it is to distrust the glamour of “the new new thing”, the can’t-fail digital preservation appliance that will hoover up all the bits in the world, and collect them in a preservation dust bag. The complexity of the digital preservation solution will have to match the complexity of the problem.

I agree with Sheila that digital preservation is "fulfilling the social compact to preserve and to pass along the socially meaningful digital creations of our time, our place, our cultures, our communities." I'm just pointing out that the techniques we currently use to do the job, mostly based on a decade-and-a-half-old analysis of the problem which history has largely invalidated, are completely inadequate to the task. And that relentlessly adding complexity is not going to help solve that problem.
Best regards,


Sheila Morrissey
Senior Research Developer, Portico


Gary McGath said...

This is all getting a bit heated -- and long. To the extent that you're arguing that digital preservation shouldn't be the activity of an elite which is apart from the institutions with content to preserve, I agree with you. But I'm in one of the few library organizations big enough to have its own ongoing digital preservation activity, which makes it easy for me to say that. Preservation needs its own expertise, though it should work with the institutions that have, not see it self as something above the market.

As I've argued in my own blog, reconstructability should be a primary metric of preservation. This depends on several factors (oh, noes! Obfuscation!), including documentation of the format, long-term availability of software to render it, and ease of extracting some content in the absence of complete information. Some formats are bad on all these counts (e.g., most variants of Word), yet are still widely used. A push in the right direction can get content managers to make more reconstructable choices, just as a push from security experts can get them to create more secure websites.

I'd be cautious about the long-term value of open-source renderers and converters. If an application was written for a 1985 system and hasn't been updated since, what are the chances its source code will compile on a modern machine? Maybe the fixes will be easy, maybe not. But source code obsolescence should be a concern, along with format obsolescence.

Unknown said...

Digital preservation is necessary, and costly. I agree that it shouldn't be the activity of the elite. However given the amount of data that is produced digital preservation has to evolve as there are extreme sociopolitical pressures to do so. I also suggest there would be natural selection to some degree. Data that is popular will become a meme, negating the argument of out dated software somewhat. I recognize it is not possible to preserve all data, and digital data will always deteriorate to some degree. All this said surely what is valuable enough to be stored in the long term is down to what is deemed valuable, and every human on this planet would have a differing opinion...if they were asked.

Ben - Kindle Case