Wednesday, November 20, 2013

Patio Perspectives at ANADP II: Preserving the Other Half

Vicky Reich and I moderated a session at ANADP II entitled Patio Perspectives Session 2: New Models of Collaborative Preservation. The abstract for the session said:
This session will explore how well current preservation models keep our evolving scholarly communication products accessible for the short and long term. Library and publisher practices are changing in response to scholars' needs and market constraints. Where are the holes in our current approaches, and how can they be filled? Or are completely new models required?
I gave a brief introductory talk; an edited text with links to the sources is below the fold.

I'm David Rosenthal from the LOCKSS Program at the Stanford University Libraries, which 6 weeks ago celebrated its 15th birthday. My job here is to set the scene, based on that experience, recent talks and blog posts, and the report of ANADP I, especially Cliff Lynch's closing remarks. As usual, after the session I will post an edited version of my text with links to the sources on my blog, so there is no need to take notes. And, as usual, I hope to say some things that people will disagree with, to spark discussion. Our abstract for this session posed three questions - I will address them in turn.

How well do current preservation models keep our evolving scholarly communication products accessible for the short and long term?

We should start by asking "how much of the scholarly record is being preserved?" In my talk to the Preservation at Scale workshop at iPRES2013 Diversity and Risk at Scale I cited estimates that less than half was preserved anywhere:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) at iPRES2013 compare information extracted from publisher's web sites with the Keepers Registry and conclude:
    We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
Somewhat less than half sounds as though we have made good progress. Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.

First, the assessment isn't risk-adjusted. Librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?

Second, the assessment isn't adjusted for difficulty. A similar problem of risk-averseness is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence. And the same effect occurs in the negotiations for obtaining permission to preserve content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content. Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking. It looks only at the traditional form of scholarly communication, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are all both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.

Where are the holes in our current approaches, and how can they be filled?

Clearly, if even on a wildly optimistic assessment our current approach is preserving well under half the content it should, there are significant holes to be filled:
  • We need to move a significant slice of our limited resources from content that is at low risk to content that is at high risk. In particular, we need to devote more resource to content that is:
    • open access rather than subscription
    • from small rather than large publishers
  • We need to move a significant slice of our limited resources from content that is technically easy to content that is technically hard. In particular:
    • Reduce the cost of preserving small, diverse publisher content by providing tools they can use to deliver content in standard forms.
    • Work with publishing platforms to provide archives with access to archival versions of content in standard forms.
  • We need to move a significant slice of our limited resources from traditional forms of content to the new forms that are taking over scholarly communication:
    • Publishing and peer-reviewing the data from which the paper was derived in a form allowing both replication and re-use.
    • Publishing the executable workflows used to analyze the raw data.
    • Transitioning from pre- to post-publication review.
    • The use of social media to communicate and rate scholarship.
It would be naive to expect a large increase in resources for digital preservation, so we need to much more than double the content preserved per dollar. Or, to look at it another way, we need to figure out how to do what we are currently doing on much less than half the dollars we are currently spending.

One popular approach is to harvest alleged "economies of scale". Here, for example, is Bill Bowen (a well-known economist, but with an axe to grind) enthusing about the prospects for Portico in 2005:
Fortunately, the economies of scale are so pronounced that if the community as a whole steps forward, cost per participant (scaled to size) should be manageable.
The idea that there are large economies of scale also leads to the recent enthusiasm for using cloud storage for preservation. Here is Duracloud from 2012:
give Duracloud users the ability to be in control of their content ... taking advantage of cloud economies of scale
But this can be a trap. As Cliff pointed out:
When resources are very scarce, there's a great tendency to centralize, to standardize, to eliminate redundancy in the name of cost effectiveness. This can be very dangerous; it can produce systems that are very brittle and vulnerable, and that are subject to catastrophic failure.
Cliff also highlighted one area in particular in which creating this kind of monoculture causes a serious problem:
I'm very worried that as we build up very visible instances of digital cultural heritage that these collections are going to become subject [to] attack in the same way the national libraries, museums, ... have been subject to deliberate attack and destruction throughout history.
...
Imagine the impact of having a major repository ... raided and having a Wikileaks type of dump of all of the embargoed collections in it. ... Or imagine the deliberate and systematic modification or corruption of materials.
But monoculture is not the only problem. As I pointed out at the Preservation at Scale workshop, the economies of scale are often misleading. Typically they are an S-curve, and the steep part of the curve is at a fairly moderate scale. And the bulk of the economies end up with commercial suppliers operating well above that scale rather than with their customers. These vendors have large marketing budgets with which to mislead about the economies.

The resources to be moved will have to be found by identifying things we are already doing, and decreasing or eliminating the resources devoted to them. I think one of the most valuable things this session could do is to think about ways to free up resources for these challenges. Here is my list:
  • Stop making the best be the enemy of the good.
  • Get real about the threats and the defenses.
  • Centralized access, distributed preservation.
  • Less metadata, more content.
Stop making the best be the enemy of the good. As I've been pointing out now for more than six years, every dollar spent gold-plating the preservation of some content is a dollar not spent preserving some other content. Lets compare two hypothetical approaches with equal budgets, one that has a zero probability of loss over the next decade, and another that is half as costly and has a probability of loss over the next decade of 10%. At the end of the decade the cheaper approach has collected twice as much stuff and lost 10% of it, so it has preserved 80% more stuff for future readers. We need to stop kidding ourselves that there is no possibility ever of losing any of the content we have collected, and pay attention to the content we are guaranteed to lose because we have failed to collect it in the first place.

Get real about the threats and the defenses. Compare the vast digital preservation resources, both R&D and operational, devoted to the threat of format obsolesence with the minimal attention devoted to threats such as operator error, external attack and insider abuse. Theoretical analysis predicted and experimental evidence confirmed that popular formats from the last couple of decades are at no practical risk of obsolescence, and thus that losses from this cause are negligible. And even if obsolescence eventually occurs, recent developments of the ability to deliver emulation-as-a-service and in-browser emulation suggests that migration is unlikely to be the preferred way to deal with it. Experience elsewhere in the data storage industry, on the other hand, suggests that operator error, external attack and insider abuse are major causes of data loss.

Centralized access, distributed preservation. A preservation system that must be ready for readers to access its content randomly at an instant's notice is both more expensive and more vulnerable than a system like the CLOCKSS Archive which is designed around a significant delay between access being requested and content being delivered. Amazon shows this, Glacier is at least 6 times cheaper than S3. For most preserved content there is no point in having the preservation system replicate the access capabilities that the original publishing system provides. On the other hand, providing access is the only effective justification for funding preservation; CLOCKSS as a dark archive may be the exception that proves this rule.

Less metadata, more content. Digital preservation cost studies generally agree that ingest is the biggest cost, perhaps about half. Much of that cost comes from metadata generation and validation. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because its possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if validating the formats of incoming content using error-prone tools is used to reject allegedly non-conforming content it is counter-productive.
Or are completely new models required?

I want to seed the discussion with some questions:
  • As ingest is the major cost component, can we just not do it? In some cases, examples may be large science databases and perhaps news, "preservation in place" may be the only viable option because it is too expensive, too time-consuming or simply impractical to replicate the content in an archive.
  • If we are going to ingest stuff, can we make doing so much cheaper? For example, can we generate only fixity metadata on ingest, and use search for all other uses. Metadata in a growing archive is a dynamic, not a static concept, so generating and storing it may be a mistake.
  • Is lossy preservation OK? Perfect preservation is a myth, as I have been saying for at least 7 years using "A Petabyte for a Century" as a theme. Some stuff is going to get lost. Losing a small proportion of a large dataset is not really a problem, because the larger the dataset the more use of it involves sampling anyway.
  • What else can we decide not to do?
There is another issue that we need to address. All the ways of making preservation cheaper can be viewed as "doing worse preservation". We live in a marketplace of competing preservation solutions. Making the tradeoff of preserving more stuff using worse preservation would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than the other pathetic systems" would win. Thus private interests work against the public interest in preserving more content.

Discussion

The discussion that took place during and after this talk, and subsequently at lunch was lively. There was considerable reluctance to acknowledge a need for radical changes in what is currently being done which, when combined with a general agreement that significant increases in funding are unlikely, struck me as having our heads in the sand. Among the issues I remember were:
  • The looming change in access patterns, which I have blogged about but forgot to include in this talk. Researchers are increasingly looking at the digital scholarly record as a large dataset to be processed, in whole or as a set of samples, rather than a collection of objects to be accessed individually. This has the potential to raise the proportion of the total cost devoted to access significantly and, in the absence of a suitable business model (such as Amazon's Free Public Datasets has for moderate-sized data), undo any savings in ingest.
  • The possibility of implementing preservation-in-place via a change to copyright deposit legislation using memoranda of understanding similar to those between NARA and federal agencies regarding datasets too large for NARA to ingest.
  • The similarity between the problem of the long tail of small digital publishers and the University Press market. That market led in effect to a mutual non-aggression pact between the Presses that turned out to be stable. People were skeptical that an analog with a broken business model was useful.
  • The similarity between the problem of the long tail of small digital publishers and the long tail of paper publishers, which led to divide-and-conquer strategies between libraries backed up by last-copy agreements.

2 comments:

Peter B said...

I've been mulling the question about 'whose responsibility?'. Is there a new set of answers for the digital? In days of print the author may have had an interest in longevity of what s/he wrote but hardly the responsibility. Similarly, the publisher complied with (legal) deposit but might not feel the responsibility. And now every library wants an e-connection without the burden of having to maintain an e-collection. Does the publisher now have the de facto archival responsibility - and does that need some de jure? But as ever, you shine a light in dark corners ...

David. said...

At CNI. Martin Klein reported initial results from Hiberlink, a new research project into reference rot of web-at-large resources (i.e. not references to other scholarly articles). This includes both link rot and content change.

In their PubMed sample, about 40% of the web-at-large references were archived, supporting the numbers I quoted above.

This research is extremely interesting.