Tuesday, March 21, 2017

The Amnesiac Civilization: Part 5

Part 2 and Part 3 of this series established that, for technical, legal and economic reasons there is much Web content that cannot be ingested and preserved by Web archives. Part 4 established that there is much Web content that can currently be ingested and preserved by public Web archives that, in the near future, will become inaccessible. It will be subject to Digital Rights Management (DRM) technologies which will, at least in most countries, be illegal to defeat. Below the fold I look at ways, albeit unsatisfactory, to address these problems.

There is a set of assumptions that underlies much of the discussion in Rick Whitt's "Through A Glass, Darkly" Technical, Policy, and Financial Actions to Avert the Coming Digital Dark Ages. For example, they are made explicit in this paragraph (page 195):
Kirchhoff has listed the key elements of a successful digital preservation program: an independent organization with a mission to carry out preservation; a sustainable economic model to support preservation activities over targeted timeframes; clear legal rights to preserve content; relationships with the content owners, and the content users; a preservation strategy and supporting technological infrastructure; and transparency about the key decisions.
The assumption that there is a singular "independent organization with a mission to carry out preservation" to which content is transferred so that it may be preserved is also at the heart of the OAIS model. As in almost all discussions of digital preservation, it is not surprising to see it here.

There are three essential aspects; the singular organization, its independence, and the transfer of content. They are related to, but not quite the same as, the three options Whitt sets out on page 209:
Digital preservation should be seen not as a commercial threat, but as a new marketplace opportunity, and even advantage. Some voluntary options include persuading content owners to (1) preserve the materials in their custody, (2) cede the rights to preserve to another entity; and/or (3) be willing to assume responsibility for preservation, through "escrow repositories" or "archives of last resort."
Lets look at each in turn.

Not Singular

If the preservation organization isn't singular at least some of it will be independent and there will be transfer of content. The LOCKSS system was designed to eliminate the single point of failure created by the singular organization. The LOCKSS Program provided software that enabled the transfer of content to multiple independent libraries, each taking custody of the content they purchased. This has had some success in the fairly simple case of academic journals and related materials, but it is fair to say that there are few other examples of similarly decentralized preservation systems in production use (Brian Hill at Ars Technica points to an off-the-wall exception).

Not singular solutions have several disadvantages to set against their lack of a single point of failure. They still need permission from the content owners, which except for the special case of LOCKSS tends to mean individual negotiation between each component and each publisher, raising costs significantly. And managing the components into a coherent whole can be like herding cats.

Not Independent

The CLOCKSS Archive is a real-world example of an "escrow repository". It ingests content from academic publishers and holds it in a dark archive. If the content ever becomes unavailable, it is triggered and made available under Creative Commons licenses. The content owners agree up-front to this contingency. It isn't really independent because, although in theory publishers and libraries share equally in the governance, in practice the publishers control and fund it. Experience suggests that content owners would not use escrow repositories that they don't in practice control.

"Escrow repositories" solve the IP and organizational problems, but still face the technical and cost problems. How would the "escrow repositories" actually ingest the flow of content from the content owners, and how would they make it accessible if it were ever to be triggered? How would these processes be funded? The CLOCKSS Archive is economically and technically feasible only because of the relatively limited scale of academic publishing. Doing the same for YouTube, for example, would be infeasible.

No Transfer

I was once in a meeting with major content owners and the Library of Congress at which it became clear to me that (a) hell would freeze over before these owners would hand a copy of their core digital assets to the Library, and (b) even after hell froze the Library would lack the ability or the resources to do anything useful with them. The Library's handling of the feed that Twitter donated is an example of (b). Whitt makes a related point on page 209:
In particular, some in the content community may perceive digital obsolescence not as a flaw to be fixed, but a feature to be embraced. After all, selling a single copy of content that theoretically could live on forever in a variety of futuristic incarnations does not appear quite as financially renumerative as leasing a copy of content that must be replaced, over and over, as technological innovationmarches on.
In the Web era, only a few cases of successful pay-per-view models are evident. Content that isn't advertiser-supported, ranging from academic journals to music to news and TV programs is much more likely to sold as an all-you-can-eat bundle. The more content available only as part of the bundle, the more valuable the bundle. Thus the obsession of content owners with maintaining control over the only accessible version of each item of content (see, for example, Sci-Hub), no matter how rarely accessed.

The scale of current Web publishing platforms, the size and growth rates of their content, and the enormous cash flow they generate all militate against the idea that their content, the asset that generates the cash flow, would be transferred to some third party for preservation. In this imperfect world the least bad solution may be some form of "preservation in place". As I wrote in The Half-Empty Archive discussing ways to reduce the cost of ingest, which is the largest cost component of preservation:
It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.
This approach has many disadvantages. It has a single point of failure. In effect preservation is at the whim of the content owner, because no-one will have standing, resources and motivation to sue in case the owner fails to deliver on their commitment. And note the connection between these ideas and Whitt's discussion of bankruptcy in Section III.C.2:
Bankruptcy laws typically treat tangible assets of a firm or individual as private property. This would include, for example, the software code, hardware, and other elements of an online business. When an entity files for bankruptcy, those assets would be subject to claims by creditors. The same arguably would be true of the third party digital materials stored by a data repository or cloud services provider. Without an explicit agreement in place that says otherwise, the courts may treat the data as part of the estate, or corporate assets, and thus not eligible to be returned to the content "owner."
But to set against these disadvantages there are two major advantages
  • As the earlier parts of this series show, there may be no technical or legal alternative for much important content.
  • Preservation in place allows for the survival of the entire publishing system, not just the content. Thus it mitigates the multiple version problem discussed in Part 3. Future readers can access the versions they are interested in by emulating the appropriate browser, device, person and location combinations.
I would argue that an urgent task should be to figure out the best approach we can to "preservation in place". A place to start might be the "preservation easement" approach take by land trusts, such as the Peninsula Open Space Trust in Silicon Valley. A viable approach would preserve more content at lower cost than any other.

No comments: