Wednesday, January 25, 2017

Rick Whitt on Digital Preservation

Google's Rick Whitt has published "Through A Glass, Darkly" Technical, Policy, and Financial Actions to Avert the Coming Digital Dark Ages (PDF), a very valuable 114-page review of digital preservation aimed at legal and policy audiences. Below the fold, some encomia and some quibbles (but much less than 114 pages of them).

Whitt's abstract aptly summarizes the problem:
The digital preservation challenge is multidimensional, requiring us not just to develop and implement technical solutions — such as proposed migration and emulation techniques—but also to address the relevant public policy components. Legal frameworks, notably copyright and contract laws, pose significant hurdles to the usability and accessibility of preserved content. Moreover, the misalignment of financial incentives undermines the prospects of creating viable and economically sustainable solutions.
First the encomia and then the quibbles, for which I need to apologize. I had the opportunity to review a draft, but it came at a time when I was too busy to give the detailed reading it needed. So many of the quibbles are my fault for not taking the time and providing a timely response.

Below, I use the page numbers from the original. Subtract 115 to get the page number in the PDF.

Encomia

Part I (pages 124-151)

Whitt starts with a wide-ranging overview of the need for, and difficulty of, preserving digital information for future access, forming the necessary motivation for the rest of the work. This is a target-rich environment and he hits many of them.

Part II (pages 151-166)

In this part Whitt provides a lay person's overview of the technical aspects of digital preservation. I'm impressed with how little I can find with which to quibble. Of course, that isn't to say that I believe each of the techniques and standards he lists are equally valuable, but I'm aware that my views are idiosyncratic.

Part III (pages 166-176)

Here, Whitt provides an extremely useful overview of the legal aspects of preservation. III.A is a brief note that preservation activities are covered by a mish-mash of private, public and criminal law (page 167):
Myriad laws and regulations can profoundly affect both the initial preservation, and subsequent re-use of, and access to, documents, data, metadata, and software. Often these laws and regulations are adopted in complete ignorance of their potential impact on digital preservation. The inadvertent impact is only heightened as the laws change, regulations are revised, and licenses expire. Among other drawbacks, this makes it more difficult to effect alignment between national responses to the legal issues arising from digital preservation.
III.B is a useful discussion of copyright, the eternal bugaboo of preservation. Whitt writes (page 169):
So, the crux of the problem becomes: How can one preserve something that one does not own? Under what circumstances does the preserving organization have the right or permission to ingest the protected content into the preservation system? And then, to provide user access on the other end of the process? Does one need a presumptive authorization to preserve something?
The LOCKSS Program, other efforts to preserve the academic literature including Portico and the Dutch KB, and some national library Web collections have operated under explicit authorization from the publisher. Other national libraries have authorization under national copyright legislation.

The Internet Archive and some other Web collections have operated under the DMCA's "safe harbor" protection or its equivalent elsewhere. But, as Whitt points out, private (contract) law can override these (page 175):
In the online context, website owners often will seek to employ so-called “click-through” or “click-wrap” licenses, essentially nonnegotiable contracts that require users to give their assent to specific stipulated uses of the digital material.
As Jefferson Bailey and I discussed in My Web Browser's Terms of Service, this area of the law is potentially troubling and very far from settled.

III.C is a valuable overview of other laws that are usually ignored in discussions of preservation, including patent and trademark, bankruptcy, privacy and data protection, and content liability. Bankruptcy is a particular problem for third-party preservation services, especially if they use "the cloud". Whitt writes (page 176):
When an entity files for bankruptcy, those assets would be subject to claims by creditors. The same arguably would be true of the third party digital materials stored by a data repository or cloud services provider. Without an explicit agreement in place that says otherwise, the courts may treat the data as part of the estate, or corporate assets, and thus not eligible to be returned to the content “owner.”
And what if the "owner" is a preservation service operating with authorization from the publisher? They are not the owner of the content, they are in effect agents of the owner. Do they get the content, or does the actual owner? And, since the purpose of the preservation service is probably to provide access after the failure of the owner, what if the actual owner is no longer in existence to claim their content?

Part IV (Pages 177-186)

This part covers the economic aspects of digital preservation, an area on which I have focused through the years. Whitt writes (page 177), and I agree, that the question is:
Who precisely is responsible, both morally and financially, for the long-term stewardship of digital content? After all, “economic sustainability—generating and allocating the resources necessary to support long-term preservation activities—is fundamental for the success of long-term digital preservation programs . . . . And yet, this fundamental point has not received the attention or the analysis it deserves.” Compared to the substantial literature on the technical and policy aspects of digital preservation, the economic aspects until recently have been “relatively neglected.”
Also "relatively neglected" has been the imperative of developing preservation technology with much lower costs. We are not preserving much of the content that deserves it. We are not going to get a large increase in funding. Thus we need to do more with less.

Part V (pages 186-202) and VI (pages 203-229)

I will treat these two forward-looking parts together. Part VI is a valuable brainstorming of possible things to do. Whitt starts part V thus (page 186):
Hopefully, the foregoing discussion leads to but one inevitable conclusion: the world must get better organized, and quickly, to preserve our digital present and future. ... The technology and laws and finances must come together, across time and space, to render our digital heritage. Not as a one-time silver bullet solution, however, but as a persistent, ever-evolving process.
Nemeth's OSI Stack
One cannot but agree. He makes an analogy between the preservation problem and the relative success of Internet policymaking (page 189):
One way to simplify this approach, as demonstrated in the Internet policymaking context, is to separate out the three dimensions of Code (the target activity), Rules (the institutional tools), and Players (the organizational entities). Or, Code is what we should be doing, Rules is how we should be doing it, and Players is who should be doing it.
This provides some interesting insights. One I particularly appreciate is his use of my late, lamented friend Evi Nemeth's brilliant extension of the 7-layer OSI stack to include Financial and Political layers (page 190). This is definitely as appropriate in preservation as it is in communication.

Whitt's grid
Whitt shows the application of a modified form of this stack to the stages of the preservation life-cycle as a grid (page 191) which he explains thus (page 192):
The illustrative digital life-cycles/system-layers mapping ... is an example of taking a Code approach to digital preservation. Together these related movements in time and space constitute the digital landscape for all preservation and access activities.
I agree with Whitt when he writes, citing Cliff Lynch (page 192):
Much as the focus should be on process rather than outcome, we should move from project-driven activities to a fundamental program of core activity worldwide. This approach will not be easy. Many digital preservation activities consist of short-term research projects, and/or institution-specific focus, and/or genre-specific focus. We also need to accept a long learning curve. As Clifford Lynch notes, “we need to acknowledge that we don’t really know how to do long-term digital preservation.” Perhaps instead “in a hundred years the community will really know about preserving over long periods of time.” And while “we have never preserved everything; we need to start preserving something.”
and when he returns to the theme introducing part VI (page 203):
To a certain extent, the focus on researching the various preservation techniques also may be misplaced. It may well be that only by actually using preservation strategies for a number of years, can we conclude which ones might work best in practice. Under this approach, documents should be stored now in the most promising formats, and then immediately begin testing preservation strategies systematically.
This reflects our attitude when, more than 18 years ago, we started the LOCKSS Program. Looking at the digital preservation landscape of the time, we saw many short-term research projects hoping that their results would form components of a future complete generic digital preservation system suitable for all kinds of digital content. Our view was that such a generic system was so difficult to implement that it would never succeed.

Instead, we aimed to build and operate in production a complete "soup-to-nuts" preservation system specialized to a single type of digital object, the academic journal. Our hope was that by doing so we would gain experience that could be applied to preservation systems for other types of object. We adopted Cliff's view - "we need to start preserving something."

Speaking only for myself, I would say that too much scarce resource still goes into research, which is relatively easy to fund via grants, and too little into production systems actually preserving content, which have to navigate the difficult path to a sustainable business model. And that in the search for sustainability systems designed for specific kinds of digital objects too often fall into the trap of trying to be all things to all kinds of object. Whitt seems to agree when he writes (page 203):
The wide variety of digital formats and applications makes it impossible to select a one-size-fits-all solution for preservation. Indeed, Hedstrom warns us that “the search for the Holy Grail of digital archiving is premature, unrealistic, and possible counter-productive.”

Quibbles


Now for the quibbles, none of which should detract from the overall value of Whitt's comprehensive survey of the past, and interesting suggestions for the future.

Part I (pages 124-151)

Whitt exaggerates the problem and triggers a pet peeve of mine when he writes (page 130):
IBM estimates that “90 percent of the data in the world today has been created in the last two years alone.” IDC estimates that the amount of digital information in the world is doubling every eighteen months. And, the amount of digital content created every year is more than all of the cloud-based data storage capacity in the world. Some eighty to ninety percent of that stored material — and growing — is so-called “unstructured data,” which runs the gamut from ordinary emails and other text documents to more rich data types such as photographs, music, and movies.
This paragraph contradicts itself when it says "that stored material". IBM and IDC are talking about data created. Vastly more data is created than is ever stored. Vastly more data is stored than ever needs to be preserved. We do ourselves a disservice by our lazy reliance on these eye-popping but irrelevant numbers. Our problems are bad enough without exaggerating them by orders of magnitude. For details see my post Where Did All Those Bits Go?.

On the other hand, Whitt joins almost all writers on digital preservation by understating one of the fundamental problems of preserving the Web. He acknowledges that Web resources are evanescent (page 132):
Indeed, some fifty percent of Web resources archived by the British Library had links that were unrecognizable or gone after just one year.
citing Andy Jackson's work at the BL, and that they are mutable (page 132):
a Web document is not inherently fixed, but comprises many dynamic links that shift and change over time. A Web page is not actually a single page. What indeed is the one “authentic” version? Or are they all?
But, despite citing me (page 149):
Rosenthal asks what it means to preserve an artifact that changes every time it is examined.205
he doesn't really address the problem caused by personalization, geolocation, ad insertion, and other forms of dynamic content. The problem is that these days most Web resources are deliberately implemented to ensure that visits by different users, browsers, IP addresses and even by the same ones at different times, see different content. There is no way to collect an authentic or even representative version of the resource. Suppose a Web crawler collects a Web site for preservation. In most cases no human could ever have seen the preserved version of the site. How can this be regarded as authentic, or representative?

Just for the advertising view of this, see Maciej Cegłowski's wonderful What Happens Next Will Amaze You. And consider that, despite the facts that advertising underpins the economics of the Web, and that advertisments and tracking code form perhaps 35% of the bytes a user's browser downloads, in most cases the advertisements are not collected and thus not preserved. Future scholar's view of the Web will be quite unlike ours.

Incidentally, reference 205 is one of many in the PDF that I downloaded that read like this:
205. Rosenthal, supra note Error! Bookmark not defined., at 20.

Part II (pages 151-166)

The one significant lapse in this otherwise excellent overview of the technical aspects is on page 161:
Digital signatures combine all three techniques to authenticate both the document and the creator. A hash is created and encrypted, usually with a public key, and the creator’s identity certified through digital IDs issued by a third party. A digital signature currently is a common technique for ensuring the authenticity of documents, although its chief drawback is its inability to specify how a particular document may have been altered.
Actually, the hash is encrypted with the private part of a public-private key pair. It is true that, because it is based on a hash, a digital signature can reveal that a digital object has changed, but not how it has changed. But to call this its chief drawback is a stretch. As I wrote in 2011, there are a number of problems with the use of digital signatures over the long term:
First, it is important to distinguish between detecting corruption or tampering, i.e. tamper-evident storage, and recovering from tampering, i.e. tamper-proof storage. Digital signatures are intended to provide tamper-evident storage. They do so by allowing a later reader to verify the signature. To do so, the readers need the public half of the key used to sign the document. Without it, or with a corrupted public key, the signature is useless. Thus, strictly speaking, digital signatures do not solve the problem of tamper-evidence, they reduce it to the harder problem of tamper-proof storage, applied to a smaller set of bits (the public key). And even if they did solve it, maintaining integrity requires not tamper-evident but tamper-proof storage for documents.

Even tamper-proof storage of the public key alone is not enough. The signature's testimony as to the integrity of the document depends not just on the availability of the public key but also on the secrecy of the private key. As we see from recent compromises at RSA, Comodo, DigiNotar and others, maintaining the secrecy of private keys is hard. In fact, over the long term it is effectively impossible. So keys have a limited life. They are created and eventually revoked. In order to verify a signature over the long term, a reader needs access to a tamper-proof database of keys and the date ranges over which they were valid. Implementing such a database is an extraordinarily hard problem; for details see Petros Maniatis' Ph.D. thesis (PS).

Even if the practical problems of implementing a tamper-proof database could be overcome, the reader would know only the span of time over which the creator of the key believed that the secret had not leaked. Secrets don't ring a bell when they leak, so the creator might be unduly optimistic. And, of course, truly tamper-proof databases are a utopian concept, the best we can do in the real world is to make them tamper-resistant.
The long timescale of digital preservation poses another problem for digital signatures; they fade over time. Like other integrity check mechanisms, the signature attests not to the digital object, but to the hash of the digital object. The goal of hash algorithm design is to make if extremely difficult with foreseeable technology to create a different digital object with the same hash, a collision. They cannot be designed to make this impossible, merely very difficult. So, as technology advances with time, it becomes easier and easier for an attacker to substitute a different object without invalidating the signature. Because over time hash algorithms become vulnerable and obsolete, preservation system depending for integrity on preserving digital signatures, or even just hashes, must routinely re-sign, or re-hash, with a more up-to-date algorithm.

Moreover, many events since 2011, for example last week's case of Symantec issuing bogus certificates, have shown that the idea of trusting "digital IDs issued by a third party" is also problematic.

Part IV (Pages 177-186)

Whitt writes (page 181):
One bit of good news is that the costs of keeping digital data continues to fall over time, a very special and convenient property. Storage costs in particular are “dropping by 50% every 18 months.”
Would that it were still so! 50% in 18 months is a Kryder rate of about 35%/yr, which hasn't been the case since 2010. As my modeling of the economics of long-term preservation has shown, the lifetime cost of storing data increases rapidly as the Kryder rate falls below about 20%, where it has been for the last 7 years.

Whitt is far from alone in this outdated view of storage costs, and also in ignoring the importance of the time value of money (aka Discounted Cash Flow or Net Present Value) in the economics of digital preservation. Whitt is correct when he cites me thus (page 182):
David Rosenthal has examined the costs of digital preservation in the context of the information lifecycle, and concluded that about one-half are due to ingest, one-third for preservation and one-sixth for access. Resource allocations tend to be driven towards “low hanging fruit — content from larger publishers at low risk of loss (due to relative ease of discovery, migration, and collection) — and away from content at higher risk of loss. Perversely, as Rosenthal finds, the more difficult it is to find, collect, and migrate content, the less likely it will be funded.
But he doesn't note the implication of the heavily front-loaded cost structure. The ingest half of life-cycle costs must be paid immediately as a lump sum. The other costs are incurred gradually over time. Contrast this with the attractive economics of "the cloud". Unlike on-premise technology, which has up-front equipment investments, costs in "the cloud" are paid gradually over time with no large up-front cost. Rampant short-termism makes justifying large up-front payments extremely difficult.

I would argue that Whitt is similarly behind the times when he follows Brian Lavoie by writing (page 183):
while emulation best preserves the original form and functionality, supporting that most ambitious preservation objective consequently makes it the most expensive option. Emulation will require developing and maintaining a library of emulators, preserving the software and hardware environments along with the digital object, and maintaining a constant supply of new emulators as new environments emerge.
In my report on emulation (and subsequent blog posts) I point out that, while the idea that digital preservation via emulation required investment in "maintaining a constant supply of new emulators" was once true, this has not been the case for some considerable time, for a number of reasons:
  • Increasing returns to scale have meant that information technology has converged on a small number of hardware architectures (Intel, ARM and some also-rans), and that these architectures face extremely strong backwards compatibility constraints. Thus a small number of relatively stable emulators cover the vast majority of cases.
  • Both hardware and software development require emulators, which are thus developed for reasons having nothing to do with preservation. The development cost of keeping the emulators current doesn't come out of the preservation budget.
  • Mainstream IT now depends heavily on virtual machines. Virtualization is, in effect, a partial emulation, the technologies are closely linked and, given the small number of popular hardware architectures, largely functionally equivalent.
  • Programming technology has migrated to interpreted languages, such as Java and JavaScript, whose virtual machines hide the underlying hardware architecture.
It seems likely that, at least for newly created content, emulation will be the least expensive option.

Whitt is correct that digital preservation initiatives reliant on grant funding are at high risk (page 184):
Many digital preservation initiatives are funded by “soft money,” such as grants, one-time donations, or other one-time expenditures. ... And so far, nonprofit organizations have not demonstrated an ability to earn enough money to operate independently.
This is too pessimistic. The LOCKSS Program provides one counter-example. For the last decade the program has covered all operations and routine development costs, in a market with extremely high staff costs, without grant funding. And it has accumulated adequate reserves. In 2012 we did receive a small grant to accelerate some "future-proofing" software development, but we would argue that investing to reduce future costs in this way is an appropriate use of one-time funds.

Although there is some reference to the possibility of commercial preservation services (e.g. page 185), Whitt doesn't acknowledge that some such services, for example Arkivum, have already achieved significant market share and appear as sustainable as companies in other IT sectors.

No comments: