Thursday, May 26, 2016

Abby Smith Rumsey's "When We Are No More"

Back in March I attended the launch of Abby Smith Rumsey's book When We Are No More. I finally found time to read it from cover to cover, and can recommend it. Below the fold are some notes.

There are four main areas where I have comments on Rumsey's text. On page 144, in the midst of a paragraph about the risks to our personal digital information she writes:
The documents on our hard disks will be indecipherable in a decade.
The word "indecipherable" implies not data loss but format obsolescence. As I've written many times, Jeff Rothenberg was correct to identify format obsolescence as a major problem for documents published before the advent of the Web in the mid-90s. But the Web caused documents to evolve from being the private property of a particular application to being published. On the Web, published documents don't know what application will render them, and are thus largely immune to format obsolescence.

It is true that we're currently facing a future in which most current browsers will not render preserved Flash, not because they don't know how to but because it isn't safe to do so. But shows that the technological fix for this problem is already in place. Format obsolescence, were it to occur, would be hard for individuals to mitigate. Especially since it isn't likely to happen, it isn't helpful to lump it in with threats they can do something about by, for example, keeping local copies of their cloud data.

On page 148 Rumsey discusses the problem of the scale of the preservation effort needed and the resulting cost:
We need to keep as much as we can as cheaply as possible. ... we will have to invent ways to essentially freeze-dry data, to store data at some inexpensive low level of curation, and at some unknown time in the future be able to restore it. ... Until such a long-term strategy is worked out, preservation experts focus on keeping digital files readable by migrating data to new hardware and software systems periodically. Even though this looks like a short-term strategy, it has been working well  ... for three decades and more.
Yes, it has been working well and will continue to do so provided the low level of curation manages find enough money to keep the bits safe. Emulation will ensure that if the bits survive we will be able to render them, and it does not impose significant curation costs along the way.

The aggressive (and therefore necessarily lossy) compression Rumsey enviasges would reduce storage costs, and I've been warning for some time that Storage Will Be Much Less Free Than It Used To Be. But it is important not to lose sight of the fact that ingest, not storage, is the major cost in digital preservation. We can't keep it all; deciding what to keep and putting it some place safe is the most expensive part of the process.

On page 163 Rumsey switches to ignoring the cost and assuming that, magically, storage supply will expand to meet the demand:
Our appetite for more and more data is like a child's appetite for chocolate milk: ... So rather than less, we are certain to collect more. The more we create, paradoxically, the less we can afford to lose.
Alas, we can't store everything we create now, and the situation isn't going to get better.

On page 166 Rumsey writes:
Other than the fact that preservation yields long-term rewards, and most technology funding goes to creating applications that yield short-term rewards, it is hard to see why there is so little investment, either public or private, in preserving data. The culprit is our myopic focus on short-term rewards, abetted by financial incentives that reward short-term thinking. Financial incentives are matters of public policy, and can be changed to encourage more investment in digital infrastructure.
I completely agree that the culprit is short-term thinking, but the idea that "incentives ... can be changed" is highly optimistic. The work of, among others, Andrew Haldane at the Bank of England shows that short-termism is a fundamental problem in our global society. Inadequate investment in infrastructure, both physical and digital, is just a symptom, and is far less of a problem than society's inability to curb carbon emissions.

Finally, some nits to pick. On page 7 Rumsey writes of the Square Kilometer Array:
up to one exabyte (1018 bytes) of data per day
I've already had to debunk another "exabyte a day" claim. It may be true that the SKA generates an exabyte a day but it could not store that much data. An exabyte a day is most of the world's production of storage. Like the Large Hadron Collider, which throws away all but one byte in a million before it is stored the SKA actually stores only(!) a petabyte a day (according to Ian Emsley, who is responsible for planning its storage). A book about preserving information for the long term should be careful to maintain the distinction between the amounts of data generated, and stored. Only the stored data is relevant.

On page 46 Rumsey writes:
our recording medium of choice, the silicon chip, is vulnerable to decay, accidental deletion and overwriting
Our recording medium of choice is not, and in the foreseeable future will not be, the silicon chip. It will be the hard disk, which is of course equally vulnerable, as any read-write digital medium would be. Write-once media would be somewhat less vulnerable, and they definitely have a role to play, but they don't change the argument.

David. said...

Via Mike Masnick at Techdirt, the Hoover Institution's Russ Roberts has a podcast interview with Abby Smith Rumsey.