Tuesday, February 17, 2015

Vint Cerf's talk at AAAS

Vint Cerf gave a talk entitled Digital Vellum at the AAAS meeting last Friday that has received a lot of attention in the media, including follow-up pieces by other writers, and even drew the attention of Dave Farber's famed IP list. I have some doubts about how accurately the press has reported his talk, which isn't available via the AAAS meeting website. I am commenting on the reports, not the talk. But, as The Register points out, Cerf has been making similar points for some time. I did find a TEDx talk he titled Bit Rot on YouTube, uploaded a year ago. Below the fold is my take.

Cerf's talk was the first in a session devoted to Information-Centric Networks:
Vinton Cerf’s talk discusses the desirable properties of a "Digital Vellum" — a system that is capable of preserving the meaning of the digital objects we create over periods of hundreds to thousands of years. This is not about preserving bits, It is about preserving meaning, much like the Rosetta Stone. Information Centric Networking may provide an essential element to implement a Digital Vellum. This long-term thinking will serve as a foundation and context for exploring ICNs in more detail.
ICN is a generalization of the Content-Centric Networks about which I blogged two years ago. I agree with Cerf that these concepts are probably very important for long-term digital preservation, but not why they are. ICNs make it easy for Lots Of Copies to Keep Stuff Safe, and thus make preserving bits easier, but I don't see that they affect the interpretation of the bits.

There's more to disagree with Cerf about. What he calls "bit rot" is not what those in the digital preservation field call it. In his 1995 Scientific American article Jeff Rothenberg analyzed the reasons digital information might not reach future readers:
  • Media Obsolescence - you might not be able to read bits from the   storage medium, for example because a reader for that medium might no   longer be available.
  • Bit Rot - you might be able to read bits from the medium, but they might be corrupt.
  • Format Obsolescence - you might be able to read the correct bits from   the medium but they might no longer be useful because software to render them into an intelligible form might no longer be available.
Media Obsolescence was a big problem, but as Patrick Gilmore pointed out on Farber's list it is much less of a problem now that most data is on-line and thus easily copied to replacement media.

Bit Rot (not in Cerf's sense) is an inescapable problem - no real-world storage system can be perfectly reliable. In the TEDx talk Cerf simply assumes it away.

Format Obsolescence is what Cerf was discussing. There is no doubt that it is a real problem, and that in the days before the Web it was rampant. However, the advent of the Web forced a change. Pre-Web, most formats were the property of the application that both wrote and read the data. In the Web world, these two are different and unrelated.

Google is famously data-driven, and there is data about the incidence of format obsolescence - for example the Institut National de l'Audiovisuel surveyed their collection of audiovisual content from the early Web, which would be expected to have been very vulnerable to format obsolescence. They found an insignificant amount. I predicted this finding on twofold theoretical grounds three years before their study:
  • The Web is a publishing medium. The effect is that formats in the Web world are effectively network protocols - the writer has no control over the reader. Experience shows protocols are the hardest things to change in incompatible ways (cf. Postel's Law, "no flag day on the Internet", IPv6, etc.).
  • Almost all widely used formats have open source renderers, preserved in source code repositories. It is very difficult to construct a plausible scenario by which a format with an open source renderer could become uninterpretable.
Even The Guardian's Samuel Gibbs is skeptical of Cerf's worry about format obsolescence:
That is the danger of closed, proprietary formats and something consumers should be aware of. However, it is much less of an issue for most people because the majority of the content they collect as they move through life will be documented in widely supported, more open formats.
While format obsolescence is a problem, it is neither significant nor pressing for most digital resources.

However, there is a problem that is both significant and pressing that affects the majority of digital resources. By far the most important reason that digital information will fail to reach future readers is not technical, or even the very real legal issues that Cerf points to. It is economic. Every study of the proportion of content that is being preserved comes up with numbers of 50% or less. The institutions tasked with preserving our digital heritage, the Internet Archive and national libraries and archives, have nowhere close to the budget they would need to get that number even up to 90%.

Note that increasingly people's and society's digital heritage is in the custody of a small number of powerful companies, Google prominent among them. All the examples from the TEDx talk are of this kind. Experience shows that the major cause of lost data in this case is the company shutting the service down, as Google does routinely. Jason Scott's heroic Archive Team has tried to handle many such cases.

These days, responsibility for ensuring that the bits survive and can be interpreted rests primarily on Cerf's own company and its peers.

5 comments:

  1. Video of Vint's talk is available online at https://www.youtube.com/watch?v=K_DIwiSDaT8&feature=youtu.be

    ReplyDelete
  2. Even open source may not save correct interpretability of bits after 1000 years. Please see the OLIVE project at CMU.

    vint

    ReplyDelete
  3. Thank you, Vint. I am not claiming that any technology will save the interpretability or even the survival of bits for 1000 years. I would be interested in a plausible scenario by which a format with an open source renderer would be uninterpretable in a less speculative time-frame such as 100 years.

    I'm quite familiar with the Olive project and have pointed to it from this blog. I've blogged about emulation as a preservation strategy since 2009 and in particular about the encouraging recent progress in delivering emulation in a user-friendly way at C-MU, Freiburg, the Internet Archive and elsewhere. But I'm very skeptical that these efforts will have a detectable impact on the survival of usable information for 1000 years.

    Its a sad fact that very little information survives from 1000 years ago. The same will be said 1000 years from now. In order to survive 1000 years information has to survive the first 100. The vast majority of information generated today will not survive 100 years for reasons that have nothing to do with the interpretability of the bits.

    We have a choice. We can deploy the limited resources society makes available for preservation to efforts that might or might not have an impact 1000 years from now, or to efforts that have a high probability of increasing the resources available to readers in the next few decades. I made my choice more than a decade and a half ago.

    ReplyDelete
  4. If you doubt the risk that, despite the efforts of this team, Vint's company poses to your data, click here.

    ReplyDelete