Cerf's talk was the first in a session devoted to Information-Centric Networks:
Vinton Cerf’s talk discusses the desirable properties of a "Digital Vellum" — a system that is capable of preserving the meaning of the digital objects we create over periods of hundreds to thousands of years. This is not about preserving bits, It is about preserving meaning, much like the Rosetta Stone. Information Centric Networking may provide an essential element to implement a Digital Vellum. This long-term thinking will serve as a foundation and context for exploring ICNs in more detail.ICN is a generalization of the Content-Centric Networks about which I blogged two years ago. I agree with Cerf that these concepts are probably very important for long-term digital preservation, but not why they are. ICNs make it easy for Lots Of Copies to Keep Stuff Safe, and thus make preserving bits easier, but I don't see that they affect the interpretation of the bits.
There's more to disagree with Cerf about. What he calls "bit rot" is not what those in the digital preservation field call it. In his 1995 Scientific American article Jeff Rothenberg analyzed the reasons digital information might not reach future readers:
- Media Obsolescence - you might not be able to read bits from the storage medium, for example because a reader for that medium might no longer be available.
- Bit Rot - you might be able to read bits from the medium, but they might be corrupt.
- Format Obsolescence - you might be able to read the correct bits from the medium but they might no longer be useful because software to render them into an intelligible form might no longer be available.
Bit Rot (not in Cerf's sense) is an inescapable problem - no real-world storage system can be perfectly reliable. In the TEDx talk Cerf simply assumes it away.
Format Obsolescence is what Cerf was discussing. There is no doubt that it is a real problem, and that in the days before the Web it was rampant. However, the advent of the Web forced a change. Pre-Web, most formats were the property of the application that both wrote and read the data. In the Web world, these two are different and unrelated.
Google is famously data-driven, and there is data about the incidence of format obsolescence - for example the Institut National de l'Audiovisuel surveyed their collection of audiovisual content from the early Web, which would be expected to have been very vulnerable to format obsolescence. They found an insignificant amount. I predicted this finding on twofold theoretical grounds three years before their study:
- The Web is a publishing medium. The effect is that formats in the Web world are effectively network protocols - the writer has no control over the reader. Experience shows protocols are the hardest things to change in incompatible ways (cf. Postel's Law, "no flag day on the Internet", IPv6, etc.).
- Almost all widely used formats have open source renderers, preserved in source code repositories. It is very difficult to construct a plausible scenario by which a format with an open source renderer could become uninterpretable.
That is the danger of closed, proprietary formats and something consumers should be aware of. However, it is much less of an issue for most people because the majority of the content they collect as they move through life will be documented in widely supported, more open formats.While format obsolescence is a problem, it is neither significant nor pressing for most digital resources.
However, there is a problem that is both significant and pressing that affects the majority of digital resources. By far the most important reason that digital information will fail to reach future readers is not technical, or even the very real legal issues that Cerf points to. It is economic. Every study of the proportion of content that is being preserved comes up with numbers of 50% or less. The institutions tasked with preserving our digital heritage, the Internet Archive and national libraries and archives, have nowhere close to the budget they would need to get that number even up to 90%.
Note that increasingly people's and society's digital heritage is in the custody of a small number of powerful companies, Google prominent among them. All the examples from the TEDx talk are of this kind. Experience shows that the major cause of lost data in this case is the company shutting the service down, as Google does routinely. Jason Scott's heroic Archive Team has tried to handle many such cases.
These days, responsibility for ensuring that the bits survive and can be interpreted rests primarily on Cerf's own company and its peers.