This post provides the text of the slides, sources and commentary for the opening plenary that I just gave at the CNI Spring Task Force meeting. The actual slides are available here (PDF). Follow me below the fold for the full details.
Kirk McKusick's IEEE Award
Kirk McKusick was awarded the 2009 IEEE Reynold B. Johnson Information Storage Systems Award at Usenix's 2009 FAST conference.
Shifting Sands"... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats."
As Jeff wrote this, Kirk's file system was 16 years old, with no incompatible changes to the API or on-disk format.
The quotation is from the Jeff Rothenberg's original article "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272, No. 1, 1995. A 1999 update is here, but the update doesn't change the argument of the talk.
The Meme
Incompatibility is not inevitable, it is a choice someone made. If they are rational, they assessed the costs and the benefits. Incompatible changes to widely used software impose costs on each user; if there are many users, aggregating these costs overwhelms any possible benefit. This is especially true when the benefits, even if large, accrue only to a few users.
Talk in 3 Parts
Ancient History"History is not what you thought. It is what you can remember. All other history defeats itself."
1066 And All That is a classic of English humor. Anyone baffled by it should consult the po-faced Wikipedia entry.
Jeff Rothenberg's Scenario
The first two threats are easy to explain and defend against by regularly migrating the bits from older to newer media. The third threat was harder to explain and defend against, so it dominates the article.
Jeff on Format Obsolescence
Note how, because the hardware specifications are themselves digital documents to be preserved, Jeff has deftly reduced the emulation strategy to a previously unsolved problem.
Jeff's Dystopian Vision
Two Words: Desktop Publishing
It is evident reading Jeff's article that the way the document describing his hidden fortune got on to the CD was via a desktop publishing system. If you think back to 1995, desktop publishing was all the rage.
IT in 1995
Modern History"A preoccupation with the future not only prevents us from seeing the present as it is but often prompts us to rearrange the past."
Impacts of Jeff's Vision
It is somewhat odd that, despite Jeff's preference for emulation, many more of the existing systems use format migration.
The Web
The graph is from Netcraft. It shows that Netcraft didn't even start tracking the Web until after Jeff's article had been published, and that the real explosive growth of the Web didn't start until after Jeff's update appeared in 1999.
Off-line or On-Line
To be sure, some material worth keeping is not on-line, at least not in the sense of being accessible via the Web. For example, the Stanford Digital Repository contains material that has been deposited on condition that it not be made accessible. Some of this represents preservation masters for content that is on-line in a presentation format. In other cases, it is content that ideally would be on-line if only that were permitted, for example content embargoed for a period, or material that would be on-line if only the resources to put it on-line were available.
Microsoft vs. its Users
Two books about Microsoft's anti-trust struggle with the US Justice Dept. are Ken Auletta's World War 3.0 and John Heilemann's Pride Before The Fall.
Note that format obsolescence happens when support for a format is removed, not when support for a successor format is added. Microsoft's business model depended on adding support for new formats not on removing support for old formats; making the new version of Office incapable of reading documents produced by its predecessor would have been self-defeating.
Evidence that Microsoft can no longer remove support for old formats, as opposed to add support for new formats is in this post from last year.
Documents or Content
Virtual Machines
Open Source
For a discussion of the importance of open source for preservation, see this post.
This argument may not apply to console games and other forms of content protected by Digital Rights Management (DRM). Although in practice most forms of DRM have been cracked (for a particularly revealing description of the necessary reverse-engineering process, see Bunnie Huang's fascinating book Hacking the Xbox. Thus, although in most cases it is technically possible to preserve access to DRM-protected content, the legality of doing so is often challenged. Presumably, the challenges wouldn't be mounted if the open source renderers didn't render the content. There is more on DRM in this post.
20/20 Hindsight
The Big Picture
W. Brian Arthur's book Increasing Returns and Path Dependence in the Economy" is an important description of the behavior of technology markets. It explains how, as illustrated in the graph that I created, they are initially fragmented, with multiple products competing with comparable shares of a small market. At some point, for random reasons, one gets enough bigger market share for the increasing returns to scale (or network effects) to take over. Once they do, one product rapidly gains share in a rapidly expanding market. Others initially benefit from the growing market even as they lose market share, but rapidly start losing their existing customers to the winner.
At this point, as shown by the arrow on the graph, it is in the interest of the winning product to make switching from their competitor's products as easy as possible.
This analysis works very well for markets with large numbers of relatively unsophisticated customers. Markets with a small number of sophisticated customers have figured out strategies for fighting back. For example, in the airliner business the airlines have understood that it is in their long-term interest to buy from both Boeing and Airbus; allowing either to fail would impose unacceptable monopoly costs. Similar behavior can be seen in the market for CPU chips (Intel vs. AMD) and graphics chips (NVIDIA vs. ATI).
Yes We Can!
The Future"Prediction is very difficult, especially about the future."
The Real Problems Were ...
Scale
Storage cost issues are addressed in the series of posts on A Petabyte For A Century and the resulting iPRES paper (190K PDF).
Metcalfe's Law
The two extremes are archive.org and Portico. I should stress that both systems are well engineered to meet their different goals using their chosen techniques. I am not criticizing them, I'm simply using them as bounds on the costs of operating at scale.
Scale Implies Cost
My cost numbers for archive.org come from a recent article in The Economist's Technology Quarterly, and for Portico from a guesstimate based on their tax returns.
How Many $ Do We Need?
Much less $100B/yr. The point is that, even if we could do adequate quality preservation with archive.org's cost structure, we'd still be much too expensive to address society's need for preservation. With the cost structures more normally associated with preservation at scale, we're much, much further away from addressing it.
Intellectual Property
The real problem is that the need to talk to the copyright owner's lawyers applies even if the content is open access motivates preservation of content for which a single lawyer's conversation obtains permission for a great deal of content. So, for example, even if it takes a lot of lawyer time to talk to Elsevier, the cost per unit of content preserved is small. Whereas even if the cost to talk to a small open access publisher is small, the cost per unit of content will be prohibitive. Once again, the economic forces push towards preservation of the content that is not at risk of loss.
Looking Forwards
Non-Problems
There are extended discussions of the usefulness of format metadata in this post, and of the relative value of open source renderers as against format specifications in this post. For a discussion of the questionable value of format metadata for preservation see this post.
Services not Documents
For a discussion of the importance of context in preserving the Web see this post.
Things Worth Preserving
Do you remember Myst from 1993? It was a beautiful virtual world that you explored. Pretty soon you figured out that you were the only person there. Some time after that you figured out that the goal of the game was to figure out why you were the only person there. We've come a long way since then, Myst would not make it against World of Warcraft or Second Life.
For a discussion of the problem of preserving the materials future scholars will need to study elections, see this post.
Economics
Bytes are a lot more vulnerable to disruptions in the money supply that paper. They are like divers in old-fashioned diving suits, dependent on air continuously pumped down from the surface. We need to make preserved bytes more like SCUBA divers, carrying their own tank of air with them that only needs to be refilled at intervals. Endowing data is discussed in this post.
Digital Preservation Difficult
Alyssa Henry's FAST keynote, in which she offered numbers for availability but pointedly not for reliability is discussed in this post.
Digital Preservation Important
FDsys is discussed in this post.
Practical Next Steps
Additional Material
Here is some additional material I prepared but which I cut to get down to the time allowed.
Did Documents Get Lost?
I was expecting a question asserting that I was wrong to suggest that formats in wide use in 1995 had not gone obsolete.
The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).
I am sure that there are many formats that were in use in 1995 that are now difficult to render because current tools lack support for them. I have argued for a long time that there are few, if any, formats in wide use in 1995 that are difficult to render with current tools. I'm still looking for counter-examples.
But even if there were counter-examples, it wouldn't invalidate my case. It is easy to emulate 1995 PCs, and quite possible to emulate most other architectures current in 1995 using virtual machine technology. See, for example, this BBC story about a collaboration between Microsoft, the British Library and the British National Archives to access old formats by running virtual instances of old Microsoft operating systems and the relevant applications.
The only question is, did someone keep the bits for the operating system and the application as well as the document?
As regards media, the media in wide use in 1995 that are less common today are 3.5" floppies (still on the shelves at Fry's), ZIP drives (as I write this there are 306 of the original ZIP drives on eBay), and DAT tape (40 drives on eBay).
Friday, April 10, 2009
Spring CNI Plenary: The Remix
Posted by
David.
at
11:39 AM
6
comments
Links to this post
Labels: CNI2009spring, digital preservation, format migration, format obsolescence
Subscribe to:
Posts (Atom)