This post provides the text of the slides, sources and commentary for the opening plenary that I just gave at the CNI Spring Task Force meeting. The actual slides are available here (PDF). Follow me below the fold for the full details.
Kirk McKusick's IEEE Award
- 30 years of the Unix file system
- Disks 1,000,000x bigger
- Code 4x bigger, much faster, more reliable
- Reads every disk it ever wrote
- No incompatible change to on-disk format
- No incompatible change to API
- For widely used software
- Costs of incompatibility outweigh benefits
- Strict compatibility makes Kirk's life easier
Kirk McKusick was awarded the 2009 IEEE Reynold B. Johnson Information Storage Systems Award at Usenix's 2009 FAST conference.
Shifting Sands
"... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats."- Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995
As Jeff wrote this, Kirk's file system was 16 years old, with no incompatible changes to the API or on-disk format.
The quotation is from the Jeff Rothenberg's original article "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272, No. 1, 1995. A 1999 update is here, but the update doesn't change the argument of the talk.
The Meme
- Incompatibility is inevitable, a force of nature
- Why did Jeff think this in 1995?
- Is it true in 2009?
- If this meme isn't true
- What causes incompatibility?
- Are these causes operating now?
Incompatibility is not inevitable, it is a choice someone made. If they are rational, they assessed the costs and the benefits. Incompatible changes to widely used software impose costs on each user; if there are many users, aggregating these costs overwhelms any possible benefit. This is especially true when the benefits, even if large, accrue only to a few users.
Talk in 3 Parts
- Ancient History: before 1995
- Jeff Rothenberg's 50-year look forward from 1995
- What he predicted & why
- Modern History: from 1995 to 2009
- Impacts of Jeff's article
- What else happened
- How Jeff rates as a prophet & why
- The Future: following Jeff's example
- Looking forward to identify the real problems
Ancient History
"History is not what you thought. It is what you can remember. All other history defeats itself."- From the Compulsory Preface to 1066 And All That, W. C. Sellar & R. J. Yeatman
1066 And All That is a classic of English humor. Anyone baffled by it should consult the po-faced Wikipedia entry.
Jeff Rothenberg's Scenario
- In 2045, descendants find a CD
- Try to recover document from it leading to Jeff's fortune
- Threat: Media degradation
- Bits on the CD suffer "bit rot"
- Threat: Media obsolescence
- No hardware capable of reading the bits available
- Threat: Format obsolescence
- No software capable of rendering the bits available
The first two threats are easy to explain and defend against by regularly migrating the bits from older to newer media. The third threat was harder to explain and defend against, so it dominates the article.
Jeff on Format Obsolescence
- Defenses
- Format Migration
- Emulation
- Format migration disapproved
- "Finally, [format migration] suffers from a fatal flaw. ... Shifts of this kind make it difficult or impossible to translate old documents into new standard forms."
- Emulation approved subject to caveat
- "specifications for the outdated hardware ... must be saved in a digital form independent of ... software"
Note how, because the hardware specifications are themselves digital documents to be preserved, Jeff has deftly reduced the emulation strategy to a previously unsolved problem.
Jeff's Dystopian Vision
- Documents survive in off-line media
- The media have a short lifetime
- The media readers have a short lifetime
- Documents are in app-specific formats
- Typical formats are proprietary
- Attempts to standardize formats will fail
- Hardware & O/S will change rapidly
- In ways that break applications
- Apps for rendering formats have a short life
Two Words: Desktop Publishing
- The publishing medium was paper
- Design goal of Word & WordPerfect files:
- Save the state of the word processor
- Formats - exclusive property of applications
- Other apps interpreting them - threat to biz model
- Then people started e-mailing the files:
- Got there quicker, could be edited & returned
It is evident reading Jeff's article that the way the document describing his hidden fortune got on to the CD was via a desktop publishing system. If you think back to 1995, desktop publishing was all the rage.
IT in 1995
- Many hardware architectures
- Several operating systems
- Fragmented applications market
Modern History
"A preoccupation with the future not only prevents us from seeing the present as it is but often prompts us to rearrange the past."
Impacts of Jeff's Vision
- Scientific American article = lots of attention
- Governments, foundations started funding
- Mellon Foundation
- NSF, Library of Congress, National Archives ...
- Now have systems in production
- Using both strategies Jeff identified
- Internet Archive started the next year
It is somewhat odd that, despite Jeff's preference for emulation, many more of the existing systems use format migration.
The Web

- May 1995: HighWire puts JBC on-line
- Pioneers academic e-journals
The graph is from Netcraft. It shows that Netcraft didn't even start tracking the Web until after Jeff's article had been published, and that the real explosive growth of the Web didn't start until after Jeff's update appeared in 1999.
Off-line or On-Line
- In Jeff's vision documents survived off-line
- Coming on-line for occasional manipulation or copying
- Copy-ability was extrinsic to the medium
- Now, if it is worth keeping, it is on-line
- Off-line backups are temporary
- Copy-ability is intrinsic to the on-line medium
- No-one cares what the physical medium is
- Disk, flash memory, RAM, ...
- Just that it obeys the access protocols
To be sure, some material worth keeping is not on-line, at least not in the sense of being accessible via the Web. For example, the Stanford Digital Repository contains material that has been deposited on condition that it not be made accessible. Some of this represents preservation masters for content that is on-line in a presentation format. In other cases, it is content that ideally would be on-line if only that were permitted, for example content embargoed for a period, or material that would be on-line if only the resources to put it on-line were available.
Microsoft vs. its Users
- MSFT Office biz model has to drive upgrades
- Introduce gratuitous format incompatibility by default
- New machine writes document old machine can't read
- Old machine buys upgrade, MSFT happy
- Users carry the cost of incompatibility
- Unhappy - anti-trust probe ('90) & consent decree ('94)
- Users ('02-'05) force ODF standard for documents
- MSFT ('07) does OOXML, but concedes the basic point
- Experience with MSFT misled Jeff
- Even MSFT's ability to obsolete formats now limited
Two books about Microsoft's anti-trust struggle with the US Justice Dept. are Ken Auletta's World War 3.0 and John Heilemann's Pride Before The Fall.
Note that format obsolescence happens when support for a format is removed, not when support for a successor format is added. Microsoft's business model depended on adding support for new formats not on removing support for old formats; making the new version of Office incapable of reading documents produced by its predecessor would have been self-defeating.
Evidence that Microsoft can no longer remove support for old formats, as opposed to add support for new formats is in this post from last year.
Documents or Content
- Jeff's documents were property of a program
- A Word file is data to be manipulated (only) by Word
- Proprietary format changeable on a whim
- Now documents are content to be published
- Charge to upgrade browser so it can't read old content?
- Browser free, content free, Office biz model dead
- Goal of publishing: reach as many readers as you can
- Gratuitous incompatibility is now self-defeating
- Publishing IE-only pages gets you flamed
Virtual Machines
- H/W virtualization has long history (VM/370!)
- In 1995 it wasn't mainstream
- Intel was just putting necessary stuff into X86
- Now virtual hardware is mainstream
- Old hardware can be emulated easily with open source
- Mainstream software now written for VMs
- Jeff was right about emulation
- But preservation wasn't the reason for doing it
Open Source
- In 1995 Open Source wasn't mainstream
- Now it's basic strategy for all but 2 big IT companies
- Open Source renderers for all major formats
- Even those with DRM! (Legal status obscure)
- Open Source is best preserved of all
- ASCII, source code control, can rebuild stack as it was
- Open Source isn't backwards incompatible
- For same reason as "no flag day on the Internet"
- Format with Open Source renderer is safe
- Executable "preservation metadata"
For a discussion of the importance of open source for preservation, see this post.
This argument may not apply to console games and other forms of content protected by Digital Rights Management (DRM). Although in practice most forms of DRM have been cracked (for a particularly revealing description of the necessary reverse-engineering process, see Bunnie Huang's fascinating book Hacking the Xbox. Thus, although in most cases it is technically possible to preserve access to DRM-protected content, the legality of doing so is often challenged. Presumably, the challenges wouldn't be mounted if the open source renderers didn't render the content. There is more on DRM in this post.
20/20 Hindsight
- Documents survive on-line, on the Web
- Off-line used only for temporary backups
- Migration between on-line media is inherent
- Readers are bundled with storage technology
- Formats are standard & app-independent
- Proprietary formats get open-source renderers
- Format obsolescence never happens
- No flag day on the Internet
- I.e: Jeff wrong in every particular
The Big Picture

- IT markets have increasing returns
- Usually called "network effects" - Metcalfe's Law
- IT markets have path dependence
- Many players early
- Randomly one gets bigger, network effects take over
- IT markets subject to capture (MSFT, INTC)
- Captured markets slow change down (e.g. Vista)
- History misled Jeff to overestimate change
W. Brian Arthur's book Increasing Returns and Path Dependence in the Economy" is an important description of the behavior of technology markets. It explains how, as illustrated in the graph that I created, they are initially fragmented, with multiple products competing with comparable shares of a small market. At some point, for random reasons, one gets enough bigger market share for the increasing returns to scale (or network effects) to take over. Once they do, one product rapidly gains share in a rapidly expanding market. Others initially benefit from the growing market even as they lose market share, but rapidly start losing their existing customers to the winner.
At this point, as shown by the arrow on the graph, it is in the interest of the winning product to make switching from their competitor's products as easy as possible.
This analysis works very well for markets with large numbers of relatively unsophisticated customers. Markets with a small number of sophisticated customers have figured out strategies for fighting back. For example, in the airliner business the airlines have understood that it is in their long-term interest to buy from both Boeing and Airbus; allowing either to fail would impose unacceptable monopoly costs. Similar behavior can be seen in the market for CPU chips (Intel vs. AMD) and graphics chips (NVIDIA vs. ATI).
Yes We Can!
- Jeff being wrong is Good News!
- Collections that survive aren't as hard as we thought
- Just collect and keep the bits
- Not collecting is the major reason for stuff being lost
- If you keep the bits, all will be well
- Current tools will let you access them for a long time
- Just go do it!
The Future
"Prediction is very difficult, especially about the future."
The Real Problems Were ...
- Scale
- Not individual documents but vast collections of them
- Cost
- Preservation not by individuals but large organizations
- Intellectual Property
- If content worth saving someone is making money from it
Scale
- Jeff looked at micro-level preservation
- A single document on a single CD
- Society needs macro-level preservation
- Information is now industrial scale
- Data centers the size of car factories
- As much power as an aluminum smelter
- 1 copy of 1 important database = $1M/yr
- Document-at-a-time preservation impractical
- Curators must get huge collections per day's work
Storage cost issues are addressed in the series of posts on A Petabyte For A Century and the resulting iPRES paper (190K PDF).
Metcalfe's Law
- The lesson of Google
- More value in connections than in documents themselves
- Preserving individual documents loses this value
- Need to preserve collections including the connections
- Another instance of Metcalfe's Law
- Value of a network goes as # of nodes squared
- Isolated document is a network of 1 node
- Google's other lesson - it's expensive
- We lack good cost data for digital preservation at scale
- Use two extremes to get a ballpark estimate
The two extremes are archive.org and Portico. I should stress that both systems are well engineered to meet their different goals using their chosen techniques. I am not criticizing them, I'm simply using them as bounds on the costs of operating at scale.
Scale Implies Cost
- Internet Archive:
- contains 2PB, growing 240TB/yr
- Google collects the Web monthly then discards it
- archive.org collects the Web monthly then keeps it
- 2 snapshot copies + 1 coming up
- $10-14M/yr operation so ~$0.5 per GB per year
- Portico:
- All academic literature ~50TB, growing ~5TB/yr
- Portico still working on ingesting back content
- $6-8M/yr operation so >$10 per GB per year
My cost numbers for archive.org come from a recent article in The Economist's Technology Quarterly, and for Portico from a guesstimate based on their tax returns.
How Many $ Do We Need?
- archive.org should be cheaper than Portico
- It isn't doing all that "preservation" stuff
- Better bit preservation than archive.org important
- But does all the other stuff justify 20x cost per byte?
- How much do we need to save? An exabyte?
- 0.3% of the data generated in 2007, 0.05% of 2011
- @ archive.org = $5B/yr, @ Portico = $100B/yr
- The world doesn't have even $5B/yr to spend on this
Much less $100B/yr. The point is that, even if we could do adequate quality preservation with archive.org's cost structure, we'd still be much too expensive to address society's need for preservation. With the cost structures more normally associated with preservation at scale, we're much, much further away from addressing it.
Intellectual Property
- Most content worth saving is making money
- Lawyers won't risk that; don't want you to keep a copy
- They have massaged the law to their ends
- You must get permission, so you must talk to lawyers
- Or you are vulnerable to DMCA take-down like IA
- 1 hour of 1 lawyer ~ 5TB of disk
- 10 hours of 1 lawyer could store the academic literature
- For preservation, much uncertainty
- Effort devoted to high byte/lawyer-hour content
- Please use Creative Commons licenses!
The real problem is that the need to talk to the copyright owner's lawyers applies even if the content is open access motivates preservation of content for which a single lawyer's conversation obtains permission for a great deal of content. So, for example, even if it takes a lot of lawyer time to talk to Elsevier, the cost per unit of content preserved is small. Whereas even if the cost to talk to a small open access publisher is small, the cost per unit of content will be prohibitive. Once again, the economic forces push towards preservation of the content that is not at risk of loss.
Looking Forwards
- What are the non-problems?
- Or rather, the problems not big enough to matter
- What are the big problems?
- Preserving the world the way it is now
- Not the way it used to be
- Finding enough money
- And working out how much that is
- Surviving not having enough money
- By turning more things into non-problems
Non-Problems
- Formats
- Any format with an open-source renderer is not at risk
- Metadata (at least for documents)
- Hand-generated metadata
- Too expensive, search is better & more up-to-date
- Program-generated metadata
- Why save the output? You can save the program!
There are extended discussions of the usefulness of format metadata in this post, and of the relative value of open source renderers as against format specifications in this post. For a discussion of the questionable value of format metadata for preservation see this post.
Services not Documents
- "Preservation" implies static, isolated object
- Web 0.9 is like reading a printed book
- Web 1.0 dynamically inserts personalized adverts
- No-one preserves the adverts, but they're important
- Web 2.0 is dynamic, interconnected
- Each page view is unique, mash-ed up from services
- Pages change as you watch them
- What does it mean to preserve a unique, dynamic page?
For a discussion of the importance of context in preserving the Web see this post.
Things Worth Preserving
- User Generated Content
- To understand 2008 election you need to save blogs
- To do that you need to save YouTube, photo sites, ...
- So that the links to them keep working ...
- Technical, legal, scale obstacles almost insuperable
- Multi-player games & virtual worlds
- Even if you could get the data and invest in the servers
- They're dead without the community - Myst (1993)
- Dynamic databases & links to them
- e.g. Google Earth mash-ups - is Google Earth forever?
Do you remember Myst from 1993? It was a beautiful virtual world that you explored. Pretty soon you figured out that you were the only person there. Some time after that you figured out that the goal of the game was to figure out why you were the only person there. We've come a long way since then, Myst would not make it against World of Warcraft or Second Life.
For a discussion of the problem of preserving the materials future scholars will need to study elections, see this post.
Economics
- 2008 Preservation Buzzword: Sustainability
- We can't afford to preserve the stuff we know how to
- Future stuff will be much more expensive
- There'll be a lot more bytes of it
- Each byte will be more difficult & more expensive
- Bytes vulnerable to money supply glitches
- Data needs to be endowed if it is to survive hard times
- Endowing up front means preserving less
- Collection development: what must be kept?
- But it has really bad scaling problems
Bytes are a lot more vulnerable to disruptions in the money supply that paper. They are like divers in old-fashioned diving suits, dependent on air continuously pumped down from the surface. We need to make preserved bytes more like SCUBA divers, carrying their own tank of air with them that only needs to be refilled at intervals. Endowing data is discussed in this post.
Digital Preservation Difficult
- Conceptually
- What does it mean to preserve dynamic content?
- Technically
- Need to preserve services not content. How?
- Legally
- Preservation requires permission
- How do you even find everyone you need to ask?
- Economically
- Just storing the bits needs industrial infrastructure
- Beyond resources of universities, national libraries
- Are services like S3 reliable enough?
Alyssa Henry's FAST keynote, in which she offered numbers for availability but pointedly not for reliability is discussed in this post.
Digital Preservation Important
- Paper's attributes built in to society
- Durable, write-once, tamper-evident, highly replicated, ...
- Society needs fixed, tamper-evident record
- E.g. laws, contracts, evidence, ...
- Paper provides this as a side-effect
- The Web is Winston Smith's dream machine
- All govt. information on a single web server (FDsys)
- Point-&-click to rewrite history
FDsys is discussed in this post.
Practical Next Steps
- Everyone - just go collect the bits:
- Not hard or costly to do a good enough job
- Please use Creative Commons licenses
- Preserve Open Source repositories:
- Easy & vital: no legal, technical or scale barriers
- Support Open Source renderers & emulators
- Support research into preservation tech:
- How to preserve bits adequately & affordably?
- How to preserve this decade's dynamic web of services?
- Not just last decade's static web of pages
Additional Material
Here is some additional material I prepared but which I cut to get down to the time allowed.
Did Documents Get Lost?
I was expecting a question asserting that I was wrong to suggest that formats in wide use in 1995 had not gone obsolete.
The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).
I am sure that there are many formats that were in use in 1995 that are now difficult to render because current tools lack support for them. I have argued for a long time that there are few, if any, formats in wide use in 1995 that are difficult to render with current tools. I'm still looking for counter-examples.
But even if there were counter-examples, it wouldn't invalidate my case. It is easy to emulate 1995 PCs, and quite possible to emulate most other architectures current in 1995 using virtual machine technology. See, for example, this BBC story about a collaboration between Microsoft, the British Library and the British National Archives to access old formats by running virtual instances of old Microsoft operating systems and the relevant applications.
The only question is, did someone keep the bits for the operating system and the application as well as the document?
As regards media, the media in wide use in 1995 that are less common today are 3.5" floppies (still on the shelves at Fry's), ZIP drives (as I write this there are 306 of the original ZIP drives on eBay), and DAT tape (40 drives on eBay).
Read More......