Friday, April 10, 2009

Spring CNI Plenary: The Remix

This post provides the text of the slides, sources and commentary for the opening plenary that I just gave at the CNI Spring Task Force meeting. The actual slides are available here (PDF). Follow me below the fold for the full details.


Kirk McKusick's IEEE Award

  • 30 years of the Unix file system
    • Disks 1,000,000x bigger
    • Code 4x bigger, much faster, more reliable
  • Reads every disk it ever wrote
    • No incompatible change to on-disk format
    • No incompatible change to API
  • For widely used software
    • Costs of incompatibility outweigh benefits
    • Strict compatibility makes Kirk's life easier

Kirk McKusick was awarded the 2009 IEEE Reynold B. Johnson Information Storage Systems Award at Usenix's 2009 FAST conference.

Shifting Sands
"... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats."
  • Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995
As Jeff wrote this, Kirk's file system was 16 years old, with no incompatible changes to the API or on-disk format.

The quotation is from the Jeff Rothenberg's original article "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272, No. 1, 1995. A 1999 update is here, but the update doesn't change the argument of the talk.

The Meme
  • Incompatibility is inevitable, a force of nature
    • Why did Jeff think this in 1995?
    • Is it true in 2009?
  • If this meme isn't true
    • What causes incompatibility?
    • Are these causes operating now?
Incompatibility is not inevitable, it is a choice someone made. If they are rational, they assessed the costs and the benefits. Incompatible changes to widely used software impose costs on each user; if there are many users, aggregating these costs overwhelms any possible benefit. This is especially true when the benefits, even if large, accrue only to a few users.

Talk in 3 Parts
  • Ancient History: before 1995
    • Jeff Rothenberg's 50-year look forward from 1995
    • What he predicted & why
  • Modern History: from 1995 to 2009
    • Impacts of Jeff's article
    • What else happened
    • How Jeff rates as a prophet & why
  • The Future: following Jeff's example
    • Looking forward to identify the real problems


Ancient History
"History is not what you thought. It is what you can remember. All other history defeats itself."
  • From the Compulsory Preface to 1066 And All That, W. C. Sellar & R. J. Yeatman
1066 And All That is a classic of English humor. Anyone baffled by it should consult the po-faced Wikipedia entry.

Jeff Rothenberg's Scenario
  • In 2045, descendants find a CD
    • Try to recover document from it leading to Jeff's fortune
  • Threat: Media degradation
    • Bits on the CD suffer "bit rot"
  • Threat: Media obsolescence
    • No hardware capable of reading the bits available
  • Threat: Format obsolescence
    • No software capable of rendering the bits available
The first two threats are easy to explain and defend against by regularly migrating the bits from older to newer media. The third threat was harder to explain and defend against, so it dominates the article.

Jeff on Format Obsolescence
  • Defenses
    • Format Migration
    • Emulation
  • Format migration disapproved
    • "Finally, [format migration] suffers from a fatal flaw. ... Shifts of this kind make it difficult or impossible to translate old documents into new standard forms."
  • Emulation approved subject to caveat
    • "specifications for the outdated hardware ... must be saved in a digital form independent of ... software"
Note how, because the hardware specifications are themselves digital documents to be preserved, Jeff has deftly reduced the emulation strategy to a previously unsolved problem.

Jeff's Dystopian Vision
  • Documents survive in off-line media
  • The media have a short lifetime
  • The media readers have a short lifetime
  • Documents are in app-specific formats
    • Typical formats are proprietary
    • Attempts to standardize formats will fail
  • Hardware & O/S will change rapidly
    • In ways that break applications
  • Apps for rendering formats have a short life


Two Words: Desktop Publishing
  • The publishing medium was paper
  • Design goal of Word & WordPerfect files:
    • Save the state of the word processor
  • Formats - exclusive property of applications
    • Other apps interpreting them - threat to biz model
  • Then people started e-mailing the files:
    • Got there quicker, could be edited & returned
It is evident reading Jeff's article that the way the document describing his hidden fortune got on to the CD was via a desktop publishing system. If you think back to 1995, desktop publishing was all the rage.

IT in 1995


Modern History
"A preoccupation with the future not only prevents us from seeing the present as it is but often prompts us to rearrange the past."
  • Eric Hoffer


Impacts of Jeff's Vision
  • Scientific American article = lots of attention
  • Governments, foundations started funding
    • Mellon Foundation
    • NSF, Library of Congress, National Archives ...
  • Now have systems in production
    • Using both strategies Jeff identified
  • Internet Archive started the next year
    • Using neither of them
It is somewhat odd that, despite Jeff's preference for emulation, many more of the existing systems use format migration.

The Web
  • May 1995: HighWire puts JBC on-line
    • Pioneers academic e-journals
The graph is from Netcraft. It shows that Netcraft didn't even start tracking the Web until after Jeff's article had been published, and that the real explosive growth of the Web didn't start until after Jeff's update appeared in 1999.

Off-line or On-Line
  • In Jeff's vision documents survived off-line
    • Coming on-line for occasional manipulation or copying
    • Copy-ability was extrinsic to the medium
  • Now, if it is worth keeping, it is on-line
  • Off-line backups are temporary
  • Copy-ability is intrinsic to the on-line medium
  • No-one cares what the physical medium is
    • Disk, flash memory, RAM, ...
    • Just that it obeys the access protocols
To be sure, some material worth keeping is not on-line, at least not in the sense of being accessible via the Web. For example, the Stanford Digital Repository contains material that has been deposited on condition that it not be made accessible. Some of this represents preservation masters for content that is on-line in a presentation format. In other cases, it is content that ideally would be on-line if only that were permitted, for example content embargoed for a period, or material that would be on-line if only the resources to put it on-line were available.

Microsoft vs. its Users
  • MSFT Office biz model has to drive upgrades
    • Introduce gratuitous format incompatibility by default
    • New machine writes document old machine can't read
    • Old machine buys upgrade, MSFT happy
  • Users carry the cost of incompatibility
    • Unhappy - anti-trust probe ('90) & consent decree ('94)
    • Users ('02-'05) force ODF standard for documents
    • MSFT ('07) does OOXML, but concedes the basic point
  • Experience with MSFT misled Jeff
    • Even MSFT's ability to obsolete formats now limited
Two books about Microsoft's anti-trust struggle with the US Justice Dept. are Ken Auletta's World War 3.0 and John Heilemann's Pride Before The Fall.

Note that format obsolescence happens when support for a format is removed, not when support for a successor format is added. Microsoft's business model depended on adding support for new formats not on removing support for old formats; making the new version of Office incapable of reading documents produced by its predecessor would have been self-defeating.

Evidence that Microsoft can no longer remove support for old formats, as opposed to add support for new formats is in this post from last year.

Documents or Content
  • Jeff's documents were property of a program
    • A Word file is data to be manipulated (only) by Word
    • Proprietary format changeable on a whim
  • Now documents are content to be published
    • Charge to upgrade browser so it can't read old content?
    • Browser free, content free, Office biz model dead
  • Goal of publishing: reach as many readers as you can
    • Gratuitous incompatibility is now self-defeating
    • Publishing IE-only pages gets you flamed


Virtual Machines
  • H/W virtualization has long history (VM/370!)
    • Software too (Basic!)
  • In 1995 it wasn't mainstream
    • Intel was just putting necessary stuff into X86
  • Now virtual hardware is mainstream
    • Old hardware can be emulated easily with open source
  • Mainstream software now written for VMs
    • Java, C#, ...
  • Jeff was right about emulation
    • But preservation wasn't the reason for doing it


Open Source
  • In 1995 Open Source wasn't mainstream
    • Now it's basic strategy for all but 2 big IT companies
  • Open Source renderers for all major formats
    • Even those with DRM! (Legal status obscure)
  • Open Source is best preserved of all
    • ASCII, source code control, can rebuild stack as it was
  • Open Source isn't backwards incompatible
    • For same reason as "no flag day on the Internet"
  • Format with Open Source renderer is safe
    • Executable "preservation metadata"
For a discussion of the importance of open source for preservation, see this post.

This argument may not apply to console games and other forms of content protected by Digital Rights Management (DRM). Although in practice most forms of DRM have been cracked (for a particularly revealing description of the necessary reverse-engineering process, see Bunnie Huang's fascinating book Hacking the Xbox. Thus, although in most cases it is technically possible to preserve access to DRM-protected content, the legality of doing so is often challenged. Presumably, the challenges wouldn't be mounted if the open source renderers didn't render the content. There is more on DRM in this post.

20/20 Hindsight
  • Documents survive on-line, on the Web
    • Off-line used only for temporary backups
  • Migration between on-line media is inherent
    • Readers are bundled with storage technology
  • Formats are standard & app-independent
    • Proprietary formats get open-source renderers
  • Format obsolescence never happens
    • No flag day on the Internet
  • I.e: Jeff wrong in every particular


The Big Picture

  • IT markets have increasing returns
    • Usually called "network effects" - Metcalfe's Law
  • IT markets have path dependence
    • Many players early
      • Randomly one gets bigger, network effects take over
    • IT markets subject to capture (MSFT, INTC)
      • Captured markets slow change down (e.g. Vista)
    • History misled Jeff to overestimate change
W. Brian Arthur's book Increasing Returns and Path Dependence in the Economy" is an important description of the behavior of technology markets. It explains how, as illustrated in the graph that I created, they are initially fragmented, with multiple products competing with comparable shares of a small market. At some point, for random reasons, one gets enough bigger market share for the increasing returns to scale (or network effects) to take over. Once they do, one product rapidly gains share in a rapidly expanding market. Others initially benefit from the growing market even as they lose market share, but rapidly start losing their existing customers to the winner.

At this point, as shown by the arrow on the graph, it is in the interest of the winning product to make switching from their competitor's products as easy as possible.

This analysis works very well for markets with large numbers of relatively unsophisticated customers. Markets with a small number of sophisticated customers have figured out strategies for fighting back. For example, in the airliner business the airlines have understood that it is in their long-term interest to buy from both Boeing and Airbus; allowing either to fail would impose unacceptable monopoly costs. Similar behavior can be seen in the market for CPU chips (Intel vs. AMD) and graphics chips (NVIDIA vs. ATI).

Yes We Can!
  • Jeff being wrong is Good News!
    • Collections that survive aren't as hard as we thought
  • Just collect and keep the bits
    • Not collecting is the major reason for stuff being lost
  • If you keep the bits, all will be well
    • Current tools will let you access them for a long time
  • Just go do it!


The Future
"Prediction is very difficult, especially about the future."
  • Neils Bohr


The Real Problems Were ...
  • Scale
    • Not individual documents but vast collections of them
  • Cost
    • Preservation not by individuals but large organizations
  • Intellectual Property
    • If content worth saving someone is making money from it


Scale
  • Jeff looked at micro-level preservation
    • A single document on a single CD
  • Society needs macro-level preservation
    • Information is now industrial scale
    • Data centers the size of car factories
    • As much power as an aluminum smelter
  • 1 copy of 1 important database = $1M/yr
    • In storage costs alone
  • Document-at-a-time preservation impractical
    • Curators must get huge collections per day's work
Storage cost issues are addressed in the series of posts on A Petabyte For A Century and the resulting iPRES paper (190K PDF).

Metcalfe's Law
  • The lesson of Google
    • More value in connections than in documents themselves
    • Preserving individual documents loses this value
    • Need to preserve collections including the connections
  • Another instance of Metcalfe's Law
    • Value of a network goes as # of nodes squared
    • Isolated document is a network of 1 node
  • Google's other lesson - it's expensive
    • We lack good cost data for digital preservation at scale
    • Use two extremes to get a ballpark estimate
The two extremes are archive.org and Portico. I should stress that both systems are well engineered to meet their different goals using their chosen techniques. I am not criticizing them, I'm simply using them as bounds on the costs of operating at scale.

Scale Implies Cost
  • Internet Archive:
    • contains 2PB, growing 240TB/yr
    • Google collects the Web monthly then discards it
    • archive.org collects the Web monthly then keeps it
    • 2 snapshot copies + 1 coming up
    • $10-14M/yr operation so ~$0.5 per GB per year
  • Portico:
    • All academic literature ~50TB, growing ~5TB/yr
    • Portico still working on ingesting back content
    • $6-8M/yr operation so >$10 per GB per year
My cost numbers for archive.org come from a recent article in The Economist's Technology Quarterly, and for Portico from a guesstimate based on their tax returns.

How Many $ Do We Need?
  • archive.org should be cheaper than Portico
    • It isn't doing all that "preservation" stuff
    • Better bit preservation than archive.org important
  • But does all the other stuff justify 20x cost per byte?
  • How much do we need to save? An exabyte?
    • 0.3% of the data generated in 2007, 0.05% of 2011
    • @ archive.org = $5B/yr, @ Portico = $100B/yr
    • The world doesn't have even $5B/yr to spend on this
Much less $100B/yr. The point is that, even if we could do adequate quality preservation with archive.org's cost structure, we'd still be much too expensive to address society's need for preservation. With the cost structures more normally associated with preservation at scale, we're much, much further away from addressing it.

Intellectual Property
  • Most content worth saving is making money
    • Lawyers won't risk that; don't want you to keep a copy
  • They have massaged the law to their ends
    • You must get permission, so you must talk to lawyers
      • Or you are vulnerable to DMCA take-down like IA
  • 1 hour of 1 lawyer ~ 5TB of disk
    • 10 hours of 1 lawyer could store the academic literature
  • For preservation, much uncertainty
    • Effort devoted to high byte/lawyer-hour content
  • Please use Creative Commons licenses!
The real problem is that the need to talk to the copyright owner's lawyers applies even if the content is open access motivates preservation of content for which a single lawyer's conversation obtains permission for a great deal of content. So, for example, even if it takes a lot of lawyer time to talk to Elsevier, the cost per unit of content preserved is small. Whereas even if the cost to talk to a small open access publisher is small, the cost per unit of content will be prohibitive. Once again, the economic forces push towards preservation of the content that is not at risk of loss.

Looking Forwards
  • What are the non-problems?
    • Or rather, the problems not big enough to matter
  • What are the big problems?
    • Preserving the world the way it is now
      • Not the way it used to be
    • Finding enough money
      • And working out how much that is
    • Surviving not having enough money
      • By turning more things into non-problems


Non-Problems
  • Formats
    • Any format with an open-source renderer is not at risk
  • Metadata (at least for documents)
    • Hand-generated metadata
      • Too expensive, search is better & more up-to-date
    • Program-generated metadata
      • Why save the output? You can save the program!
There are extended discussions of the usefulness of format metadata in this post, and of the relative value of open source renderers as against format specifications in this post. For a discussion of the questionable value of format metadata for preservation see this post.

Services not Documents
  • "Preservation" implies static, isolated object
  • Web 2.0 is dynamic, interconnected
    • Each page view is unique, mash-ed up from services
    • Pages change as you watch them
  • What does it mean to preserve a unique, dynamic page?
For a discussion of the importance of context in preserving the Web see this post.

Things Worth Preserving
  • User Generated Content
    • To understand 2008 election you need to save blogs
    • To do that you need to save YouTube, photo sites, ...
      • So that the links to them keep working ...
    • Technical, legal, scale obstacles almost insuperable
  • Multi-player games & virtual worlds
    • Even if you could get the data and invest in the servers
    • They're dead without the community - Myst (1993)
  • Dynamic databases & links to them
    • e.g. Google Earth mash-ups - is Google Earth forever?
Do you remember Myst from 1993? It was a beautiful virtual world that you explored. Pretty soon you figured out that you were the only person there. Some time after that you figured out that the goal of the game was to figure out why you were the only person there. We've come a long way since then, Myst would not make it against World of Warcraft or Second Life.

For a discussion of the problem of preserving the materials future scholars will need to study elections, see this post.

Economics
  • 2008 Preservation Buzzword: Sustainability
    • We can't afford to preserve the stuff we know how to
  • Future stuff will be much more expensive
    • There'll be a lot more bytes of it
    • Each byte will be more difficult & more expensive
  • Bytes vulnerable to money supply glitches
    • Data needs to be endowed if it is to survive hard times
    • Endowing up front means preserving less
  • Collection development: what must be kept?
    • But it has really bad scaling problems
Bytes are a lot more vulnerable to disruptions in the money supply that paper. They are like divers in old-fashioned diving suits, dependent on air continuously pumped down from the surface. We need to make preserved bytes more like SCUBA divers, carrying their own tank of air with them that only needs to be refilled at intervals. Endowing data is discussed in this post.

Digital Preservation Difficult
  • Conceptually
    • What does it mean to preserve dynamic content?
  • Technically
    • Need to preserve services not content. How?
  • Legally
    • Preservation requires permission
    • How do you even find everyone you need to ask?
  • Economically
    • Just storing the bits needs industrial infrastructure
      • Beyond resources of universities, national libraries
    • Are services like S3 reliable enough?
    Alyssa Henry's FAST keynote, in which she offered numbers for availability but pointedly not for reliability is discussed in this post.

    Digital Preservation Important
    • Paper's attributes built in to society
      • Durable, write-once, tamper-evident, highly replicated, ...
    • Society needs fixed, tamper-evident record
      • E.g. laws, contracts, evidence, ...
        • Paper provides this as a side-effect
    • The Web is Winston Smith's dream machine
      • All govt. information on a single web server (FDsys)
      • Point-&-click to rewrite history
    FDsys is discussed in this post.

    Practical Next Steps
    • Everyone - just go collect the bits:
      • Not hard or costly to do a good enough job
      • Please use Creative Commons licenses
    • Preserve Open Source repositories:
      • Easy & vital: no legal, technical or scale barriers
    • Support Open Source renderers & emulators
    • Support research into preservation tech:
      • How to preserve bits adequately & affordably?
      • How to preserve this decade's dynamic web of services?
        • Not just last decade's static web of pages


      Additional Material

      Here is some additional material I prepared but which I cut to get down to the time allowed.

      Did Documents Get Lost?

      I was expecting a question asserting that I was wrong to suggest that formats in wide use in 1995 had not gone obsolete.

      The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).

      I am sure that there are many formats that were in use in 1995 that are now difficult to render because current tools lack support for them. I have argued for a long time that there are few, if any, formats in wide use in 1995 that are difficult to render with current tools. I'm still looking for counter-examples.

      But even if there were counter-examples, it wouldn't invalidate my case. It is easy to emulate 1995 PCs, and quite possible to emulate most other architectures current in 1995 using virtual machine technology. See, for example, this BBC story about a collaboration between Microsoft, the British Library and the British National Archives to access old formats by running virtual instances of old Microsoft operating systems and the relevant applications.

      The only question is, did someone keep the bits for the operating system and the application as well as the document?

      As regards media, the media in wide use in 1995 that are less common today are 3.5" floppies (still on the shelves at Fry's), ZIP drives (as I write this there are 306 of the original ZIP drives on eBay), and DAT tape (40 drives on eBay).

      7 comments:

      Sheila Morrissey said...
      This comment has been removed by a blog administrator.
      David. said...

      I'm grateful to Portico's Sheila Morrissey for setting out the conventional wisdom I was arguing against in my plenary talk. Unfortunately, the length of her comment appears to be inhibiting others from contributing to the discussion the talk was intended to spark.

      I have deleted Sheila's comment but have preserved the text, which will shortly appear as a post, interspersed with my responses to her points. I hope others will now feel freer to comment at less length on this post.

      Clearly, this is a blog where relatively lengthy comments add value. But the urge to expand one's thoughts in a comment needs to be balanced against the imposition on readers to scroll past one's contribution to find other's comments that they may find equally valuable. If the Daily Kos FAQ had been accessible from my present remote location, I would have linked to what I remember as valuable advice on the length to which a comment should grow before it should mutate into a post less it disrupt the discussion.

      Sheila Morrissey said...

      Readers who wish to see the response to David’s talk can find it here.

      David. said...

      The promised post of Sheila Morrissey's comment with my responses is now available. I'm sorry for the delay; I had less Internet access and less time to work on it in my holiday hotels in Tasmania than I expected.

      David. said...

      CNI has posted the video of Cliff's introduction, my talk, and the questions here.

      Matt said...

      As someone who has worked in digital preservation at the UK National Archives for several years, I would like to correct a few misconceptions in the linked BBC story.

      The volume of digital data managed by the UK National Archives is simply wrong by several orders of magnitude (this has done the rounds on the internet until it has now become a virtual fact!).

      The formats we hold digital data in are mostly open standards, and even the proprietary formats are still fully readable using current software. We have never had lots of data in unreadable formats.

      Microsoft did donate virtual machines with older and current versions of its operating system for us to experiment with in our preservation laboratory. However, the primary digital perservation strategy of the UK National Archives is still, and always has been, format migration.

      JKF said...

      There are some interesting thoughts expressed in regards to lawyers. I ran across your post while doing some research on e-discovery for a case I'm currently working on and it really got me thinking about other issues as well.

      But first back to the e-discovery aspect. As an attorney in this new digital age it has become more and more important that I'm able to track down records that may or may not be hanging out in cyberspace. For me preservation of these records is vital. Our cases often hinge on what we find during this discovery, so of course I am wanting companies to spend whatever is needed to make sure that they have secure storage of documents with adequate capacity for all needed storage.

      The second issue this made me think of is the new rage of SAAS or software as a service. I know that I use two different providers to keep my client contact information, files, billing and basically everything in the "cloud". As more and more companies move toward this and more safe storage is needed how will this effect the current landscape?

      And finally a very small issue, but one which is important to firms such as mine is how do we make sure that everything we ever post, comment on or make a note about in our professional lives on the internet is safely stored and retrievable? Again, as a lawyer I'm supposed to keep records of everything I ever do on my website. On my Memphis personal injury website/blog for example I have hundreds of pages of content that I'm changing up all the time to reflect new changes in the law. Under my state bar associations rule I'm supposed to keep a copy of each and every page, each and every time it is changed. These regulations are ridiculous and cause me a huge headache. But it's just another example of why reliable and affordable storage is so important.

      Oh well, I'm rambling. But from the point of view of someone who is definitely not an expert in this field, I really enjoyed browsing through several of your posts.

      Jami Ferrell - (just a simple lawyer trying to enjoy life, and better himself if at all possible)