DSHR's Blog: Spring CNI Plenary: The Remix

This post provides the text of the slides, sources and commentary for the opening plenary that I just gave at the CNI Spring Task Force meeting. The actual slides are available here (PDF). Follow me below the fold for the full details.

Kirk McKusick's IEEE Award

30 years of the Unix file system

Disks 1,000,000x bigger

Code 4x bigger, much faster, more reliable

Reads every disk it ever wrote

No incompatible change to on-disk format

No incompatible change to API

For widely used software

Costs of incompatibility outweigh benefits

Strict compatibility makes Kirk's life easier

Kirk McKusick was awarded the 2009 IEEE Reynold B. Johnson Information Storage Systems Award at Usenix's 2009 FAST conference.

Shifting Sands

"... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats."

Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995

As Jeff wrote this, Kirk's file system was 16 years old, with no incompatible changes to the API or on-disk format.

The quotation is from the Jeff Rothenberg's original article "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272, No. 1, 1995. A 1999 update is here, but the update doesn't change the argument of the talk.

The Meme

Incompatibility is inevitable, a force of nature

Why did Jeff think this in 1995?

Is it true in 2009?

If this meme isn't true

What causes incompatibility?

Are these causes operating now?

Incompatibility is not inevitable, it is a choice someone made. If they are rational, they assessed the costs and the benefits. Incompatible changes to widely used software impose costs on each user; if there are many users, aggregating these costs overwhelms any possible benefit. This is especially true when the benefits, even if large, accrue only to a few users.

Talk in 3 Parts

Ancient History: before 1995

Jeff Rothenberg's 50-year look forward from 1995

What he predicted & why

Modern History: from 1995 to 2009

Impacts of Jeff's article

What else happened

How Jeff rates as a prophet & why

The Future: following Jeff's example

Looking forward to identify the real problems

Ancient History

"History is not what you thought. It is what you can remember. All other history defeats itself."

From the Compulsory Preface to 1066 And All That, W. C. Sellar & R. J. Yeatman

1066 And All That is a classic of English humor. Anyone baffled by it should consult the po-faced Wikipedia entry.

Jeff Rothenberg's Scenario

In 2045, descendants find a CD

Try to recover document from it leading to Jeff's fortune

Threat: Media degradation

Bits on the CD suffer "bit rot"

Threat: Media obsolescence

No hardware capable of reading the bits available

Threat: Format obsolescence

No software capable of rendering the bits available

The first two threats are easy to explain and defend against by regularly migrating the bits from older to newer media. The third threat was harder to explain and defend against, so it dominates the article.

Jeff on Format Obsolescence

Defenses

Format Migration

Emulation

Format migration disapproved

"Finally, [format migration] suffers from a fatal flaw. ... Shifts of this kind make it difficult or impossible to translate old documents into new standard forms."

Emulation approved subject to caveat

"specifications for the outdated hardware ... must be saved in a digital form independent of ... software"

Note how, because the hardware specifications are themselves digital documents to be preserved, Jeff has deftly reduced the emulation strategy to a previously unsolved problem.

Jeff's Dystopian Vision

Documents survive in off-line media

The media have a short lifetime

The media readers have a short lifetime

Documents are in app-specific formats

Typical formats are proprietary

Attempts to standardize formats will fail

Hardware & O/S will change rapidly

In ways that break applications

Apps for rendering formats have a short life

Two Words: Desktop Publishing

The publishing medium was paper

Design goal of Word & WordPerfect files:

Save the state of the word processor

Formats - exclusive property of applications

Other apps interpreting them - threat to biz model

Then people started e-mailing the files:

Got there quicker, could be edited & returned

It is evident reading Jeff's article that the way the document describing his hidden fortune got on to the CD was via a desktop publishing system. If you think back to 1995, desktop publishing was all the rage.

IT in 1995

Many hardware architectures

X86, SPARC, MIPS, 680X0, PowerPC, ...

PC split between ISA bus and PCI bus

Several operating systems

Windows 3.X, Windows 95, OS/2, System 7, Solaris, ...

Linux (1.2.0) was barely functional

Fragmented applications market

MSFT Word vs. WordPerfect ...

Lotus 1-2-3 vs. Excel

No standard for PC graphics, so no 3D PC games

Modern History

"A preoccupation with the future not only prevents us from seeing the present as it is but often prompts us to rearrange the past."

Eric Hoffer

Impacts of Jeff's Vision

Scientific American article = lots of attention

Governments, foundations started funding

Mellon Foundation

NSF, Library of Congress, National Archives ...

Now have systems in production

Using both strategies Jeff identified

Internet Archive started the next year

Using neither of them

It is somewhat odd that, despite Jeff's preference for emulation, many more of the existing systems use format migration.

The Web

May 1995: HighWire puts JBC on-line

Pioneers academic e-journals

The graph is from Netcraft. It shows that Netcraft didn't even start tracking the Web until after Jeff's article had been published, and that the real explosive growth of the Web didn't start until after Jeff's update appeared in 1999.

Off-line or On-Line

In Jeff's vision documents survived off-line

Coming on-line for occasional manipulation or copying

Copy-ability was extrinsic to the medium

Now, if it is worth keeping, it is on-line

Off-line backups are temporary

Copy-ability is intrinsic to the on-line medium

No-one cares what the physical medium is

Disk, flash memory, RAM, ...

Just that it obeys the access protocols

To be sure, some material worth keeping is not on-line, at least not in the sense of being accessible via the Web. For example, the Stanford Digital Repository contains material that has been deposited on condition that it not be made accessible. Some of this represents preservation masters for content that is on-line in a presentation format. In other cases, it is content that ideally would be on-line if only that were permitted, for example content embargoed for a period, or material that would be on-line if only the resources to put it on-line were available.

Microsoft vs. its Users

MSFT Office biz model has to drive upgrades

Introduce gratuitous format incompatibility by default

New machine writes document old machine can't read

Old machine buys upgrade, MSFT happy

Users carry the cost of incompatibility

Unhappy - anti-trust probe ('90) & consent decree ('94)

Users ('02-'05) force ODF standard for documents

MSFT ('07) does OOXML, but concedes the basic point

Experience with MSFT misled Jeff

Even MSFT's ability to obsolete formats now limited

Two books about Microsoft's anti-trust struggle with the US Justice Dept. are Ken Auletta's World War 3.0 and John Heilemann's Pride Before The Fall.

Note that format obsolescence happens when support for a format is removed, not when support for a successor format is added. Microsoft's business model depended on adding support for new formats not on removing support for old formats; making the new version of Office incapable of reading documents produced by its predecessor would have been self-defeating.

Evidence that Microsoft can no longer remove support for old formats, as opposed to add support for new formats is in this post from last year.

Documents or Content

Jeff's documents were property of a program

A Word file is data to be manipulated (only) by Word

Proprietary format changeable on a whim

Now documents are content to be published

Charge to upgrade browser so it can't read old content?

Browser free, content free, Office biz model dead

Goal of publishing: reach as many readers as you can

Gratuitous incompatibility is now self-defeating

Publishing IE-only pages gets you flamed

Virtual Machines

H/W virtualization has long history (VM/370!)

Software too (Basic!)

In 1995 it wasn't mainstream

Intel was just putting necessary stuff into X86

Now virtual hardware is mainstream

Old hardware can be emulated easily with open source

Mainstream software now written for VMs

Java, C#, ...

Jeff was right about emulation

But preservation wasn't the reason for doing it

Open Source

In 1995 Open Source wasn't mainstream

Now it's basic strategy for all but 2 big IT companies

Open Source renderers for all major formats

Even those with DRM! (Legal status obscure)

Open Source is best preserved of all

ASCII, source code control, can rebuild stack as it was

Open Source isn't backwards incompatible

For same reason as "no flag day on the Internet"

Format with Open Source renderer is safe

Executable "preservation metadata"

For a discussion of the importance of open source for preservation, see this post.

This argument may not apply to console games and other forms of content protected by Digital Rights Management (DRM). Although in practice most forms of DRM have been cracked (for a particularly revealing description of the necessary reverse-engineering process, see Bunnie Huang's fascinating book Hacking the Xbox. Thus, although in most cases it is technically possible to preserve access to DRM-protected content, the legality of doing so is often challenged. Presumably, the challenges wouldn't be mounted if the open source renderers didn't render the content. There is more on DRM in this post.

20/20 Hindsight

Documents survive on-line, on the Web

Off-line used only for temporary backups

Migration between on-line media is inherent

Readers are bundled with storage technology

Formats are standard & app-independent

Proprietary formats get open-source renderers

Format obsolescence never happens

No flag day on the Internet

I.e: Jeff wrong in every particular

The Big Picture

IT markets have increasing returns

Usually called "network effects" - Metcalfe's Law

IT markets have path dependence

Many players early

Randomly one gets bigger, network effects take over

IT markets subject to capture (MSFT, INTC)

Captured markets slow change down (e.g. Vista)

History misled Jeff to overestimate change

W. Brian Arthur's book Increasing Returns and Path Dependence in the Economy" is an important description of the behavior of technology markets. It explains how, as illustrated in the graph that I created, they are initially fragmented, with multiple products competing with comparable shares of a small market. At some point, for random reasons, one gets enough bigger market share for the increasing returns to scale (or network effects) to take over. Once they do, one product rapidly gains share in a rapidly expanding market. Others initially benefit from the growing market even as they lose market share, but rapidly start losing their existing customers to the winner.

At this point, as shown by the arrow on the graph, it is in the interest of the winning product to make switching from their competitor's products as easy as possible.

This analysis works very well for markets with large numbers of relatively unsophisticated customers. Markets with a small number of sophisticated customers have figured out strategies for fighting back. For example, in the airliner business the airlines have understood that it is in their long-term interest to buy from both Boeing and Airbus; allowing either to fail would impose unacceptable monopoly costs. Similar behavior can be seen in the market for CPU chips (Intel vs. AMD) and graphics chips (NVIDIA vs. ATI).

Yes We Can!

Jeff being wrong is Good News!

Collections that survive aren't as hard as we thought

Just collect and keep the bits

Not collecting is the major reason for stuff being lost

If you keep the bits, all will be well

Current tools will let you access them for a long time

Just go do it!

The Future

"Prediction is very difficult, especially about the future."

Neils Bohr

The Real Problems Were ...

Scale

Not individual documents but vast collections of them

Cost

Preservation not by individuals but large organizations

Intellectual Property

If content worth saving someone is making money from it

Scale

Jeff looked at micro-level preservation

A single document on a single CD

Society needs macro-level preservation

Information is now industrial scale

Data centers the size of car factories

As much power as an aluminum smelter

1 copy of 1 important database = $1M/yr

In storage costs alone

Document-at-a-time preservation impractical

Curators must get huge collections per day's work

Storage cost issues are addressed in the series of posts on A Petabyte For A Century and the resulting iPRES paper (190K PDF).

Metcalfe's Law

The lesson of Google

More value in connections than in documents themselves

Preserving individual documents loses this value

Need to preserve collections including the connections

Another instance of Metcalfe's Law

Value of a network goes as # of nodes squared

Isolated document is a network of 1 node

Google's other lesson - it's expensive

We lack good cost data for digital preservation at scale

Use two extremes to get a ballpark estimate

The two extremes are archive.org and Portico. I should stress that both systems are well engineered to meet their different goals using their chosen techniques. I am not criticizing them, I'm simply using them as bounds on the costs of operating at scale.

Scale Implies Cost

Internet Archive:

contains 2PB, growing 240TB/yr

Google collects the Web monthly then discards it

archive.org collects the Web monthly then keeps it

2 snapshot copies + 1 coming up

$10-14M/yr operation so ~$0.5 per GB per year

Portico:

All academic literature ~50TB, growing ~5TB/yr

Portico still working on ingesting back content

$6-8M/yr operation so >$10 per GB per year

My cost numbers for archive.org come from a recent article in The Economist's Technology Quarterly, and for Portico from a guesstimate based on their tax returns.

How Many $ Do We Need?

archive.org should be cheaper than Portico

It isn't doing all that "preservation" stuff

Better bit preservation than archive.org important

But does all the other stuff justify 20x cost per byte?

How much do we need to save? An exabyte?

0.3% of the data generated in 2007, 0.05% of 2011

@ archive.org = $5B/yr, @ Portico = $100B/yr

The world doesn't have even $5B/yr to spend on this

Much less $100B/yr. The point is that, even if we could do adequate quality preservation with archive.org's cost structure, we'd still be much too expensive to address society's need for preservation. With the cost structures more normally associated with preservation at scale, we're much, much further away from addressing it.

Intellectual Property

Most content worth saving is making money

Lawyers won't risk that; don't want you to keep a copy

They have massaged the law to their ends

You must get permission, so you must talk to lawyers

Or you are vulnerable to DMCA take-down like IA

1 hour of 1 lawyer ~ 5TB of disk

10 hours of 1 lawyer could store the academic literature

For preservation, much uncertainty

Effort devoted to high byte/lawyer-hour content

Please use Creative Commons licenses!

The real problem is that the need to talk to the copyright owner's lawyers applies even if the content is open access motivates preservation of content for which a single lawyer's conversation obtains permission for a great deal of content. So, for example, even if it takes a lot of lawyer time to talk to Elsevier, the cost per unit of content preserved is small. Whereas even if the cost to talk to a small open access publisher is small, the cost per unit of content will be prohibitive. Once again, the economic forces push towards preservation of the content that is not at risk of loss.

Looking Forwards

What are the non-problems?

Or rather, the problems not big enough to matter

What are the big problems?

Preserving the world the way it is now

Not the way it used to be

Finding enough money

And working out how much that is

Surviving not having enough money

By turning more things into non-problems

Non-Problems

Formats

Any format with an open-source renderer is not at risk

Metadata (at least for documents)

Hand-generated metadata

Too expensive, search is better & more up-to-date

Program-generated metadata

Why save the output? You can save the program!

There are extended discussions of the usefulness of format metadata in this post, and of the relative value of open source renderers as against format specifications in this post. For a discussion of the questionable value of format metadata for preservation see this post.

Services not Documents

"Preservation" implies static, isolated object

Web 0.9 is like reading a printed book

Web 1.0 dynamically inserts personalized adverts

No-one preserves the adverts, but they're important

With the Night Mail Rudyard Kipling (1905)

The Who Sell Out The Who (1967)

A Prairie Home Companion Garrison Keillor (1974-)

Web 2.0 is dynamic, interconnected

Each page view is unique, mash-ed up from services

Pages change as you watch them

What does it mean to preserve a unique, dynamic page?

For a discussion of the importance of context in preserving the Web see this post.

Things Worth Preserving

User Generated Content

To understand 2008 election you need to save blogs

To do that you need to save YouTube, photo sites, ...

So that the links to them keep working ...

Technical, legal, scale obstacles almost insuperable

Multi-player games & virtual worlds

Even if you could get the data and invest in the servers

They're dead without the community - Myst (1993)

Dynamic databases & links to them

e.g. Google Earth mash-ups - is Google Earth forever?

Do you remember Myst from 1993? It was a beautiful virtual world that you explored. Pretty soon you figured out that you were the only person there. Some time after that you figured out that the goal of the game was to figure out why you were the only person there. We've come a long way since then, Myst would not make it against World of Warcraft or Second Life.

For a discussion of the problem of preserving the materials future scholars will need to study elections, see this post.

Economics

2008 Preservation Buzzword: Sustainability

We can't afford to preserve the stuff we know how to

Future stuff will be much more expensive

There'll be a lot more bytes of it

Each byte will be more difficult & more expensive

Bytes vulnerable to money supply glitches

Data needs to be endowed if it is to survive hard times

Endowing up front means preserving less

Collection development: what must be kept?

But it has really bad scaling problems

Bytes are a lot more vulnerable to disruptions in the money supply that paper. They are like divers in old-fashioned diving suits, dependent on air continuously pumped down from the surface. We need to make preserved bytes more like SCUBA divers, carrying their own tank of air with them that only needs to be refilled at intervals. Endowing data is discussed in this post.

Digital Preservation Difficult

Conceptually

What does it mean to preserve dynamic content?

Technically

Need to preserve services not content. How?

Legally

Preservation requires permission

How do you even find everyone you need to ask?

Economically

Just storing the bits needs industrial infrastructure

Beyond resources of universities, national libraries

Are services like S3 reliable enough?

Alyssa Henry's FAST keynote, in which she offered numbers for availability but pointedly not for reliability is discussed in this post.

Digital Preservation Important

Paper's attributes built in to society

Durable, write-once, tamper-evident, highly replicated, ...

Society needs fixed, tamper-evident record

E.g. laws, contracts, evidence, ...

Paper provides this as a side-effect

The Web is Winston Smith's dream machine

All govt. information on a single web server (FDsys)

Point-&-click to rewrite history

FDsys is discussed in this post.

Practical Next Steps

Everyone - just go collect the bits:

Not hard or costly to do a good enough job

Please use Creative Commons licenses

Preserve Open Source repositories:

Easy & vital: no legal, technical or scale barriers

Support Open Source renderers & emulators

Support research into preservation tech:

How to preserve bits adequately & affordably?

How to preserve this decade's dynamic web of services?

Not just last decade's static web of pages

Additional Material

Here is some additional material I prepared but which I cut to get down to the time allowed.

Did Documents Get Lost?

I was expecting a question asserting that I was wrong to suggest that formats in wide use in 1995 had not gone obsolete.

The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).

I am sure that there are many formats that were in use in 1995 that are now difficult to render because current tools lack support for them. I have argued for a long time that there are few, if any, formats in wide use in 1995 that are difficult to render with current tools. I'm still looking for counter-examples.

But even if there were counter-examples, it wouldn't invalidate my case. It is easy to emulate 1995 PCs, and quite possible to emulate most other architectures current in 1995 using virtual machine technology. See, for example, this BBC story about a collaboration between Microsoft, the British Library and the British National Archives to access old formats by running virtual instances of old Microsoft operating systems and the relevant applications.

The only question is, did someone keep the bits for the operating system and the application as well as the document?

As regards media, the media in wide use in 1995 that are less common today are 3.5" floppies (still on the shelves at Fry's), ZIP drives (as I write this there are 306 of the original ZIP drives on eBay), and DAT tape (40 drives on eBay).

7 comments:

Sheila Morrissey said...: This comment has been removed by a blog administrator.; April 15, 2009 at 8:43 AM
David. said...: I'm grateful to Portico's Sheila Morrissey for setting out the conventional wisdom I was arguing against in my plenary talk. Unfortunately, the length of her comment appears to be inhibiting others from contributing to the discussion the talk was intended to spark.

I have deleted Sheila's comment but have preserved the text, which will shortly appear as a post, interspersed with my responses to her points. I hope others will now feel freer to comment at less length on this post.

Clearly, this is a blog where relatively lengthy comments add value. But the urge to expand one's thoughts in a comment needs to be balanced against the imposition on readers to scroll past one's contribution to find other's comments that they may find equally valuable. If the Daily Kos FAQ had been accessible from my present remote location, I would have linked to what I remember as valuable advice on the length to which a comment should grow before it should mutate into a post less it disrupt the discussion.; April 23, 2009 at 3:29 AM
Sheila Morrissey said...: Readers who wish to see the response to David’s talk can find it here.; May 1, 2009 at 2:08 PM
David. said...: The promised post of Sheila Morrissey's comment with my responses is now available. I'm sorry for the delay; I had less Internet access and less time to work on it in my holiday hotels in Tasmania than I expected.; May 6, 2009 at 10:28 AM
David. said...: CNI has posted the video of Cliff's introduction, my talk, and the questions here.; July 13, 2009 at 4:09 PM
Matt said...: As someone who has worked in digital preservation at the UK National Archives for several years, I would like to correct a few misconceptions in the linked BBC story.

The volume of digital data managed by the UK National Archives is simply wrong by several orders of magnitude (this has done the rounds on the internet until it has now become a virtual fact!).

The formats we hold digital data in are mostly open standards, and even the proprietary formats are still fully readable using current software. We have never had lots of data in unreadable formats.

Microsoft did donate virtual machines with older and current versions of its operating system for us to experiment with in our preservation laboratory. However, the primary digital perservation strategy of the UK National Archives is still, and always has been, format migration.; July 29, 2009 at 7:15 AM
JKF said...: There are some interesting thoughts expressed in regards to lawyers. I ran across your post while doing some research on e-discovery for a case I'm currently working on and it really got me thinking about other issues as well.

But first back to the e-discovery aspect. As an attorney in this new digital age it has become more and more important that I'm able to track down records that may or may not be hanging out in cyberspace. For me preservation of these records is vital. Our cases often hinge on what we find during this discovery, so of course I am wanting companies to spend whatever is needed to make sure that they have secure storage of documents with adequate capacity for all needed storage.

The second issue this made me think of is the new rage of SAAS or software as a service. I know that I use two different providers to keep my client contact information, files, billing and basically everything in the "cloud". As more and more companies move toward this and more safe storage is needed how will this effect the current landscape?

And finally a very small issue, but one which is important to firms such as mine is how do we make sure that everything we ever post, comment on or make a note about in our professional lives on the internet is safely stored and retrievable? Again, as a lawyer I'm supposed to keep records of everything I ever do on my website. On my Memphis personal injury website/blog for example I have hundreds of pages of content that I'm changing up all the time to reflect new changes in the law. Under my state bar associations rule I'm supposed to keep a copy of each and every page, each and every time it is changed. These regulations are ridiculous and cause me a huge headache. But it's just another example of why reliable and affordable storage is so important.

Oh well, I'm rambling. But from the point of view of someone who is definitely not an expert in this field, I really enjoyed browsing through several of your posts.

Jami Ferrell - (just a simple lawyer trying to enjoy life, and better himself if at all possible); July 27, 2010 at 4:49 PM

Friday, April 10, 2009

Spring CNI Plenary: The Remix

7 comments: