What is digital preservation and why is it interesting to work on? For millennia, society has relied on paper as its archival memory medium, its way to preserve future generation's access to information. Paper has many advantages for this task. It is cheap, needs no special equipment to read and, best of all, survives benign neglect very well. Put it in a box in the basement and it is good for 100 years.
Rothenberg's Dystopian Vision
But less and less of today's culture and science ever makes it to paper. In 1995 Jeff Rothenberg wrote an article in Scientific American that first drew public attention to the fact that digital media have none of these durable properties of paper. The experience he drew on was the rapid evolution of digital storage media such as tapes and floppy disks, and of applications such as word processors each with their own incompatible format. His vision can be summed up as follows: documents are stored on off-line media which decay quickly, whose readers become obsolete quickly, as do the proprietary, closed formats. If this wasn't enough, operating systems and hardware change quickly in ways that break the applications that render the documents.
Distrust of digital storage continues to this day. Cathy Marshall, a researcher at Microsoft, vividly describes the attitudes of ordinary users to the loss of their digital memories in a talk called Its Like A Fire, You Just Have To Move On".
The Digital Dark Ages
The phrase "digital dark ages" describes the idea that future generations will lack information about our age because the bits did not survive. It gets about 8.4 million hits on Google. Rothenberg didn't use the phrase, but it appeared soon afterwards as a useful short-hand for the problems he was describing. Wikipedia, the first of the hits, traces it to a paper by Terry Kuny at the 1997 IFLA conference. Nowadays, it is the universal shorthand for journalists writing about digital preservation, for example in the recent American Scientist article by Kurt Bollacker.
Fifteen years is a long time in information technology. Today I'll be looking at the problems Rothenberg identified, asking whether they turned out to be as daunting as he predicted, or whether the real problems lay elsewhere. But first, lets look at the historical Dark Ages and ask whether this is a good analogy to start with.
The Dark Ages
To oversimplify greatly, the Dark Ages started with the collapse of the Roman Empire and ended with the Renaissance. The period is dark because it lacks a written record. This is odd, because one of the iconic images of the period is of a scriptorum, a room full of monks laboriously copying gloriously illustrated manuscripts. But another iconic image is of these monks fleeing as the invading Vikings put their monastery to fire and the sword.
The monks had two reasons for copying the manuscripts. First, to disseminate them; to make their contents available to scholars without the need for laborious travel. Second, to preserve them for posterity. The monks were keenly aware of the history that had already lost, for example in the multiple fires that destroyed the Library of Alexandria, and of the risks to which their own library's manuscripts were exposed. They understood that making multiple copies and scattering them across the known world was the best way to guard against loss of information. Unfortunately, the process of copying and distribution was slow and expensive, limiting its effectiveness. Not as slow and expensive as the similar process by which Korean monks engraved the Buddhist Tripitaka into wooden blocks in response to the Mongol invasion; 52 million characters in 16 years without a single error.
Why Were The Ages Dark?
Notice that in both cases the monks were copying documents from times past. They were in many cases effective in ushering them through the dark ages. We still have many of these manuscripts. The reason the dark ages are considered dark is not because they obscured what came before them, but because they are themselves obscure. There is a lack of contemporary written materials. This is partly because the chronicles that were written were seen as less valuable than the works of the ancients, so less was investing in ensuring their survival by creating copies, but also because in the chaotic social conditions writers simply wrote fewer new works.
Although the improved social stability of the Renaissance meant there were fewer Vikings sacking libraries, that wasn't the reason that works from previous centuries survived better. Early in the Renaissance, Gutenberg radically reduced the cost and increased the bandwidth of the copying process - his press could print a page every 15 seconds. We focus on the ability of Gutenberg's press to create copies, but a pile of copies in Gutenberg's shop would have been hardly any safer than a single copy in a monastery library. There were still wars a-plenty. What made the difference was the distribution channel of publishers, booksellers and libraries that came into existence to spread the newly abundant copies across the world. Lots of copies at lots of libraries kept books safe for posterity.
This distribution channel provided writers with a very powerful motive to create new works, namely profit. The reason we know much more about the Renaissance than the Dark Ages is not primarily because the probability of a book written then surviving to today was greater, but more because writers had a better business model, so they wrote many more books.
Jeff Rothenberg's Scenario
Rothenberg is an excellent communicator, so he based his explanation of the problems of preserving digital information on a story. His descendants in 2045 found a CD supposedly containing clues to the location of his fortune. He described three problems they would encounter in their quest:
- Media Decay: By this he meant that although the medium itself survived, the information it carried would no longer be legible. Scribes went to great lengths to avoid this problem by using durable media such as vellum, papyrus, wood blocks and clay tablets, and using durable inks or engraving.
- Media Obsolescence: By this he meant that although both the medium and the information it carried survived, there was no device available that could read the information from the medium into a contemporary computer. The monks' reading device was the human eye, so they would not have recognized this exact problem. But a close analogy would be if the manuscript survived but the script in which it was written was no longer understood. Until it was deciphered, the Linear B script used to write Greek was an example. The related Linear A script still is.
- Format Obsolescence: By this he meant that although the medium and the information both survived, and a device was available to recover the information as a bit stream, no software could be found to render the bit stream legible to a reader. The analogy for the monks would be that the manuscript survived and the script could be read, but the language in which it was written was no longer understood.
Rothenberg described cures for each of the problems he identified:
- Media Decay could be cured simply by regularly refreshing the data, copying the bits from the old to a new instance of the medium.
- Media Obsolescence could be cured simply by regularly migrating the data, copying the bits from the medium facing obsolescence to a newer medium.
For this reason Rothenberg preferred an alternative way to address the threat of format obsolescence, emulation. If both the bits of the document and the bits of the software that rendered it legible to a reader survived, then all that would be needed to access the document would be to run the software on the document. In order to do this in the absence of the original or compatible hardware, Rothenberg suggested using an emulator. This is a program that mimics the original hardware using currently available hardware.
Of course, creating an emulator requires the specifications of the computer to be emulated. Rothenberg proposed storing these specifications in electronic form, thus neatly reducing the problem to a previously unsolved one.
Collecting Digital Content
In the wake of both Rothenberg's paper and the Web revolution, the combination of the threat of the digital dark ages and the importance of the on-line world to our culture has driven an explosion of interest in collecting and preserving access to digital information. There are programs, such as SHAMAN and PLANETS, researching details of the technology. Universities are setting up "institutional repositories" to collect the digital works of their scholars for posterity. National and research libraries are collecting everything from tweets to political web sites and legal blogs. Libraries and Google are digitizing books, laws and maps. New institutions, such as the Internet Archive, CLOCKSS and Portico have arisen to tackle tasks, such as preserving the history of the Web, and academic e-journals, that are beyond the capabilities of libraries acting alone. Science is evolving a "fourth paradigm" in which research consists of collecting and investigating vast datasets.
It is striking that, in most cases, these preservation efforts are based on Rothenberg's less-favored technique of format migration. Indeed, some are so concerned at the prospect of format obsolescence that they preemptively migrate the content they collect to a format they believe is less doomed than the original.
Rothenberg's article assumed a fifty-year interval between the creation of the digital document and the need to access it in which the problems could arise. One would expect that if the problems were to be so significant on a fifty year time scale, some of them would have happened to some of these collections in the fifteen years since he wrote, and thus that the cures would have been employed. What actually happened?
To illustrate the need for media to be refreshed, Rothenberg discussed attempting to read a fifty-year-old recordable CD, then a relatively new medium. Experience has shown that, in common with most digital media, the initial projections of their durability were greatly exaggerated. The CD-R would have had to be refreshed every 2-3 years, and this turns out to be the norm for digital collections. Although most are stored on disk or tape, these media have a long history of rapidly increasing density. Thus, even if it is not necessary for data reliability, data is copied to new media every few years to reduce non-media costs such as power and data center space. Each time content is copied to the latest media generation, it allows for storing perhaps four times the data with the same space, power, and media cost.
Based on experience with analog media such as 8-track tapes and early digital media such as 1/2" tape and 8" floppy disks, Rothenberg was skeptical that after 50 years a drive capable of reading the CD-R would be available. With hindsight, we might not agree. Precisely because the CD was the first successful consumer digital medium, fifteen years later compatible drives are universal. Successor media, such as DVD and Blu-Ray, have seen that the costs of providing CD compatibility were low and the benefits high.
In the pre-Web world digital information lived off-line, in media like CD-Rs. The copy-ability needed for media refresh and migration was provided by a reader extrinsic to the medium, such as a CD drive. In the Web world, information lives on-line. Copy-ability is intrinsic to on-line media; media migration is routine but insignificant. The details of the storage device currently in use may change at any time without affecting reader's access. Access is mediated not by physical devices such as CD readers but by network protocols such as TCP/IP and HTTP. These are the most stable parts of cyberspace. Changing them in incompatible ways is effectively impossible; even changing them in compatible ways is extremely hard. This is because they are embedded in huge numbers of software products which it is impossible to update synchronously and whose function is essential to the Internet's operation.
When Rothenberg proposed emulation as a preservation strategy, he assumed that the necessary emulators would have to be written by preservationists themselves. Instead, what has happened is that emulation, or more properly virtual machines, has become an integral part of the IT mainstream for reasons entirely unrelated to preservation. Thus we now have a wide range of emulators for every computer architecture of note, including in most cases an open-source emulator. Emulation as a preservation strategy looks even better than it did 15 years ago, because it now leverages mainstream IT technologies. If the bits of the application survive together with the data for the application, an emulation can easily be created to run it. In fact, the emulation can even be delivered to and run inside your browser.
Given the attention paid to the problem of format migration since 1995, it is striking that there is no record of any migration of a current format being necessary in that time. Format migrations have, of course, taken place, but they have all been either of a format that was already obsolete in 1995, or of a more recent format by way of a test, demonstration or preemption. The reason is that no formats in wide use in 1995 have gone obsolete since then; they are all still easily legible.
The Web Revolution
Rothenberg's article was published in January 1995, at the very start of the Web revolution. He was writing from the perspective of desktop publishing, observing that the personal computer had provided everyone with a printing press like Gutenberg's; an easy way to get their thoughts on to paper. In the long term, storing these desktop publishing files, which is what he thought of as "digital documents", instead of the paper output was the cause of the problems he identified.
What Rothenberg could not see was that the documents lacked a distribution channel; the digital equivalent of the service publishers, booksellers and libraries provided of getting the books Gutenberg printed into reader's hands. The revolution of the Web was not the means of turning keystrokes into bits and then into print, but the means to get those bits to interested readers worldwide without ever meeting paper.
The audience for which desktop publishing files were intended was the desktop publishing software itself. Their raison d'etre was to save the state of the software to enable future editing, not to communicate with future readers. They were like the diaries Samuel Pepys wrote by hand in a secret shorthand, never intending them to be read by others.
In contrast, information on the Web is published, intended to be read by others using whatever software they like, not solely the software that created it. It resembles the way we read Pepys' diaries, transcribed into Latin script, printed, published and distributed via booksellers and libraries. Note that Pepys was a notable patron of this distribution channel, amassing through his life a wonderful library of 3000 books, which survives to this day in the bookcases he had built for it, in a room in Magdalene College, Cambridge. Make sure to visit it on your next trip to Cambridge.
Software Evolution & The Web
The key thing that changes once information is published in the Web sense is that no-one, least of all the publisher, controls the software with which it will be read. Making incompatible changes to a single version of the reading software doesn't render the information inaccessible; readers will simply use another version, or even the original, pre-change version. What it does do is to reduce the value of the changed software. We see this in practice. In January 2008 Microsoft announced that it would remove Office's support for some really old formats. Customers rebelled, and less than a week later they retracted the announcement.
This ability to publish information has greatly increased the flow of open-source software. This is important for digital preservation in a number of ways:
- The distributed nature of open-source software development means that, for the same reasons that published document formats tend not to change in incompatible ways, neither does open source software.
- If a proprietary, closed-source document format becomes popular, there are strong incentives for an open-source product to reverse-engineer it. Thus it is rare for a popular format to lack an open-source interpreter. This is true even of formats protected by digital rights management technologies. The legality of such interpreters may be questionable, but if they didn't work the content owners would not waste time pursuing them in the courts.
- Open source software is itself immune from format obsolescence, being written in ASCII. It is carefully preserved in source code control systems such as SourceForge, from which the state of the software at any time in the past can be reconstructed. It is, therefore, hard to create a plausible scenario in which any format with an open-source interpreter could possibly go obsolete.
In February 2009 my friend Kirk McKusick received a prestigious award from the IEEE for his 30 years stewardship of the Unix File System. After 30 years disks are a million times bigger, the code is four times bigger, much faster and more reliable. But it would still read every disk it has ever written, assuming that the disk itself had survived. This isn't a heroic effort of digital preservation, it is an illustration of the fact that making incompatible changes to widely used software is impractical, so it almost never happens.
So No Digital Dark Ages?
Digital formats in current use are very unlikely ever to go obsolete. Even if they do, they have open-source interpreters that can be used to resurrect them. In the unlikely event that this fails to restore legibility, the entire computing environment that was current at the time the format was legible can easily be emulated. Do these three layers of defense mean that there is no prospect of a "digital dark age"?
One way of looking at it is that there was a digital dark age, but that it was in the past before the Gutenberg-like transformation wrought by the Web. It is true that a good deal of the pre-1995 digital world can only be accessed via heroic acts of software archeology. The digital monks hand-crafting this early software lacked a distribution channel, so were apt to change things in incompatible ways. Rothenberg quite reasonably projected this experience into the future.
Another way of looking at it is that those who followed in Rothenberg's footsteps simply failed to notice many of the threats which cause loss or damage to stored digital information. Here, from a 2005 paper is our list in two halves. First, the threats that are usually invoked when digital preservation is discussed, and second the threats that people who actually run large-scale storage facilities will tell you are the ones that actually cause the majority of data loss.
The Real Problems Were ...
Instead of media and format obsolescence of off-line information leading to a digital dark ages, experience has shown that most of the real difficulties in collecting digital content and preserving it against this broad range of threats stem from the on-line nature of the Web and the extraordinarily powerful business models it enables:
- Scale - the monks of the Dark Ages didn't create much new content. Today we have the opposite problem, the amounts of new content are so large as to overwhelm our ability to select and curate. Only highly automated processes, such as those used by the Internet Archive and Google, can begin to keep up.
- Cost - the vast scale of the problem means that even efforts to address small parts of the whole problem need major funding.
- Copyright - the vast sums to be made on the Web together with the ease of copying digital information have led content owners to demand, and legislatures to grant, excessively strong intellectual property rights. These rights often interfere with the historic mission of libraries and archives to preserve cultural history.
- Dynamism - the one truly new aspect of the Web, enabled by its nature as an effectively instantaneous bi-directional communications medium, is that its content can be dynamic and individualized, with each reader having a different experience at each access. What does it mean to preserve something that is different every time you look at it?
It was not the direct effects of documents in digital form that caused the difficulties in preserving them, it was the indirect effects of the distribution and communications channel that digital documents enabled.
Rothenberg's focus was on micro-level preservation; a single document on a single CD. But, even if his approach worked perfectly, we can't simply apply it one at a time to each of the billions of web pages and other digital documents we would like to preserve. After all, digital documents nowadays live not on CDs but in the "cloud", or rather in data centers the size of car factories using as much power as an aluminum smelter. Keeping one copy of one important database on-line costs $1M/year. We can't afford handcrafted, artisan digital preservation. We need industrial scale preservation, with tools such that a single digital curator can collect and preserve millions of documents in a day's work.
It isn't enough to preserve billions of individual documents. The number one lesson from Google is that there is more value in the connections between documents than in the documents themselves. This is another instance of Metcalfe's Law; the value of a network goes as the square of the number of nodes. An isolated document is a network of one node. If we preserve the documents in a way that loses the connections between them, we have failed to preserve the more important part of the works. We have to preserve not individual documents, but vast collections of documents with their inter-connections intact.
The other lesson from Google is, it's expensive. How expensive are we talking here? Lets do a back-of-the-envelope using two major preservation efforts whose approaches are opposite extremes.
Scale Implies Cost
The Internet Archive is the lowest cost per byte preservation approach we know. It crawls the Web monthly and keeps it. Google crawls the Web at least monthly and throws the previous copy away, so you can see the Internet Archive has a big problem. It is currently over 2PB and growing about a quarter of a petabyte a year. Its costs are $10-14M/yr or about 50 cents per GB per year.
Portico collects the academic literature. They have yet to collect it all, but if they had it would be about 50TB and growing about 5TB per year. They are a $6-8M/yr operation, so their costs are more than $10 per GB per year.
Thus there is a 20-fold range of costs per GB per year for current digital preservation technologies, from 50 cents to $10.
How Many $ Do We Need?
This range goes from keeping a couple of copies to obsessively verifying and preemptively migrating the formats. Portico should be more expensive than the Internet Archive, it's doing a lot more stuff. The question is whether the benefits of the extra stuff it is doing are worth paying 20 times as much for every byte.
How much data do we need to save? An exabyte? That would be one in every 200,000 bytes that will be generated next year. With the Internet Archive's cost structure it would cost five billion dollars a year. With Portico's cost structure it would cost one hundred billion dollars a year. No-one believes that the world has $100B/yr to spend on this, or even $5B/yr. We can say definitely that less than one in a million bytes is going to be preserved.
The downside of the enormously effective distribution channel of the Web is that most content that anyone would think worth giving the one-in-a-million chance to survive is somehow making someone money, whether by subscription or advertising. Lawyers won't risk that income, and they have massaged the law so that even if the content isn't making money, the mere possibility that it could at some time in the future leaves you with only two possibilities. You can take the "so, sue me" approach of the Internet Archive, which means that anyone claiming to be the copyright owner can remove the content simply by writing a letter. Or, even if the content is open access, you must get permission from the copyright owner, which means talking to their lawyers. Talking to lawyers is expensive; 5 hours of 1 lawyer would pay for enough disk to hold the entire academic literature. The bad effect of this is that preservation effort is focused on high byte per lawyer hour content. It takes a lot of lawyer hours to get permission to preserve a big publisher's content, but in return you get a lot of bytes. The big publisher isn't going away any time soon, so this effort preserves stuff that isn't at risk. The stuff at risk is from small or informal publishers. Generally, if you can find them, they easily agree that their stuff should be preserved. But they don't have much stuff, so their byte per lawyer hour is low.
You can do your part to help solve this aspect of the problem. Please, for anything you publish, use a Creative Commons license. All the Creative Commons licenses allow everything needed for preservation.
Services not Documents
The one truly new thing about the Web is that documents are no longer static objects that can be copied once so that the copy preserves the original. This stopped being true in the very early days of the Web as pages started to contain advertisements. Each viewer saw a different ad. But no-one preserving Web pages preserves the ads. They aren't considered important. But here are examples of works of art containing advertisements. Everyone would agree that preserving these requires preserving the ads. So why are ads important only when they aren't advertising real products?
Nowadays, the Web is far more dynamic than static pages with ads inserted. Pages are created on the fly by layers of dynamic services.
Here, for example, is the real estate site Zillow. It mashes up Microsoft's Bing, the real estate listing service MLS, county tax and land registry data, ad services, and its own proprietary information. Note that the ads are an essential part of the service. What you are seeing is a small part of Palo Alto as of a day in late March. I am certainly the only person ever to see this page. Collecting this page doesn't begin to preserve Zillow.
Services & IP
Even if your lawyers persuaded Zillow's lawyers to allow you to preserve their internal database, it changes every day. And it is only a small part of the service. Zillow can't give you permission to preserve data from Bing, or MLS. The IP issues around dynamic services are at least as bad as the technical ones. You can't legally preserve even a simple web page with ads. The owner of the web page typically can't tell you who owns the ads, because he has no way of knowing what ads you would see if you visited the page. After visiting the page, you probably can't tell who owns the ads.
Web 2.0 pages mashed up from multiple services are even worse. Each of the layers on layers of services is a separate business relationship. Layer N+1 probably knows nothing about layer N-1. And none of the terms and conditions implementing the relationships envisages the possibility of preservation.
Things Worth Preserving
But we can't just ignore dynamic content. It isn't just ads. Future scholars won't understand today's politics without blogs. But blogs typically leave ownership of posts and comments with their authors, who are in many cases anonymous. And the blogs point to other services, such as YouTube, Twitter and so on, each posing its own legal and technical problems.
Efforts are being made to collect blogs, YouTube and Twitter. But even to the extent that they succeed, they are isolated; the links in the preserved blogs won't point ot the preserved YouTube videos. Other important content isn't being collected. The 2008 elections featured political ads and debates in multi-player games and virtual worlds. Does anyone here remember Myst? It was a beautifully rendered virtual world that you could explore, Pretty soon, you figured out that you were alone in the virtual world, and a bit later you figured out that the point of the game was to figure out why you were alone in it. Myst can be preserved, but preserving other worlds as a static snapshot, without their community, misses the point by reducing them to Myst-like worlds.
Many sites these days base themselves on Google services such as Blogger or Google Earth. Google isn't going to give you permission to preserve them, or access to the technology that runs them, but even if it did you couldn't afford the infrastructure to do the job. But if you preserve the site that uses the services, are they going to be there when a future scholar wants to visit the preserved site?
Digital dark ages turns out to be a poor analogy for the situation we face today. It may be that the digital dark ages are behind us, the pre-Web era of disappearing formats and off-line media. The advent of the Web was the equivalent of Gutenberg's role in a digital renaissance. No-one can claim that our age isn't producing enough material for future scholars.
Of course, the vast majority of these documents will not survive for future readers. The reason is not that their formats or their media will decay. Rather, it is an even worse version of the reason that only a small proportion of material from the Renaissance did. No-one can afford to keep everything. Because bits, unlike paper, don't survive benign neglect well, and because of the enormous scale of digital publishing, a byte has much less than a one-in-a-million chance of being chosen for preservation.
Looking ahead, the real problem we're facing is a fundamental change in the publishing paradigm that has been with us since Gutenberg. Documents are no longer static objects that can be copied in such a way that any reader will find any of the copies equivalent. Each reader sees a unique object that created on the fly, and changes continually as the reader looks at it. The problems that may lead to a future digital dark age are much more fundamental than details of formats or storage media, they lie in a transformation of the very nature of documents.