I'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford Libraries. We've been sustainably preserving digital information for a reasonably long time, and I'm here to talk about some of the lessons we've learned along the way that are relevant for research data.
In May 1995 Stanford Libraries' HighWire Press pioneered the shift of academic journals to the Web by putting the Journal of Biological Chemistry on-line. Almost immediately librarians, who pay for this extraordinarily expensive content, saw that the Web was a far better medium than paper for their mission of getting information to current readers. But they have a second mission, getting information to future readers. There were both business and technical reasons why, for this second mission, the Web was a far worse medium than paper:
- The advent of the Web forced libraries to change from purchasing a copy of the content to renting access to the publisher's copy. If the library stopped paying the rent, it would lose access to the content.
- Because in the Web the publisher stored the only copy of the content, and because it was on short-lived, easily rewritable media, the content was at great risk of loss and damage.
The LOCKSS Program started in October 1998 with the goal of replicating the paper library system for the Web. We built software that allowed libraries to deploy a PC, a LOCKSS box, that was the analog for the Web of the paper library's stacks. By crawling the Web, the box collected a copy of the content to which the library subscribed and stored it. Readers could access their library's copy if for any reason they couldn't get to the publisher's copy. Boxes at multiple libraries holding the same content cooperated in a peer-to-peer network to detect and repair any loss or damage.
The program was developed and went into early production with initial funding from the NSF, and then major funding from the Mellon Foundation, the NSF and Sun Microsystems. But grant funding isn't a sustainable business model for digital preservation. In 2005, the Mellon Foundation gave us a grant with two conditions; we had to match it dollar-for-dollar and by the end of the grant in 2007 we had to be completely off grant funding. We made both conditions, and we have (with one minor exception which I will get to later) been off grant funding and in the black ever since. The LOCKSS Program has two businesses:
- We develop, and support libraries that use, our open-source software for digital preservation. The software is free, libraries pay for support. We refer to this as the "Red Hat" business model
- Under contract to a separate not-for-profit organization called CLOCKSS run jointly by publishers and libraries, we use our software to run a large dark archive of e-journals and e-books. This archive has recently been certified as a "Trustworthy Repository" after a third-party audit which awarded it the first-ever perfect score in the Technologies, Technical Infrastructure, Security category.
- Do nothing. In that case we can stop worrying about bit rot, format obsolescence, operator error and all the other threats digital preservation systems are designed to combat. These threats are dwarfed by the threat of can't afford to preserve. It is going to mean that more than 50% of the stuff that should be available to future readers isn't.
- Double the budget for digital preservation. This is so not going to happen. Even if it did, it wouldn't solve the problem because, as I will show, the cost per unit content is going to rise.
- Halve the cost per unit content of current systems. This can't be done with current architectures. Yesterday morning I gave a talk at the Library of Congress describing a radical re-think of long-term storage architecture that might do the trick. You can find the text of the talk on my blog.
As an engineer, I'm used to using rules of thumb. The one I use to summarize most of the cost research is that ingest takes half the lifetime cost, preservation takes one third, and access takes one sixth.
Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.
Kryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.
This graph, from Preeti Gupta of UC Santa Cruz plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.
Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.
You may think that cloud storage is a competitive market; in fact it is dominated by Amazon. When Google recently started to get serious about competing, they pointed out that Amazon's margins on S3 may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.
In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.
The storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this is a difficulty, the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.
In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.
But the advent of "Big Data" techniques mean that, going forward, scholars increasingly want not to access a few individual items in a collection, but to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library of Congress can afford to do with the feed is to write two copies to tape. There's no way they can afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.
The real problem here is that scholars are used to having free access to library collections, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.
For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.
Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud. they aren't locked-in.
One thing it should be easy to agree on about digital preservation is that you have to do it with open-source software; closed-source preservation has the same fatal "just trust me" aspect that closed-source encryption (and cloud storage) suffer from. Sustaining open source preservation software is interesting, because unlike giants like Linux, Apache and so on it is a niche market with little commercial interest.
We have managed to sustain open-source preservation software well for 7 years, but have encountered one problem. This brings me to the exception I mentioned earlier. To sustain the free software, paid support model you have to deliver visible value to your customers regularly and frequently. We try to release updated software every 2 months, and new content for preservation weekly. But this makes it difficult to commit staff resources to major improvements to the infrastructure. These are needed to address problems that don't impact customers yet, but will in a few years unless you work on them now.
The Mellon Foundation supports a number of open-source initiatives, and after discussing this problem with them they gave us a small grant specifically to work on enhancements to the LOCKSS system such as support for collecting websites that use AJAX, and for authenticating users via Shibboleth. Occasional grants of this kind may be needed to support open-source preservation infrastructure generally, even if pay-for-support can keep it running.
Unfortunately, economics aren't the only hard problem facing the long-term storage of data. There are serious technical problems too. Lets start by examining the technical problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.
Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.
Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.
Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.
Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
- Media failure
- Hardware failure
- Software failure
- Network failure
- Natural Disaster
- Operator error
- External Attack
- Insider Attack
- Economic Failure
- Organizational Failure
Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.
However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.
Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 150th most visited site, whereas loc.gov is the 1519th. For UK users archive.org is currently the 131st most visited site, whereas bl.uk is the 2744th.
Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.
To sum up, the good news is that sustainable preservation of digital content such as research data is possible, and the LOCKSS Program is an example.
The bad news is that people's expectations are way out of line with reality. It isn't possible to preserve nearly as much as people assume is already being preserved, nearly as reliably as they assume it is already being done. This mismatch is going to increase. People don't expect more resources yet they do expect a lot more data. They expect that the technology will get a lot cheaper but the experts no longer believe it will.
Research data, libraries and archives are a niche market. Their problems are technologically challenging but there isn't a big payoff for solving them, so neither industry nor academia are researching solutions. We end up cobbling together preservation systems out of technology intended to do something quite different, like backups.