Monday, April 7, 2014

What Could Possibly Go Wrong?

I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.

I'm David Rosenthal and I'm an engineer. I'm about two-thirds of a century old. I wrote my first program almost half a century ago, in Fortran for an IBM1401. Eric Allman invited me to talk; I've known Eric for more than a third of a century. About a third of a century ago Bob Sproull recruited me for the Andrew project at C-MU, I where I worked on the user interface with James Gosling. I followed James to Sun to work on window systems, both X, which you've probably used, and a more interesting one called NeWS that you almost certainly haven't. Then I worked on operating systems with Bill Shannon, Rob Gingell and Steve Kleiman. More than a fifth of a century ago I was employee #4 at NVIDIA, helping Curtis Priem architect the first chip. Then I was an early employee at Vitria, the second company of JoMei Chang and Dale Skeen, founders of the company now called Tibco. One seventh of a century ago, after doing 3 companies, all of which IPO-ed, I was burnt out and decided to ease myself gradually into retirement.

Academic Journals and the Web

It was a total failure. I met Vicky Reich, the wife of the late Mark Weiser, CTO of Xerox PARC. She was a librarian at Stanford, and had been part of the team which, nearly a fifth of a century ago, started Stanford's HighWire Press and pioneered the transition of academic journals from paper to the Web.

In the paper world, librarians saw themselves as having two responsibilities, to provide current scholars with the materials they needed, and to preserve their accessibility for future scholars. They did this through a massively replicated. loosely coupled, fault-tolerant, tamper-evident, system of mutually untrusting but cooperating peers that had evolved over centuries. Libraries purchased copies of journals, monographs and books. The more popular the work, the more replicas were stored in the system. The storage of each replica was not very reliable; libraries put them in the stacks and let people take them away. Most times the replicas came back, sometimes they had coffee spilled on them, and sometimes they vanished. Damage could be repaired via inter-library loan and copy. There was a market for replicas; as the number of replicas of a work decreased, the value of a replica in this market increased, encouraging librarians who had a replica to take more care it, by moving it to more secure storage. The system resisted attempts at censorship or re-writing of history precisely because it was a loosely coupled peer-to-peer system; although it was easy to find a replica, it was hard to find all the replicas, or even to know exactly how many there were. And although it was easy to destroy a replica, it was fairly hard to modify one undetectably.

The transition of academic journals from paper to the Web destroyed two of the pillars of this system, ownership of copies, and massive replication. In the excitement of seeing how much more useful content on the Web was to scholars, librarians did not think through the fundamental implications of the transition. The system that arose meant that they no longer purchased a copy of the journal, they rented access to the publisher's copy. Renting satisfied their responsibility to current scholars, but it couldn't satisfy their responsibility to future scholars.

Librarians' concerns reached the Mellon Foundation, who funded exploratory work at Stanford and five other major research libraries. In what can only be described as a serious failure of systems analysis, the other five libraries each proposed essentially the same system, in which they would take custody of the journals. Other libraries would subscribe to this third-party archive service. If they could not get access from the original publisher and they had a current subscription to the third-party archive they could access the content from the archive. None of these efforts led to a viable system because they shared many fundamental problems including:
  • Libraries such as Harvard were reluctant to outsource a critical function to a competing library such as Yale. On the other hand, funders were reluctant to pay for more than one archive.
  • Publishers were reluctant to deliver their content to a library in order that the library might make money by re-publishing the content to others. This made the contract negotiations necessary to obtain content from the publishers time-consuming and expensive.
  • The concept of a subscription archive was not a solution to the problem of post-cancellation access; it was merely a second instance of exactly the same problem.
One of the problems I had been interested in at Sun and then again at Vitria was fault-tolerance. To a computer scientist, it was a solved problem. Byzantine Fault Tolerance (BFT) could prove that 3f+1 replicas could survive f simultaneous faults. To an engineer, it was not a solved problem. Two obvious questions were:
  • What is the probability that my system will encounter f simultaneous faults?
  • How could my system recover if it did?
There's a very good reason why suspension bridges use stranded cables. A solid rod would be cheaper, but the bridge would then have the same unfortunate property as BFT. It would work properly up to the point of failure, which would be sudden, catastrophic and from which recovery would be impossible.

I have long thought that the fundamental challenge facing system architects is to build systems that fail gradually, progressively, and slowly enough for remedial action to be effective, all the while emitting alarming noises to attract attention to impending collapse. In a post-Snowden world it is perhaps superfluous to say that these properties are especially important for failures caused by external attack or internal subversion.

The LOCKSS System

As Vicky explained the paper library system to me, I came to see two things:
  • It was a system in the physical world that had a very attractive set of fault-tolerance properties.
  • An analog of the paper system in the Web world could be built that retained those properties.
With a small grant from Michael Lesk, then at the NSF, I built a prototype system called LOCKSS (Lots Of Copies Keep Stuff Safe), modelled on the paper library system. By analogy with the stacks, libraries would run what you can think of as a persistent Web cache with a Web crawler which would pre-load the cache with the content to which the library subscribed. The contents of each cache would never be flushed, and would be monitored by a peer-to-peer anti-entropy protocol. Any damage detected would be repaired by the Web analog of inter-library copy. Because the system was an exact analog of the existing paper system, the copyright legalities were very simple.

The Mellon Foundation, and then Sun and the NSF funded the work to throw my prototype away and build a production-ready system. The interesting part of this started when we discovered that, as usual with my prototypes, the anti-entropy protocol had gaping security holes. I worked with Mary Baker and some of her students in CS, Petros Maniatis, Mema Roussopoulos and TJ Giuli, to build a real P2P anti-entropy protocol, for which we won Best Paper at SOSP a tenth of a century ago.

The interest in this paper is that it shows a system, albeit in a restricted area of application, that has a high probability of failing slowly and gradually, and of generating alarms in the case of external attack, even from a very powerful adversary. It is a true P2P system with no central control,  because that would provide a focus for attack. It uses three major defensive techniques:
  • Effort-balancing, to ensure that the computational cost of requesting a service from a peer exceeds the computational cost of satisfying the request. If this condition isn't true in a P2P network, the bad guy can wear the good guys down.
  • Rate-limiting, to ensure that the rate at which the bad guy can make bad things happen can't make the system fail quickly.
  • Lots of copies, so that the anti-entropy protocol can work with samples of the population of copies. Randomly sampling the peers makes it hard for the bad guy to know which peers are involved in which operations.
Recent DDoS attacks, such as the 400Gbps NTP Reflection attack on CloudFlare, have made clear the importance of rate-limiting to services such as DNS and NTP.

Now, our free, open source, peer-to-peer digital preservation system is in use at around 150 libraries worldwide. The program has been economically self-supporting for nearly 7 years using the "RedHat" model of free software and paid support. In addition to our SOSP paper, the program has published research into many aspects of digital preservation.

The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at current prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk. The median data item in the Global LOCKSS network has copies at a couple of dozen peers.

I doubt that copyright is high on your list of design problems. You may be wrong about that, but I'm not going to argue with you. So, the rest of this talk will not be about the LOCKSS system as such, but about the lessons we've learned in the last 15 years that are applicable to everyone who is trying to store digital information for the long term. The title of this talk is the question that you have to keep asking yourself over and over again as you work on digital preservation, "what could possibly go wrong?" Unfortunately, once I started writing this talk, it rapidly grew far too long for lunch. Don't expect a comprehensive list, you're only getting edited low-lights.

Stuff is going to get lost

Lets start by examining the problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

At scale, storing realistic amounts of data for human timescales is an unsolvable problem. Some stuff is going to get lost. This shouldn't be a surprise, even in the days of paper stuff got lost. But the essential information needed to keep society running, to keep science progressing, to keep the populace entertained was stored very robustly, with many copies on durable, somewhat tamper-evident media in a fault-tolerant, peer-to-peer, geographically and administratively diverse system.

This is no longer true. The Internet has, in the interest of reducing costs and speeding communication, removed the redundancy, the durability and the tamper-evidence from the system that stores society's critical data. Its now all on spinning rust, with hopefully at least one backup on tape covered in rust.

Two weeks ago, researchers at Berkeley co-authored a paper in which they reported that:
a rapid succession of coronal mass ejections ... sent a pulse of magnetized plasma barreling into space and through Earth’s orbit. Had the eruption come nine days earlier, when the ignition spot on the solar surface was aimed at Earth, it would have hit the planet, potentially wreaking havoc with the electrical grid, disabling satellites and GPS, and disrupting our increasingly electronic lives. ... A study last year estimated that the cost of a solar storm like [this] could reach $2.6 trillion worldwide.
Most of the information needed to recover from such an event exists only in digital form on magnetic media. These days, most of it probably exists only in "the cloud", which is this happy place immune from the electromagnetic effects of coronal mass ejections and very easy to access after the power grid goes down.

How many of you have read the science fiction classic The Mote In God's Eye by Larry Niven and Jerry Pournelle? It describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.

Larry and Jerry were writing a third of a century ago, but in the light of this week's IPCC report, they are starting to look uncomfortably prophetic. The problem we face is that, with no collective memory of a societal collapse, no-one is willing to pay either to fend it off or to build the museums to pass knowledge to the successor society.

Why is stuff going to get lost?

One way to express the "what could possibly go wrong?" question is to ask "against what threats are you trying to preserve data?" The threat model of a digital preservation system is a very important aspect of the design which is, alas, only rarely documented. In 2005 we did document the LOCKSS threat model. Unfortunately, we didn't consider coronal mass ejections or societal collapse from global warming.

We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure 
How much stuff is going to get lost?

The more we spend per byte, the safer the bytes are going to be. Unfortunately, this is subject to the Law of Diminishing Returns; each successive nine of reliability is exponentially more expensive than the last. We don't have an unlimited budget, so we're going to have to trade off cost against the probability of data loss. To do this we need models to predict the cost of storing data using a given technology, and models to predict the probability of that technology losing data. I've worked on both kinds of model and can report that they're both extremely difficult.

Models of Data Loss

There's quite a bit of research, from among others Google, C-MU and BackBlaze, showing that failure rates of storage media in service are much higher than the rates claimed by the manufacturers specifications. Why is this? For example, the Blu-Ray disks Facebook is experimenting with for cold storage claim a 50-year data life. No-one has seen a 50-year-old DVD disk, so how do they know?

The claims are based on a model of the failure mechanisms and data from accelerated life testing, in which batches of media are subjected to unrealistically high temperature and humidity. The model is used to extrapolate from these unrealistic conditions to the conditions to be encountered in service. There are two problems, the conditions in service typically don't match those assumed by the models, and the models only capture some of the failure mechanisms.

These problems are much worse when we try to model not just failures of individual media, but of the entire storage system. Research has shown that media failures account for less than half the failures encountered in service; other components of the system such as buses, controllers, power supplies and so on contribute the other half. But even models that include these components exclude many of the threats we identified, from operator errors to coronal mass ejections.

Even more of a problem is that the threats, especially the low-probability ones, are highly correlated. Operators are highly likely to make errors when they are stressed coping with, say, an external attack. The probability of economic failure is greatly increased by, say, insider abuse. Modelling these correlations is a nightmare.

It turns out that economics are by far the largest cause of data failing to reach future readers. A month ago I gave a seminar in the I-school entitled The Half-Empty Archive, in which I pulled together the various attempts to measure how much of the data that should be archived is being collected by archives, and assessed that it was much less than half.  No-one believes that archiving budgets are going to double, so we can be confident that the loss rate from unable to afford to collect is at least 50%. This dwarfs all other causes of data loss.

Lets Keep Everything For Ever!

Digital preservation has three cost areas; ingest, preservation and dissemination. In the seminar I looked at the prospects for radical cost decreases in  all three, but I assume that the one you are interested in is storage, which is the main cost of preservation. Everyone knows that, if not actually free, storage is so cheap that we can afford to store everything for ever. For example, Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."
Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely.

Kryder's Law

They believe this because they lived through a third of a century of Kryder's Law, the analog of Moore's Law for disks. Kryder's Law predicted that the bit density on the platters of disk drives would more than double every 18 months, leading to a consistent 30-40%/yr drop in cost per byte. Thus, long-term storage was effectively free. If you could afford to store something for a few years, you could afford to store it for ever. The cost would have become negligible.

As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. One of the things that most impressed me about Krste Asanović's keynote on the ASPIRE Project at this year's FAST conference was that their architecture took for granted that Moore's Law was in the past. Kryder's Law is also flattening out.

Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the Kryder curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.

Industry projections have a history of optimism, but if we believe that data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost. If at the start of the decade storage is 5% of your budget, at the end it is more than 100% of your budget. So the Big Data zealot has an affordability problem.

Why Is Kryder's Law Slowing?

It is easy to, and we often do, conflate Kryder's Law, which describes the increase in the areal density of bits on disk platters, with the cost of disk storage in $/GB. We wave our hands and say that it roughly mapped one-for-one into a decrease in the cost of disk drives. We are not alone in using this approximation, Mark Kryder himself does (PDF):
Density is viewed as the most important factor ... because it relates directly to cost/GB and in the HDD marketplace, cost/GB has always been substantially more important than other performance parameters. To compare cost/GB, the approach used here was to assume that, to first order, cost/GB would scale in proportion to (density)-1
My co-author Daniel Rosenthal (no relation) has investigated the relationship between bits/in2 and $/GB over the last couple of decades. Over that time, it appears that about 3/4 of the decrease in $/GB can be attributed to the increase in bits/in2. Where did the rest of the decrease come from? I can think of three possible causes:
  • Economies of scale. For most of the last two decades the unit shipments of drives have been increasing, resulting in lower fixed costs per drive. Unfortunately, unit shipments are currently declining, so this effect has gone into reverse. In 2005 Mark Kryder was quoted as predicting "In a few years the average U.S. consumer will own 10 to 20 disk drives in devices that he uses regularly," but what is in those devices now is flash. The remaining market for disks is the cloud; they are no longer a consumer technology.
  • Manufacturing technology. The technology to build drives has improved greatly over the last couple of decades, resulting in lower variable costs per drive. Unfortunately HAMR, the next generation of disk drive technology has proven to be extraordinarily hard to manufacture, so this effect has gone into reverse.
  • Vendor margins. Over the last couple of decades disk drive manufacturing was a very competitive business, with numerous competing vendors. This gradually drove margins down and caused the industry to consolidate. Before the Thai floods, there were only two major manufacturers left, with margins in the low single digits. Unfortunately, the lack of competition and the floods have led to a major increase in margins, so this effect has gone into reverse.
But these factors only account for 1/4 of the missing cost decrease. Where did the other 3/4 go? Here is a 2008 graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology. Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. Instead, the industry has resorted to stretching PMR by shingling (which increases the density) and helium (which increases the number of platters).

Each technology generation has to stay in the market long enough to earn a return on the cost of the transition from its predecessor. There are two problems:
  • The return it needs to earn is, in effect, the margins the vendors enjoy. The higher the margins, the longer the technology needs to be in the market. Margins have increased.
  • As technology advances, the easier problems get solved first. So each technology transition involves solving harder and harder problems, so it costs more. The transition from PMR to HAMR has turned out to be vastly more expensive than the industry expected. Getting the laser and the magnetics in the head assembly to cooperate is very hard, the transition involves a huge increase in the production of the lasers, and so on.
According to Dave's 6-year-old graph, we should now be almost done with HAMR and starting the transition to Bit Patterned Media (BPM). It is already clear that the HAMR-BPM transition will be even more expensive and thus even more delayed than the PMR-HAMR transition. So the projected 20%/yr Kryder rate is unlikely to be realized. The one good thing, if you can call it that, about the slowing of the Kryder rate for disk is that it puts off the day when the technology hits the superparamagnetic limit. This is when the shrinking magnetic domains become unstable at the temperatures encountered inside an operating disk, which are quite warm.

We'll Just Use Tape Instead of Disk

About 70% of all bytes of storage produced each year is disk,the rest being tape and solid state.. Tape has been the traditional medium for long-term storage. Its recording technology lags about 8 years behind disk; it is unlikely to run into the problems plaguing disk for some years. We can expect its relative cost per byte advantage over disk to grow in the medium term. But tape is losing ground in the market. Why is this?

In the past, the access patterns to archived data were stable. It was rarely accessed, and accesses other than integrity checks were sparse. But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until last week it was 5.5 times.

An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the considerable costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year. Recently, Twitter issued a call for a "small number of proposals to receive free datasets", but even Twitter can't support 400.

Thus future archives will need to keep at least one copy of their content on low-latency, high-bandwidth storage, not tape.

We'll Just Use Flash Instead

Flash memory's advantages, including low power, physical robustness and low access latency have overcome its higher cost per byte in many markets, such as tablets and servers. But there is no possibility of flash replacing disk in the bulk storage market; that would involve trebling the number of flash fabs. Even if we ignore the lead time to build the new fabs, the investment to do so would not pay dividends. Everyone understands that shrinking flash cells much further will impair their ability to store data. Increasing levels, stacking cells in 3D and increasingly desperate signal processing in the flash controller will keep density going for a little while, but not long enough to pay back the investment in the fabs.

We'll Just Use Flash Non-volatile RAM Instead

There are many technologies vying to be the successor to flash, and most can definitely keep scaling beyond the end of flash provided the semiconductor industry keeps on its road-map.  They all have significant advantages over flash, in particular they are byte- rather than block-addressable. But analysis by Mark Kryder and Chang Soo Kim (PDF) at Carnegie-Mellon is not encouraging about the prospects for either flash or the competing solid state technologies beyond the end of the decade.

We'll Just Use Metal Tape, Stone DVDs, Holographic DVDs DNA Instead

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than Kryder's or Moore's Laws. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the cold storage problem. But DNA access will always be slow enough that it can't store the only copy.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)
Note that this analysis assumes that the drives fail under warranty. One thing the drive vendors did to improve their margins after the floods was to reduce the length of warranties.

Does Kryder's Law Slowing Matter?

Figures from SDSC suggest that media cost is about 1/3 of the lifecycle cost of storage, although figures from BackBlaze suggest a much higher proportion. As a rule of thumb, the research into digital preservation costs suggests that ingesting the content costs about 1/2 the total lifecycle costs, preserving it costs about 1/3 and disseminating it costs about 1/6. So why are we worrying about a slowing of the decrease in 1/9 of the total cost?

Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.

We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.

In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
  • The endowment needed increases sharply.
  • The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
The reason to worry is that the cost of storing data for the long term depends strongly on the Kryder rate if it falls much below 20%, which it has. Everyone's storage expectations, and budgets, are based on their pre-2010 experience, and on a belief that the effect of the floods was a one-off glitch; the industry will quickly get back to historic Kryder rates. It wasn't, and they won't.

Does Losing Stuff Matter?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.

Can We Do Better?

In the short term, the inertia of manufacturing investment means that things aren't going to change much. Bulk data is going to be on disk, it can't compete with other uses for the higher-value space on flash. But looking out to the end of the decade and beyond, we're going to be living in a world of much lower Kryder rates. What does this mean for storage system architectures?

The reason disks have a five-year service life isn't an accident of technology. Disks are engineered to have a five-year service life because, with a 40%/yr Kryder rate, it is uneconomic to keep the data on the drive for longer than 5 years. After 5 years the data will take up about 8% of the drive's replacement.

At lower Kryder rates the media, whatever they are, will be in service longer. That means that running cost will be a larger proportion of the total cost. It will be worth while to spend more on purchasing the media to spend less on running them. Three years ago Ian Adams, Ethan Miller and I were inspired by the FAWN paper from Carnegie-Mellon to do an analysis we called DAWN: Durable Array of Wimpy Nodes. In it we showed that, despite the much higher capital cost, a storage fabric consisting of a very large number of very small nodes each with a very low-power system-on-chip and a small amount of flash memory would be competitive with disk.

The reason was that DAWN's running cost would be so much lower, and its economic media life so much longer, that it would repay the higher initial investment. The more the Kryder rate slows, the better our analysis looks. DAWN's better performance was a bonus. To the extent that successors to flash behave like RAM, and especially if they can be integrated with the system-on-chip, they strengthen the case further with lower costs and an even bigger performance edge.


Summing Up

Expectations for future storage technologies and costs were built up during three decades of extremely rapid cost per byte decrease. We are now 4 years into a period of much slower cost decrease, but expectations remain unchanged. Some haven't noticed the change, some believe it is temporary and the industry will return to the good old days of 40%/yr Kryder rates.

Industry insiders are projecting no more than 20%/yr rates for the rest of the decade. Technological and market forces make it likely that, as usual, they are being optimistic. Lower Kryder rates greatly increase both the cost of long-term storage, and the uncertainty in estimating it.

Lower Kryder rates mean that the economic service life of media will be longer, placing more emphasis on lower running cost than on lower purchase cost. This is particularly true since bulk storage media are no longer a consumer product; businesses are better placed to make this trade-off. But they may not do so (see the work of Andrew Haldane and Richard Davies at the Bank of England, and Doyne Farmer of the Santa Fe Institute and John Geanakoplos of Yale).

The idea that archived data can live on long-latency, low-bandwidth media is no longer the case. Future archival storage architectures must deliver adequate performance to sustain data-mining as well as low cost. Bundling computation into the storage medium is the way to do this.

Discussion

As usual, I was too busy answering questions to remember most of them. Here are the ones I remember, rephrased, with apologies the the questioners whose contributions slipped my memory:
  • Won't the evolution of flash technology drive its price down more quickly than disk? The problem is that the manufacturing capacity doesn't, and won't exist for flash to displace disk in the bulk storage space. Flash is a better technology than disk for many applications, so it is likely always to command a premium over disk.
  • Isn't DNA a really noisy technology to build long-term memory from? At the raw media level, all storage technologies are unpleasantly noisy. The signal processing that goes on inside your disk or flash controlled is amazing. DNA has the advantage that the signal processing has a vast number of replicas to work with.
  • Doesn't experience with flash suggest that it isn't capable of storing data reliably for the long term? The way current flash controllers use the raw medium optimizes things other than data retention, such as performance (for SSDs) and low cost (for SD cards, see Bunnie Huang and xobs' talk at the Chaos Computer Conference). That doesn't mean it isn't possible, with alternate flash controller technology, to optimize for data retention.

1 comment:

Eric Lease Morgan said...

We are in the midsts of creaeting a digital dark age. And I sincerely believe that if we -- the "owners" of data and information -- do not actively preserve our content, then it will disappear into Big Byte Heaven. It is only a matter of time. --ELM