Storage Will BeBelow the fold is an edited text with links to the sources.
Much Less Free
Than It Used To Be
Who Am I?
I'm David Rosenthal and I'm an engineer. I'm about two-thirds of a century old. I wrote my first program almost half a century ago, in Fortran for an IBM1401, at high school. My high school is three and a half centuries old, my undergraduate school is over eight centuries old, my graduate school is about a century and two-thirds old. About a third of a century ago I was recruited for the Andrew project at C-MU, where I worked on the user interface with James Gosling. I followed James to Sun to work on window systems, both X, which became the Linux standard, and a more interesting one called NeWS that you almost certainly haven't heard of. Then I worked on operating systems and graphics chips. More than a fifth of a century ago I was employee #4 at NVIDIA, helping architect the first chip. Then I was an early employee at Vitria, the second company of some founders of the company now called Tibco. One seventh of a century ago, after doing 3 companies, all of which IPO-ed, I was burnt out and decided to ease myself gradually into retirement.
The LOCKSS Program
It was a total failure. I met Vicky Reich, the wife of the late Mark Weiser, CTO of Xerox PARC. She was a librarian at Stanford, and had been part of the team which, now nearly a fifth of a century ago, started Stanford's HighWire Press and pioneered the transition of academic journals from paper to the Web.
Librarians are the people who buy academic journals. They were, and are, worried about the long-term availability of the Web journals to which they subscribe. Vicky and I started the LOCKSS (Lots Of Copies Keep Stuff Safe) program at the Stanford Library to address this problem, with seed funding from NSF, and then major support from NSF, the Mellon Foundation and Sun Microsystems. We've been working on it ever since. For the last 7 years we have been running without grant funding, covering our costs using the "Red Hat" model of free, open source software and paid support.
Long-term Storage Is Free
Or Is It?
I grew up in the days when people expected nuclear power to become "too cheap to meter". My Ph. D. is in nuclear power, so I'm naturally skeptical of projections like this. Think about the business model for digital preservation:
- It could be an advertising model, like Gmail, where Google funds storing your mail by selling ads on your accesses to it. But accesses to archival data are rare, so this isn't likely to work.
- It could be a rental model, like Amazon's S3, where you pay each month for storage. But future budgets are uncertain, and it only takes one month of budget crunch to lose all the data.
- It could be an endowment model, in which the data is deposited together with an capital sum that is invested to generate sufficient income to pay for the storage.
If you think storage costs will drop rapidly, the endowment model is very attractive, and it has features that make it very suitable for the academic and research world, where grants are for a few years but data is supposed to live for ever. So I was very interested to hear a 2010 talk about Princeton's endowed data service, based on an economic model predicting that if you changed media every 4 years and $/GB dropped 20%/yr the endowment should be double the initial storage cost. I was interested enough to look the service up on the Web, and I found that what they were actually charging for each disk copy was $3000/TB, or 30 times the then retail cost of the drives. Clearly, someone else was skeptical of the endowment analysis.
Modelling Long-Term Storage Costs
So, with initial funding from the Library of Congress, I started to build an economic model of long-term storage. My funding ran out after about a year as the Library got hit with sequestration and other budget problems, but the work continues thanks to Ethan Miller's students in the Storage Systems Research Center at UC Santa Cruz. I've just been kibitzing.
For the purpose of building a model of long-term storage economics, the endowment approach is essential; it effectively computes the net present value of the expenditures that data incurs through its history and allows apples to apples comparisons between different technologies.
Endowment vs. Kryder Rate
Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to have a 95% probability of not running out of money in 100 years, against the Kryder rate (the annual percentage drop in $/GB). The different curves represent policies of keeping the drives for 1,2,3,4,5 years. Up to 2010, we were in the flat part of the graph, where the endowment is low and doesn't depend much on the exact Kryder rate. This is the environment in which everyone believed that long-term storage was effectively free. But suppose the Kryder rate were to drop below about 20%/yr. We would be in the steep part of the graph, where the endowment needed is both much higher and also strongly dependent on the exact Kryder rate.
The Kryder Rate Slowed
Actually, we don't have to suppose. About a year after our first model, the Thai floods almost doubled $/GB.
Preeti's graph shows that disk is now about 7 times as expensive as it would have been had the industry maintained its pre-2010 Kryder rate. The red lines show the range of industry projections for the Kryder rate going forward, between 20%/yr and 10%/yr. If these projections pan out, disk in 2020 will be between 100 and 300 times as expensive as it would have been had the industry maintained its pre-2010 Kryder rate. I don't think many organizations appreciate the impact this will have on the cost of storing data for the long term.
Slowing of the Kryder rate should not have been a surprise. Here is Randall Munroe's explanation. In the real world exponential growth can't go on forever, it is always just the steep part of an S-curve. It is now widely accepted that Moore's Law has slowed. Although transistors will continue to shrink for a few more generations, this will no longer result in transistors getting cheaper or CPUs getting faster.
Kryder Rate vs. Service Life
The reason why disks have a 5-year life is not because that is how long they last, they last 5 years because the historic Kryder rate means that a longer service life is economically unjustifiable. It isn't hard to engineer drives for a much longer life but unless the customer expects the Kryder rate to slow dramatically they won't expect a return from even a small cost increment. They will plan to replace the drives before their life is up.
Once the customer realizes that the Kryder rate has slowed they should plan to keep the media longer, which increases the relative importance of running cost against capital cost. Three years ago Ian Adams, Ethan Miller & I published a technical report on a storage architecture we called DAWN, for Durable Array of Wimpy Nodes. DAWN was a large number of small storage nodes, each with an extremely low-power system-on-chip, some flash memory, and an Ethernet interface. We showed that on reasonable assumptions DAWN's very low running cost and long media life could provide archival storage cheaper than disk despite its much higher capital cost while retaining low access latency.
Unfortunately, in practice few if any organizations can make the tradeoff to pay more up-front in order to realize lower running costs and thus lower total cost of ownership. The problem is not unique to long-term storage; in its various forms it is called short-termism, or the planning horizon, or the budget cycle. Fundamentally, it is the issue of a discount rate. Discounted Cash Flow (DCF) is the standard technique for computing the Net Present Value (NPV) of future income or expenditures so that different cash flow scenarios can be compared. It does so by applying a constant interest rate, the discount rate, chosen by the organization. Recent research has identified both theoretical and practical problems with this technique:
- Haldane and Davies of the Bank of England showed that investors using DCF systematically used discount rates that were too high (PDF), raising unjustified barriers to future investments.
Farmer and Geanakoplos showed that the use of a constant discount rate, which averages out the effects of periods of very high or (as now) very low interest rates, produced invalid results in the long term.
A Simpler Analysis