Monday, October 1, 2012

Storage Will Be A Lot Less Free Than It Used To Be

I presented our paper The Economics of Long-Term Digital Storage (PDF) at UNESCO's "Memory of the World in the Digital Age" Conference in Vancouver, BC. It pulls together the modeling work we did up to mid-summer. The theme of the talk was, in a line that came to me as I was answering a question, "storage will be a lot less free than it used to be". Below the fold is an edited text of my talk with links to the sources.

Money turns out to be the major problem facing the future of our digital heritage. Paper survives benign neglect very well, but bits are very vulnerable to interruptions in the money supply. No-one has enough money to preserve even a fraction of the content worthy of preservation. Broadly speaking, the extensive research on the cost history of preservation concludes that about half the money has been spent ingesting an object, about a third storing it and about a sixth disseminating it. If storage has been only a third of the cost, why are we building a model of it?

The answer lies in Kryder's Law, the analog of Moore's Law for disk. There is a 30-year history of disk prices dropping about 40% per year. Figures from the San Diego Supercomputer Center show that media is about 1/3 of the total storage cost, the rest being power, cooling, space, staff and so on. But these costs are almost completely per-drive, not per-byte, so the total per-byte cost drops in line with media costs, meaning that customers got roughly double the capacity for the same price every two years. Thus the cost of storing a given digital object rapidly becomes negligible. The perception was that the delta between storing an object for a few years and storing it forever was too small to worry about. Kryder's Law has held for three decades; surely it is good for another decade or two?

Here is XKCD's explanation. It is always tempting to think that exponential curves will continue, but in the real world they are always just the steep part of an S-curve.

Here is a graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology.

Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. If we had stayed on the Kryder's Law curve we should have had 4TB 3.5" SATA drives in 2010. Instead, in late 2012 the very first 4TB drives are just hitting the market.

It was clear by mid-2011 that the industry had fallen off the Kryder curve. That was before the floods in Thailand destroyed 40% of the world's disk manufacturing capacity and doubled disk prices almost overnight. Prices are still about 60% more than they were before the floods and they are not expected to return to pre-flood levels until 2014. By then they should have been 50% lower. The latest industry projections are for no more than 20% per year improvement in bit density over the next 5 years. In our paper you will find a long list of reasons why even if this is correct, it may not result in a 20%/yr drop in price. These include industry consolidation, and the shift from a 3.5" to a 2.5" form factor.

Bill McKibben's Rolling Stone article Global Warming's Terrifying New Math< uses three numbers to illustrate the looming climate crisis. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
• According to IDC, the demand for storage each year grows about 60%.
• According to IHS iSuppli, the bit density on the platters of disk drives will grow no more than 20%/year for the next 5 years.
• According to computereconomics.com, IT budgets in recent years have grown between 0%/year and 2%/year.
This graph projects these three numbers out for the next 10 years. The red line is Kryder's Law, at 20%/yr. The blue line is the IT budget, at 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget. Its going to consume 100% of the budget in much less than 10 years.

Although about 70% of all bytes of storage produced each year is disk, both tape and solid state are alternatives for preservation. Tape's recording technology lags about 8 years behind disk; it is unlikely to run into the problems plaguing disk for some years. We can expect its relative cost advantage over disk to grow in the medium term.

Flash memory's advantages, including low power, physical robustness and low access latency have overcome its higher cost per byte in many markets, such as tablets and servers. Properly exploited, they could result in enough lower running costs to justify use for long-term storage too. But analysis by Mark Kryder and Chang Soo Kim (PDF) at Carnegie-Mellon is not encouraging about the prospects for flash and the range of alternate solid state technologies beyond the end of the decade.

Based on recent history and projections of future trends we can be fairly confident that the period when storage costs dropped rapidly is over at least for the medium term. This has two effects on the cost of preservation. First, the proportion of the total cost attributable to storage will rise. Second, the total cost of preservation will be higher than projected by current models, which assume Kryder's law continues as it did in the past.

Thus, as a component of overall models of the cost of preservation, we need a more sophisticated model of storage costs. One that doesn't simply assume Kryder's Law continues at 40%/yr, but allows us to investigate the effects of varying rates through time. I'm going to describe some results from one of the preliminary models we have built, others are in the paper.

There are three different business models for long-term storage:
• It can be rented, as for example with Amazon's S3 which charges an amount per GB per month.
• It can be monetized, as with Google's Gmail, which sells ads against your accesses to your e-mail.
• Or it can be endowed, as with Princeton's DataSpace, which requires data to be deposited together with a capital sum thought to be enough to fund its storage "for ever".
Comparing different technologies for long-term storage requires comparing different sets of expenditures at different times. Economists' standard technique for doing so is called Discounted Cash Flow (DCF), and it corresponds to working out how big the endowment should be to fund data storage. It assumes an interest rate, the discount rate, and for each future expenditure computes the sum which, deposited now, would with the accumulated interest amount to the expenditure when it occurs.

Recent research has cast doubt on both the theoretical and practical basis of DCF. Haldane and Davies of the Bank of England showed that investors using DCF systematically used discount rates that were too high (PDF), raising unjustified barriers to future investments.

Farmer and Geanakoplos showed that the use of a constant discount rate, which averages out the effects of periods of very high or (as now) very low interest rates, produced invalid results in the long term.

We built two prototype models. The second of which includes storage media, which are replaced when their service life is over or when newer media have costs low enough to justify migrating out of the old media into them. The media have running costs and costs for moving in and out. It uses a model of interest rates based on the 20-year history of inflation-protected US treasury bonds. An initial endowment earns interest and pays for purchase, running and media migration costs.

Here is the result of a typical run of the second model assuming a constant 25%/yr Kryder's Law. The endowment is stepped from 4 to 7 times the initial purchase cost and, for each value, 100 different 100-year histories are simulated to compute the probability of running out of money before 100 years are up. As expected, it is an S-curve. If the endowment is too low, running out of money is certain. If it is large enough, survival is certain. One insight from this graph is that the transition from 0% to 100% survival happens over about 10% of the endowment value. The 25% Kryder rate dominates the much lower interest rates.

This heatmap shows the result of repeating this run for the range of Kryder rates from 5% to 45%. Note that the transition is sharp for high rates and more gradual for rates more in line with historic interest rates.

Taking the 98% survival contour of the heatmap gives us this graph of endowment against Kryder rate. In the past on which current models of preservation costs are based we were in the flat-ish region to the right of the graph, where the endowment needed was low and not much affected by the precise values of the Kryder rate or interest rates. In the future we will be in the steep region to the left of the graph, where the endowment needed is much larger and depends strongly on the precise Kryder rate and interest rates. Since we will only be able to guess at both rate, this means that, even with better models, future estimates of preservation costs will necessarily have larger margins of error than current ones.

Here is a graph showing the effect on the endowment needed if a spike in costs similar to that caused by the Thai floods happens after 1, 2, 3, and so on years, The zero-year graph has no spike for comparison. As one would expect, if the Kryder rate is high a spike has little effect; if it is low the effect is large. This simulation assumed a 4-year service life for the media. The ridge shows that a spike is especially unpleasant if it occurs just when the media need to be replaced.

Here is a history of the prices charged by some major cloud storage services. As you can see, they have hardly dropped at all.
• Amazon's S3 launched March '06 at \$0.15/GB/mo and is now \$0.125/GB/mo, a 3%/yr drop.
• Rackspace launched May '08 at \$0.15/GB/mo and has not changed reduced prices to \$0.10/GB/mo 1st June 2012, about a 9%/yr drop.
• Azure launched November '09 at \$0.15/GB/mo and is now \$0.14/GB/mo, a 3%/yr drop.
• Google launched October '11 at \$0.13/GB/mo and has not changed.

Here we apply the model to compare Amazon's S3 storage service to local storage, based on the cost figures published by the Backblaze PC backup service. To make the comparison fair, we assume that three geographically separate copies are maintained in Backblaze hardware, and, based on the San Diego Supercomputer Center study, that over 3 years non-hardware costs are double the hardware costs.

The model suggests that S3 is not competitive with local storage at any Kryder rate. But they don't have the same Kryder rates. If S3 continues its historic 3%/yr rate and Backblaze experiences the industry projection of a 20%/yr drop the endowment needed in S3 is more than 5 times larger.

Why is cloud storage so expensive? For the majority of customers, it isn't. Amazon prices S3 against the value it delivers to the majority of customers, not against their cost. That value is largely the flexibility to cope with spikes in demand. But digital preservation is the canonical example of an application with a stable, predictable demand for storage. S3's pricing model is inappropriate for this, as Amazon has acknowledged with their recent announcement of Glacier, a different service with a different pricing model that is aimed at the digital preservation market. Its headline pricing is 5-12 times lower than S3.

Why isn't cloud storage getting cheaper? Two reasons:
• Amazon has the vast majority of the market and is under no competitive pressure to reduce prices. Note that S3's competitors charge more than S3 does.
• Bandwidth charges and the hassles of getting large amounts of data out of S3 in order to move to a competitor provide a very effective customer lock-in.
I will leave you with this thought experiment. Suppose we wanted to keep all the world's data in S3. Each year, we would need to endow that year's data. 2011's endowment would be about \$11.4T, or 14% of the gross world product. According to IDC, the data to be stored grows by 60%/yr. Gross world product is growing about 5%/yr. The endowment needed for 2018's data would exceed the gross world product.

Trey Duskin said...

Your statement about Rackspace pricing is incorrect. It has been falling and is currently \$0.10/GB/month.

http://www.rackspace.com/cloud/public/files/pricing/

Of course that doesn't change your points about digital preservation in the cloud. However, what if you rolled your own OpenStack object storage (the backing tech for Rackspace and others)? According to some (see http://www.buildcloudstorage.com/2012/01/can-openstack-swift-hit-amazon-s3-like.html) the cost to run an OpenStack Swift cluster starts at \$0.045/GB/month for small clusters and goes down (to \$0.0165/GB/month) for larger clusters.

Chris Rusbridge said...

David, I'm guessing you didn't mean this quite as written: "[storage is] going to consume 100% of the budget in much less than 10 years". This is as much a reductio ad absurdem as the XKCD cartoon you pointed to, and means, I guess, that something else is going to happen. As far as archiving is concerned, I can only guess that that something is greater selection.

David. said...

Trey, thank you for the correction. I will correct the post. It appears that Rackspace reduced prices 33% 1st June 2012. I should have checked before submitting the paper.

Chris, the problem with depending on "greater selection" to reduce the growth of the collections to a level that can fit with the budget is that "greater selection" itself costs money. And it does so up-front along with the ingest costs, making a bad situation worse.

David. said...

Trey, thank you also for the link to Swift's costs. I need to look at them closely but at first sight they appear to reinforce Backblaze's message that building it yourself can result in considerable savings as against commercial cloud services.

But the real message from our long-term model is that the key factor is not so much the initial cost, but how quickly the cost drops through time. Buying from a commercial cloud storage service means you are the mercy of the service's decision as to how quickly to drop prices. The history shows they don't pass the savings along to their customers.

David. said...

Another instance of uncritically assuming that Kryder's Law is bound to continue, and that therefore we can afford to save everything for ever is Recording Everything: Digital Storage as an Enabler of Authoritarian Governments by John Villasenor of Brookings, linked from Data Storage Could Expand Reach of Surveillance, a post by Scott Shane on the New York Times Caucus blog.

While I agree that low-cost storage is a powerful enabler of government control of the population, its effects have been seen in the US already. To restrict the threat to governments the US labels "authoritarian" is simplistic. Even more simplistic are the paper's assumptions that Kryder's law will continue unabated, when the storage industry itself has halved the projected rate of growth in bit density,
and that only the cost of the media is relevant.

David. said...

This post attracted attention from Dave Feinleib, who blogs at Forbes.

Unknown said...

It would be great to hear if the conclusions in this piece still hold now (late 2015). AWS S3 is now \$0.030/GB/month for the first 1 TB. Here's the price history: https://docs.google.com/spreadsheets/d/11-9Iz701NTvsWv-LJGcbcT7WXBzRce178Kg_bczXCdQ/edit?pli=1#gid=0. AWS Glacier is now \$0.007.

It seems unlikely that it'll keep dropping this fast. Perhaps the competition from Azure is squeezing margins for Amazon. Regardless, even at a low Kyder rate, cloud storage now looks much more competitive to on-prem.

Andrew said...

Hello David,
First I'd like to thank you for having shared your expertise so extensively here online - thank you! I stumbled upon this fine blog because I've been thinking about the economics of "better compression algorithms."

I've got a question for you but I'm having a slightly difficult time articulating it so I'll try my best. In short "At what point does the cost of compression outrun the cost savings of that compression?" ...or, said differently, Is the cost of long term storage so much more expensive than the cost of better compression that it would be difficult to envision a day when the benefit of "gaining a unit more compression" would fall short of the benefit of "storing the data without that additional unit of compression."

The longer form:
Compression algorithms exist because they make storing and sharing data more economical than storing and sharing uncompressed data. If it were ever true that a compression algorithm cost more to use (say by the power draw or time of coding/decoding a file over that file's expected code/decode life) than it were to use some other compression algo/no compression at all, then we would store those files in a lesser compressed/non-compressed state. In example, consider the following 20 year expense (toy) scenarios for accessing a digital photo

With H (high) compression:
\$6, time/power to compress a RAW photo file into a JPEG (via the JFIF codec) and to decode it x times
\$1, power to access it x times
\$10, storage
Total: \$17

With M (medium) compression:
\$1, time/power to compress a RAW photo file into a JPEG (via the JFIF codec) and to decode it x times
\$1, power to access it x times
\$14, storage
Total: \$16

With no compression:
\$1, power to access it x times
\$25, storage
Total: \$26

All considered, it would make more sense to store the photo with M compression because it saves \$1 (in time and power) over H compression.

So, if this inflection point of "diminishing returns" via compression were to be reached in real life (perhaps it already has?), would that be because the cost of compression was power bound or time bound? I.e., in my example, H compression is 6 times more expensive than M compression. If real life pricing dynamics were concerned, would most of that increased cost be due to the electrical power to achieve the added compression or the inconvenience of additional time?

I can clarify where needed.