Paying For Long-Term Storage
The LOCKSS Program at the Stanford Library builds and supports tools libraries can use to collect and preserve material published on the Web, such as academic journals, e-books and government documents, so we face somewhat different problems from personal digital archives. But I'm here to talk about a problem we all share, paying for long-term storage.
I'm sure you've all seen this graph, called Kryder's Law: It shows the now 30-year history of exponential decrease in dollars-per-byte of hard disks. There are three main business models for long term digital storage, and two of them depend heavily on Kryder's Law's continuing validity.
Amazon's S3 is an example of the simplest of these business models, namely charging rent for the space occupied. The rent per byte can be adjusted over time to match whatever the underlying cost turns out to be; this model will work just fine even if the price of storage goes up.
Google's Gmail doesn't charge for the space. Instead, it makes money by serving ads around accesses to the stored information. How often do you access mail a decade old? So how much money is Google going to make serving ads against your decade-old mail? Google is betting that Kryder's Law will continue, so that the money they make serving ads against the frequently accessed recent mail will be more than enough to pay for storing the accumulated, much larger, body of rarely accessed old mail.
This is one reason Gmail imposes a limit on how much mail you can store at any one time. They can modulate this limit in response to changes in storage costs, although actually shrinking the space in response to an increase in cost would be rather unpopular.
The third model is the one I'll be talking about. The obvious conclusion to draw from Kryder's Law is that it is possible to deposit data in a preservation service together with an endowment, a capital sum sufficient to pay for its preservation indefinitely. Or, as Serge Goldstein of Princeton calls it "Pay Once, Store Endlessly".
Princeton is actually operating an endowed data service, so I will use Goldstein's PowerPoint from last fall's CNI meeting as an example of the typical simple analysis. He computes that if you replace the storage every 4 years and the price drops 20%/yr, you can keep the data forever for twice the initial storage cost.
Benefits of Endowment
There are some important benefits of endowing long-term storage in this way. If you are mandated to preserve data, for example under an NSF Data Management Plan, it allows for the cost of doing so to be rolled-up and funded as part of the term-limited funding of the research. The other models require a separate, continuing funding stream. These are difficult to arrange. Most importantly, unlike paper which survives benign neglect well, bits are very vulnerable to interruptions in the money supply. Endowment is the only way to protect against this threat.
Endowment provides a relatively predictable money supply, but how well can we really predict the money demand, and will they match? In practice, we will need some margin of supply over predicted demand in order to be safe.
What Goldstein is actually charging for Princeton's service is $3000/TB for a single on-line copy. A single tape backup is an extra $2000/TB. This would imply that he pays $3000 each for 2TB SATA drives. They are on the shelves at Fry's for $100. So why is the proponent of endowing data so reluctant to follow his own analysis that he charges a 3000% margin?
The simple analysis rests on a set of assumptions. Here are some of them:
- Storage is the only significant cost.
- Kryder's Law will continue.
- The service will do in the future what you paid for in the past.
Storage is the Major Cost
Figures from Vijay Gill, who runs Google's farms, show that space, power and cooling are 58% of the three-year cost of a server, whereas the cost of the hardware itself is only 26%. Figures from the San Diego Supercomputer Center (abstract only) show that the media is only about 1/3 the total cost of storage.
Some of these other costs are insensitive to the amount of data being stored, so the cost-per-byte drops with the media. But in most cases, such as power, staff, software licenses, and so on, the absolute cost is subject to inflation. The total cost of storage has been dropping, just not as fast as the raw cost of hard disk. As this trend continues, the effect of Kryder's Law on the total cost decreases, since it affects a smaller proportion of the total.
All these factors seem likely to slow the exponential price drop, thus increasing the margin above the current cost of storage that must be charged to endow data.
There's another exponential curve I'm sure you're familiar with, namely Moore's Law: It says that the number of transistors on a chip doubles every 2 years. This curve is expected to continue for at least another 5 years. For many years, Moore's Law delivered faster and faster CPUs, so people started believing the faster CPUs were a consequence of the law. But a few years ago the increase in CPU speed stopped, because it turned out that there were more profitable things to use the smaller and smaller transistors for, such as multiple cores and lower power.
Kryder's Law Will Continue
There are a number of reasons to believe something similar is about to happen to storage. First, laptops, netbooks and now tablets are destroying the market for desktop PCs, which is where the bulk of the high-capacity low-cost consumer drives go. As the market switches from 3.5" to 2.5" drives, the cost per byte goes up - right now at Fry's a 3.5" drive is 5c/GB where a 2.5" drive is 16c/GB. As 3.5" volumes decrease their cost per byte will drop more slowly. 2.5" volumes are already higher so they are already on the normal exponential drop. Eventually they will cross, but until then the cost curve will drop more slowly.
Second, although the Kryder's Law curve looks like a smooth exponential, it is actually the result of overlaying a series of exponentials, one for each successive disk technology. The current technology, Perpendicular Magnetic Recording (PMR) has been through 5 generations. It was expected already to have been replaced by either Heat-Assisted Magnetic Recording (HAMR) or
Bit-Patterned Media (BPM). The costs of transitioning to these technologies have turned out to be vastly higher than expected. The industry is having to stretch PMR into a sixth generation. Delaying adoption of these higher-density technologies will slow the cost decrease for a while, and their higher costs will also reduce their initial impact on the cost curve.
Third, solid-state storage, such as Flash and soon Phase Change Memory and Memristors, is eroding the market for hard drives. These technologies need much less power, cooling and rack space. In archival use, they can be expected to last much longer. Their initial cost is much, much higher but the savings can nevertheless be significant because space, power and cooling is such a large part of the total.
All these factors seem likely to slow the exponential price drop, thus increasing the margin above the current cost of storage that must be charged to endow data.
You Will Get What You Paid For
Once you have deposited the data you want preserved, and paid the service, you no longer have much leverage if the service doesn't do a good job. That is why insurance, which has similar characteristics, is a heavily regulated industry. One obvious refinement is to make sure that your money goes not to the service directly up front, but to a third party escrow agent. The agent's job is to audit the preservation service at intervals and, if the service is actually preserving the data it is supposed to, hand over the payment for the interval.
If it turns out that the service you chose is not doing a good job, the escrow agent will have to arrange for your data to be transferred to another, successor service. The endowment will have to include a reserve to cover the costs of such transfers.
Of course, the escrow and audit service has to be paid too, and their costs may well not drop exponentially, so this again increases the margin over the current cost of storage that needs to be charged.
An alternative would be to set up a trust with the capital and simply pay for cloud storage. Using S3's "reduced redundancy storage" with a design goal of 4 nines reliability, and assuming the same 20% per year cost decrease with 5% interest you would need a $4.7K endowment. Using Amazon's full redundancy storage for a goal of 11 nines, the trust would need over $7K. It isn't likely that Princeton is buying storage much cheaper than Amazon, so we see that Princeton isn't charging enough to provide even 4 nines reliability, let alone the much higher levels needed for long-term preservation.
Endowing data has some significant advantages over the competing business models when applied to long-term data preservation. But the assumptions behind the simple analysis are optimistic. Real endowed data services, such as Princeton's, need to charge a massive markup over the cost of the raw storage to insulate themselves from this optimism. The perceived mismatch this causes between cost and value may make the endowed data model hard to sell.
Thanks David. I hadn't heard the name of the storage law, but Bryan Lawrence first alerted me to its effects. You suggest the exponential growth may slow down, citing higher costs of 2.5" drives and Flash; but the Kryder graph you included shows easy factors of 10 for each time point, so surely this is just another generational blip?
Strictly, the graph shows the history of the capacity of raw hard disk, which certainly does not increase by 10 every year. Kryder's law actually says that the areal density on the platters doubles every year, but as of recent years this hasn't been achieved - if it had been we would have 4TB drives by now. So the flattening of the curve is not a projection, but a fact. The question is how long will it last?
More importantly, as I point out in the talk it isn't the raw cost of the disk but the total cost of storage that determines how much you need to pay to endow data. The total cost has been dropping, but not as fast as the cost of the raw disk. I discuss the reasons for this in the talk.
The question Chris raises is the effect of a blip. Lets use Serge Goldstein's numbers as an example. He projects total storage costs dropping 20% a year, and a 4-year life for the hardware. Thus for every $2 he charges, he spends $1 on storage and keeps $1 for 4 years. Then he spends about $0.40 on new storage and keeps $0.60 for 4 years.
Suppose the total cost of storage stays constant for the first 3 years, then starts dropping as expected 20% a year. Now Serge spends $1 and keeps $1 for 4 years. At this point it costs him $0.80 to replace the storage, leaving him $0.20 for the next 4 years. But at the end of those 4 years he can't afford to pay the $0.32 that replacing the storage will cost him.
Serge is right that the cost series converges, but it only does so if a non-zero proportion of the years see a drop in cost. However, his computation that twice the cost of storage is enough is true only if all years see a 20% drop in cost. Even a few years with flat costs somewhere in the future will cause the value to which the series converges to exceed the $2, and the endowed data to run out of money.
Some endowed data proposals charge higher margins that Princeton's 3000%, and some charge less. But margins of this magnitude or even greater are needed to insure against the possibility of the curve flattening. The alternative would be a futures market in storage, so that trusts endowing data could hedge against the volatility. We've seen how well derivatives of this kind work when the unexpected happens.
Thanks for a very insightful and useful analysis. I wanted to comment on a couple of issues that you raise, and how the model might address them.
You point out that, while it is true that the total cost of storage has been declining, such a decline may not be sustained, and if storage costs stay flat for a few years, then the endowment model will fail. I am hesitant to bet against technology, and the declines in storage costs (all costs, not just disk drives) have been so steady and dramatic over the past 40 years, that I think basing a model on this assumption is a pretty safe bet. However, the model does have a way to address changes in the rate of decrease of storage costs. The cost factor the model charges can be adjusted from year to year. Does this involve robing Peter to pay Paul? Yes, in a way it does, but this is how banks and many other organizations operate. (The 4% 40-year fixed term mortgage can end up looking bad when interest rates rise above 4%, but the bank makes up for this by charging higher rates on other loans). As long as storage costs decline over the long haul, and we continue to attract new customers (people keep doing research that generates data which needs to be saved), then the model does have a mechanism for adjusting to "blips" in storage costs.
The second area where I think some comment is needed has to do with the 3000% surcharge we are imposing on our customers. I believe this is a much more serious problem than the possibility of flat storage costs. Your analysis assumes that we are doing this to provide us with a "buffer" against blips in storage costs. In fact, we are charging our customers exactly 2x what we are paying ourselves for disk space. The problem here is that University data centers are not designed to provide low-cost long-term storage. Our sysadmins do not buy USB disk drives at Best Buy. They buy enterprise-level storage, designed to sustain high use and high reliability. I would love to find a supplier who would provide us with reliable, scalable storage at $100/terabyte (and no usage or bandwidth fees), but that hasn't happened. Maybe someone in the research community will create such a center. I think there is a business opportunity here for an enterprising group of graduate students.
Serge Goldstein, OIT
Thank you, Serge. I agree that you can "rob Peter to pay Paul" in the long term but if costs flatten in the first few years you do need a substantial margin over your initial costs to be safe, as my reply to Chris points out.
And in order to "rob Peter to pay Paul" you need a continuing flow of new customers, which I argue is going to be a hard sell. I am not computing the 3000% margin against USB disks, but against the one-off cost of bare 3.5" 2TB SATA drives at retail. These are the same drives that large data centers use for bulk storage. Clearly, there are other costs on top, for the servers, switches, bandwidth, power and cooling. I point that out. But the marketing problem remains. The perception among your potential customers is that this endowed storage is 3000% more expensive than the off-the-shelf storage they can buy themselves. Do they perceive the added value of long-term storage as worth a factor of 30x in cost?
You've hit the nail on the head. If DataSpace fails, it will be because faculty are unwilling to pay even a one-time charge if they perceive that charge to be unreasonably high. Because the granting agencies require a data plan, what I suspect will happen is something along the lines of:
a) Faculty will ask their departmental sysadmins to write the data management plan.
b) The departmental sysadmins, most of whom are graduate students, will say "hey, we can do this for a whole lot less than those DataSpace folks are charging", and will go out and buy the $100 Best Buy terabyte drive.
c) The plan will be submitted (it will say "reliable departmental servers", not "cheap Best Buy drives") and the reviewers will say, that's fine, that solution should last a few years, which is all that's needed anyway.
If people really want "indefinite", reliable, accessible storage, then $100 Best Buy drives won't do it. On the other hand, I agree that we are charging too much. I am cautiously optimistic that we can architect a storage solution here that is a lot cheaper than what we currently have for use with DataSpace.
USC is charging $70/TB/month for a copy on disk and $1K/TB for a copy on tape for 20 years both for internal and external customers, with no bandwidth charges.
The Library of Congress is concerned with the price they are paying for disk storage, and have instituted a policy that limits the cost of new storage systems (media, enclosure, switches, ...) to 10x the cost of raw consumer disk storage.
Anyone thinking about endowment models for storage should read the recent speech by Andrew Haldane and Richard Davies of the Bank of England, which shows that decision makers systematically and increasingly use unrealistically high discount rates. This means that the marketing headwind endowment suffers is even greater than the raw numbers suggest.
David, please forgive me for commenting on an old post. My math skills are somewhat outdated, but it seems to me that the formula on slide 16 of Serge Goldstein's presentation that you cite is mistaken: He holds that the series converges on I/(1-d)^r, but I think the equation should read I/(1-(1-d)^r), according to the formula for geometric series. Please correct me if I'm wrong, but at least this way the results make sense.
Post a Comment