This talk describes joint work with my colleague Daniel Vargas, and relies on modelling work I've been doing with Daniel Rosenthal (no relation), Ethan Miller and Ian Adams of UC Santa Cruz, Erez Zadok of Stony Brook University and Mark Storer of NetApp.
In the session before this one we saw Universities working through the implications of being asked by research funders to store data for the long haul. One unpleasant implication is the financial commitment. How big is this commitment?
The paper examines the question of whether the use of cloud storage can reduce the cost of long-term digital preservation. It reports on an experiment in which we ran a LOCKSS box in Amazon's cloud and recorded detailed costings and compared them with local costs.
Normally, this talk would be about the experiment, but much has happened in the world of cloud storage since we did it. Although these developments do not alter our conclusions, they are I'm sure more interesting to this audience. I will cover the motivation for our work, briefly discuss the experiment, and then focus on developments including Amazon's Glacier.
Money turns out to be the major problem facing the future of our digital heritage. Broadly speaking, the extensive research on the costs of preservation concludes that about half the money to preserve an object is spent ingesting it, about a third storing it and about a sixth disseminating it. If storage has been only a third of the cost, why are we worrying about it?
Kryder's Law, the analog of Moore's Law for disk. There is a 30-year history of per-byte costs dropping about 40% per year. Figures from the San Diego Supercomputer Center show that media is about 1/3 of the total storage cost, the rest being power, cooling, space, staff and so on. But these costs are almost completely per-drive, not per-byte, so the total per-byte cost drops in line with media costs, meaning that customers got roughly double the capacity for the same price every two years. Thus the cost of storing a given digital object rapidly becomes negligible.
The perception was that if you could afford to store it for a few years, you could afford to store it forever. Kryder's Law has held for three decades; surely it is good for another decade or two?
XKCD has an explanation here. It is always tempting to think that exponential curves will continue, but in the real world they are always just the steep part of an S-curve.
very first 4TB drives are just hitting the market.
It was clear by mid-2011 that the industry had fallen off the Kryder curve. That was before the floods in Thailand destroyed 40% of the world's disk manufacturing capacity and doubled disk prices almost overnight. Prices are not expected to return to pre-flood levels until 2014. By then they should have been 50% lower. The latest industry projections are for no more than 20% per year improvement over the next 5 years. There are many reasons why even this may be optimistic, such as industry consolidation, and the shift from a 3.5" to a 2.5" form factor.
Bill McKibben's Rolling Stone article "Global Warming's Terrifying New Math" uses three numbers to illustrate the looming crisis. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
- According to IDC, the demand for storage each year grows about 60%.
- According to IHS iSuppli, the bit density on the platters of disk drives will grow no more than 20%/year for the next 5 years. The bit density doesn't affect the per-byte cost one-for-one, but they are closely related.
- According to computereconomics.com, IT budgets in recent years have grown between 0%/year and 2%/year.
We need to focus on the triangle between the blue and green lines. We need some combination of an increase in the IT budget growth by an order of magnitude, a radical reduction in the rate at which we store new data, or a radical reduction in the cost of storage media.
Many people believe there is a fourth alternative, because there is some magic in "the cloud" that makes storage much cheaper. I've spent the last year looking at the numbers behind the "affordable cloud storage" hype. We ran a LOCKSS box in Amazon's cloud and collected detailed cost numbers. I looked at the pricing history of cloud storage services. And I used our prototype economic model to compare cloud and local storage costs.
The detailed cost numbers we recorded are in the paper, and in our report to the Library of Congress, which funded the experiment. We scaled up these numbers to match the median LOCKSS box in the Global LOCKSS Network. We then projected the numbers over 3 years to compute a 3-year total cost of ownership.
Depending on the exact assumptions, the projected 3-year TCO ranged from (A) $11.6K to (B) $19.1K. We would expect in practice to incur costs closer to the A case. The hardware cost of the median GLN LOCKSS box with the B assumptions is less than $1.5K, so even if we compare it to the cloud box with A assumptions running costs of $280/month would be needed to make it more expensive. This is vastly higher than incurred in practice. The conclusion that local storage is cheaper than S3 is supported by our modeling work.
We model a chunk of data through time as it migrates from one generation of storage media to its successors. The goal is to compute the endowment, the capital needed to fund the chunk's preservation for, in our case, 100 years. The price per byte of each media generation is set by a Kryder's Law parameter. Each technology also has running costs, and costs for moving in and moving out. Interest rates are set each year using a model based on the last 20 years of inflation-protected Treasuries. I should add the caveat that this is a still a prototype, so the numbers it generates should not be relied on. But the shapes of the graphs and the relative costs seem highly plausible.
We need a baseline, the cost of local storage, to compare with cloud costs. It should lean over backwards to be fair to cloud storage. I don't know of a lot of good data to base this on; I used numbers from Backblaze, a PC backup service which publishes detailed build and ownership costs for their 4U 117TB storage pods. I took their 2011 build cost, and increase it to reflect the then 60% increase in disk cost since the Thai floods. Based on numbers from San Diego Supercomputer Center and Google, I added running costs so that the hardware cost is only 1/3 of the total 3-year cost of ownership. Note that this is much more expensive than Backblaze's published running cost. I added move-in and move-out costs of 20% of the purchase price in each generation. Then I multiplied the total by three to reflect three geographically separate replicas.
- The cost increases sharply.
- The cost becomes harder to predict, because it depends strongly on the precise Kryder rate, which we are not going to know.
Here is a table showing the history of the prices for several major cloud storage services since their launch. Their Kryder rates are below 10%, except for the recent entrants, which made major pricing mistakes when they launched their services.
|Service||Launch||Launch (c/GB/mo)||Dec 2012 (c/GB/mo)||Annual % Drop|
If local storage's Kryder rate matches IHS' 20%, and if S3's is their historic 7% the endowment needed in S3 is more than 5 times larger than in local storage, and depends much more strongly on the Kryder rate. This raises two obvious questions.
First, why don't S3's prices drop as the cost of the underlying storage drops? The answer is that they don't need to. Their customers are locked-in by bandwidth charges. S3 has the bulk of the market with their current prices. Their competitors match or even exceed their prices. Why would Amazon cut prices?
Second, why is S3 so much more expensive than local storage? After all, even using S3's Reduced Redundancy Storage to store 117TB, you would pay in the first month almost enough to buy the hardware for one of Backblaze's storage pods. The answer is that, for the vast majority of S3's customers, it isn't that expensive. First, they are not in the business of long-term storage. Their data has a shelf-life much shorter than the life of the drives, so they cannot amortize across the full life of the media. Second, their demand for storage has spikes. By using S3, they avoid paying for the unused capacity to cover the spikes.
Long-term storage has neither of these characteristics, and this makes S3's business model inappropriate for long-term storage. Amazon recently admitted as much when they introduced Glacier, a product aimed specifically at long-term storage, with headline pricing between 5 and 12 times cheaper than S3.
To make sure that Glacier doesn't compete with S3, Amazon gave it two distinguishing characteristics. First, there is a unpredictable delay between requesting data and getting it. Amazon says this will average about 4 hours, but they don't commit to either an average or a maximum time. Second, the pricing for access to the data is designed to discourage access. There is a significant per-request charge, to motivate access in large chunks. Although you are allowed to access 5% of your data each month with no per-byte charge, the details are complex and hard to model.
As I understand it, if on any day of the month you exceed the pro-rated free allowance (i.e. about 0.17% depending on the month), you are charged for the whole month as if you had sustained your peak hourly rate in that month for the entire month. Thus, to model Glacier I had to make some fairly heroic assumptions:
- No accesses to the content other than for integrity checks.
- Accesses to the content for integrity checking are generated at a precisely uniform rate.
- Each request is for 1GB of content.
- One reserved AWS instance used for integrity checks.
But they won't be. One cent per gigabyte per month is a great price point; Amazon won't reduce it for a long time. So Glacier's Kryder rate will be 0%.
But even this is not an apples-to-apples comparison. Both local storage and S3 provide adequate access to the data. Glacier's long latency and severe penalties for unplanned access mean that, except for truly dark archives, it isn't feasible to use Glacier as the only repository. Even for dark archives, Glacier's access charges provide a very powerful lock-in. Getting data out of Glacier to move it to a competitor in any reasonable time-frame would be very expensive, easily as much as a year's storage.
Providing adequate access to justify preserving the content, and avoiding getting locked-in to Amazon, requires maintaining at least one copy outside Glacier. If we maintain one copy of our 117TB example in Glacier with 20-month integrity checks experiencing a 7% Kryder rate, and one copy in local storage experiencing a 20% Kryder rate (instead of the three in our earlier local storage examples), the endowment needed would be $517K. The endowment needed for three copies in local storage at a 20% Kryder rate would be $486K. Given the preliminary state of our economic model, this is not a significant difference.
Replacing two copies in local storage with one copy in Glacier would not significantly reduce costs, instead it might increase them slightly. Its effect on robustness would be mixed, with 4 versus 3 total copies (effectively triplicated in Glacier, plus local storage) and greater system diversity, but at the cost of less frequent integrity checks.
- It is pretty clear that services like S3 are simply too expensive to use for digital preservation. The reasons for this are mostly business rather than technical.
- Access and lock-in considerations make it very difficult for a digital preservation system to use Glacier as its only repository.
- Even with generous assumptions, it isn't clear that using Glacier to replace all but one local store reduces costs or enhances overall reliability.
- Systems that combine Glacier with local storage, or with other cloud storage systems, will need to manage accesses to the Glacier copy very carefully if they are not to run up large access costs.
- Everything I have been discussing is about commercial cloud storage providers such as Amazon. It isn't to say that a private cloud isn't economically viable. But it is to say that in setting up a private cloud it is essential to economic viability that its prices follow the Kryder rate of the underlying storage.