Tuesday, January 22, 2013

Talk at IDCC2013

At IDCC2013 in Amsterdam I presented the paper Distributed Digital Preservation in the Cloud in which Daniel Vargas and I described an experiment in which we ran a LOCKSS box in Amazon's cloud. Or rather, I gave a talk that briefly motivated and summarized the paper and then focused on subsequent developments in cloud storage services, such as Glacier. Below the fold is an edited text of the talk with links to the resources. I believe that video of the talk (and, I hope, the interesting question-and-answer session that followed) will be made available eventually.

This talk describes joint work with my colleague Daniel Vargas, and relies on modelling work I've been doing with Daniel Rosenthal (no relation), Ethan Miller and Ian Adams of UC Santa Cruz, Erez Zadok of Stony Brook University and Mark Storer of NetApp.

In the session before this one we saw Universities working through the implications of being asked by research funders to store data for the long haul. One unpleasant implication is the financial commitment. How big is this commitment?

The paper examines the question of whether the use of cloud storage can reduce the cost of long-term digital preservation. It reports on an experiment in which we ran a LOCKSS box in Amazon's cloud and recorded detailed costings and compared them with local costs.

Normally, this talk would be about the experiment, but much has happened in the world of cloud storage since we did it. Although these developments do not alter our conclusions, they are I'm sure more interesting to this audience. I will cover the motivation for our work, briefly discuss the experiment, and then focus on developments including Amazon's Glacier.

Money turns out to be the major problem facing the future of our digital heritage. Broadly speaking, the extensive research on the costs of preservation concludes that about half the money to preserve an object is spent ingesting it, about a third storing it and about a sixth disseminating it. If storage has been only a third of the cost, why are we worrying about it?

The answer lies in this graph, which illustrates Kryder's Law, the analog of Moore's Law for disk. There is a 30-year history of per-byte costs dropping about 40% per year. Figures from the San Diego Supercomputer Center show that media is about 1/3 of the total storage cost, the rest being power, cooling, space, staff and so on. But these costs are almost completely per-drive, not per-byte, so the total per-byte cost drops in line with media costs, meaning that customers got roughly double the capacity for the same price every two years. Thus the cost of storing a given digital object rapidly becomes negligible.

The perception was that if you could afford to store it for a few years, you could afford to store it forever. Kryder's Law has held for three decades; surely it is good for another decade or two?

XKCD has an explanation here. It is always tempting to think that exponential curves will continue, but in the real world they are always just the steep part of an S-curve.

Here is a 2008 graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology. Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. If we had stayed on the Kryder's Law curve we should have had 4TB 3.5" SATA drives in 2010. Instead, in late 2012 the very first 4TB drives are just hitting the market.

It was clear by mid-2011 that the industry had fallen off the Kryder curve. That was before the floods in Thailand destroyed 40% of the world's disk manufacturing capacity and doubled disk prices almost overnight. Prices are not expected to return to pre-flood levels until 2014. By then they should have been 50% lower. The latest industry projections are for no more than 20% per year improvement over the next 5 years. There are many reasons why even this may be optimistic, such as industry consolidation, and the shift from a 3.5" to a 2.5" form factor.

Bill McKibben's Rolling Stone article "Global Warming's Terrifying New Math" uses three numbers to illustrate the looming crisis. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
Here's a graph that projects these three numbers out for the next 10 years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at computereconomics.com's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget.

We need to focus on the triangle between the blue and green lines. We need some combination of an increase in the IT budget growth by an order of magnitude, a radical reduction in the rate at which we store new data, or a radical reduction in the cost of storage media.

Many people believe there is a fourth alternative, because there is some magic in "the cloud" that makes storage much cheaper. I've spent the last year looking at the numbers behind the "affordable cloud storage" hype. We ran a LOCKSS box in Amazon's cloud and collected detailed cost numbers. I looked at the pricing history of cloud storage services. And I used our prototype economic model to compare cloud and local storage costs.

The detailed cost numbers we recorded are in the paper, and in our report to the Library of Congress, which funded the experiment. We scaled up these numbers to match the median LOCKSS box in the Global LOCKSS Network. We then projected the numbers over 3 years to compute a 3-year total cost of ownership.

Depending on the exact assumptions, the projected 3-year TCO ranged from (A) $11.6K to (B) $19.1K. We would expect in practice to incur costs closer to the A case. The hardware cost of the median GLN LOCKSS box with the B assumptions is less than $1.5K, so even if we compare it to the cloud box with A assumptions running costs of $280/month would be needed to make it more expensive. This is vastly higher than incurred in practice. The conclusion that local storage is cheaper than S3 is supported by our modeling work.

We model a chunk of data through time as it migrates from one generation of storage media to its successors. The goal is to compute the endowment, the capital needed to fund the chunk's preservation for, in our case, 100 years. The price per byte of each media generation is set by a Kryder's Law parameter. Each technology also has running costs, and costs for moving in and moving out. Interest rates are set each year using a model based on the last 20 years of inflation-protected Treasuries. I should add the caveat that this is a still a prototype, so the numbers it generates should not be relied on. But the shapes of the graphs and the relative costs seem highly plausible.

We need a baseline, the cost of local storage, to compare with cloud costs. It should lean over backwards to be fair to cloud storage. I don't know of a lot of good data to base this on; I used numbers from Backblaze, a PC backup service which publishes detailed build and ownership costs for their 4U 117TB storage pods. I took their 2011 build cost, and increase it to reflect the then 60% increase in disk cost since the Thai floods. Based on numbers from San Diego Supercomputer Center and Google, I added running costs so that the hardware cost is only 1/3 of the total 3-year cost of ownership. Note that this is much more expensive than Backblaze's published running cost. I added move-in and move-out costs of 20% of the purchase price in each generation. Then I multiplied the total by three to reflect three geographically separate replicas.

The result is this graph, plotting the endowment needed to have a 98% chance of not running out of money in 100 years against the Kryder rate. In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decease, we move into the steep part of the graph, which has two effects:
  • The cost increases sharply.
  • The cost becomes harder to predict, because it depends strongly on the precise Kryder rate, which we are not going to know.
Now we do the same thing for S3, setting the purchase cost to 0 and the running cost from S3's published prices. The only additional cost is the running cost of a reserved AWS virtual machine to do integrity checks.

The result is this graph, showing that S3 is not competitive with local storage at any Kryder rate. This comparison is misleading. It assumes that local storage and S3 experience the same Kryder rate.

Here is a table showing the history of the prices for several major cloud storage services since their launch. Their Kryder rates are below 10%, except for the recent entrants, which made major pricing mistakes when they launched their services.
Service Launch Launch (c/GB/mo) Dec 2012 (c/GB/mo) Annual % Drop
S3 03/06 0.15 0.095 7
Rackspace 05/08 0.15 0.10 9
Azure 11/09 0.15 0.095 14
Google 10/11 0.13 0.095 24

If local storage's Kryder rate matches IHS' 20%, and if S3's is their historic 7% the endowment needed in S3 is more than 5 times larger than in local storage, and depends much more strongly on the Kryder rate. This raises two obvious questions.

First, why don't S3's prices drop as the cost of the underlying storage drops? The answer is that they don't need to. Their customers are locked-in by bandwidth charges. S3 has the bulk of the market with their current prices. Their competitors match or even exceed their prices. Why would Amazon cut prices?

Second, why is S3 so much more expensive than local storage? After all, even using S3's Reduced Redundancy Storage to store 117TB, you would pay in the first month almost enough to buy the hardware for one of Backblaze's storage pods. The answer is that, for the vast majority of S3's customers, it isn't that expensive. First, they are not in the business of long-term storage. Their data has a shelf-life much shorter than the life of the drives, so they cannot amortize across the full life of the media. Second, their demand for storage has spikes. By using S3, they avoid paying for the unused capacity to cover the spikes.

Long-term storage has neither of these characteristics, and this makes S3's business model inappropriate for long-term storage. Amazon recently admitted as much when they introduced Glacier, a product aimed specifically at long-term storage, with headline pricing between 5 and 12 times cheaper than S3.

To make sure that Glacier doesn't compete with S3, Amazon gave it two distinguishing characteristics. First, there is a unpredictable delay between requesting data and getting it. Amazon says this will average about 4 hours, but they don't commit to either an average or a maximum time. Second, the pricing for access to the data is designed to discourage access. There is a significant per-request charge, to motivate access in large chunks. Although you are allowed to access 5% of your data each month with no per-byte charge, the details are complex and hard to model.

As I understand it, if on any day of the month you exceed the pro-rated free allowance (i.e. about 0.17% depending on the month), you are charged for the whole month as if you had sustained your peak hourly rate in that month for the entire month. Thus, to model Glacier I had to make some fairly heroic assumptions:
  • No accesses to the content other than for integrity checks.
  • Accesses to the content for integrity checking are generated at a precisely uniform rate.
  • Each request is for 1GB of content.
  • One reserved AWS instance used for integrity checks.
This graph shows the result for Glacier with no integrity checks, with checks every 20 months (the most that is possible while staying within the 5% free allowance) and every 4 months. Note the huge impact of the access charges incurred to do integrity checks every 4 months. It looks as though Glacier with 20-month checks is a little cheaper than local storage, and with 4-month checks is a little more expensive, provided the Kryder rates are the same.

But they won't be. One cent per gigabyte per month is a great price point; Amazon won't reduce it for a long time. So Glacier's Kryder rate will be 0%.

But even this is not an apples-to-apples comparison. Both local storage and S3 provide adequate access to the data. Glacier's long latency and severe penalties for unplanned access mean that, except for truly dark archives, it isn't feasible to use Glacier as the only repository. Even for dark archives, Glacier's access charges provide a very powerful lock-in. Getting data out of Glacier to move it to a competitor in any reasonable time-frame would be very expensive, easily as much as a year's storage.

Providing adequate access to justify preserving the content, and avoiding getting locked-in to Amazon, requires maintaining at least one copy outside Glacier. If we maintain one copy of our 117TB example in Glacier with 20-month integrity checks experiencing a 7% Kryder rate, and one copy in local storage experiencing a 20% Kryder rate (instead of the three in our earlier local storage examples), the endowment needed would be $517K. The endowment needed for three copies in local storage at a 20% Kryder rate would be $486K. Given the preliminary state of our economic model, this is not a significant difference.

Replacing two copies in local storage with one copy in Glacier would not significantly reduce costs, instead it might increase them slightly. Its effect on robustness would be mixed, with 4 versus 3 total copies (effectively triplicated in Glacier, plus local storage) and greater system diversity, but at the cost of less frequent integrity checks.

  • It is pretty clear that services like S3 are simply too expensive to use for digital preservation. The reasons for this are mostly business rather than technical.
  • Access and lock-in considerations make it very difficult for a digital preservation system to use Glacier as its only repository.
  • Even with generous assumptions, it isn't clear that using Glacier to replace all but one local store reduces costs or enhances overall reliability.
  • Systems that combine Glacier with local storage, or with other cloud storage systems, will need to manage accesses to the Glacier copy very carefully if they are not to run up large access costs.
  • Everything I have been discussing is about commercial cloud storage providers such as Amazon. It isn't to say that a private cloud isn't economically viable. But it is to say that in setting up a private cloud it is essential to economic viability that its prices follow the Kryder rate of the underlying storage.
I'll leave you with this thought experiment. Suppose we decided to keep everything in S3 for ever, what would it cost? Each year, we would need to endow that year's data. 2011's endowment would be about $11.4T, or 14% of the gross world product. According to IDC, the data to be stored grows by 60%/yr. Gross world product is growing about 5%/yr. The endowment needed for 2018's data would exceed the gross world product.


  1. More evidence of their long-term strategy to lock-in customers and, once they are locked-in, extract rent from them comes from Amazon's behavior in the book market:

    "Amazon sells about one in four printed books, ... a level of market domination with little precedent in the book trade.

    It is an achievement built on superior customer service, a vast range of titles and, most of all, rock-bottom prices that no physical store could hope to match. Even as Amazon became one of the largest retailers in the country, it never seemed interested in charging enough to make a profit. Customers celebrated and the competition languished.

    Now, with Borders dead, Barnes & Noble struggling and independent booksellers greatly diminished, for many consumers there is simply no other way to get many books than through Amazon. And for some books, Amazon is, in effect, beginning to raise prices.

    ... “You lower your prices until the competition is out of the picture, and then you raise your prices and get your money back,” [Stephen Blake Mettee] said."