Sunday, July 15, 2007

Update to "Petabyte for a Century"

In a paper (abstract only) at the Archiving 2007 conference Richard Moore and his co-authors report that the San Diego Supercomputer Center's cost to sustain one disk plus three tape replicas is $3K per terabyte per year. The rapidly decreasing disk media cost is only a small part of this, so that the overall cost is not expected to drop rapidly. Consider our petabyte of data example. Simply keeping it on-line with bare-bones backup, ignoring all access and update costs, will cost $3M per year. The only safe funding mechanism is endowment. Endowing the petabyte at a 7% rate of return is a $43M investment.

There are probably already many fields of study for which the cost of generating a petabyte of useful data is less than $43M. The trend in the cost per byte of generating data is down, in part because of the increased productivity of scholarship based on data rather than directly on experiment. Thus the implied and unacknowledged cost of the data generated may in many cases overwhelm the acknowledged cost of the project that generated it.

Further, if all the data cannot be saved, a curation process is needed to determine what should be saved and add metadata describing (among other things) what has been discarded. This process is notoriously hard to automate, and thus expensive. The curation costs are just as unacknowledged as the storage costs. The only economically feasible thing to do with the data may be to discard it.

An IDC report sponsored by EMC (pdf) estimates that the world created 161 exabytes of data in 2006. Using SDSC's figures it would cost almost half a trillion dollars per year to keep one on-line and three tape backup copies. Endowing this amount of data for long-term preservation would take nearly seven trillion dollars in cash. Its easy to see that a lot of data isn't going to survive.

The original "Petabyte for a Century" post is here.

[Edited to correct broken link]


Phil Wilson said...

If the media cost is trivial, then where does the money go?

David. said...

According to SDSC, media cost is 36% of total for disk and 20% for tape. The majority of the costs are categorized as "other capital" 15/33%, "maintenance and license" 15/22%, "facilities" 11/5% and "sysadmin labor" 23/20%.

SDSC projects that this breakdown will remain roughly constant. Thus they expect the overall cost per TB per year to reduce with media costs, since the same fixed cost will cover more data with time. I think this is too optimistic, although I agree that the cost per TB per year will decrease over time. This would make my $43M number too high. But the SDSC numbers omit the costs of access and actually doing the backups, which would have the opposite effect.

It is important to note that the SDSC paper is mostly concerned with the methodological issues involved in making these estimates. I'm sure they would agree that these make any actual numbers highly uncertain.

David. said...

I refined my economic model. Using a 7% interest rate and agreeing with SDSC that overall cost per byte halves every two years, the endowment would need to be about $10M. Assuming that the media cost per byte halves every two years but the rest of the cost per byte remains constant, the endowment would need to be almost $30M. The $43M figure is too high because it incorrectly assumes that the media cost does not decrease.

There is further scope for refinement over the question of whether the model accounts for inflation. A real (post-inflation) return of 7% is unrealistically high. Disk cost per byte does not drop as fast in real terms as it does in nominal terms. I believe that extending the model with inflation will show that both the $10M and $30M numbers above are too low. Stay tuned for more.

Anonymous said...

For sure the numbers are too low.

Tim McMahon said...

Assuming a 3.39% average annual inflation rate (per

) and a nominal return of 9% which has been considered the long
term average return (although not this decade) you are looking at an endowment
rate of 5.61% so you are right the estimate of 7% post inflation is a bit high
unless you have some really good professional money managers. Like Jack R.
Meyer, the former manager of Harvard University's $23 billion in assets.