In a paper (abstract only) at the Archiving 2007 conference Richard Moore and his co-authors report that the San Diego Supercomputer Center's cost to sustain one disk plus three tape replicas is $3K per terabyte per year. The rapidly decreasing disk media cost is only a small part of this, so that the overall cost is not expected to drop rapidly. Consider our petabyte of data example. Simply keeping it on-line with bare-bones backup, ignoring all access and update costs, will cost $3M per year. The only safe funding mechanism is endowment. Endowing the petabyte at a 7% rate of return is a $43M investment.
There are probably already many fields of study for which the cost of generating a petabyte of useful data is less than $43M. The trend in the cost per byte of generating data is down, in part because of the increased productivity of scholarship based on data rather than directly on experiment. Thus the implied and unacknowledged cost of the data generated may in many cases overwhelm the acknowledged cost of the project that generated it.
Further, if all the data cannot be saved, a curation process is needed to determine what should be saved and add metadata describing (among other things) what has been discarded. This process is notoriously hard to automate, and thus expensive. The curation costs are just as unacknowledged as the storage costs. The only economically feasible thing to do with the data may be to discard it.
An IDC report sponsored by EMC (pdf) estimates that the world created 161 exabytes of data in 2006. Using SDSC's figures it would cost almost half a trillion dollars per year to keep one on-line and three tape backup copies. Endowing this amount of data for long-term preservation would take nearly seven trillion dollars in cash. Its easy to see that a lot of data isn't going to survive.
The original "Petabyte for a Century" post is here.
[Edited to correct broken link]