Sunday, July 15, 2007

Update to "Petabyte for a Century"

In a paper (abstract only) at the Archiving 2007 conference Richard Moore and his co-authors report that the San Diego Supercomputer Center's cost to sustain one disk plus three tape replicas is $3K per terabyte per year. The rapidly decreasing disk media cost is only a small part of this, so that the overall cost is not expected to drop rapidly. Consider our petabyte of data example. Simply keeping it on-line with bare-bones backup, ignoring all access and update costs, will cost $3M per year. The only safe funding mechanism is endowment. Endowing the petabyte at a 7% rate of return is a $43M investment.

There are probably already many fields of study for which the cost of generating a petabyte of useful data is less than $43M. The trend in the cost per byte of generating data is down, in part because of the increased productivity of scholarship based on data rather than directly on experiment. Thus the implied and unacknowledged cost of the data generated may in many cases overwhelm the acknowledged cost of the project that generated it.

Further, if all the data cannot be saved, a curation process is needed to determine what should be saved and add metadata describing (among other things) what has been discarded. This process is notoriously hard to automate, and thus expensive. The curation costs are just as unacknowledged as the storage costs. The only economically feasible thing to do with the data may be to discard it.

An IDC report sponsored by EMC (pdf) estimates that the world created 161 exabytes of data in 2006. Using SDSC's figures it would cost almost half a trillion dollars per year to keep one on-line and three tape backup copies. Endowing this amount of data for long-term preservation would take nearly seven trillion dollars in cash. Its easy to see that a lot of data isn't going to survive.

The original "Petabyte for a Century" post is here.

[Edited to correct broken link]

Friday, July 13, 2007

Update on Post-Cancellation Access

In my June 10 post on post-cancellation access to e-journals I said:
big publishers increasingly treat their content not as separate journals but as a single massive database. Subscription buys access to the whole database. If a library cancels their subscription, they lose access to the whole database. This bundling, or "big deal", leverages a small number of must-have journals to ensure that cancellation of even low-value journals, the vast majority in the bundle, is very unlikely. It is more expensive to subscribe individually to the few high-value journals than to take the "big deal". Thus cancellation of large publisher journals is a low risk, which is the goal of the "big deal" scheme.
On July 5 Elsevier mailed their subscribers about 2008 pricing. The mail confirmed that both individual print journals and individual e-journal subscriptions are history.

On July 6 the Association of Subscription Agents (an interested party) issued a press release that clarified the impact of Elsevier's move:
Libraries face a choice between Science Direct E-Selects (a single journal title purchased in electronic format only on a multiple password basis rather than a site licence), or a Science Direct Standard or Complete package (potentially a somewhat more expensive option but with the virtue of a site licence).
Libraries must now pay the full rate for both the print and the E-Select (electronic) option if they required both formats. This separation of the print from the electronic also leaves European customer open to Value Added Tax on the E-Select version which, depending on the EU country involved could add substantially to the cost (17.5% in the UK, 19% in Germany for example).
It is clear that Elsevier, at least, is determined to ensure that its customers subscribe only to the electronic "big deal" because, once subscribed, libraries will find cancellation effectively impossible.