Thursday, June 13, 2013

Brief talk at ElPub 2013

I was on the panel entitled Setting Research Data Free: Problems and Solutions at the ElPub 2013 conference. Below the fold is the text of my introductory remarks with links to the sources.

One of the few things that most people actually trying to preserve large amounts of digital content agree on is that the number 1 problem they face is not technical but economic. Unlike paper, bits are very vulnerable to interruptions in the money supply. To survive, or in the current jargon to "be sustainable", a digital collection needs an assured stream of funds for the long term. Very few have it. You can tell people are worried about a topic when they appoint a "Blue Ribbon Task Force" to study it. We had such a task force. It reported 2 years ago that, yes, sustainable economics was a big problem. But the panel conspicuously failed to come up with credible solutions.

You, or at least I, often hear people say something similar to what Dr. Fader of the Wharton School Customer Analytics Initiative attributes to Big Data zealots:
Save it all - you never know when it might come in handy for a future data-mining expedition.
Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is on average probably small. The reason the Big Data zealot gets away with saying things like this is because he, and his audience, believe that this small value outweighs the cost of keeping the data indefinitely. They believe that storage is, in effect, free.

So how free does storage turn out to be? This concern has motivated a good deal of research into the costs of digital preservation, efforts such as CMDP (PDF), LIFE, KRDS, PrestoPrime, ENSURE, and others. Their conclusions differ, but broadly we can say that typically about half the total cost is ingest, about one-third is preservation, mostly storage, and about one-sixth is dissemination.

It is easy to understand why ingesting content is expensive, at least it is easy if you have ever tried to do it on a production scale. There is a lot of stuff to ingest. In the real world it is diverse and messy. People want not just the content, but also metadata. This has to be either manually generated, which is expensive, or extracted automatically, which is a great way of revealing the messy nature of the real world. It is easy to understand why disseminating content is a small part of the total, because preserved content is, on average, very rarely disseminated. Why is storage, an on-going cost that must be paid for the life of the collection, such a small part of the total?

It is easy to understand why disseminating content is a small part of the total, because preserved content is, on average, very rarely accessed.

Why has storage, an on-going cost that must be paid for the life of the collection, been such a small part of the total in the past?

The reason is this graph, showing Kryder's Law, which says that the areal density of bits on disk platters has increased 30-40%/year for the last 30 years. The areal density doesn't have a one-to-one relationship with the cost per GB of disk, but they are closely correlated. The effect has been, for the last 30 years, that consumers got roughly double the storage at the same price every two years or so.

If something goes on steadily for 30 years or so it gets built into people's models of the world. For digital preservation, the model of the world into which it gets built is that, if you can afford to store something for a few years, you can afford to store it forever. The price per byte of the storage will have become negligible. Thus, the breakdown that has storage costs being one-third of the total has built into it the idea that storage media costs drop so fast that the one- third has only to pay for a few years of storage.

If you look on my blog, for example at the talk I gave at the UNESCO "Memory of the World" meeting last year, you will find a lot of detailed explanation of the technological and economic reasons why Kryder's Law has slowed, and will continue to slow, and what this means for cost of storing data for the long term. But there's a much simpler argument to convey the basic idea.

Bill McKibben's Rolling Stone article Global Warming's Terrifying New Math; uses three numbers to illustrate the looming climate crisis. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
  • According to IDC, the demand for storage each year grows about 60%.
  • According to IHS iSuppli, the bit density on the platters of disk drives will grow no more than 20%/year for the next 5 years.
  • According to, IT budgets in recent years have grown between 0%/year and 2%/year.
This graph projects these three numbers out for the next 10 years. The red line is Kryder's Law, at 20%/yr. The blue line is the IT budget, at 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget. Its going to consume 100% of the budget in much less than 10 years.

Lets look at the economics of each of the three components.

Ingest: This is a one-time, up-front cost, so it can in principle be grant-funded. The big cost is generating and validating metadata that is good enough to allow sharing. This is hard to automate, so it is expensive. The cost falls on the owner of the data, but the benefits accrue to the re-user of the data. This makes it hard to motivate the data owner to fill the gap between metadata that is good enough for their own use, and good enough for sharing and re-use.

Worse, the potential beneficiaries of this effort are competitors for recognition and funding. The mechanisms for getting credit for re-use of data don't work well, precisely because they depend on the competitor to assign the credit, which isn't in their interest.

Thus, for the data owner, the costs of re-use are likely to exceed the benefits. And, unless the data owner takes pro-active steps to market the data to competitors, re-use isn't likely to happen.

Dissemination: If the data owner doesn't provide good metadata and doesn't market their data well, it won't be accessed much and thus the access costs will be low. It isn't hard to pay for the outcome we don't want.

But if data, especially large data, gets popular the access costs can be significant. The market price, just for data transfer, is roughly $120/TB. These costs are borne by the data owner, so the better job of marketing they do the more costs they incur.

Further, these costs are unpredictable, so they are hard to budget for. And they're an on-going cost that can't be grant funded.

We all want the data to be open, but that implies that the access costs can't be recovered from the readers. And selling ads on data isn't a viable business model. Worse, much of the data is either burdened with rosy projections of the IP that can be generated from it, or contains personally identifiable information, so cannot be made open.

Storage is an on-going cost, but unlike access it is somewhat predictable. This makes the endowment model possible, where data is deposited together with a capital sum thought to be adequate to pay for its storage "for ever". The endowment model enables grant-funding of long-term data storage.

Unfortunately, my joint research with UC Santa Cruz, StonyBrook and NetApp shows that the endowment needed is rather large. The market price right now is probably around $8K/TB. Few institutions have been willing to take the risk of such projections being wrong, I only know of Princeton (PDF) and the Internet Archive. Both are asking endowments that appear too low.

This price is so high that funders, used to Kryder's Law and thus assuming that "storage will be free", are unlikely to agree to fund storing everything. But allowing data owners to select the data to be stored for re-use is a very bad idea. We see that from the dire effects of selective publishing of the results of drug trials. Thus either no data should be shared, or all data should be shared.

Given these economic hurdles, one can expect that data sharing will continue to be the exception rather than the rule, no matter how much society might benefit.

1 comment:

  1. Video of the whole first day is here. Click on "Spela Presentationen" to play - my panel is the third chapter.