Tuesday, May 6, 2014

On the Economics of Throwing Stuff Away

I've been arguing for some time that storing bits will be a lot less free than it used to be. The Big Data zealots who say:
Save it all—you never know when it might come in handy for a future data-mining expedition.
will have to adapt to this new reality. Below the fold I look at possible adaptations.

As I wrote a couple of years ago:
Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely. 
The cost of keeping the data indefinitely went up a lot. So one of four things needs to happen:
  • The Big Data zealot gets their budget increased by a lot.
  • The Big Data zealot gets a lot better at turning dross into gold.
  • The Big Data zealot decides that the data he collects doesn't need to be nearly as big.
  • The Big Data zealot starts throwing away lots of the stuff he has collected.
I think the first three are all unlikely, so it is time to look at the economics of throwing stuff away.

It turns out that the endowment model provides useful insights into throwing stuff away as well as into storing stuff. This may have something to do with the fact that, just as people like the Big Data zealot think long-term storage is effectively free, they think that throwing stuff away is free too.

The net present value of throwing some stuff away now is the as-yet-unexpended portion of the stuff's endowment. The costs of throwing the stuff away are in two parts:
  • The (hypothetical) future values that can no longer be extracted from the stuff, discounted back to now.
  • The cost of taking the decision now to throw this stuff, rather than some other stuff or nothing, away.
The first thing this analysis makes clear is that, provided that the Kryder rate stays significantly above interest rates, the Big Data zealot is comparing apples to oranges. The bulk of the costs for storing the data are incurred in the near future. For a fair analysis, the future benefits from data-mining need to be discounted back to now. Unless they are also accrued in the near future they will be much lower than they seem.

It is true that rm -rf / is fairly cheap, but it isn't free. It takes some I/O and some time. Being more selective than that gets expensive. Worse, you have to spend the resources to take the decision at a time when most of the endowment has been spent, so you have very little capital to invest in what librarians call the de-accessioning decision.So de-accessioning technology needs to have two characteristics:
  • It must preserve as much as possible of the (hypothetical) future value that could be extracted from the stuff.
  • It must be as cheap as possible to apply, which means it must involve as little human attention as possible.
The need therefore is for some form of lossy compression at the semantic level.

No comments: