Save it all—you never know when it might come in handy for a future data-mining expedition.will have to adapt to this new reality. Below the fold I look at possible adaptations.
As I wrote a couple of years ago:
Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely.The cost of keeping the data indefinitely went up a lot. So one of four things needs to happen:
- The Big Data zealot gets their budget increased by a lot.
- The Big Data zealot gets a lot better at turning dross into gold.
- The Big Data zealot decides that the data he collects doesn't need to be nearly as big.
- The Big Data zealot starts throwing away lots of the stuff he has collected.
It turns out that the endowment model provides useful insights into throwing stuff away as well as into storing stuff. This may have something to do with the fact that, just as people like the Big Data zealot think long-term storage is effectively free, they think that throwing stuff away is free too.
The net present value of throwing some stuff away now is the as-yet-unexpended portion of the stuff's endowment. The costs of throwing the stuff away are in two parts:
- The (hypothetical) future values that can no longer be extracted from the stuff, discounted back to now.
- The cost of taking the decision now to throw this stuff, rather than some other stuff or nothing, away.
It is true that rm -rf / is fairly cheap, but it isn't free. It takes some I/O and some time. Being more selective than that gets expensive. Worse, you have to spend the resources to take the decision at a time when most of the endowment has been spent, so you have very little capital to invest in what librarians call the de-accessioning decision.So de-accessioning technology needs to have two characteristics:
- It must preserve as much as possible of the (hypothetical) future value that could be extracted from the stuff.
- It must be as cheap as possible to apply, which means it must involve as little human attention as possible.