Monday, May 14, 2012

Lets Just Keep Everything Forever In The Cloud

Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:
Dr Fader ... coins the terms "data fetish" and "data fetishist" to describe the belief that people and organisations need to capture and hold on to every scrap of data, just in case it might be important down the road. (I recently completed a Big Data survey in which a large proportion of respondents said they intend to keep their data “forever”. Great news for the tech industry, for sure.)
The full interview is worth reading, but I want to focus on one comment, which is similar to things I hear all the time:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."
Follow me below the fold for some thoughts on data hoarding.

Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely. They believe that because they believe Kryder's Law will continue.

Lets imagine that everyone thought that way, and decided to keep everything forever. The natural place to put it would be in S3. According to IDC, in 2011 the world stored 1.8 Zettabytes (billion TB) of data. If we decided to keep it all for the long term in the cloud, we would be effectively endowing it. How big would the endowment be? Applying our model, starting with S3's current highest-volume price of $0.055/GB/mo and assuming that price continues to drop at the 10%/yr historic rate for S3's largest tier, we need an endowment of about $6.3K/TB. So the net present value of the cost of keeping all the world's 2011 data in S3 would be about $11.4 trillion. The 2011 Gross World Product (GWP) at purchasing price parity is almost $80 trillion. So keeping 2011's data would consume 14% of 2011's GWP. The world would be writing S3 a check each month of the first year for almost $100 billion, unless the world got a volume discount.

IDC estimates that 2011's data was 50% larger than 2010's; I believe their figure for the long-run annual growth of data is 57%/yr. Even if it is only 50%, compare that with even the most optimistic Kryder's Law projections of around 30%. But we're using S3, and a 10% rate of cost decrease. So 2012's endowment will be (50-10)=40% bigger than 2011, and so on into the future. The World Bank estimates that in 2010 GWP grew 5.1%. Assuming this growth continues, endowing 2012's data will consume 19% of GWP. On these trends, endowing 2018's data will consume more than the entire GWP for the year.

So, we're going to have to throw stuff away. Even if we believe keeping stuff is really cheap, its still too expensive. The bad news is that deciding what to keep and what to throw away isn't free either. Ignoring the problem incurs the costs of keeping the data; dealing with the problem incurs the costs of deciding what to throw away. We may be in the bad situation of being unable to afford either to keep or to throw away the data we generate. Perhaps we should think more carefully before generating it in the first place. Of course, thought of that kind isn't free either ...


  1. I like to think of myself as a Data Therapist, helping some overcome their data fetish.

    But most of the time I suspect I am just enabling.

    A really important question for an archive to ask itself over and over again is, "Why do we need to keep this?"

    Nice post, David. Thanks for the link to the article.

  2. Amazon designs S3 for 11 nines of reliability, or a 10^-11 chance that an object will be unrecoverable in a given year. The average size of an object is a megabyte. Thus 2011's 1.8 Zettabytes would be 1.8*10^15 objects, and S3 would lose 18,000 of 2011's objects each year, or 18GB/yr of data.

  3. Commentary on this issue from the New South Wales Archives with some interesting links to the work of Barclay T Blair is here.

  4. In June 2012 S3 stored over a trillion objects. If S3 achieves its 11 nines design goal for object durability, it will lose 10 objects of 1MB each every year, or 10MB/yr.