Thursday, October 18, 2018

Betteridge's Law Violation

Erez Zadok points me to Wasim Ahmed Bhat's Is a Data-Capacity Gap Inevitable in Big Data Storage? in IEEE Computer. It is a violation of Betteridge's Law of Headlines because the answer isn't no. But what, exactly, is this gap? Follow me below the fold.

Bhat's introduction draws heavily on IDC's Digital Universe report which:
forecasts that the amount of data generated globally will reach 44 zettabytes (ZBs) in 2020 and 163 ZBs in 2025. Even the estimates are increasing, as earlier it was forecast to be 35 ZBs in 2020 instead of 44.
Seagate's projections
And on Seagate's marketing materials based upon it:
Seagate ... subscribes to IDC’s estimate that around 13 ZBs of 44 ZBs generated in 2020 would be critical and should be stored. ... Seagate also anticipates that the storage capacity available in 2020 will not be able to fulfill the minimum required storage demand, and will lead to a data-capacity gap of at least 6 ZBs
Derived from
Bhat goes on to point out the slowing of Kryder's Law that I've been writing about since 2011, with a nice graph based on data from the Information Storage Industry Consortium. It shows that the rate of increase of areal density from 1991 to 2009 was never less than 39%, and that since then it has been 16%, nicely between the two projections (10% and 20%) Preeti Gupta and I made in 2014.


Preeti Gupta's 2014 graph
Bhat continues with an analysis of the prospects for improvements in storage technology that tracks closely (and cites) my The Medium-Term Prospects for Long-term Storage Systems from nearly two years ago. He also cites our 2012 paper The Economics of Long-Term Digital Storage and our 2014 paper An Economic Perspective of Disk vs. Flash Media in Archival Storage. It is always nice when one's work is cited!

Unfortunately, Bhat does not cite or seem to have read my 2016 post Where Did All Those Bits Go? in which I point out a number of flaws in IDC's reports, and in the analyses such as Seagate's based on them. The most important of these flaws is the implicit assumption that the demand for storage is independent of the price of storage:
Seagate ... subscribes to IDC’s estimate that around 13 ZBs of 44 ZBs generated in 2020 would be critical and should be stored. ... the storage capacity available in 2020 will not be able to fulfill the minimum required storage demand, and will lead to a data-capacity gap of at least 6 ZB
Note the lack of any concept of the price of storing the 13ZB. Since it is evident that neither IDC nor Seagate nor Baht believe that the 6ZB of additional media "required" would be available at any price, something has to give. But what?

In practice big, and indeed any, data storage user compares a prediction of the value to be realized by storing the data with the cost of doing so. Data whose potential value does not justify its storage doesn't get stored, which is what will happen to the 6ZB.

IDC, Seagate and Bhat suffer from the collision of two ideas, both of which are wrong:
  • The "Big Data" hype, which is that the value of keeping everything is almost infinite.
  • The "storage is free" idea left over from the long-gone days of 40+% Kryder rates.
If storage is free and the value to be extracted from stored data is non-zero, of course the extra 6ZB "should be stored". But:
Storage Will Be
A Lot Less Free
Than It Used To Be
and the value to be extracted from some data will be a lot more than others. The more valuable data is more likely to be stored than the less valuable. Typically, the value of data decays with time, in most cases quite rapidly. Another flaw in the IDC analysis is that there is no concept of how long the data is to be stored, and thus how quickly the media storing it can be re-used. Again, the more valuable data is likely to be stored longer than the less valuable.

Thus the gap to which Bhat refers is between what data centers would store if storage were free, and what they will store given the actual cost of storing it. This gap would only be something new or unexpected if storage were free. This hasn't been the perception, let alone the reality, since very early in the history of Big Data centers.

2 comments:

  1. This is something I repeatedly run up against myself, though I didn't think about the notion of data lifetime and media re-use (stealing, with attribution to you ;))

    Broadly, I tend to think:
    1) The forecasts are inaccurate, or at the very least, very misleading. Leading to
    2) The data isn't actually that valuable to begin to with/we haven't figured out get sufficient value out of it.

    ReplyDelete
  2. Chris Mellor at The Register reports on quarterly drive shipments:

    "Total capacity shipped reached 230EB, up 31 per cent annually. Total enterprise capacity shipped was 113.5EB, up 71 per cent."

    Two points:

    1) That is adding data center storage capacity at nearly half a Zettabyte a year, which puts Bhat's 6ZB gap into perspective.

    2) Nearline data center drives are half of total drive production. The transition of the hard drive market to data centers is progressing rapidly.

    ReplyDelete