Bhat's introduction draws heavily on IDC's Digital Universe report which:
forecasts that the amount of data generated globally will reach 44 zettabytes (ZBs) in 2020 and 163 ZBs in 2025. Even the estimates are increasing, as earlier it was forecast to be 35 ZBs in 2020 instead of 44.
Seagate's projections |
Seagate ... subscribes to IDC’s estimate that around 13 ZBs of 44 ZBs generated in 2020 would be critical and should be stored. ... Seagate also anticipates that the storage capacity available in 2020 will not be able to fulfill the minimum required storage demand, and will lead to a data-capacity gap of at least 6 ZBs
Derived from |
Preeti Gupta's 2014 graph |
Unfortunately, Bhat does not cite or seem to have read my 2016 post Where Did All Those Bits Go? in which I point out a number of flaws in IDC's reports, and in the analyses such as Seagate's based on them. The most important of these flaws is the implicit assumption that the demand for storage is independent of the price of storage:
Seagate ... subscribes to IDC’s estimate that around 13 ZBs of 44 ZBs generated in 2020 would be critical and should be stored. ... the storage capacity available in 2020 will not be able to fulfill the minimum required storage demand, and will lead to a data-capacity gap of at least 6 ZBNote the lack of any concept of the price of storing the 13ZB. Since it is evident that neither IDC nor Seagate nor Baht believe that the 6ZB of additional media "required" would be available at any price, something has to give. But what?
In practice big, and indeed any, data storage user compares a prediction of the value to be realized by storing the data with the cost of doing so. Data whose potential value does not justify its storage doesn't get stored, which is what will happen to the 6ZB.
IDC, Seagate and Bhat suffer from the collision of two ideas, both of which are wrong:
- The "Big Data" hype, which is that the value of keeping everything is almost infinite.
- The "storage is free" idea left over from the long-gone days of 40+% Kryder rates.
Storage Will Beand the value to be extracted from some data will be a lot more than others. The more valuable data is more likely to be stored than the less valuable. Typically, the value of data decays with time, in most cases quite rapidly. Another flaw in the IDC analysis is that there is no concept of how long the data is to be stored, and thus how quickly the media storing it can be re-used. Again, the more valuable data is likely to be stored longer than the less valuable.
A Lot Less Free
Than It Used To Be
Thus the gap to which Bhat refers is between what data centers would store if storage were free, and what they will store given the actual cost of storing it. This gap would only be something new or unexpected if storage were free. This hasn't been the perception, let alone the reality, since very early in the history of Big Data centers.
This is something I repeatedly run up against myself, though I didn't think about the notion of data lifetime and media re-use (stealing, with attribution to you ;))
ReplyDeleteBroadly, I tend to think:
1) The forecasts are inaccurate, or at the very least, very misleading. Leading to
2) The data isn't actually that valuable to begin to with/we haven't figured out get sufficient value out of it.
Chris Mellor at The Register reports on quarterly drive shipments:
ReplyDelete"Total capacity shipped reached 230EB, up 31 per cent annually. Total enterprise capacity shipped was 113.5EB, up 71 per cent."
Two points:
1) That is adding data center storage capacity at nearly half a Zettabyte a year, which puts Bhat's 6ZB gap into perspective.
2) Nearline data center drives are half of total drive production. The transition of the hard drive market to data centers is progressing rapidly.