Thursday, September 22, 2016

Where Did All Those Bits Go?

Lay people reading the press about storage, and even some "experts" writing in the press about storage, believe two things:
  • per byte, storage media are getting cheaper very rapidly (Kryder's Law), and
  • the demand for storage greatly exceeds the supply.
These two things cannot both be true. Follow me below the fold for an analysis of a typical example, Lauro Rizatti's article in EE Times entitled Digital Data Storage is Undergoing Mind-Boggling Growth.

Why can't these statements both be true? If the demand for storage greatly exceeded the supply, the price would rise until supply and demand were in balance.

In 2011 we actually conducted an experiment to show that this is what happens. We nearly halved the supply of disk drives by flooding large parts of Thailand including the parts where disks were manufactured. This flooding didn't change the demand for disks, because these parts of Thailand were not large consumers of disks. What happened? As shown in this 2013 graph from Preeti Gupta, the price of disks immediately nearly doubled, choking off demand to match the available supply, and then fell slowly as supply recovered.

So we have two statements. The first is "per byte, storage media are getting cheaper very rapidly". We can argue about exactly how rapidly, but there are decades of factual data recording the drop in cost per byte of disk and other storage media (see Preeti's graph). So it is reasonable to believe the first statement. Anyone who has been buying computers for a few years can testify to it.

Source
The second is "the demand for storage greatly exceeds the supply". The first statement is true, so this has to be false. Why do people believe it? The evidence for the excess of demand over supply in Rizatti's article is this graph. Where does the graph come from? What exactly do the two bars on the graph show?

The orange bars are labeled "output", which I believe represents the total number of bytes of storage media manufactured each year. This number should be fairly accurate, but it overstates the amount of newly created information stored each year for many reasons:
  • Newly manufactured media does not instantly get filled. There are delays in the distribution pipeline - for example I have nearly half a terabyte of unwritten DVD-R media sitting on a shelf. This is likely to be a fairly small percentage.
  • Some media that gets filled turns out to be faulty and gets returned under warranty. This is likely to be a fairly small percentage.
  • Some of the newly manufactured media replaces obsolete media, so isn't available to store newly created information.
  • Because of overhead from file systems and so on, newly created information occupies more bytes of storage than its raw size. This is typically a small percentage.
  • If newly created information does actually get written to a storage medium, several copies of it normally get written. This is likely to be a factor of about two.
  • Some newly created information exists in vast numbers of copies. For example, my iPhone 6 claims to have 64GB of storage. That corresponds to the amount of newly manufactured storage medium (flash) it consumes. But about 8.5GB of that is consumed by a copy of iOS, the same information that consumes 8.5GB in every iPhone 6. Between October 2014 and October 2015 Apple sold 222M iPhones, So those 8.5GB of information are replicated 222M times, consuming about 1.9EB of the storage manufactured in that year.
The mismatch between the blue and orange bars is much greater than it appears.

What do the blue bars represent? They are labeled "demand" but, as we have seen, the demand for storage depends on the price. There's no price specified for these bars. The caption of the graph says "Source: Recode", which I believe refers to this 2014 article by Rocky Pimentel entitled Stuffed: Why Data Storage Is Hot Again. (Really!). Based on the IDC/EMC Digital Universe report, Pimentel writes:
The total amount of digital data generated in 2013 will come to 3.5 zettabytes (a zettabyte is 1 with 21 zeros after it, and is equivalent to about the storage of one trillion USB keys). The 3.5 zettabytes generated this year will triple the amount of data created in 2010. By 2020, the world will generate 40 zettabytes of data annually, or more than 5,200 gigabytes of data for every person on the planet.
The operative words are "data generated". Not "data stored permanently", nor "bytes of storage consumed". The numbers projected by IDC for "data generated" have always greatly exceeded the numbers actually reported for storage media manufactured in a given year, which in turn as discussed above exaggerate the capacity added to the world's storage infrastructure. So where were the extra projected bytes stored?

The assumption behind "demand exceeds supply" is that every byte of "data generated" in the IDC report is a byte of demand for permanent storage capacity. In a world where storage was free there would still be much data generated that was never intended to be stored for any length of time, and would thus not represent demand for storage media.

In the real word, where Storage Will Be Much Less Free Than It Used To Be, there are at least two answers to the question Why Not Store It All? Data costs money to store and, as Maciej Cegłowski's Haunted By Data points out, it costs a whole lot more when it leaks.

WD results
There's another way of looking at the idea that Digital Data Storage is Undergoing Mind-Boggling Growth.  What does it mean for an industry to have Mind-Boggling Growth? It means that the companies in the industry have rapidly increasing revenues and, normally, rapidly increasing profits.

Seagate results
The graphs show the results for the two companies that manufacture the bulk of the storage bytes each year. Revenues are flat or decreasing, profits are decreasing for both companies. These do not look like companies faced by insatiable demand for their products; they look like mature companies facing increasing difficulty in scaling their technology.

For a long time, discussions of storage have been bedevilled by the confusion between IDC's projections for "data generated" and the actual demand for storage media. Don't get fooled. Any articles using the IDC/EMC numbers as storage demand can be ignored.

1 comment:

Wes Felter said...

This is a great example of the difference between what people say they want and what they're willing to pay for.

It also reminds me of Odlyzko's work debunking the myth that the Internet was growing 10X per year when it was obvious that no ISP was ordering 10X as many routers and no router vendor was getting 10X revenue growth.