Wednesday, May 16, 2018

Longer talk at MSST2018

I was invited to give both a longer and a shorter talk at the 34th International Conference on Massive Storage Systems and Technology at Santa Clara University. Below the fold is the text with links to the sources of the longer talk, which was updated from and entitled The Medium-Term Prospects for Long-Term Storage Systems.

You don't need to take notes or photograph the slides, because the whole text of my talk with links to the sources will go up on my blog shortly after I sit down. For almost the last two decades I've been working at the Stanford Libraries on the problem of keeping data safe for the long term. The most important lesson I've learned is that this is fundamentally an economic problem; we know how to do it but we don't want to pay enough to have it done.
  • How far into the future should we be looking?
  • What do the economics of storing data for that long look like?
  • How long should the media last?
  • How reliable do the media need to be?

How far into the future?

Source: Disks for Data Centers
Discussions of storage tend to focus on the sexy, expensive, high-performance market. Those systems are migrating to flash. The data in those systems is always just a cache. In the long term, that data lives further down the hierarchy. What I'm talking about is the next layer down, the capacity systems where all the cat videos, snapshots, old blog posts and the scientific data live. And, below them, the archival layer where data goes to hibernate. These are where data lives for the long term.

The Square Kilometer Array plans to generate a Petabyte a day of data that is expected to be used for 50 years. How hard a problem is planning with this long a horizon? Lets go back 50 years or so and see.


IBM2314s (source)
In 1966 as I was writing my first program disk technology was about 10 years old; the IBM 350 RAMAC was introduced in 1956. The state of the art was the IBM 2314. Each removable disk pack stored 29MB on 11 platters with a 310KB/s data transfer rate. Roughly equivalent to 60MB/rack. The SKA would have needed to add nearly 17M, or about 10 square kilometers, of racks each day.

R. M. Fano's 1967 paper The Computer Utility and the Community reports that for MIT's IBM 7094-based CTSS:
the cost of storing in the disk file the equivalent of one page of single-spaced typing is approximately 11 cents per month.
It would have been hard to believe a projection that in 2016 it would be more than 7 orders of magnitude cheaper.

Your 1966 alter-ego's data management plan would be correct in predicting that 50 years later the dominant media would be "disk" and "tape", and that disk's lower latency would carry a higher cost per byte. But its hard to believe that any more detailed predictions about the technology would be correct. The extraordinary 30-year history of 30-40% annual density increase and thus cost per byte decrease, the Kryder rate, had yet to start.

Although disk is a 60-year old technology, a 50-year time horizon may seem too long to be useful. But a 10-year time horizon is definitely too short to be useful. Storage is not just a technology, but also a multi-billion dollar manufacturing industry dominated by a few huge businesses, with long, hard-to-predict lead times.

Seagate 2008 roadmap
To illustrate the lead times, here is a Seagate roadmap slide from 2008 predicting that perpendicular magnetic recording (PMR) would be replaced in 2009 by heat-assisted magnetic recording (HAMR), which would in turn be replaced in 2013 by bit-patterned media (BPM).

In 2016, the trade press reported that:
Seagate plans to begin shipping HAMR HDDs next year.
ASTC 2016 roadmap
Here is a 2016 roadmap from ASTC showing HAMR starting in 2017 and BPM in 2021. So in 8 years HAMR went from next year to next year, and BPM went from 5 years out to 5 years out. The reason for this real-time schedule slip is that as technologies get closer and closer to the physical limits, the difficulty and above all cost of getting from lab demonstration to shipping in volume increases exponentially.

This schedule slip phenomenon isn't peculiar to hard disks, it is happening with Moore's Law, in pharmaceuticals, and in many other industries. Kelvin Stott argues, using drug discovery as an example, that it is a natural consequence of the Law of Diminishing Returns; because R&D naturally explores the avenues with the best bang for the buck first, later avenues perform worse.

Seagate 2018 roadmap
The most recent Seagate roadmap pushes HAMR shipments into 2020, so they are now slipping faster than real-time. Western Digital has given up on HAMR and is promising that Microwave Assisted Magnetic Recording (MAMR) is only a year out. BPM has dropped off both companies' roadmaps.


Historically, tape was the medium of choice for long-term storage. Its basic recording technology is around 8 years behind hard disk, so it has a much more credible technology road-map than disk. But its importance is fading rapidly. There are several reasons:
  • Tape is a very small market in unit terms:
    Just under 20 million LTO cartridges were sent to customers last year. As a comparison let's note that WD and Seagate combined shipped more than 350 million disk drives in 2015; the tape cartridge market is less than 0.00567 5.7 per cent of the disk drive market in unit terms
  • In effect there is now a single media supplier, raising fears of price gouging and supply vulnerability. The disk market has consolidated too, but there are still two very viable suppliers.
  • The advent of data-mining and web-based access to archives make the long access latency of tape less tolerable.
  • To maximize the value of the limited number of slots in the robots it is necessary to migrate data to new, higher-capacity cartridges as soon as they appear. This has two effects. First, it makes the long data life of tape media irrelevant. Second, it consumes a substantial fraction of the available bandwidth, up to a quarter in some cases.


Flash as a data storage technology is almost 30 years old. Eli Harari filed the key enabling patent in 1988, describing multi-level cell, wear-leveling and the Flash Translation Layer. Flash has yet to make a significant impact on the capacity storage market. Perhaps at some point in the future it will displace hard disk as the medium for this level of the hierarchy. There are two contrasting views as to how long this will take.

First, the conventional wisdom as expressed by the operators of cloud services and the disk industry.

Proportion of shipments
by media type
This graph, based on historical data from Robert Fontana and Gary Decad of IBM, shows the small and slowly rising proportion of total bytes shipped as NAND flash. At current rates of change, it won't be until around 2040 before flash forms the majority of bytes shipped. And it is likely that both flash and hard disk will have run into their physical limits by then.

Courtesy Tom Coughlin
Tom Coughlin's graph of exabytes shipped includes industry projections for the next 5 years. It sends the same message, that the proportion of flash in the mix will grow only slowly.

Although flash is displacing disk from markets such as PCs, laptops and servers, Eric Brewer's fascinating keynote at the 2016 FAST conference started from the assertion that the only feasible medium for bulk data storage in the cloud was spinning disk. The argument is that flash, despite its many advantages, is and will remain too expensive for the capacity layer.

NAND vs. HDD capex/TB
Why is this? The graph of the ratio of capital expenditure per TB of flash and hard disk shows that each exabyte of flash contains about 50 times as much capital as an exabyte of disk. Because:
factories to build 3D NAND are vastly more expensive than plants that produce planar NAND or HDDs -- a single plant can cost $10 billion
no-one is going to invest the roughly $80B needed to displace hard disks because the investment would not earn a viable return.

Second, the view from the flash advocates. They argue that the fabs will be built, because they are no longer subject to conventional economics. The governments of China, Japan, and other countries are stimulating their economies by encouraging investment, and they regard dominating the market for essential chips as a strategic goal, something that justifies investment. They are thinking long-term, not looking at the next quarter's results. The flash companies can borrow at very low interest rates, so even if they do need to show a return, they only need to show a very low return.

This quarter's results from Seagate and Western Digital show that the decline in unit shipments for all segments except "enterprise nearline" continues;  but that "enterprise nearline" is up 93% (Seagate) and 31% (WD) year-on-year. There's no evidence of massive new flash fab construction, so it looks to me like the conventional wisdom is correct.

Ratio of NAND to HDD $/GB
Flash will only displace hard disk for bulk storage if it gets closer to price parity. This graph from Fontana & Decad's data shows that, except for the impact of the Thai floods, the ratio of flash to hard disk price is approximately stable. The floods reduced the ratio from about 12 to about 8. So at least two more events at the scale of the floods would be needed to get to price parity.

Alternate Media

Because no other technology is currently shipping significant volumes, manufacturing ramp times mean no other technology will significantly impact the bulk storage market in the medium term.

SanDisk shipped the first flash SSDs to GRiD Systems in 1991. Even if flash were to impact the capacity market in 2018, it would have been 27 years after the first shipment. Initial flash impact is much more likely to take 40 years. The storage technology that follows flash is probably some form of Storage Class Memory (SCM) such as XPoint. Small volumes of some forms of SCM have been shipping for a few years. Like flash, SCMs leverage much of the semiconductor manufacturing technology. Optimistically, one might expect SCM to impact the capacity market sometime in the late 2030s.

I'm not aware of any other storage technologies that could compete for the capacity market in the next three decades. SCMs have occupied the niche for a technology that exploits semiconductor manufacturing. A technology that didn't would find it hard to build the manufacturing infrastructure to ship the thousands of exabytes a year the capacity market will need by then.

Economics of Long-Term Storage

Cost vs. Kryder rate
Back in 2011 I started building a model of the economics of long-term storage using data from Backblaze and the San Diego Supercomputer Center. It plots the net present value of all the expenditures incurred in storing a fixed-size dataset for 100 years against the Kryder rate. As you can see, at the 30-40%/yr rates that prevailed until 2010, the cost is low and doesn't depend much on the precise Kryder rate. Below 20%, the cost rises rapidly and depends strongly on the precise Kryder rate.

2014 cost/byte projection
As it turned out, we were already well below 20%. Here is a 2014 graph from Preeti Gupta, a Ph.D. student at UC Santa Cruz, plotting $/GB against time. The red lines are projections at the industry roadmap's 20% and my less optimistic 10%. It shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • So far, the projections have panned out and we're in the region between the red lines. If we stay there, by 2020 disk costs will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Cost vs. Kryder &
Discount rates
Here's a graph from a simplified version of the model I built last year with data from the Internet Archive. The funds required to deliver on a commitment to store a chunk of data for the long term depend strongly on the future Kryder rate, especially in the first decade or two, and on the future real interest rate. Two things we aren't going to know. Industry projections of both rates have a history of optimism, and are vulnerable to natural disasters, industry consolidation, and so on. So we aren't going to know the cost; it is probably going to be a lot more expensive than we expect.

Long-Lived Media?

Every few months there is another press release announcing that some new, quasi-immortal medium such as 5D quartz or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.  The drives currently marketed for "archival" use have a shorter warranty and a shorter MTBF than the enterprise drives, so they're not expected to have long service lives.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center racks or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason why disks are engineered to have a 5-year service life is that, at 30-40% Kryder rates, they were going to be replaced within 5 years simply for economic reasons. But, if Kryder rates are going to be much lower going forward, the incentives to replace drives early will be much less, so a somewhat longer service life would make economic sense for the customer. From the disk vendor's point of view, a longer service life means they would sell fewer drives. Not a reason to make them.

Additional reasons for skepticism include:
  • The research we have been doing in the economics of long-term preservation demonstrates the enormous barrier to adoption that accounting techniques pose for media that have high purchase but low running costs, such as these long-lived media.
  • The big problem in digital preservation is not keeping bits safe for the long term, it is paying for keeping bits safe for the long term. So an expensive solution to a sub-problem can actually make the overall problem worse, not better.
  • These long-lived media are always off-line media. In most cases, the only way to justify keeping bits for the long haul is to provide access to them (see Blue Ribbon Task Force). The access latency scholars (and general Web users) will tolerate rules out off-line media for at least one copy. As Rob Pike said "if it isn't on-line no-one cares any more".
  • So at best these media can be off-line backups. But the long access latency for off-line backups has led the backup industry to switch to on-line backup with de-duplication and compression. So even in the backup space long-lived media will be a niche product.
  • Off-line media need a reader. Good luck finding a reader for a niche medium a few decades after it faded from the market - one of the points Jeff Rothenberg got right two decades ago.
Archival-only media such as steel tape, silica DVDs, 5D quartz DVDs, and now DNA also face some fundamental business model problems because they function only at the very bottom of the storage hierarchy. The usual diagram of the storage hierarchy, like this one from the Microsoft/UW team researching DNA storage, makes it look like the size of the market increases downwards. But that's very far from the case.

2016 Media Shipments

Exabytes Revenue $/GB
Hard Disk693$26.8B$0.039
LTO Tape40$0.65B$0.016
This table, with data extracted from Robert Fontana and Gary Decad's paper, shows that the size of the market in dollar terms decreases downwards. LTO tape is less than 1% of the media market in dollar terms and less than 5% in capacity terms. Archival media are a very small part of the storage market.

Why is this? The upper layers of the hierarchy generate revenue; the archival layer is purely a cost. If the data are still generating revenue, at least one copy is on flash or hard disk. Even if there is a copy in the archive, that one isn't generating revenue. Facebook expects the typical reason for a read request  for data from their Blu-Ray cold storage will be a subpoena. Important, but not a revenue generator. So archival media are a market where customers are reluctant to spend, because there is no return on the investment.

This means that both revenue and margins decrease down the hierarchy, and thus that R&D spending decreases down the hierarchy. R&D spending on a new archival medium is aimed at a market with low revenues and low margins. Not a good investment decision.

But that isn't the worst prospect facing a new archival medium. As we currently see with flash, R&D investment in storage media is focused at the top of the hierarchy, where the revenues and margins are best. The result is to push legacy media, currently hard disk, down the hierarchy. Thus new, archival-only media have to compete with legacy universal media being pushed down the hierarchy. They face two major disadvantages:
  • The legacy medium's investment in R&D and manufacturing capacity has been amortized at the higher levels of the hierarchy, whereas the new medium's R&D and manufacturing investments have to earn their whole return at the archival layer. So the legacy medium is likely to be cheaper.
  • The legacy medium has latency and bandwidth suited to the higher layers of the hierarchy, albeit some time ago. It thus out-performs the new, archival-only medium.
The price/performance playing field is far from level. The market is small, with low margins. This isn't a good business to be in.

Ultra-Reliable Media?

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures, but you can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly.
To give you an idea of the reliability requirements, since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in. You have just specified a half-life for the bits. It's about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of Backblaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
Eric Brewer made the same point in his 2016 FAST keynote. Because for availability and resilience against disasters they need geographic diversity, they have replicas from which to recover. So spending more to increase media reliability makes no sense, they're already reliable enough. This is because the systems that surround the drives have been engineered to deliver adequate reliability despite the current unreliability of the drives. Thus engineering away the value of more reliable drives.


Here are the takeaway lessons from the economics of long-term storage:


David. said...

It turns out that phase-change memory, the technology behind Storage-Class Memories, is more than 20 years older than flash. Stanford Ovshinsky filed the first patent in September 1966.

David. said...

"Samsung [announced] 90+ layer 3D-NAND chip manufacturing, with 1Tbit and QLC (4-level cell) chips coming." from Chis Melllor at The Register. Also:

"The V-NAND chip is a 256Gbit single stack device whereas Western Digital and Toshiba's sampling 96-layer chips are made from two 48-layer components stacked one above the other (string-stacking). Micron is also developing 96-layer chip technology."