Thursday, May 12, 2016

The Future of Storage

My preparation for a workshop on the future of storage included giving a talk at Seagate and talking to the all-flash advocates. Below the fold I attempt to organize into a coherent whole the results of these discussions and content from a lot of earlier posts.

I'd like to suggest answers to five questions related to the economics of long-term storage:
  • How far into the future should we be looking?
  • What do the economics of storing data for that long look like?
  • How long should the media last?
  • How reliable do the media need to be?
  • What should the architecture of a future storage system look like?

How far into the future?

Source: Disks for Data Centers
Discussions of storage tend to focus on the sexy, expensive, high-performance market. Those systems are migrating to flash. The data in those systems is always just a cache. In the long term, that data lives further down the hierarchy. What I'm talking about is the next layer down the hierarchy, the capacity systems where all the cat videos, snapshots and old blog posts live. And the scientific data.

Iain Emsley's talk at PASIG2016 on planning the storage requirements of the 1PB/day Square Kilometer Array mentioned that the data was expected to be used for 50 years. How hard a problem is planning with this long a horizon? Lets go back 50 years and see.


IBM2314s (source)
In 1966 as I was writing my first program disk technology was about 10 years old; the IBM 350 RAMAC was introduced in 1956. The state of the art was the IBM 2314. Each removable disk pack stored 29MB on 11 platters with a 310KB/s data transfer rate. Roughly equivalent to 60MB/rack. The SKA would have needed to add nearly 17M, or about 10 square kilometers, of racks each day.

R. M. Fano's 1967 paper The Computer Utility and the Community reports that for MIT's IBM 7094-based CTSS:
the cost of storing in the disk file the equivalent of one page of single-spaced typing is approximately 11 cents per month.
It would have been hard to believe a projection that in 2016 it would be more than 7 orders of magnitude cheaper.

IBM2401s By Erik Pitti CC BY 2.0.
The state of the art in tape storage was the IBM 2401, the first nine-track tape drive, storing 45MB per tape with a 320KB/s maximum transfer rate. Roughly equivalent to 45MB/rack of accessible data.

Your 1966 alter-ego's data management plan would be correct in predicting that 50 years later the dominant media would be "disk" and "tape", and that disk's lower latency would carry a higher cost per byte. But its hard to believe that any more detailed predictions about the technology would be correct. The extraordinary 30-year history of 30-40% annual cost per byte decrease, the Kryder rate, had yet to start.

Although disk is a 60-year old technology, a 50-year time horizon for a workshop on the Future of Storage may seem too long to be useful. But a 10-year time horizon is definitely too short to be useful. Storage is not just a technology, but also a multi-billion dollar manufacturing industry dominated by a few huge businesses, with long, hard-to-predict lead times.

Seagate 2008 roadmap
To illustrate the lead times, here is a Seagate roadmap slide from 2008 predicting that perpendicular magnetic recording (PMR) would be replaced in 2009 by heat-assisted magnetic recording (HAMR), which would in turn be replaced in 2013 by bit-patterned media (BPM).

In 2016, the trade press is reporting that:
Seagate plans to begin shipping HAMR HDDs next year.
ASTC 2016 roadmap
Here is a recent roadmap from ASTC showing HAMR starting in 2017 and BPM in 2021. So in 8 years HAMR has gone from next year to next year, and BPM has gone from 5 years out to 5 years out. The reason for this real-time schedule slip is that as technologies get closer and closer to the physical limits, the difficulty and above all cost of getting from lab demonstration to shipping in volume increases exponentially.

A recent TrendFocus report suggests that the industry is preparing to slip the new technologies even further:
The report suggests we could see 14TB PMR drives in 2017 and 18TB SMR drives as early as 2018, with 20TB SMR drives arriving by 2020.
I believe this is mostly achieved by using helium-filled drives to add platters, and thus cost, not by increasing density above current levels.


Historically, tape was the medium of choice for long-term storage. Its basic recording technology is around 8 years behind hard disk, so it has a much more credible technology road-map than disk. But its importance is fading rapidly. There are several reasons:
  • Tape is a very small market in unit terms:
    Just under 20 million LTO cartridges were sent to customers last year. As a comparison let's note that WD and Seagate combined shipped more than 350 million disk drives in 2015; the tape cartridge market is less than 0.00567 per cent of the disk drive market in unit terms
  • In effect there is now a single media supplier, raising fears of price gouging and supply vulnerability. The disk market has consolidated too, but there are still two very viable suppliers.
  • The advent of data-mining and web-based access to archives make the long access latency of tape less tolerable.
  • To maximize the value of the limited number of slots in the robots it is necessary to migrate data to new, higher-capacity cartridges as soon as they appear. This has two effects. First, it makes the long data life of tape media less important. Second, it consumes a substantial fraction of the available bandwidth, up to a quarter in some cases.


Source: The Register
Flash as a data storage technology is almost 30 years old. Eli Harari filed the key enabling patent in 1988, describing multi-level cell, wear-leveling and the Flash Translation Layer. Flash has yet to make a significant impact on the capacity storage market. Probably, at some point in the future it will displace hard disk as the medium for this level of the hierarchy. There are two contrasting views as to how long this will take.

Exabytes shipped
First, the conventional wisdom as expressed by the operators of cloud services and the disk industry, and supported by these graphs showing how few exabytes of flash are shipped in comparison to disk. Although flash is displacing disk from markets such as PCs, laptops and servers, Eric Brewer's fascinating keynote at this year's FAST conference started from the assertion that the only feasible medium for bulk data storage in the cloud was spinning disk.

NAND vs. HDD capex/TB
The argument is that flash, despite its many advantages, is and will remain too expensive for the capacity layer. The graph of the ratio of capital expenditure per TB of flash and hard disk shows that each exabyte of flash contains about 50 times as much capital as an exabyte of disk. Because:
factories to build 3D NAND are vastly more expensive than plants that produce planar NAND or HDDs -- a single plant can cost $10 billion
no-one is going to invest the roughly $80B needed to displace hard disks because the investment would not earn a viable return.

WD unit shipments
Second, the view from the flash advocates. They argue that the fabs will be built, because they are no longer subject to conventional economics. The governments of China, Japan, and other countries are stimulating their economies by encouraging investment, and they regard dominating the market for essential chips as a strategic goal, something that justifies investment. They are thinking long-term, not looking at the next quarter's results. The flash companies can borrow at very low interest rates, so even if they do need to show a return, they only need to show a very low return.

Seagate unit shipments
If the fabs are built, the increase in supply will increase the Kryder rate of flash. This will increase the trend of storage moving from disk to flash. In turn, this will increase the rate at which disk vendor's unit shipments decrease. In turn, this will decrease their economies of scale, and cause disk's Kryder rate to go negative. The point at which flash becomes competitive with disk moves closer in time. Disk enters a death spiral.

The result would be that the Kryder rate for the capacity market, which has been very low, would get back closer to the historic rate sooner, and thus that storing bulk data for the long term would be significantly cheaper. But this isn't the only effect. When Data Domain's disk-based backup displaced tape, greatly reducing the access latency for backup data, the way backup data was used changed. Instead of backups being used mostly to cover media failures, they became used mostly to cover operator errors.

Similarly, if flash were to displace disk, the access latency for stored data would be significantly reduced, and the way the data is used would change. Because it is more accessible, people would find more ways to extract value from it. The changes induced by reduced latency would probably significantly increase the perceived value of the stored data, which would itself accelerate the turn-over from disk to flash.

I hope everyone is familiar with the concept of "stranded assets", for example the idea that if we're not to fry the planet oil companies cannot develop many of the reserves they carry on their books. Both views of the future of disk vs. flash involve a reduction in the unit volume of drives. The disk vendors cannot raise prices significantly, doing so would accelerate the reduction in unit volume. Thus their income will decrease, and thus their ability to finance the investments needed to get HAMR and then BPM into the market. The longer they delay these investments, the more difficult it becomes to afford them. Thus it is likely that HAMR and BPM will be "stranded technologies", advances we know how to, but never actually deploy.

Alternate Media

Media trends to 2014
Robert Fontana of IBM has an excellent overview of the roadmaps for tape, disk, optical and NAND flash (PDF) through the early 2020s. Clearly no other technology will significantly impact the storage market before then.

SanDisk shipped the first flash SSDs to GRiD Systems in 1991. Even if flash impacts the capacity market in 2018, it will have been 27 years after the first shipment. The storage technology that follows flash is probably some form of Storage Class Memory (SCM) such as XPoint. Small volumes of some forms of SCM have been shipping for a couple of years. Like flash, SCMs leverage much of the semiconductor manufacturing technology. Optimistically, one might expect SCM to impact the capacity market sometime in the late 2030s.

I'm not aware of any other storage technologies that could compete for the capacity market in the next three decades. SCMs have occupied the niche for a technology that exploits semiconductor manufacturing. A technology that didn't would find it hard to build the manufacturing infrastructure to ship the thousands of exabytes a year the capacity market will need by then.

Economics of Long-Term Storage

Cost vs. Kryder rate
Here is a graph from a model of the economics of long-term storage I built back in 2012 using data from Backblaze and the San Diego Supercomputer Center. It plots the net present value of all the expenditures incurred in storing a fixed-size dataset for 100 years against the Kryder rate. As you can see, at the 30-40%/yr rates that prevailed until 2010, the cost is low and doesn't depend much on the precise Kryder rate. Below 20%, the cost rises rapidly and depends strongly on the precise Kryder rate.

2014 cost/byte projection
As it turned out, we were already well below 20%. Here is a 2014 graph from Preeti Gupta, a Ph.D. student at UC Santa Cruz, plotting $/GB against time. The red lines are projections at the industry roadmap's 20% and my less optimistic 10%. It shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
The funds required to deliver on a commitment to store a chunk of data for the long term depend strongly on the Kryder rate, especially in the first decade or two. Industry projections of the rate have a history of optimism, and are vulnerable to natural disasters, industry consolidation, and so on. We aren't going to know the cost, and the probability is that it is going to be a lot more expensive than we expect.

Long-Lived Media?

Every few months there is another press release announcing that some new, quasi-immortal medium such as 5D quartz or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.  The drives currently marketed for "archival" use have a shorter warranty and a shorter MTBF than the enterprise drives, so they're not expected to have long service lives.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center racks or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason why disks are engineered to have a 5-year service life is that, at 30-40% Kryder rates, they were going to be replaced within 5 years simply for economic reasons. But, if Kryder rates are going to be much lower going forward, the incentives to replace drives early will be much less, so a somewhat longer service life would make economic sense for the customer. From the disk vendor's point of view, a longer service life means they would sell fewer drives. Not a reason to make them.

Additional reasons for skepticism include:
  • The research we have been doing in the economics of long-term preservation demonstrates the enormous barrier to adoption that accounting techniques pose for media that have high purchase but low running costs, such as these long-lived media.
  • The big problem in digital preservation is not keeping bits safe for the long term, it is paying for keeping bits safe for the long term. So an expensive solution to a sub-problem can actually make the overall problem worse, not better.
  • These long-lived media are always off-line media. In most cases, the only way to justify keeping bits for the long haul is to provide access to them (see Blue Ribbon Task Force). The access latency scholars (and general Web users) will tolerate rules out off-line media for at least one copy. As Rob Pike said "if it isn't on-line no-one cares any more".
  • So at best these media can be off-line backups. But the long access latency for off-line backups has led the backup industry to switch to on-line backup with de-duplication and compression. So even in the backup space long-lived media will be a niche product.
  • Off-line media need a reader. Good luck finding a reader for a niche medium a few decades after it faded from the market - one of the points Jeff Rothenberg got right two decades ago.

Ultra-Reliable Media?

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. But current media are many orders of magnitude too unreliable for the task ahead, so you can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of Backblaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
Eric Brewer made the same point in his 2016 FAST keynote. Because for availability and resilience against disasters they need geographic diversity, they have replicas from which to recover. So spending more to increase media reliability makes no sense, they're already reliable enough. This is because the systems that surround the drives have been engineered to deliver adequate reliability despite the current unreliability of the drives. Thus engineering away the value of more reliable drives.

Future Storage System Architecture?

What do we want from a future bulk storage system?
  • An object storage fabric.
  • With low power usage and rapid response to queries.
  • That maintains high availability and durability by detecting and responding to media failures without human intervention.
  • And whose reliability is externally auditable.
At the 2009 SOSP David Anderson and co-authors from C-MU presented FAWN, the Fast Array of Wimpy Nodes. It inspired me to suggest, in my 2010 JCDL keynote, that the cost savings FAWN realized without performance penalty by distributing computation across a very large number of very low-power nodes might also apply to storage.

The following year Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We showed that it was indeed plausible that, even at then current flash prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk while providing high performance and enabling self-healing.

Two subsequent developments suggest we were on the right track. First, Seagate's announcement of its Kinetic architecture and Western Digital's subsequent announcement of drives that ran Linux, both exploited the processing power available from the computers in the drives that perform command processing, internal maintenance operations, and signal processing to delegate computation from servers to the storage media, and to get IP communication all the way to the media, as DAWN suggested. IP to the drive is a great way to future-proof the drive interface.

FlashBlade hardware
Second, although flash remains more expensive than hard disk, since 2011 the gap has narrowed from a factor of about 12 to about 6. Pure Storage recently announced FlashBlade, an object storage fabric composed of large numbers of blades, each equipped with:
  • Compute: 8-core Xeon system-on-a-chip, and Elastic Fabric Connector for external, off-blade, 40GbitE networking,
  • Storage: NAND storage with 8TB or 52TB raw capacity of raw capacity and on-board NV-RAM with a super-capacitor-backed write buffer plus a pair of ARM CPU cores and an FPGA,
  • On-blade networking: PCIe card to link compute and storage cards via a proprietary protocol.
FlashBlade clearly isn't DAWN. Each blade is much bigger, much more powerful and much more expensive than a DAWN node. No-one could call a node with an 8-core Xeon, 2 ARMs, and 52TB of flash "wimpy", and it'll clearly be too expensive for long-term bulk storage. But it is a big step in the direction of the DAWN architecture.

DAWN exploits two separate sets of synergies:
  • Like FlashBlade, DAWN moves the computation to where the data is, rather then moving the data to where the computation is, reducing both latency and power consumption. The further data moves on wires from the storage medium, the more power and time it takes. This is why Berkeley's Aspire project's architecture is based on optical interconnect technology, which when it becomes mainstream will be both faster and lower-power than wires. In the meantime, we have to use wires.
  • Unlike FlashBlade, DAWN divides the object storage fabric into a much larger number of much smaller nodes, implemented using the very low-power ARM chips used in cellphones. Because the power a CPU needs tends to grow faster than linearly with performance, the additional parallelism provides comparable performance at lower power.
So FlashBlade currently exploits only one of the two sets of synergies. But once Pure Storage has deployed this architecture in its current relatively high-cost and high-power technology, re-implementing it in lower-cost, lower-power technology should be easy and non-disruptive. They have done the harder of the two parts.

Storage systems are extremely reliable, but at scale nowhere near reliable enough to mean data loss can be ignored. Internal auditing, in which the system detects and reports it own losses, for example by hashing the stored data and comparing the result with a stored hash, is important but is not enough. The system's internal audit function will itself have bugs, which are likely to be related to the bugs in the underlying functionality causing data loss. Having the system report "I think everything is fine" is not as reassuring as one would like.

Auditing a system by extracting its entire contents for integrity checking does not scale, and is likely itself to cause errors. Asking a storage system for the hash of an object is not adequate, the system could have remembered the object's hash instead of computing it afresh. Although we don't yet have a perfect solution to the external audit problem, it is clear that part of the solution is the ability to supply a random nonce that is prepended to the object's data before hashing. The result is different every time, the system cannot simply remember it.


I'm grateful to Seagate for (twice) allowing me to pontificate about their industry, to Brian Berg for his encyclopedic knowledge of the history of flash, and Tom Coughlin for illuminating discussions and the graph of exabytes shipped. This isn't to say that they agree with any of the above.


David. said...

Chris Mellor at The Register has a piece worth reading pushing the claim that Seagate will ship HAMR drives next year.

David. said...

From Chris Mellor at The Register:

Rakers said Seagate has a plan to reduce its HDD manufacturing capacity footprint by 35 per cent, saving approximately $20m per quarter.

David. said...

More on FlashBlade from Enrico Signoretti at The Register.

David. said...

Seagate plans to lay off 14% of its workforce.

David. said...

Chris Mellor at The Register has graphs showing peak disk for both Seagate and WDC.

David. said...

Chris Mellor at The Register reports that Good gravy, Toshiba QLC flash chips are getting closer. QLC is 4 bits per cell. I'll blog about the significance of this shortly.

David. said...

WD's 4th quarter continues the decline in hard disk unit shipments, and The Register has updated graphs.

David. said...

Also in The Register's report is this quote from WD's CEO on the prospect for HAMR:

"Well, I would say it's 2018, 2019 kind of dynamic."

So the real-time slippage continues.

David. said...

Chris Mellor at The Register reports on Seagate's 4th quarter with updated shipment graphs:

"The segment splits show quarter on quarter rises for enterprise and consumer electronics drives and falls elsewhere, the largest for enterprise and notebook drives. However, within the enterprise segment, mission-critical (performance) drives had a q-on-q fall from 3.2 million to 3 million drives, while nearline (capacity) drives rose from 4.5 million to 5.5 million; reflecting its entry into the 8TB drive sector."

Seagate's 8TB drives are doing well. Stiffel Nicolaus estimates Seagate:

"gained ~6pp of capacity ship share in the high-cap/nearline HDD market during the June quarter (~49 per cent ship share vs. WD’s ~46 per cent ship share.)"

David. said...

Validation of Pure Storage's FlashBlade architecture comes from Toshiba's announcement of FlashMatrix, a rather similar architecture.

David. said...

It appears that Intel may have rushed XPoint into the market too early so that it isn't delivering on the promises made at introduction.