I met Dave Anderson years ago at the Library of Congress' Storage Architecture workshop. When Dave heard I'd been invited to a workshop on The Future of Storage, he invited me here to preview my position paper for it. What you'll hear is an expanded version of the talk I've been asked to give at the workshop. In it, I'd like to suggest answers to five questions:
- How far into the future should we be looking?
- What do the economics of storing data for that long look like?
- How long should the media last?
- How reliable do the media need to be?
- What should the architecture of a future storage system look like?
- How much storage will be consumed in 2020?
How far into the future?
R. M. Fano's 1967 paper The Computer Utility and the Community reports that for MIT's IBM7094-based CTSS:
the cost of storing in the disk file the equivalent of one page of single-spaced typing is approximately 11 cents per month.It would have been hard to believe a projection that in 2016 it would be more than 7 orders of magnitude cheaper.
|IBM2401s By Erik Pitti CC BY 2.0.|
Your 1966 alter-ego's data management plan would be correct in predicting that 50 years later the dominant media would be "disk" and "tape", and that disk's lower latency would carry a higher cost per byte. But its hard to believe that any more detailed predictions about the technology would be correct. The extraordinary 30-year exponential cost per byte decrease had yet to start. The idea that ordinary citizens would carry tens of gigabytes in their pockets would have seemed ludicrous.
Thus a 50-year time horizon for a workshop on the Future of Storage may seem too long to be useful. But a 10-year time horizon is definitely too short to be useful. Storage is not just a technology, but also a multi-billion dollar manufacturing industry dominated by a few huge businesses, with long, hard-to-predict lead times.
|Seagate 2008 roadmap|
In 2016, the trade press is reporting that:
Seagate plans to begin shipping HAMR HDDs next year.
|ASTC 2016 roadmap|
A recent TrendFocus report suggests that the industry is preparing to slip the new technologies even further:
The report suggests we could see 14TB PMR drives in 2017 and 18TB SMR drives as early as 2018, with 20TB SMR drives arriving by 2020.I believe this is mostly achieved by using helium-filled drives to add platters, and thus cost, not by increasing density above current levels.
Historically, tape was the medium of choice for long-term storage. Its basic recording technology is around 8 years behind hard disk, so it has a much more credible technology road-map than disk. But its importance is fading rapidly. There are several reasons:
- Tape is a very small market in unit terms:
Just under 20 million LTO cartridges were sent to customers last year. As a comparison let's note that WD and Seagate combined shipped more than 350 million disk drives in 2015; the tape cartridge market is less than 0.00567 per cent of the disk drive market in unit terms
- In effect there is now a single media supplier, raising fears of price gouging and supply vulnerability. The disk market has consolidated too, but there are still two very viable suppliers.
- The advent of data-mining and web-based access to archives make the long access latency of tape less tolerable.
- To maximize the value of the limited number of slots in the robots it is necessary to migrate data to new, higher-capacity cartridges as soon as they appear. This has two effects. First, it makes the long data life of tape media less important. Second, it consumes a substantial fraction of the available bandwidth, up to a quarter in some cases.
|Source: The Register|
factories to build 3D NAND are vastly more expensive than plants that produce planar NAND or HDDs -- a single plant can cost $10 billion
|NAND vs. HDD capex/TB|
So we need to think about media that might enter service as the bulk store for persistent data starting in the second half of the next decade. Before that, Robert Fontana of IBM has an excellent overview of the roadmaps for tape, disk, optical and NAND flash (PDF), so we know the answer.
Economics of Long-Term Storage
|Cost vs. Kryder rate|
|2014 cost/byte projection|
- The slowing started in 2010, before the floods hit Thailand.
- Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
- If the industry projections pan out, as shown by the red lines, by 2020 disk costs will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Long-Lived Media?Every few months there is another press release announcing that some new, quasi-immortal medium such as 5D quartz or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost. The drives currently marketed for "archival" use have a shorter warranty and a shorter MTBF than the enterprise drives, so they're not expected to have long service lives.
The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center racks or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.
The reason why disks are engineered to have a 5-year service life is that, at 30-40% Kryder rates, they were going to be replaced within 5 years simply for economic reasons. But, if Kryder rates are going to be much lower going forward, the incentives to replace drives early will be much less, so a somewhat longer service life would make economic sense for the customer. From the disk vendor's point of view, a longer service life means they would sell fewer drives.
Additional reasons for skepticism include:
- The research we have been doing in the economics of long-term preservation demonstrates the enormous barrier to adoption that accounting techniques pose for media that have high purchase but low running costs, such as these long-lived media.
- The big problem in digital preservation is not keeping bits safe for the long term, it is paying for keeping bits safe for the long term. So an expensive solution to a sub-problem can actually make the overall problem worse, not better.
- These long-lived media are always off-line media. In most cases, the only way to justify keeping bits for the long haul is to provide access to them (see Blue Ribbon Task Force). The access latency scholars (and general Web users) will tolerate rules out off-line media for at least one copy. As Rob Pike said "if it isn't on-line no-one cares any more".
- So at best these media can be off-line backups. But the long access latency for off-line backups has led the backup industry to switch to on-line backup with de-duplication and compression. So even in the backup space long-lived media will be a niche product.
- Off-line media need a reader. Good luck finding a reader for a niche medium a few decades after it faded from the market - one of the points Jeff Rothenberg got right two decades ago.
Ultra-Reliable Media?The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. But current media are many orders of magnitude too unreliable for the task ahead, so you can't:
- Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
- Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly.
Double the reliability is only worth 1/10th of 1 percent cost increase. ...Eric Brewer made the same point in his 2016 FAST keynote. Because for availability and resilience against disasters they need geographic diversity, they have replicas from which to recover. So spending more to increase media reliability makes no sense, they're already reliable enough. This is because the systems that surround the drives have been engineered to deliver adequate reliability despite the current unreliability of the drives. Thus engineering out the value of more reliable drives.
Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
Future Storage System Architecture?What do we want from a future bulk storage system?
- An object storage fabric.
- With low power usage and rapid response to queries.
- That maintains high availability and durability by detecting and responding to media failures without human intervention.
- And whose reliability is externally auditable.
The following year Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We showed that it was indeed plausible that, even at then current flash prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk while providing high performance and enabling self-healing.
Two subsequent developments suggest we were on the right track. First, Seagate's announcement of its Kinetic architecture and Western Digital's subsequent announcement of drives that ran Linux, both exploited the processing power available from the computers in the drives that perform command processing, internal maintenance operations, and signal processing to delegate computation from servers to the storage media, and to get IP communication all the way to the media, as DAWN suggested. IP to the drive is a great way to future-proof the drive interface.
- Compute: 8-core Xeon system-on-a-chip, and Elastic Fabric Connector for external, off-blade, 40GbitE networking,
- Storage: NAND storage with 8TB or 52TB raw capacity of raw capacity and on-board NV-RAM with a super-capacitor-backed write buffer plus a pair of ARM CPU cores and an FPGA,
- On-blade networking: PCIe card to link compute and storage cards via a proprietary protocol.
DAWN exploits two separate sets of synergies:
- Like FlashBlade, DAWN moves the computation to where the data is, rather then moving the data to where the computation is, reducing both latency and power consumption. The further data moves on wires from the storage medium, the more power and time it takes. This is why Berkeley's Aspire project's architecture is based on optical interconnect technology, which when it becomes mainstream will be both faster and lower-power than wires. In the meantime, we have to use wires.
- Unlike FlashBlade, DAWN divides the object storage fabric into a much larger number of much smaller nodes, implemented using the very low-power ARM chips used in cellphones. Because the power a CPU needs tends to grow faster than linearly with performance, the additional parallelism provides comparable performance at lower power.
Storage systems are extremely reliable, but at scale nowhere near reliable enough to mean data loss can be ignored. Internal auditing, in which the system detects and reports it own losses, for example by hashing the stored data and comparing the result with a stored hash, is important but is not enough. The system's internal audit function will itself have bugs, which are likely to be related to the bugs in the underlying functionality causing data loss. Having the system report "I think everything is fine" is not as reassuring as one would like.
Auditing a system by extracting its entire contents for integrity checking does not scale, and is likely itself to cause errors. Asking a storage system for the hash of an object is not adequate, the system could have remembered the object's hash instead of computing it afresh. Although we don't yet have a perfect solution to the external audit problem, it is clear that part of the solution is the ability to supply a nonce that is prepended to the object's data before hashing. The result is different every time, the system cannot simply remember it.
2020 Storage ConsumptionHow much storage will be consumed in 2020? I'm not an economist or a supply chain expert, and I don't have access to the relevant data, so its not clear why Dave thinks my answer would be interesting.
|Seagate unit shipments|
|WD unit shipments|
- By 2020 the only large growing market left for hard drives will be cloud storage.
- That market has a fairly small number of large customers, which tends to depress margins.
- But it has only two suppliers, which tends to increase margins.
- Two-supplier markets with large customers tend to be stable, the customers don't want to end up with only one so they buy from both. (See Nvidia vs. ATI after about 1998, Boeing vs. Airbus, ...).
- It seems that in this market segment unit shipments are about stable, so the increase in unit capacity is roughly matching the increase in demand.
- Overall margins have not increased significantly, which would be a sign that the vendors were not satisfying demand.
Both graph's "enterprise" line includes both performance and capacity (nearline) drives. According to Stifel:
High-capacity (nearline) enterprise HDD shipments are now estimated to grow from 37 million units in 2015 to 48 million units by 2020.The graphs suggest about a 60M/year ship rate for the sector in 2015. By 2020 one would expect the performance market to be almost all flash. If Stifel is right, this means that nearline is the only growing part of the market.
Nidec, the dominant supplier of drive spindle motors, was a source for Stifel:
It foresees demand for about 400 million units in 2016, ...The reason disk is so cheap is that building drives is a high-volume business, with strong economies of scale. As volumes decrease, economies of scale go into reverse. This effect is amplified if the volume decrease is at the high end. Technology improvements are introduced first at the high end, where higher prices can generate a better return on the investment in making them. Then they migrate down the range. But flash is removing the market for the most expensive drives. Thus it will be harder and harder for the manufacturers to keep driving prices down.
It foresees 376 million disk drive shipments in 2017 and then 357m, 343m, and 333m units shipped in 2018, 2019, and 2020.
Although disk $/GB has decreased somewhat in the last year or so, my guess is that going forward the Kryder rate for disk will be in the single digits. The effects include:
- Greatly increased costs for "keeping everything for ever".
- Further reduction in the cost per byte advantage of disk over flash, and thus increased erosion of disk's share in the total storage market. Maybe even a disk death spiral similar to tape's.
- The need for some understanding between cloud storage customers and the disk vendors as to how the improvements Eric Brewer wants can be financed. In the near future, the nearline part of the market doesn't have the volumes needed to justify the investment, especially since some of the changes, such as a different form factor, are not relevant to the other parts of the market.