Tuesday, December 13, 2016

The Medium-Term Prospects for Long-Term Storage Systems

Back in May I posted The Future of Storage, a brief talk written for a DARPA workshop of the same name. The participants were experts in one or another area of storage technology, so the talk left out a lot of background that a more general audience would have needed. Below the fold, I try to cover the same ground but with this background included, which makes for a long post.

This is an enhanced version of a journal article that has been accepted for publication in Library Hi Tech, with images that didn't meet the journal's criteria, and additional material reflecting developments since submission. Storage technology evolution can't be slowed down to the pace of peer review.

What is Long-Term Storage?

Storage Hierarchy
Public Domain
The storage of a computer system is usually described as a hierarchy. Newly created or recently accessed data resides at the top of the hierarchy in relatively small amounts of very fast, very expensive media. As it ages, it migrates down the hierarchy to larger, slower and cheaper media.

Long-term storage implements the base layers of the hierarchy, often called "bulk" or "capacity" storage. Most discussions of storage technology focus on the higher, faster layers, which these days are the territory of all-flash arrays holding transactional databases, search indexes, breaking news pages and so on. The data in those systems is always just a cache. Long-term storage is where old blog posts, cat videos and most research datasets spend their lives.

What Temperature is Your Data?

If everything is working as planned, data in the top layers of the hierarchy will be accessed much more frequently, be "hotter", than data further down. At scale, this effect can be extremely strong.

Muralidhar et al Figure 3
Subramanian Muralidhar and a team from Facebook, USC and Princeton have an OSDI paper, f4: Facebook's Warm BLOB Storage System, describing the warm layer between Facebook's Haystack hot storage layer and their cold storage layers. Section 3 describes the behavior of BLOBs (Binary Large OBjects) of different types in Facebook's storage system. Each type of BLOB contains a single type of immutable binary content, such as photos, videos, documents, etc. The rates for different types of BLOB drop differently, but all 9 types have dropped by 2 orders of magnitude within 8 months, and all but 1 (profile photos) have dropped by an order of magnitude within the first week, as shown in their Figure 3.

The Facebook data make two really strong arguments for hierarchical storage architectures at scale:
  • That significant kinds of data should be moved from expensive, high-performance hot storage to cheaper warm and then cold storage as rapidly as feasible.
  • That the I/O rate that warm storage should be designed to sustain is so different from that of hot storage, at least 2 and often many more orders of magnitude, that attempting to re-use hot storage technology for warm and even worse for cold storage is futile.
The argument that the long-term bulk storage layers will need their own technology is encouraging, because (see below) there isn't going to be enough of the flash media that are taking over the performance layers to hold everything.

But there is a caveat. Typical at-scale systems such as Facebook's do show infrequent access to old data. This used to be true in libraries and archives. But the advent of data mining and other "big data" applications means that increasingly scholars want not to access a few specific items, but instead to ask statistical questions of an entire collection. The implications of this change in access patterns for long-term storage architectures are discussed below.

How Long is the "Medium Term"?

Iain Emsley's talk at PASIG2016 on planning the storage requirements of the 1PB/day Square Kilometer Array mentioned that the data was expected to be used for 50 years. How hard a problem is planning with this long a horizon? Looking back 50 years can provide a clue.

Public Domain
In 1966 disk technology was about 10 years old; the IBM 350 RAMAC was introduced in 1956. The state of the art was the IBM 2314. Each removable disk pack stored 29MB on 11 platters with a 310KB/s data transfer rate. Roughly equivalent to 60MB/rack. Every day, the SKA would have needed to add nearly 17M racks, covering about 10 square kilometers.

R. M. Fano's 1967 paper The Computer Utility and the Community reports that for MIT's IBM 7094-based CTSS:
the cost of storing in the disk file the equivalent of one page of single-spaced typing is approximately 11 cents per month.
It would have been hard to believe a projection that in 2016 it would be more than 7 orders of magnitude cheaper.

By Erik Pitti CC BY 2.0.
The state of the art in tape storage was the IBM 2401, the first nine-track tape drive, storing 45MB per tape with a 320KB/s maximum transfer rate. Roughly equivalent to 45MB/rack of accessible data.

A 1966 data management plan would have been correct in predicting that 50 years later the dominant media would be "disk" and "tape", and that disk's lower latency would carry a higher cost per byte. But its hard to believe that any more detailed predictions about the technology would be correct. The extraordinary 30-year history of 30-40% annual cost per byte decrease of disk media, their Kryder rate, had yet to start.

Although disk and tape are 60-year old technologies, a 50-year time horizon may seem too long to be useful. But a 10-year time horizon is definitely too short to be useful. Storage is not just a technology, but also a multi-billion dollar manufacturing industry dominated by a few huge businesses, with long, hard-to-predict lead times.

Seagate 2008 roadmap 
Disk technology shows how hard it is to predict lead times. Here is a Seagate roadmap slide from 2008 predicting that the then (and still) current technology, perpendicular magnetic recording (PMR), would be replaced in 2009 by heat-assisted magnetic recording (HAMR), which would in turn be replaced in 2013 by bit-patterned media (BPM).

In 2016, the trade press is reporting that:
Seagate plans to begin shipping HAMR HDDs next year.
ASTC 2016 roadmap
Here is a recent roadmap from ASTC showing HAMR starting in 2017 and BPM in 2021. So in 8 years HAMR has gone from next year to next year, and BPM has gone from 5 years out to 5 years out. The reason for this real-time schedule slip is that as technologies get closer and closer to the physical limits, the difficulty and above all cost of getting from lab demonstration to shipping in volume increases exponentially.

A recent TrendFocus report suggests that the industry is preparing to slip the new technologies even further:
The report suggests we could see 14TB PMR drives in 2017 and 18TB SMR drives as early as 2018, with 20TB SMR drives arriving by 2020.
Here, the medium term is loosely defined as the next couple of decades, or 2-3 times the uncertainty in industry projections.

What Is The Basic Problem of Long-Term Storage?

The fundamental problem is not storing bits safely for the long term, it is paying to store bits safely for the long term. With an unlimited budget an unlimited amount of data could be stored arbitrarily reliably indefinitely. But in the real world of limited budgets there is an inevitable tradeoff between storing more data, and storing the data more reliably.

Historically, this tradeoff has not been pressing, because the rate at which the cost per byte of storage dropped (the Kryder rate) was so large that if you could afford to keep some data for a few years, you could afford to keep it "forever". The incremental cost would be negligible. Alas, this is no longer true.

Cost vs. Kryder rate
Here is a graph from a model of the economics of long-term storage I built back in 2012 using data from Backblaze and the San Diego Supercomputer Center. It plots the net present value of all the expenditures incurred in storing a fixed-size dataset for 100 years against the Kryder rate. As you can see, at the 30-40%/yr rates that prevailed until 2010, the cost is low and doesn't depend much on the precise Kryder rate. Below 20%, the cost rises rapidly and depends strongly on the precise Kryder rate.

2014 cost/byte projection
As it turned out, we were already well below 20%. Here is a 2014 graph from Preeti Gupta of UC Santa Cruz plotting $/GB against time. The red lines are projections at the industry roadmap's 20% and a less optimistic 10%. It shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
The total cost of delivering on a commitment to store a fixed-size dataset for the long term depends strongly on the Kryder rate, especially in the first decade or two. Industry projections of the rate have a history of optimism, and are vulnerable to natural disasters, industry consolidation, and so on. We aren't going to know the cost, and the probability is that it is going to be a lot more expensive than we expect.

How Much Long-Term Storage Do We Need?

Lay people reading the press about storage, a typical example is Lauro Rizatti's recent article in EE Times entitled Digital Data Storage is Undergoing Mind-Boggling Growth, believe two things:
  • per byte, storage media are getting cheaper very rapidly (Kryder's Law), and
  • the demand for storage greatly exceeds the supply.
These two things cannot both be true. If the demand for storage greatly exceeded the supply, the price would rise until supply and demand were in balance.

In 2011 we actually conducted an experiment to show that this is what happens. We nearly halved the supply of disk drives by flooding large parts of Thailand including the parts where disks were manufactured. This flooding didn't change the demand for disks, because these parts of Thailand were not large consumers of disks. What happened? As Preeti Gupta's graph shows, the price of disks immediately nearly doubled, choking off demand to match the available supply, and then fell slowly as supply recovered.

So we have two statements. The first is "per byte, storage media are getting cheaper very rapidly". We can argue about exactly how rapidly, but there are decades of factual data recording the drop in cost per byte of disk and other storage media. So it is reasonable to believe the first statement. Anyone who has been buying computers for a few years can testify to it.

The second is "the demand for storage greatly exceeds the supply". The first statement is true, so this has to be false. Why do people believe it? The evidence for the excess of demand over supply in Rizatti's article is a graph with blue bars labeled "demand" overwhelming orange bars. The orange bars are labeled "output", which appears to represent the total number of bytes of storage media manufactured each year. This number should be fairly accurate, but it overstates the amount of newly created information stored each year for many reasons:
  • Newly manufactured media does not instantly get filled. There are delays in the distribution pipeline - for example I have nearly half a terabyte of unwritten DVD-R media sitting on a shelf. This is likely to be a fairly small percentage.
  • Some media that gets filled turns out to be faulty and gets returned under warranty. This is likely to be a fairly small percentage.
  • Some of the newly manufactured media replaces obsolete media, so isn't available to store newly created information.
  • Because of overhead from file systems and so on, newly created information occupies more bytes of storage than its raw size. This is typically a small percentage.
  • If newly created information does actually get written to a storage medium, several copies of it normally get written. This is likely to be a factor of about two.
  • Some newly created information exists in vast numbers of copies. For example, my iPhone 6 claims to have 64GB of storage. That corresponds to the amount of newly manufactured storage medium (flash) it consumes. But about 8.5GB of that is consumed by a copy of iOS, the same information that consumes 8.5GB in every iPhone 6. Between October 2014 and October 2015 Apple sold 222M iPhones, So those 8.5GB of information are replicated 222M times, consuming about 1.9EB of the storage manufactured in that year.
The mismatch between the blue and orange bars is much greater than it appears.

What do the blue bars represent? They are labeled "demand" but, as we have seen, the demand for storage depends on the price. There's no price specified for these bars. The caption of the graph says "Source: Recode", which I believe refers to a 2014 article by Rocky Pimentel entitled Stuffed: Why Data Storage Is Hot Again. (Really!). Based on the IDC/EMC Digital Universe report, Pimentel writes:
The total amount of digital data generated in 2013 will come to 3.5 zettabytes (a zettabyte is 1 with 21 zeros after it, and is equivalent to about the storage of one trillion USB keys). The 3.5 zettabytes generated this year will triple the amount of data created in 2010. By 2020, the world will generate 40 zettabytes of data annually, or more than 5,200 gigabytes of data for every person on the planet.
The operative words are "data generated". Not "data stored permanently", nor "bytes of storage consumed". The numbers projected by IDC for "data generated" have always greatly exceeded the numbers actually reported for storage media manufactured in a given year, which in turn as discussed above exaggerate the capacity added to the world's storage infrastructure.

The assumption behind "demand exceeds supply" is that every byte of "data generated" in the IDC report is a byte of demand for permanent storage capacity. Even in a world where storage was free there would still be much data generated that was never intended to be stored for any length of time, and would thus not represent demand for storage media.

WD results
In the real world data costs money to store, and much of that money ends up with the storage media companies. This provides another way of looking at the idea that Digital Data Storage is Undergoing Mind-Boggling Growth. What does it mean for an industry to have Mind-Boggling Growth? It means that the companies in the industry have rapidly increasing revenues and, normally, rapidly increasing profits.

Seagate results
The graphs show the results for the two companies that manufacture the bulk of the storage bytes each year. Revenues are flat or decreasing, profits are decreasing for both companies. These do not look like companies faced by insatiable demand for their products; they look like mature companies facing increasing difficulty in scaling their technology.

For a long time, discussions of storage have been bedevilled by the confusion between IDC's projections for "data generated" and the actual demand for storage media. The actual demand will be much lower, and will depend on the price.

Does Long-Term Storage Need Long-Lived Media?

Every few months there is another press release announcing that some new, quasi-immortal medium such as 5D quartz or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but a study of the market for them revealed that no-one would pay the relatively small additional cost. The drives currently marketed for "archival" use have a shorter warranty and a shorter MTBF than enterprise drives, so they're not expected to have long service lives.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center racks or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason why disks are engineered to have a 5-year service life is that, at 30-40% Kryder rates, they were going to be replaced within 5 years simply for economic reasons. But, if Kryder rates are going to be much lower going forward, the incentives to replace drives early will be much less, so a somewhat longer service life would make economic sense for the customer. From the disk vendor's point of view, a longer service life means they would sell fewer drives. Not a reason to make them.

Additional reasons for skepticism include:
  • Our research into the economics of long-term preservation demonstrates the enormous barrier to adoption that accounting techniques pose for media that have high purchase but low running costs, such as these long-lived media.
  • Since the big problem in digital preservation is not keeping bits safe for the long term, it is paying for keeping bits safe for the long term, an expensive solution to a sub-problem can actually make the overall problem worse, not better.
  • These long-lived media are always off-line media. In most cases, the only way to justify keeping bits for the long haul is to provide access to them (see Blue Ribbon Task Force). The access latency scholars (and general Web users) will tolerate rules out off-line media for at least one copy.
  • Thus at best these media can be off-line backups. But the long access latency for off-line backups has led the backup industry to switch to on-line backup with de-duplication and compression. So even in the backup space long-lived media will be a niche product.
  • Off-line media need a reader. Good luck finding a reader for a niche medium a few decades after it faded from the market - one of the points Jeff Rothenberg got right two decades ago.
Since at least one copy needs to be on-line, and since copyability is an inherent property of being on-line, migrating data to a new medium is not a big element of the total cost of data ownership. Reducing migration cost by extending media service life thus doesn't make a big difference.

Does Long-Term Storage Need Ultra-Reliable Media?

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. But current media are many orders of magnitude too unreliable for the task ahead, so you can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly.
Thus replication, and error detection and recovery, are required features of a long-term storage system regardless of the medium it uses. Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of Backblaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Moral of the story: design for failure and buy the cheapest components you can.
Eric Brewer made the same point in his 2016 FAST keynote. For availability and resilience against disasters Google needs geographic diversity, so they have replicas from which to recover. Spending more to increase media reliability makes no sense, the media are already reliable enough. The systems that surround the drives have been engineered to deliver adequate reliability despite the current unreliability of the drives, thus engineering away the value of more reliable drives.

How Much Replication Do We Need?

Facebook's hot storage layer, Haystack, uses RAID-6 and replicates data across three data centers, using 3.6 times as much storage as the raw data. The next layer down, Facebook's f4, uses two fault-tolerance techniques:
  • Within a data center it uses erasure coding with 10 data blocks and 4 parity blocks. Careful layout of the blocks ensures that the data is resilient to drive, host and rack failures at an effective replication factor of 1.4.
  • Between data centers it uses XOR coding. Each block is paired with a different block in another data center, and the XOR of the two blocks stored in a third. If any one of the three data centers fails, both paired blocks can be restored from the other two.
The result is fault-tolerance to drive, host, rack and data center failures at an effective replication factor of 2.1, reducing overall storage demand from Haystack's factor of 3.6 by nearly 42% for the vast bulk of Facebook's data. Erasure-coding everything except the hot storage layer seems economically essential.

Another point worth noting that the f4 paper makes relates to heterogeneity as a way of avoiding correlated failures:
We recently learned about the importance of heterogeneity in the underlying hardware for f4 when a crop of disks started failing at a higher rate than normal. In addition, one of our regions experienced higher than average temperatures that exacerbated the failure rate of the bad disks. This combination of bad disks and high temperatures resulted in an increase from the normal ~1% AFR to an AFR over 60% for a period of weeks. Fortunately, the high-failure-rate disks were constrained to a single cell and there was no data loss because the buddy and XOR blocks were in other cells with lower temperatures that were unaffected.

Current Technology Choices

Fontana 2016 analysis
Robert Fontana of IBM has an excellent overview of the roadmaps for tape, disk, optical and NAND flash (PDF) through the early 2020s. These are the only media technologies currently shipping in volume. Given the long lead times for new storage technologies, no other technology will significantly impact the bulk storage market before then.


Historically, tape was the medium of choice for long-term storage. Its basic recording technology lags hard disk by many years, so it has a much more credible technology road-map than disk. The reason is that the bits on the tape are much larger. Current hard disks are roughly 1000Gbit/in2, tape is projected to be roughly 50Gbit/in2 in 6 years time.

But tape's importance is fading rapidly. There are several reasons:
  • Tape is a very small market in unit terms (See this comment):
    Just under 20 million LTO cartridges were sent to customers last year. As a comparison let's note that WD and Seagate combined shipped more than 350 million disk drives in 2015; the tape cartridge market is less than 0.00567 per cent of the disk drive market in unit terms
  • In effect there is now a single media supplier per technology, raising fears of price gouging and supply vulnerability. The disk market has consolidated too, but there are still two very viable suppliers plus another. Hard disk market share is:
    split between the three remaining HDD companies with Western Digital’s market share at 42%, Seagate’s at 37% and Toshiba at 21%.
  • The advent of data-mining and web-based access to archives make the long access latency of tape less tolerable.
  • The robots that, at scale, access the tape cartridges have a limited number of slots. To maximize the value of each slot it is necessary to migrate data to new, higher-capacity cartridges as soon as they appear. This has two effects. First, it makes the long service life of tape media less important. Second, it consumes a substantial fraction of the available bandwidth.
As an off-line medium, tape's cost and performance is determined by the ratio between the number of media (slots), which sets the total capacity of the system at a given cartridge technology, and the number of drives, which sets the access bandwidth. It can appear very inexpensive at a high media/drive ratio, but the potential bandwidth of the drives is likely to be mostly consumed with migrating old cartridges to new, higher-capacity ones. This is an illustration of the capacity vs. bandwidth tradeoffs explored in Steven Hetzler and Tom Couglin's Touch Rate:  A metric for analyzing storage system performance.


Like tape, optical media (DVD and Blu-ray) are off-line media whose cost and performance are determined by the media/drive ratio in their robots. They have long media life and some other attractive properties that mitigate some threats; immunity from electromagnetic pulse effects, and most are physically write-once.

Recently, Facebook and Panasonic have provided an impressive example of the appropriate and cost-effective use of optical media. The initial response to Facebook's announcement of their prototype Blu-ray cold storage system focused on the 50-year life of the disks, but it turns out that this isn't the interesting part of the story. Facebook's problem is that they have a huge flow of data that is accessed rarely but needs to be kept for the long-term at the lowest possible cost. They need to add bottom tiers to their storage hierarchy to do this.

The first tier they added to the bottom of the hierarchy stored the data on mostly powered-down hard drives. Some time ago a technology called MAID (Massive Array of Idle Drives) was introduced but didn't make it in the market. The idea was that by putting a large cache in front of the disk array, most of the drives could be spun-down to reduce the average power draw. MAID did reduce the average power draw, at the cost of some delay from cache misses, but in practice the proportion of drives that were spun-down wasn't as great as expected so the average power reduction wasn't as much as hoped. And the worst case was about the same as a RAID, because the cache could be thrashed in a way that caused almost all the drives to be powered up.

Facebook's design is different. It is aimed at limiting the worst-case power draw. It exploits the fact that this storage is at the bottom of the storage hierarchy and can tolerate significant access latency. Disks are assigned to groups in equal numbers. One group of disks is spun up at a time in rotation, so the worst-case access latency is the time needed to cycle through all the disk groups. But the worst-case power draw is only that for a single group of disks and enough compute to handle a single group.

Why is this important? Because of the synergistic effects knowing the maximum power draw enables. The power supplies can be much smaller, and because the access time is not critical, need not be duplicated. Because Facebook builds entire data centers for cold storage, the data center needs much less power and cooling, and doesn't need backup generators. It can be more like cheap warehouse space than expensive data center space. Aggregating these synergistic cost savings at data center scale leads to really significant savings.

Nevertheless, this design has high performance where it matters to Facebook, in write bandwidth. While a group of disks is spun up, any reads queued up for that group are performed. But almost all the I/O operations to this design are writes. Writes are erasure-coded, and the shards all written to different disks in the same group. In this way, while a group is spun up, all disks in the group are writing simultaneously providing huge write bandwidth. When the group is spun down, the disks in the next group take over, and the high write bandwidth is only briefly interrupted.

Next, below this layer of disk cold storage Facebook implemented the Blu-ray cold storage that drew such attention. It has 12 Blu-ray drives for an entire rack of cartridges holding 10,000 100TB Blu-ray disks managed by a robot. When the robot loads a group of 12 fresh Blu-ray disks into the drives, the appropriate amount of data to fill them is read from the currently active hard disk group and written to them. This scheduling of the writes allows for effective use of the limited write capacity of the Blu-ray drives. If the data are ever read, a specific group has to be loaded into the drives, interrupting the flow of writes, but this is a rare occurrence. Once all 10,000 disks in a rack have been written, the disks will be loaded for reads infrequently. Most of the time the entire ExaPetabyte rack will sit there idle.

It is this careful, organized scheduling of the system's activities at data center scale that enables the synergistic cost reductions of cheap power and space. It may be true that the Blu-ray disks have a 50-year lifetime but this isn't what matters. No-one expects the racks to sit in the data center for 50 years, at some point before then they will be obsoleted by some unknown new, much denser and more power-efficient cold storage medium (perhaps DNA).


Exabytes shipped
It is still the case, as it has been for decades, that the vast majority of bytes of storage shipped each year are hard disk. Until recently, these disks went into many different markets, desktops, laptops, servers, digital video recorders, and storage systems from homes to data centers. Over the last few years flash is increasingly displacing hard disk from most of these markets, with the sole exception of bulk storage in data centers ("the cloud").

The shrinking size of the magnetic domains that store data on the platters of hard disks, and the fairly high temperatures found inside operating drives, mean that the technology is approaching the superparamagnetic limit at which the bits become unstable. HAMR is the response, using materials which are more resistant to thermal instability but which therefore require heating before they can be written. The heat is supplied by a laser focused just ahead of the write magnetic head. As we saw above, the difficulty and cost of this technology transition has been consistently under-estimated. The successor to HAMR, BPM is likely to face even worse difficulties. Disk areal density will continue to improve, but much more slowly than in its pre-2010 heyday. Vendors are attempting to make up for the slow progress in areal density in two ways:
  • Shingling, which means moving the tracks so close together that writing a track partially overwrites the adjacent track. Very sophisticated signal processing allows the partially overwritten data to be read. Shingled drives come in two forms. WD's drives expose the shingling to the host, requiring the host software to be changed to treat them like append-only media. Seagate's drives are device-managed, with on-board software obscuring the effect of shingling, at the cost of greater variance in performance.
  • Helium, which replaced air inside the drive allowing the heads to fly lower and thus allow more platters to fit in the same form factor. WD's recently announced 12TB drives have 8 platters. Adding platters adds cost, so does little to increase the Kryder rate.
WD unit shipments
Disk manufacturing is a high-volume, low-margin business. The industry was already consolidating before the floods in Thailand wiped out 40% of the world's disk manufacturing capacity. Since this disruption, the industry has consolidated to two-and-a-half suppliers, Western Digital has a bit over 2/5 of the market, and Seagate has bit under 2/5. Toshiba is a distant third with 1/5, raising doubts about their ability to remain competitive in this volume market.

Seagate unit shipments
The effect of flash displacing hard disk from many of its traditional markets can be seen in the unit volumes for the two major manufacturers. Both have seen decreasing total unit shipments for more than 2 years. Economic stress on the industry has increased. Seagate plans a 35% reduction in capacity and 14% layoffs. WDC has announced layoffs. The most recent two quarters have seen a slight recovery:
Hard disk drives shipments have had several quarters of declining shipments since the most recent high in 2014 and the peak of 653.6 million units in 2010 (before the 2011 Thailand floods). Last quarter and likely this quarter will see significant HDD shipment increases, partly making up for declining shipments in the first half of 2016.
The spike in demand is for high-end capacity disks and is causing supply chain difficulties for component manufacturers:
Trendfocus thinks [manufacturers] therefore won't invest heavily to meet spurts of demand. Instead, the firm thinks, suppliers will do their best to juggle disk-makers' demands.

“This may result in more tight supply situations like this in the future, but ultimately, it is far better off to deal with tight supply conditions than to deal with over-supply and idle capacity issues” says analyst John Kim
Reducing unit volumes reduces the economies of scale underlying the low cost of disk, slowing disk's Kryder rate, and making disk less competitive with flash. Reduced margins from this pricing pressure reduces the cash available for investment in improving the technology, further reducing the Kryder rate. This looks like the start of a slow death spiral for disk.


Flash as a data storage technology is almost 30 years old. Eli Harari filed the key enabling patent in 1988, describing multi-level cells, wear-leveling and the Flash Translation Layer. Flash has yet to make a significant impact on the lower levels of the storage hierarchy. If flash is to displace disk from these lower levels, massive increases in flash shipments will be needed. There are a number of ways flash manufacturers could increase capacity.

They could build more flash fabs, but this is extremely expensive. If there aren't going to be a lot of new flash fabs, what else could the manufacturers do to increase shipments from the fabs they have?

The traditional way of delivering more chip product from the same fab has been to shrink the chip technology. Unfortunately, shrinking the technology from which flash is made has bad effects. The smaller the cells, the less reliable the storage and the fewer times it can be written, as shown by the vertical axis in this table:
Write endurance vs. cell size
Both in logic and in flash, the difficulty in shrinking the technology further has led to 3D, stacking layers on top of each other. 64-layer flash is in production, allowing manufacturers to go back to larger cells with better write endurance.

Flash has another way to increase capacity. It can store more bits in each cell, as shown in the horizontal axis of the table. The behavior of flash cells is analog, the bits are the result of signal-processing in the flash controller. By improving the analog behavior by tweaking the chip-making process, and improving the signal processing in the flash controller, it has been possible to move from 1 (SLC) to 2 (MLC) to 3 (TLC) bits per cell. Because 3D has allowed increased cell size (moving up the table), TLC SSDs are now suitable for enterprise workloads.

Back in 2009, thanks to their acquisition of M-Systems, SanDisk briefly shipped some 4 (QLC) bits per cell memory (hat tip to Brian Berg). But up to now the practical limit has been 3. As the table shows, storing more bits per cell also reduces the write endurance (and the reliability).

LAM research NAND roadmap
As more and more layers are stacked the difficulty of the process increases, and it is currently expected that 64 layers will be the limit for the next few years. This is despite the normal optimism from the industry roadmaps.

Beyond that for the near term manufacturers expect to use die-stacking. That involves taking two (or potentially more) complete 64-layer chips and bonding one on top of the other, connecting them by Through Silicon Vias (TSVs). TSVs are holes through the chip substrate containing wires.

Although adding 3D layers does add processing steps, and thus some cost, it merely lengthens the processing pipeline. It doesn't slow the rate at which wafers can pass through and, because each wafer contains more storage, it increases the fab's output of storage. Die-stacking, on the other hand, doesn't increase the amount of storage per wafer, only per package. It doesn't increase the fab's output of bytes.

It is only recently that sufficient data has become available to study the reliability of flash at scale in data centers. The behavior observed differs in several important ways from that of hard disks in similar environments with similar workloads. But since the workloads typical of current flash usage (near the top of the hierarchy) are quite unlike those of long-term bulk storage, the relevance of these studies is questionable.

QLC will not have enough write endurance for conventional SSD applications. So will there be enough demand for manufacturers to produce it, and thus double their output relative to TLC?

Cloud systems such as Facebook's use tiered storage architectures in which re-write rates decrease rapidly down the layers. Because most re-writes would be absorbed by higher layers, it is likely that QLC-based SSDs would work well at the bulk storage level despite only a 500 write cycle life. To do so they would likely need quite different software in the flash controllers.

Flash vs. Disk

Probably, at some point in the future flash will displace hard disk as the medium for long-term storage. There are two contrasting views as to how long this will take.

Fontana EB shipped
First, the conventional wisdom as expressed by the operators of cloud services and the disk industry, and supported by the graph showing how few exabytes of flash are shipped in comparison to disk. Note also that total capacity manufactured annually is increasing linearly, not exponentially as many would believe.

Although flash is displacing disk from markets such as PCs, laptops and servers, Eric Brewer's fascinating keynote at the 2016 FAST conference started from the assertion that in the medium term the only feasible technology for bulk data storage in the cloud was spinning disk.

NAND vs. HDD capex/TB
The argument is that flash, despite its many advantages, is and will remain too expensive for the bulk storage layer. The graph of the ratio of capital expenditure per TB of flash and hard disk shows that each exabyte of flash contains about 50 times as much capital as an exabyte of disk. Fontana estimates that last year flash shipped 83EB and hard disk shipped 565EB. For flash to displace hard disk immediately would need 32 new state-of-the-art fabs at around $9B each or nearly $300B in total investment.

But over the next 4 years Fontana projects NAND flash shipments will grow to 400EB/yr versus hard disk shipments perhaps 800EB/yr. So there will be continued gradual erosion of hard disk market share.

Second, the view from the flash advocates. They argue that the fabs will be built, because they are no longer subject to conventional economics. The governments of China, Japan, and other countries are stimulating their economies by encouraging investment, and they regard dominating the market for essential chips as a strategic goal, something that justifies investment. They are thinking long-term, not looking at the next quarter's results. The flash companies can borrow at very low interest rates, so even if they do need to show a return, they only need to show a very low return.

If the fabs are built, and if QLC becomes usable for bulk storage, the increase in supply will increase the Kryder rate of flash. This will increase the trend of storage moving from disk to flash. In turn, this will increase the rate at which disk vendor's unit shipments decrease. In turn, this will decrease their economies of scale, and cause disk's Kryder rate to go negative. The point at which flash becomes competitive with disk moves closer in time.

The result would be that the Kryder rate for bulk storage, which has been very low, would get back closer to the historic rate sooner, and thus that storing bulk data for the long term would be significantly cheaper. But this isn't the only effect. When Data Domain's disk-based backup displaced tape, greatly reducing the access latency for backup data, the way backup data was used changed. Instead of backups being used mostly to cover media failures, they became used mostly to cover user errors.

Similarly, if flash were to displace disk, the access latency for stored data would be significantly reduced, and the way the data is used would change. Because it is more accessible, people would find more ways to extract value from it. The changes induced by reduced latency would probably significantly increase the perceived value of the stored data, which would itself accelerate the turn-over from disk to flash.

If we're not to fry the planet, oil companies cannot develop many of the reserves they carry on their books; they are "stranded assets". Both views of the future of disk vs. flash involve a reduction in the unit volume of drives. The disk vendors cannot raise prices significantly, doing so would accelerate the reduction in unit volume. Thus their income will decrease, and thus their ability to finance the investments needed to get HAMR and then BPM into the market. The longer they delay these investments, the more difficult it becomes to afford them. Thus it is possible that HAMR and likely that BPM will be "stranded technologies", advances we know how to build, but never actually deploy in volume.

Future Technology Choices

Simply for business reasons, the time and the cost of developing the necessary huge manufacturing capacity, it is likely that disk and flash will dominate the bulk storage market for the medium term. Tape and optical will probably fill the relatively small niche for off-line media. Nevertheless, it is worth examining the candidate technologies being touted as potentially disrupting this market.

Storage Class Memories

As we saw above, and just like hard disk's magnetic domains, flash cells are approaching the physical limits at which they can no longer retain data. 3D and QLC have delayed the point at which flash density slows, but they are one-time boosts to the technology. Grouped under the name Storage Class Memories (SCMs), several technologies that use similar manufacturing techniques to those that build DRAM (Dynamic Random Access Memory) are competing to be the successor to flash. Like flash but unlike DRAM, they are non-volatile, so are suitable as a storage medium. Like DRAM but unlike flash, they are random- rather than block-access and do not need to be erased before writing, eliminating a cause of the tail latency that can be a problem with flash (and hard disk) systems.

Typically, these memories are slower than DRAM so are not a direct replacement but are intended to form a layer bewteen (faster) DRAM and (cheaper) flash. But included in this group is a more radical technology that claims to able to replace DRAM while being non-volatile. Nantero uses carbon nanotubes to implement memories that they claim:
will have several thousand times faster rewrites and many thousands of times more rewrite cycles than embedded flash memory.

NRAM, non-volatile RAM, is based on carbon nanotube (CNT) technology and has DRAM-class read and write speeds. Nantero says it has ultra-high density – but no numbers have been given out – and it's scalable down to 5nm, way beyond NAND.

This is the first practicable universal memory, combining DRAM speed and NAND non-volatility, better-than-NAND endurance and lithography-shrink prospects.
Fujitsu is among the companies to have announced plans for this technology.

Small volumes of some forms of SCM have been shipping for a couple of years. Intel and Micron attracted a lot of attention with their announcement of 3D XPoint, an SCM technology. Potentially, SCM is a better technology than flash, 1000 times faster than NAND, 1000 times the endurance, and 100 times denser.

However, the performance of initial 3D XPoint SSD products disappointed the market. by being only about 8 times faster than NAND SSDs. The SSD guy, Jim Handy's Why 3D XPoint SSDs Will Be Slow explained that this is a system-level issue. The system overheads of accessing an SSD include the bus transfer, the controller logic and the file system software. They:
account for about 15 microseconds of delay.  If you were to use a magical memory that had zero delays, then its bar would never get any smaller than 15 microseconds.

In comparison, the upper bar, the one representing the NAND-based SSD, has combined latencies of about 90 microseconds, or six times as long.
Thus for SCM's performance to justify the large cost increase, they need to eliminate these overheads:
Since so much of 3D XPoint’s speed advantage is lost to these delays The SSD Guy expects for the PCIe interface to contribute very little to long-term 3D XPoint Memory revenues.  Intel plans to offer another implementation, shipping 3D XPoint Memory on DDR4 DIMMs.  A DIMM version will have much smaller delays of this sort, allowing 3D XPoint memory to provide a much more significant speed advantage.
DIMMs are the form factor of DRAM; the system would see DIMM SCMs as memory not as an I/O device. They would be slower than DRAM but non-volatile. Exploiting these attributes requires a new, non-volatile memory layer in the storage hierarchy. Changing the storage software base to implement this will take time, which is why it makes sense to market SCMs initially as SSDs despite these products providing much less than the potential performance of the medium.

Like flash, SCMs leverage much of the semiconductor manufacturing technology. Optimistically, one might expect SCM to impact the capacity market sometime in the late 2030s. SCMs have occupied the niche for a technology that exploits semiconductor manufacturing. A technology that didn't would find it hard to build the manufacturing infrastructure to ship the thousands of exabytes a year the capacity market will need by then.

Exotic Optical Media

Those who believe that quasi-immortal media are the solution to the long-term storage problem have two demonstrable candidates, each aping a current form factor. One is various forms of robust DVD, such as the University of Southampton's 5D quartz DVDs, or Hitachi's fused silica glass, claimed to be good for 300 million years. The other is using lasers to write data on to the surface of steel tape.

Both are necessarily off-line media, with the limited applicability that implies. Neither are shipping in volume, nor have any prospect of doing so. Thus neither will significantly impact the bulk storage market in the medium term.

The latest proposal for a long-term optical storage medium is diamond, lauded by Abigail Beall in the Daily Mail as Not just a girl's best friend: Defective DIAMONDS could solve our data crisis by storing 100 times more than a DVD:
'Without better solutions, we face financial and technological catastrophes as our current storage media reach their limits,' co-first author Dr Siddharth Dhomkar wrote in an article for The Conversation. 'How can we store large amounts of data in a way that's secure for a long time and can be reused or recycled?'.

Speaking to the future real-world practicality of their innovation, Mr. Jacob Henshaw, co-first author said: 'This proof of principle work shows that our technique is competitive with existing data storage technology in some respects, and even surpasses modern technology in terms of re-writability.'
In reality, the researchers reported that:
"images were imprinted via a red laser scan with a variable exposure time per pixel (from 0 to 50 ms). Note the gray scale in the resulting images corresponding to multivalued (as opposed to binary) encoding. ... Information can be stored and accessed in three dimensions, as demonstrated for the case of a three-level stack. Observations over a period of a week show no noticeable change in these patterns for a sample kept in the dark. ... readout is carried out via a red laser scan (200 mWat 1 ms per pixel). The image size is 150 × 150 pixels in all cases."
So, at presumably considerable cost, the researchers wrote maybe 100K bits at a few milliseconds per bit, read them back a week later at maybe a few hundred microseconds per bit without measuring an error rate. Unless the "financial and technological catastrophes" are more than two decades away, diamond is not a solution to them.


Nature recently featured a news article by Andy Extance entitled How DNA could store all the world's data, which claimed:
If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.
The article is based on research at Microsoft that involved storing 151KB in DNA. The research is technically interesting, starting to look at fundamental DNA storage system design issues. But it concludes (my emphasis):
DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. While this is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry[4].
The paper doesn't claim that the solution is at hand any time soon. Reference 4 is a two year old post to Rob Carlson's blog. A more recent post to the same blog puts the claim that:
both technologies are improving at an exponential rate
in a somewhat less optimistic light. It may be true that DNA sequencing is getting cheaper very rapidly. But already the cost of sequencing (read) was insignificant in the total cost of DNA storage. What matters is the synthesis (write) cost. Extance writes:
A closely related factor is the cost of synthesizing DNA. It accounted for 98% of the expense of the $12,660 EBI experiment. Sequencing accounted for only 2%, thanks to a two-millionfold cost reduction since the completion of the Human Genome Project in 2003.
The rapid decrease in the read cost is irrelevant to the economics of DNA storage; if it were free it would make no difference. Carlson's graph shows that the write cost, the short DNA synthesis cost (red line) is falling more slowly than the gene synthesis cost (yellow line). He notes:
But the price of genes is now falling by 15% every 3-4 years (or only about 5% annually).
A little reference checking reveals that the Microsoft paper's claim that:
both technologies are improving at an exponential rate
while strictly true is deeply misleading. The relevant technology is currently getting cheaper slower than hard disk or flash memory! And since this has been true for around two decades, making the necessary 3-4 fold improvement just to keep up with the competition is going to be hard.

Decades from now, DNA will probably be an important archival medium. But the level of hype around the cost of DNA storage is excessive. Extance's article admits that cost is a big problem, yet it finishes by quoting Goldman, lead author of a 2013 paper in Nature whose cost projections were massively over-optimistic. Goldman's quote is possibly true but again deeply misleading:
"Our estimate is that we need 100,000-fold improvements to make the technology sing, and we think that's very credible," he says. "While past performance is no guarantee, there are new reading technologies coming onstream every year or two. Six orders of magnitude is no big deal in genomics. You just wait a bit."
Yet again the DNA enthusiasts are waving the irrelevant absolute cost decrease in reading to divert attention from the relevant lack of relative cost decrease in writing. They need an improvement in relative write cost of at least 6 orders of magnitude. To do that in a decade means halving the relative cost every year, not increasing the relative cost by 10-15% every year.

Journalists like Beall writing for mass-circulation newspapers can perhaps be excused for merely amplifying the researcher's hype, but Extance, writing for Nature, should be more critical.

Storage Media or Storage Systems?

As we have seen, the economics of long-term storage mean that the media to be used will have neither the service life nor the reliability needed. These attributes must therefore be provided by a storage system whose architecture anticipates that media and other hardware components will be replaced frequently, and that data will be replicated across multiple media.

The system architecture surrounding the media is all the more important given two findings from research into production use of large disk populations:
What do we want from a future bulk storage system?
  • An object storage fabric.
  • With low power usage and rapid response to queries.
  • That maintains high availability and durability by detecting and responding to media failures without human intervention.
  • And whose reliability is externally auditable.
At the 2009 SOSP David Anderson and co-authors from C-MU presented FAWN, the Fast Array of Wimpy Nodes. It inspired me to suggest, in my 2010 JCDL keynote, that the cost savings FAWN realized without performance penalty by distributing computation across a very large number of very low-power nodes might also apply to storage.

The following year Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We showed that it was indeed plausible that, even at then current flash prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk while providing high performance and enabling self-healing.

Two subsequent developments suggest we were on the right track. First, Seagate's announcement of its Kinetic architecture and Western Digital's subsequent announcement of drives that ran Linux. Drives have on-board computers that perform command processing, internal maintenance operations, and signal processing. These computeers have spare capacity, which both approaches use to delegate computation from servers to the storage media, and to get IP communication all the way to the media, as DAWN suggested. IP to the medium is a great way to future-proof the drive interface.

FlashBlade hardware
Second, although flash remains more expensive than hard disk, since 2011 the gap has narrowed from a factor of about 12 to about 6. Pure Storage recently announced FlashBlade, an object storage fabric composed of large numbers of blades, each equipped with:
  • Compute: 8-core Xeon system-on-a-chip, and Elastic Fabric Connector for external, off-blade, 40GbitE networking,
  • Storage: NAND storage with 8TB or 52TB raw capacity of raw capacity and on-board NV-RAM with a super-capacitor-backed write buffer plus a pair of ARM CPU cores and an FPGA,
  • On-blade networking: PCIe card to link compute and storage cards via a proprietary protocol.
FlashBlade clearly isn't DAWN. Each blade is much bigger, much more powerful and much more expensive than a DAWN node. No-one could call a node with an 8-core Xeon, 2 ARMs, and 52TB of flash "wimpy", and it'll clearly be too expensive for long-term bulk storage. But it is a big step in the direction of the DAWN architecture.

DAWN exploits two separate sets of synergies:
  • Like FlashBlade, DAWN moves the computation to where the data is, rather then moving the data to where the computation is, reducing both latency and power consumption. The further data moves on wires from the storage medium, the more power and time it takes. This is why Berkeley's Aspire project's architecture is based on optical interconnect technology, which when it becomes mainstream will be both faster and lower-power than wires. In the meantime, we have to use wires.
  • Unlike FlashBlade, DAWN divides the object storage fabric into a much larger number of much smaller nodes, implemented using the very low-power ARM chips used in cellphones. Because the power a CPU needs tends to grow faster than linearly with performance, the additional parallelism provides comparable performance at lower power.
So FlashBlade currently exploits only one of the two sets of synergies. But once Pure Storage has deployed this architecture in its current relatively high-cost and high-power technology, re-implementing it in lower-cost, lower-power technology should be easy and non-disruptive. They have done the harder of the two parts.

Igneous board
Update: Further evidence for the trend towards DAWN-like architectures comes from Chris Mellor at The Register:
Both OpenIO and Igneous have launched plug-on ARM server cards for storage drives: these single-board computers each snap onto a hard drive to form nano-servers that are organized into a grid of object storage nodes. ... Igneous ... launched its service in October. There were two parts to Igneous’ concept: a subscription service to a managed on-premises storage array presented like a public cloud S3 API-accessed storage service; and the actual technology, with two 1U stateless x86 dataRouters (aka controllers) fronting a 4U dataBox containing 60 nano-servers ... Each strap-on board has a 32-bit ARMv7 1GHz dual-ARM-Cortex-A9-core Marvell Armada 370 system-on-chip that includes two 1Gbps Ethernet ports and can talk SATA to the direct-attached 3.5-inch disk drive. That processor gives the card enough compute power to run Linux right up against the storage.
OpenIO board
OpenIO's system is similar:
The SLS-4U86 is a 4U box holding up to 96 vertically mounted 3.5-inch disk drives, providing up to 960TB of raw storage with 10TB drives and 1,152 TB with 12TB drives. The disk drives are actually nano-servers, nano-nodes in OpenIO terms, as they each have on-drive data processing capability.
A nano-node contains:
  • Hot-swappable 10TB or 12TB 3.5-inch SATA nearline disk drive
  • Dual 2.5Gb/s SGMII (Serial Gigabit Media Independent Interface) ports
Both systems are much closer to the DAWN concept than FlashBlade, albeit using hard drives as the storage medium.


Although predictions are always risky, it seems appropriate to conclude with some. The first is by far the most important:
  1. Increasing technical difficulty and decreasing industry competition will continue to keep the rate at which the per-byte cost of bulk storage media decrease well below pre-2010 levels. Over a decade or two this will cause a very large increase in the total cost of ownership of long-term data.
  2. No new media will significantly impact the bulk storage layer of the hierarchy in the medium term. It will be fought out between tape, disk, flash and conventional optical. Media that would impact the layer in this time-frame would already be shipping in volume, and none are.
  3. Their long latencies will confine tape and optical to really cold data, and thus to a small niche of the market.
  4. Towards the end of the medium term, storage class memories will start to push flash down the hierarchy into the bulk storage layer. 
  5. Storage system architectures will migrate functionality from the servers towards the storage media.


I'm grateful to Seagate, and in particular to Dave B. Anderson, for (twice) allowing me to pontificate about their industry, to Brian Berg for his encyclopedic knowledge of the history of flash, and Tom Coughlin for illuminating discussions and the first graph of exabytes shipped. This isn't to say that they agree with any of the above.


Nick Krabbenhoeft said...
This comment has been removed by a blog administrator.
Nick Krabbenhoeft said...

If I understand the comparison correctly (20 million units to 350 million units), "less than 0.00567 per cent" should be "less than 5.7 per cent"

David. said...

Nick does understand the comparison correctly. Chris Mellor at The Register, from whose piece Riddle me this: What grows as it shrinks? Answer: LTO tape I quoted, got the math wrong. My bad, I should have checked it.

Dragan Espenschied said...

Finally a URL I can give to people talking about preserving digital artifacts for 10'000 years!

Thomas Lindgren said...

Some random speculation:

Sophisticated coding/signal processing has improved wireless capacity greatly over the years. We have HAMR, but can we expect any further heroics in this area for disks? If so, how far would that take us?

Another question that might be of some interest is the access patterns of cold storage. In particular, how many times will a cold disk be reused/overwritten during its lifetime? The case of a single write might be an interesting special case. (In my home environment, this is approximately the case.)

In either case, it seems at first glance like doing more processing at write and read could further improve capacity. In the same vein, could the disk r/w head be further evolved to assist in more complex operations? Could we get rid of tracks, for example?

On a higher level, for further one-time savings, compression will be ever more useful, I assume. As long as the logical contents of the disk (prior to encryption) do not look random, bits are being wasted. Likewise, the disk should presumably be available at as close as possible to raw capacity. Is it worth it to have self-describing disks, for example?

All of this subject to error correction, of course. But much of that will presumably be managed on higher levels (e.g., erasure coding over multiple disks).

David. said...

Just to emphasize the message about timescales, here is Chris Mellor at The Register reporting on Micron's fiscal Q1 results:

"The multi-year transition from 2D planar NAND to 3D flash is well under way, with almost 90 per cent of NAND shipments in 2020 expected to be 3D NAND, compared to 17 per cent currently (IDC numbers). 2D NAND is looking to be a long tail, niche market in the future with 3D flash taking over the mainstream. Micron's 2nd gen, 64-Layer, TLC (3 bits/cell), 3D NAND should be in mass production by August 2017."

David. said...

Toshiba, the distant third in the disk market, is in much worse financial trouble than the big two. The disk business isn't the major problem:

"Faced with the prospect of a multi-billion-dollar writedown that could wipe out its shareholders' equity, Japan's Toshiba is running out of fixes: it is burning cash, cannot issue shares and has few easy assets left to sell.

The Tokyo-based conglomerate, which is still recovering from a $1.3 billion accounting scandal in 2015, dismayed investors and lenders again this week by announcing that cost overruns at a U.S. nuclear business bought only last year meant it could now face a crippling charge against profit."

David. said...

Chris Mellor at The Register reports:

"YMTC, through its ownership of contract chip manufacturer XMC, has started building a memory semiconductor fab on a 13-hectare site at the Donghu New Technology Development Zone in Wuhan. ... the fab would soak up $24bn in investment. ... This will be the largest memory plant in China and include three 3D NAND production lines. Volume production should start in 2018, with a run rate of 300,000 12-inch wafers a month by 2020."

$8B per fab line supports the estimates above. Note that this development is less than 10% of the capacity needed to displace hard disk if it were on stream immediately instead of in 2020.

David. said...

Chris' report also validates the argument that the Chinese are investing strategically:

"48.96 per cent of YMTC is owned by China's National Integrated Circuit Industry Investment Fund, the Hubei IC Industry Investment Fund, and the Hubei Science and Technology Investment Group."

David. said...

Another Storage Class Memory technology has started sampling, albeit only in a 40nm process:

"ReRAM startup Crossbar has sample embedded ReRAM chips from SMIC that are currently undergoing evaluation.

SMIC is using a 40nm process and there are plans for a 28nm process in development but Crossbar envisages scaling at least to 16nm and 10nm but then lower.

David. said...

Yet another illustration of the seductive power the idea of quasi-immortal media has over the minds of people who don't understand the actual problems of preserving data is Richard Kemeny's All of Human Knowledge Buried in a Salt Mine.

David. said...

Bad news for the future of optical storage media is implied by Cyrus Farivar's Sony missed writing on the wall for DVD sales, takes nearly $1B writedown:

"Sony has finally figured out what the rest of us already knew—people just aren’t buying physical media like they used to.

In a Monday statement to investors, the company attributed the “downward revision… to a lowering of previous expectations regarding the home entertainment business, mainly driven by an acceleration of market decline.”


“The decline in the DVD and Blu-ray market was faster than we anticipated,” Takashi Iida, a Sony spokesman, told Bloomberg News."

David. said...

At The Register, Chris Mellor reports on Micron's analysts' day. Key quotes:

" It is shipping its gen-1 3D NAND with 32 layers and 384Gb die capacity and moving towards its second generation with 64 layers and 256Gb capacity in a 59mm2 die size. More than half its bit output in the second half of 2016 went into 3D NAND, which means planar, 2D NAND is now falling away."

and (optimistic):

"During 2017 Micron will work on developing QLC (quad-level cell or 4bits/cell flash) with a third more capacity than equivalent tech TLC flash. However, QLC flash will have lower endurance (write cycles essentially) and slower access than TLC flash, making it only suitable for read-intensive applications. ... We could be looking at making realtime analysis of archival data more affordable. ... Toshiba and WD also have a focus on QLC flash, which gives us a hint that we might hope to see QLC drives appear in 2018."

and (more realistic):

"The semiconductor incursion into storage is growing in strength and depth. If it can provide affordable and reliable bulk capacity storage then, quite simply, disk technology will go the way of tape technology over the next couple of decades."

David. said...

AT The Register, Chris Mellor reports the French video streaming site Dailymotion is using OpenIO nano-servers.

David. said...

Peter Bright at Ars Technica reports on the specs for Intel's first XPoint-based SSD. As predicted above by the SSD Guy, they fall far short of marketing hype:

"while these numbers do represent improvements on NAND flash, they're a far cry from the promised 1,000-fold improvements."

David. said...

CHris Mellor at The Register reports on a "no comment" from Oracle that casts doubt on the future of the T10000 tape format:

"There is now doubt over the continuing life of Oracle's StreamLine tape library product range. Customers cannot assume that the products will be developed or that Oracle's proprietary T10000 tape format has a future.

The obvious strategy is to transition to the open LTO format, which has a roadmap from its current LTO-7 format (6TB raw, 15TB compressed) ... another stage in the long decline of tape as a backup and archive storage medium."

David. said...

Lucas123 at /. points me to Lucas Mearian's Why laptops won’t come with larger SSDs this year:

"A dearth in NAND flash chip supply will cause the prices of mainstream solid-state drives (SSDs) to leap by as much as 16% this quarter over the previous quarter, meaning laptop makers won't likely offer consumers higher capacity SSDs in their new systems, according to a report from market research firm DRAMeXchange.

On average, contract prices for multi-level cell (MLC) SSDs supplied to the PC manufacturing industry are projected to go up by 12% to 16% compared with the final quarter of 2016; prices of triple-level cell (TLC) SSDs are expected to rise by 10% to 16% sequentially, according to DRAMeXchange."

This illustrates the point I made in Where Did All Those Bits Go? that supply and demand for storage are in balance, when demand rises but supply does not, prices rise to maintain the balance. And also the point that a huge wave of flash is not going to displace HDDs any time soon. Mearian writes:

"The SSD adoption rate in the global notebook market is estimated to reach 45% this year, according to DRAMeXchange."

David. said...

Chris Mellor and Simon Sharwood try to make sense of the deliberately opaque performance numbers in Intel's announcement of their P4800X XPoint-based NVMe SSD. Still extremely expensive and in very limited supply.

David. said...

More analysis of Intel's Optane product announcement in Chris Mellor's Inside Intel's Optanical garden.

David. said...

Chris Mellor at The Register reports that SK Hynix has 72-layer flash but at double the cell size of its competitors, so the capacity isn't that impressive.

David. said...

At IEEE Spectrum, Marty Perlmutter's The Lost Picture Show: Hollywood Archivists Can’t Outpace Obsolescence is a great explanation of why tape's media longevity is irrelevant to long-term storage:

"While LTO is not as long-lived as polyester film stock, which can last for a century or more in a cold, dry environment, it’s still pretty good.

The problem with LTO is obsolescence. Since the beginning, the technology has been on a Moore’s Law–like march that has resulted in a doubling in tape storage densities every 18 to 24 months. As each new generation of LTO comes to market, an older generation of LTO becomes obsolete. LTO manufacturers guarantee at most two generations of backward compatibility. What that means for film archivists with perhaps tens of thousands of LTO tapes on hand is that every few years they must invest millions of dollars in the latest format of tapes and drives and then migrate all the data on their older tapes—or risk losing access to the information altogether.

That costly, self-perpetuating cycle of data migration is why Dino Everett, film archivist for the University of Southern California, calls LTO “archive heroin—the first taste doesn’t cost much, but once you start, you can’t stop. And the habit is expensive.” As a result, Everett adds, a great deal of film and TV content that was “born digital,” even work that is only a few years old, now faces rapid extinction and, in the worst case, oblivion."

David. said...

The good Dr. Pangloss would be delighted with Seagate's latest optimistic roadmap:

"Seagate is getting closer to reaching its goal of making 20TB hard drives by 2020.

Over the next 18 months, the company plans to ship 14TB and 16TB hard drives, company executives said on an earnings call this week."

David. said...

Via Catalin Cimpanu, Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques by Yu Cai and a team from CMU, ETH Zurich and Seagate reports on:

"two sources of errors that can corrupt LSB data, and characterize their impact on real stateof-the-art 1X-nm (i.e., 15-19nm) MLC NAND flash chips. The first error source, cell-to-cell program interference, introduces errors into a flash cell when neighboring cells are programmed, as a result of parasitic capacitance coupling ... The second error source, read disturb, disrupts the contents of a flash cell when another cell is read."

The first allows for an attack similar to the Rowhammer attack on DRAM, and the second allows an attack in which a rapid flow of read operations causes read disturb errors. Cimpanu writes:

"these read disturb errors will "corrupt both pages already written to partially-programmed wordlines and pages that have yet to be written," ruining the SSD's ability to store data in a reliable manner in the future."

David. said...

At Tom's Hardware, Chris Ramseyer's Flash Industry Trends Could Lead Users Back to Spinning Disks is an important analysis of the impact of flash technology's evolution to TLC and 3D:

"The goal is to push the technology into more devices and increase market share over HDDs. The push for market share has decreased the divide in performance between flash and spinning disks."

Doing this means:

"the trend has been to slow performance to reduce costs. The more the technology is neutered, the closer to hard disk performance we see. On the controller side we've seen the number of processor cores and channels from each controller to the NAND flash shrink. On paper the new flash is faster than the old flash, so it's possible to achieve the same performance with fewer channels, but the larger die sizes also give us less parallelization. On the flash side, the move to more cost efficient 3-bit per cell (TLC) has delivered less sustained performance for heavy workloads that take longer to complete."

The impact is bigger on endurance:

"Vertically stacked TLC does increase endurance over planar 2D TLC, but the gain isn't as high as you might expect. We had two engineers tell us at Computex that Micron 64-layer TLC carries between 1,000 and 1,500 P/E cycles using their testing models. The Micron 256Gbit (Gen 2) TLC is still early but it's not a good sign for users. Neither Toshiba nor Micron want to discuss endurance with us at the die level; all endurance talk comes at the device level, where powerful error correction technology plays a large role. Planar MLC devices didn't use LDPC, an advanced form of error correction technology, but the controllers did run less powerful BCH ECC."

and on write bandwidth:

"The Intel SSD 600p is a very good indicator of the performance users will see in future . The drive features 3D TLC paired with a low-cost NVMe controller. In our reviews of the series we found the performance to be better than any SATA SSD ever shipped for most users, but the sustained write performance is lower than even mainstream SATA SSDs with MLC flash."

This is an inevitable evolution. Gradually, as storage class memories such as Optane enter the market, flash will lose out at the highest-performance segment of the market. Cost-per-byte will become more important and performance less important as flash gets pushed down towards the bulk storage segment.

David. said...

At The Register, Chris Mellor's IBM will soon become sole gatekeepers to the realm of tape – report reports on Spectra Logic's Digital Data Storage Outlook 2017. Overall, Spectra Logic comes to conclusions similar to mine about the bulk storage segment of the market:

"3.5-inch disk will continue to store a majority of enterprise and cloud data requiring online or nearline access if, and only if, the magnetic disk industry is able to successfully deploy technologies that allow it to continue the downward trend of cost per capacity"


"Tape has the easiest commercialisation and manufacturing path to higher capacity technologies, but will require continuous investment in drive and media development. The size of the tape market will result in further consolidation, perhaps leaving only one drive and two tape media suppliers"


"No new storage technologies will have significant impact on the storage digital universe through 2026 with the possible exception of [storage class memories]"

David. said...

At The Register, Chris Mellor's report Quad goals: Western Digital clambers aboard the 4bits/cell wagon reveals that:

"Western Digital's 3D, 64-layer NAND is being armed with 4bits/cell (quad-level cell, QLC) and bit-cost scaling (BiCS3) technology. ... WD will have QLC SSD and removable drives on show at the Flash Memory Summit in Santa Clara next month. ... WD expects that its in-development 96-layer 3D NAND will come in QLC form as well."

David. said...

Chris Mellor at The Register reports that:

"Samsung has fired out four flashy announcements with higher capcity chips, faster drives, new packaging format and a flash version of Seagate's Kinetic disk concept."

and that:

"The fourth item in Sammy’s news blast was a Key Value SSD.


The Sammy "take" on this is that SSDs can store data faster and more simply if they take in a data object and store it as it is without converting it into logical blocks and mapping them to physical blocks. A data item is given a key which is its direct address, regardless of its size.

Sammy says, as a result, when data is read or written, a Key Value SSD can reduce redundant steps, which leads to faster data inputs and outputs, as well as increasing TCO and significantly extending the life of an SSD."

David. said...

Ed Grochowski's fascinating presentation on the history of hard disk technology at the Flash Memory Summit includes density and price graphs that illustrate the slowdown in Kryder's Law. Tom Gardner's view of the history is slightly different but also worth your time.

David. said...

Western Digital's HGST unit claims a record for the most bytes in a 3.5" drive:

"WDC has released an Ultrastar 14TB disk drive with host application software managing its shingled writing scheme.

It is the world's first 14TB disk drive, and is helium-filled, as usual at these greater-than-10TB capacities. The disk uses shingled media recording (SMR) with partially overlapping write tracks to increase the areal density to 1034Gbit/in2. An earlier He10 (10TB helium drive), which was not shingled, using perpendicular magnetic recording (PMR) technology, had an 816Gbit/in2 areal density, and the PMR He12 has a 864Gb/in2 one. Shingling adds 2TB of extra capacity over the He12."

Note the relatively small 17% capacity increment from shingling.

David. said...

The abstract for Blu-Ray Media Stability and Suitability for Long-Term Storage by Joe Iraci is not encouraging:

"The most recent generation of optical disc media available is the Blu-ray format. Blu-rays offer significantly more storage capacity than compact discs (CDs) and digital versatile discs (DVDs) and thus are an attractive option for the storage of large image or audio and video files. However, uncertainty exists on the stability and longevity of Blu-ray discs and the literature does not contain much information on these topics. In this study, the stabilities of Blu-ray formats such as read-only movie discs as well as many different brands of recordable and erasable media were evaluated. Testing involved the exposure of samples to conditions of 80 °C and 85 % relative humidity for intervals up to 84 days. Overall, the stability of the Blu-ray formats was poor with many discs significantly degraded after only 21 days of accelerated ageing. In addition to large increases in error rates, many discs showed easily identifiable visible degradation in several different forms. In a comparison with other optical disc formats examined previously, Blu-ray stability ranked very low. Other data from the study indicated that recording Blu-ray media with low initial error rates is challenging for some brands at this time, which is a factor that ultimately affects longevity."

David. said...

I keep referring back to the 2009 paper from CMU, FAWN, the Fast Array of Wimpy Nodes. Now, ten years on, Storage Newsletter points to Catalina: In-Storage Processing Acceleration for Scalable Big Data Analytics, which from the abstract sounds very FAWN-like:

"In this paper, we investigated the deployment of storage units with embedded low-power application processors along with FPGA-based reconfigurable hardware accelerators to address both performance and energy efficiency. To this purpose, we developed a high-capacity solid-state drive (SSD) named Catalina equipped with a quad-core ARM A53 processor running a Linux operating system along with a highly efficient FPGA accelerator for running applications in-place."