This is an enhanced version of a journal article that has been accepted for publication in Library Hi Tech, with images that didn't meet the journal's criteria, and additional material reflecting developments since submission. Storage technology evolution can't be slowed down to the pace of peer review.
What is Long-Term Storage?
Long-term storage implements the base layers of the hierarchy, often called "bulk" or "capacity" storage. Most discussions of storage technology focus on the higher, faster layers, which these days are the territory of all-flash arrays holding transactional databases, search indexes, breaking news pages and so on. The data in those systems is always just a cache. Long-term storage is where old blog posts, cat videos and most research datasets spend their lives.
What Temperature is Your Data?If everything is working as planned, data in the top layers of the hierarchy will be accessed much more frequently, be "hotter", than data further down. At scale, this effect can be extremely strong.
|Muralidhar et al Figure 3
The Facebook data make two really strong arguments for hierarchical storage architectures at scale:
- That significant kinds of data should be moved from expensive, high-performance hot storage to cheaper warm and then cold storage as rapidly as feasible.
- That the I/O rate that warm storage should be designed to sustain is so different from that of hot storage, at least 2 and often many more orders of magnitude, that attempting to re-use hot storage technology for warm and even worse for cold storage is futile.
But there is a caveat. Typical at-scale systems such as Facebook's do show infrequent access to old data. This used to be true in libraries and archives. But the advent of data mining and other "big data" applications means that increasingly scholars want not to access a few specific items, but instead to ask statistical questions of an entire collection. The implications of this change in access patterns for long-term storage architectures are discussed below.
How Long is the "Medium Term"?Iain Emsley's talk at PASIG2016 on planning the storage requirements of the 1PB/day Square Kilometer Array mentioned that the data was expected to be used for 50 years. How hard a problem is planning with this long a horizon? Looking back 50 years can provide a clue.
R. M. Fano's 1967 paper The Computer Utility and the Community reports that for MIT's IBM 7094-based CTSS:
the cost of storing in the disk file the equivalent of one page of single-spaced typing is approximately 11 cents per month.It would have been hard to believe a projection that in 2016 it would be more than 7 orders of magnitude cheaper.
By Erik Pitti CC BY 2.0.
A 1966 data management plan would have been correct in predicting that 50 years later the dominant media would be "disk" and "tape", and that disk's lower latency would carry a higher cost per byte. But its hard to believe that any more detailed predictions about the technology would be correct. The extraordinary 30-year history of 30-40% annual cost per byte decrease of disk media, their Kryder rate, had yet to start.
Although disk and tape are 60-year old technologies, a 50-year time horizon may seem too long to be useful. But a 10-year time horizon is definitely too short to be useful. Storage is not just a technology, but also a multi-billion dollar manufacturing industry dominated by a few huge businesses, with long, hard-to-predict lead times.
|Seagate 2008 roadmap
In 2016, the trade press is reporting that:
Seagate plans to begin shipping HAMR HDDs next year.
|ASTC 2016 roadmap
A recent TrendFocus report suggests that the industry is preparing to slip the new technologies even further:
The report suggests we could see 14TB PMR drives in 2017 and 18TB SMR drives as early as 2018, with 20TB SMR drives arriving by 2020.Here, the medium term is loosely defined as the next couple of decades, or 2-3 times the uncertainty in industry projections.
What Is The Basic Problem of Long-Term Storage?The fundamental problem is not storing bits safely for the long term, it is paying to store bits safely for the long term. With an unlimited budget an unlimited amount of data could be stored arbitrarily reliably indefinitely. But in the real world of limited budgets there is an inevitable tradeoff between storing more data, and storing the data more reliably.
Historically, this tradeoff has not been pressing, because the rate at which the cost per byte of storage dropped (the Kryder rate) was so large that if you could afford to keep some data for a few years, you could afford to keep it "forever". The incremental cost would be negligible. Alas, this is no longer true.
|Cost vs. Kryder rate
|2014 cost/byte projection
- The slowing started in 2010, before the floods hit Thailand.
- Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
- If the industry projections pan out, as shown by the red lines, by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
How Much Long-Term Storage Do We Need?Lay people reading the press about storage, a typical example is Lauro Rizatti's recent article in EE Times entitled Digital Data Storage is Undergoing Mind-Boggling Growth, believe two things:
- per byte, storage media are getting cheaper very rapidly (Kryder's Law), and
- the demand for storage greatly exceeds the supply.
So we have two statements. The first is "per byte, storage media are getting cheaper very rapidly". We can argue about exactly how rapidly, but there are decades of factual data recording the drop in cost per byte of disk and other storage media. So it is reasonable to believe the first statement. Anyone who has been buying computers for a few years can testify to it.
- Newly manufactured media does not instantly get filled. There are delays in the distribution pipeline - for example I have nearly half a terabyte of unwritten DVD-R media sitting on a shelf. This is likely to be a fairly small percentage.
- Some media that gets filled turns out to be faulty and gets returned under warranty. This is likely to be a fairly small percentage.
- Some of the newly manufactured media replaces obsolete media, so isn't available to store newly created information.
- Because of overhead from file systems and so on, newly created information occupies more bytes of storage than its raw size. This is typically a small percentage.
- If newly created information does actually get written to a storage medium, several copies of it normally get written. This is likely to be a factor of about two.
- Some newly created information exists in vast numbers of copies. For example, my iPhone 6 claims to have 64GB of storage. That corresponds to the amount of newly manufactured storage medium (flash) it consumes. But about 8.5GB of that is consumed by a copy of iOS, the same information that consumes 8.5GB in every iPhone 6. Between October 2014 and October 2015 Apple sold 222M iPhones, So those 8.5GB of information are replicated 222M times, consuming about 1.9EB of the storage manufactured in that year.
What do the blue bars represent? They are labeled "demand" but, as we have seen, the demand for storage depends on the price. There's no price specified for these bars. The caption of the graph says "Source: Recode", which I believe refers to a 2014 article by Rocky Pimentel entitled Stuffed: Why Data Storage Is Hot Again. (Really!). Based on the IDC/EMC Digital Universe report, Pimentel writes:
The total amount of digital data generated in 2013 will come to 3.5 zettabytes (a zettabyte is 1 with 21 zeros after it, and is equivalent to about the storage of one trillion USB keys). The 3.5 zettabytes generated this year will triple the amount of data created in 2010. By 2020, the world will generate 40 zettabytes of data annually, or more than 5,200 gigabytes of data for every person on the planet.The operative words are "data generated". Not "data stored permanently", nor "bytes of storage consumed". The numbers projected by IDC for "data generated" have always greatly exceeded the numbers actually reported for storage media manufactured in a given year, which in turn as discussed above exaggerate the capacity added to the world's storage infrastructure.
The assumption behind "demand exceeds supply" is that every byte of "data generated" in the IDC report is a byte of demand for permanent storage capacity. Even in a world where storage was free there would still be much data generated that was never intended to be stored for any length of time, and would thus not represent demand for storage media.
For a long time, discussions of storage have been bedevilled by the confusion between IDC's projections for "data generated" and the actual demand for storage media. The actual demand will be much lower, and will depend on the price.
Does Long-Term Storage Need Long-Lived Media?Every few months there is another press release announcing that some new, quasi-immortal medium such as 5D quartz or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but a study of the market for them revealed that no-one would pay the relatively small additional cost. The drives currently marketed for "archival" use have a shorter warranty and a shorter MTBF than enterprise drives, so they're not expected to have long service lives.
The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center racks or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.
The reason why disks are engineered to have a 5-year service life is that, at 30-40% Kryder rates, they were going to be replaced within 5 years simply for economic reasons. But, if Kryder rates are going to be much lower going forward, the incentives to replace drives early will be much less, so a somewhat longer service life would make economic sense for the customer. From the disk vendor's point of view, a longer service life means they would sell fewer drives. Not a reason to make them.
Additional reasons for skepticism include:
- Our research into the economics of long-term preservation demonstrates the enormous barrier to adoption that accounting techniques pose for media that have high purchase but low running costs, such as these long-lived media.
- Since the big problem in digital preservation is not keeping bits safe for the long term, it is paying for keeping bits safe for the long term, an expensive solution to a sub-problem can actually make the overall problem worse, not better.
- These long-lived media are always off-line media. In most cases, the only way to justify keeping bits for the long haul is to provide access to them (see Blue Ribbon Task Force). The access latency scholars (and general Web users) will tolerate rules out off-line media for at least one copy.
- Thus at best these media can be off-line backups. But the long access latency for off-line backups has led the backup industry to switch to on-line backup with de-duplication and compression. So even in the backup space long-lived media will be a niche product.
- Off-line media need a reader. Good luck finding a reader for a niche medium a few decades after it faded from the market - one of the points Jeff Rothenberg got right two decades ago.
Does Long-Term Storage Need Ultra-Reliable Media?The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. But current media are many orders of magnitude too unreliable for the task ahead, so you can't:
- Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
- Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly.
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Moral of the story: design for failure and buy the cheapest components you can.Eric Brewer made the same point in his 2016 FAST keynote. For availability and resilience against disasters Google needs geographic diversity, so they have replicas from which to recover. Spending more to increase media reliability makes no sense, the media are already reliable enough. The systems that surround the drives have been engineered to deliver adequate reliability despite the current unreliability of the drives, thus engineering away the value of more reliable drives.
How Much Replication Do We Need?Facebook's hot storage layer, Haystack, uses RAID-6 and replicates data across three data centers, using 3.6 times as much storage as the raw data. The next layer down, Facebook's f4, uses two fault-tolerance techniques:
- Within a data center it uses erasure coding with 10 data blocks and 4 parity blocks. Careful layout of the blocks ensures that the data is resilient to drive, host and rack failures at an effective replication factor of 1.4.
- Between data centers it uses XOR coding. Each block is paired with a different block in another data center, and the XOR of the two blocks stored in a third. If any one of the three data centers fails, both paired blocks can be restored from the other two.
Another point worth noting that the f4 paper makes relates to heterogeneity as a way of avoiding correlated failures:
We recently learned about the importance of heterogeneity in the underlying hardware for f4 when a crop of disks started failing at a higher rate than normal. In addition, one of our regions experienced higher than average temperatures that exacerbated the failure rate of the bad disks. This combination of bad disks and high temperatures resulted in an increase from the normal ~1% AFR to an AFR over 60% for a period of weeks. Fortunately, the high-failure-rate disks were constrained to a single cell and there was no data loss because the buddy and XOR blocks were in other cells with lower temperatures that were unaffected.
Current Technology Choices
|Fontana 2016 analysis
TapeHistorically, tape was the medium of choice for long-term storage. Its basic recording technology lags hard disk by many years, so it has a much more credible technology road-map than disk. The reason is that the bits on the tape are much larger. Current hard disks are roughly 1000Gbit/in2, tape is projected to be roughly 50Gbit/in2 in 6 years time.
But tape's importance is fading rapidly. There are several reasons:
- Tape is a very small market in unit terms (See this comment):
Just under 20 million LTO cartridges were sent to customers last year. As a comparison let's note that WD and Seagate combined shipped more than 350 million disk drives in 2015; the tape cartridge market is less than 0.00567 per cent of the disk drive market in unit terms
- In effect there is now a single media supplier per technology, raising fears of price gouging and supply vulnerability. The disk market has consolidated too, but there are still two very viable suppliers plus another. Hard disk market share is:
split between the three remaining HDD companies with Western Digital’s market share at 42%, Seagate’s at 37% and Toshiba at 21%.
- The advent of data-mining and web-based access to archives make the long access latency of tape less tolerable.
- The robots that, at scale, access the tape cartridges have a limited number of slots. To maximize the value of each slot it is necessary to migrate data to new, higher-capacity cartridges as soon as they appear. This has two effects. First, it makes the long service life of tape media less important. Second, it consumes a substantial fraction of the available bandwidth.
OpticalLike tape, optical media (DVD and Blu-ray) are off-line media whose cost and performance are determined by the media/drive ratio in their robots. They have long media life and some other attractive properties that mitigate some threats; immunity from electromagnetic pulse effects, and most are physically write-once.
Recently, Facebook and Panasonic have provided an impressive example of the appropriate and cost-effective use of optical media. The initial response to Facebook's announcement of their prototype Blu-ray cold storage system focused on the 50-year life of the disks, but it turns out that this isn't the interesting part of the story. Facebook's problem is that they have a huge flow of data that is accessed rarely but needs to be kept for the long-term at the lowest possible cost. They need to add bottom tiers to their storage hierarchy to do this.
The first tier they added to the bottom of the hierarchy stored the data on mostly powered-down hard drives. Some time ago a technology called MAID (Massive Array of Idle Drives) was introduced but didn't make it in the market. The idea was that by putting a large cache in front of the disk array, most of the drives could be spun-down to reduce the average power draw. MAID did reduce the average power draw, at the cost of some delay from cache misses, but in practice the proportion of drives that were spun-down wasn't as great as expected so the average power reduction wasn't as much as hoped. And the worst case was about the same as a RAID, because the cache could be thrashed in a way that caused almost all the drives to be powered up.
Facebook's design is different. It is aimed at limiting the worst-case power draw. It exploits the fact that this storage is at the bottom of the storage hierarchy and can tolerate significant access latency. Disks are assigned to groups in equal numbers. One group of disks is spun up at a time in rotation, so the worst-case access latency is the time needed to cycle through all the disk groups. But the worst-case power draw is only that for a single group of disks and enough compute to handle a single group.
Why is this important? Because of the synergistic effects knowing the maximum power draw enables. The power supplies can be much smaller, and because the access time is not critical, need not be duplicated. Because Facebook builds entire data centers for cold storage, the data center needs much less power and cooling, and doesn't need backup generators. It can be more like cheap warehouse space than expensive data center space. Aggregating these synergistic cost savings at data center scale leads to really significant savings.
Nevertheless, this design has high performance where it matters to Facebook, in write bandwidth. While a group of disks is spun up, any reads queued up for that group are performed. But almost all the I/O operations to this design are writes. Writes are erasure-coded, and the shards all written to different disks in the same group. In this way, while a group is spun up, all disks in the group are writing simultaneously providing huge write bandwidth. When the group is spun down, the disks in the next group take over, and the high write bandwidth is only briefly interrupted.
Next, below this layer of disk cold storage Facebook implemented the Blu-ray cold storage that drew such attention. It has 12 Blu-ray drives for an entire rack of cartridges holding 10,000 100TB Blu-ray disks managed by a robot. When the robot loads a group of 12 fresh Blu-ray disks into the drives, the appropriate amount of data to fill them is read from the currently active hard disk group and written to them. This scheduling of the writes allows for effective use of the limited write capacity of the Blu-ray drives. If the data are ever read, a specific group has to be loaded into the drives, interrupting the flow of writes, but this is a rare occurrence. Once all 10,000 disks in a rack have been written, the disks will be loaded for reads infrequently. Most of the time the entire Exa
It is this careful, organized scheduling of the system's activities at data center scale that enables the synergistic cost reductions of cheap power and space. It may be true that the Blu-ray disks have a 50-year lifetime but this isn't what matters. No-one expects the racks to sit in the data center for 50 years, at some point before then they will be obsoleted by some unknown new, much denser and more power-efficient cold storage medium (perhaps DNA).
The shrinking size of the magnetic domains that store data on the platters of hard disks, and the fairly high temperatures found inside operating drives, mean that the technology is approaching the superparamagnetic limit at which the bits become unstable. HAMR is the response, using materials which are more resistant to thermal instability but which therefore require heating before they can be written. The heat is supplied by a laser focused just ahead of the write magnetic head. As we saw above, the difficulty and cost of this technology transition has been consistently under-estimated. The successor to HAMR, BPM is likely to face even worse difficulties. Disk areal density will continue to improve, but much more slowly than in its pre-2010 heyday. Vendors are attempting to make up for the slow progress in areal density in two ways:
- Shingling, which means moving the tracks so close together that writing a track partially overwrites the adjacent track. Very sophisticated signal processing allows the partially overwritten data to be read. Shingled drives come in two forms. WD's drives expose the shingling to the host, requiring the host software to be changed to treat them like append-only media. Seagate's drives are device-managed, with on-board software obscuring the effect of shingling, at the cost of greater variance in performance.
- Helium, which replaced air inside the drive allowing the heads to fly lower and thus allow more platters to fit in the same form factor. WD's recently announced 12TB drives have 8 platters. Adding platters adds cost, so does little to increase the Kryder rate.
|WD unit shipments
|Seagate unit shipments
Hard disk drives shipments have had several quarters of declining shipments since the most recent high in 2014 and the peak of 653.6 million units in 2010 (before the 2011 Thailand floods). Last quarter and likely this quarter will see significant HDD shipment increases, partly making up for declining shipments in the first half of 2016.The spike in demand is for high-end capacity disks and is causing supply chain difficulties for component manufacturers:
Trendfocus thinks [manufacturers] therefore won't invest heavily to meet spurts of demand. Instead, the firm thinks, suppliers will do their best to juggle disk-makers' demands.Reducing unit volumes reduces the economies of scale underlying the low cost of disk, slowing disk's Kryder rate, and making disk less competitive with flash. Reduced margins from this pricing pressure reduces the cash available for investment in improving the technology, further reducing the Kryder rate. This looks like the start of a slow death spiral for disk.
“This may result in more tight supply situations like this in the future, but ultimately, it is far better off to deal with tight supply conditions than to deal with over-supply and idle capacity issues” says analyst John Kim
FlashFlash as a data storage technology is almost 30 years old. Eli Harari filed the key enabling patent in 1988, describing multi-level cells, wear-leveling and the Flash Translation Layer. Flash has yet to make a significant impact on the lower levels of the storage hierarchy. If flash is to displace disk from these lower levels, massive increases in flash shipments will be needed. There are a number of ways flash manufacturers could increase capacity.
They could build more flash fabs, but this is extremely expensive. If there aren't going to be a lot of new flash fabs, what else could the manufacturers do to increase shipments from the fabs they have?
The traditional way of delivering more chip product from the same fab has been to shrink the chip technology. Unfortunately, shrinking the technology from which flash is made has bad effects. The smaller the cells, the less reliable the storage and the fewer times it can be written, as shown by the vertical axis in this table:
|Write endurance vs. cell size
Flash has another way to increase capacity. It can store more bits in each cell, as shown in the horizontal axis of the table. The behavior of flash cells is analog, the bits are the result of signal-processing in the flash controller. By improving the analog behavior by tweaking the chip-making process, and improving the signal processing in the flash controller, it has been possible to move from 1 (SLC) to 2 (MLC) to 3 (TLC) bits per cell. Because 3D has allowed increased cell size (moving up the table), TLC SSDs are now suitable for enterprise workloads.
Back in 2009, thanks to their acquisition of M-Systems, SanDisk briefly shipped some 4 (QLC) bits per cell memory (hat tip to Brian Berg). But up to now the practical limit has been 3. As the table shows, storing more bits per cell also reduces the write endurance (and the reliability).
|LAM research NAND roadmap
Beyond that for the near term manufacturers expect to use die-stacking. That involves taking two (or potentially more) complete 64-layer chips and bonding one on top of the other, connecting them by Through Silicon Vias (TSVs). TSVs are holes through the chip substrate containing wires.
Although adding 3D layers does add processing steps, and thus some cost, it merely lengthens the processing pipeline. It doesn't slow the rate at which wafers can pass through and, because each wafer contains more storage, it increases the fab's output of storage. Die-stacking, on the other hand, doesn't increase the amount of storage per wafer, only per package. It doesn't increase the fab's output of bytes.
It is only recently that sufficient data has become available to study the reliability of flash at scale in data centers. The behavior observed differs in several important ways from that of hard disks in similar environments with similar workloads. But since the workloads typical of current flash usage (near the top of the hierarchy) are quite unlike those of long-term bulk storage, the relevance of these studies is questionable.
QLC will not have enough write endurance for conventional SSD applications. So will there be enough demand for manufacturers to produce it, and thus double their output relative to TLC?
Cloud systems such as Facebook's use tiered storage architectures in which re-write rates decrease rapidly down the layers. Because most re-writes would be absorbed by higher layers, it is likely that QLC-based SSDs would work well at the bulk storage level despite only a 500 write cycle life. To do so they would likely need quite different software in the flash controllers.
Flash vs. DiskProbably, at some point in the future flash will displace hard disk as the medium for long-term storage. There are two contrasting views as to how long this will take.
|Fontana EB shipped
Although flash is displacing disk from markets such as PCs, laptops and servers, Eric Brewer's fascinating keynote at the 2016 FAST conference started from the assertion that in the medium term the only feasible technology for bulk data storage in the cloud was spinning disk.
|NAND vs. HDD capex/TB
But over the next 4 years Fontana projects NAND flash shipments will grow to 400EB/yr versus hard disk shipments perhaps 800EB/yr. So there will be continued gradual erosion of hard disk market share.
Second, the view from the flash advocates. They argue that the fabs will be built, because they are no longer subject to conventional economics. The governments of China, Japan, and other countries are stimulating their economies by encouraging investment, and they regard dominating the market for essential chips as a strategic goal, something that justifies investment. They are thinking long-term, not looking at the next quarter's results. The flash companies can borrow at very low interest rates, so even if they do need to show a return, they only need to show a very low return.
If the fabs are built, and if QLC becomes usable for bulk storage, the increase in supply will increase the Kryder rate of flash. This will increase the trend of storage moving from disk to flash. In turn, this will increase the rate at which disk vendor's unit shipments decrease. In turn, this will decrease their economies of scale, and cause disk's Kryder rate to go negative. The point at which flash becomes competitive with disk moves closer in time.
The result would be that the Kryder rate for bulk storage, which has been very low, would get back closer to the historic rate sooner, and thus that storing bulk data for the long term would be significantly cheaper. But this isn't the only effect. When Data Domain's disk-based backup displaced tape, greatly reducing the access latency for backup data, the way backup data was used changed. Instead of backups being used mostly to cover media failures, they became used mostly to cover user errors.
Similarly, if flash were to displace disk, the access latency for stored data would be significantly reduced, and the way the data is used would change. Because it is more accessible, people would find more ways to extract value from it. The changes induced by reduced latency would probably significantly increase the perceived value of the stored data, which would itself accelerate the turn-over from disk to flash.
If we're not to fry the planet, oil companies cannot develop many of the reserves they carry on their books; they are "stranded assets". Both views of the future of disk vs. flash involve a reduction in the unit volume of drives. The disk vendors cannot raise prices significantly, doing so would accelerate the reduction in unit volume. Thus their income will decrease, and thus their ability to finance the investments needed to get HAMR and then BPM into the market. The longer they delay these investments, the more difficult it becomes to afford them. Thus it is possible that HAMR and likely that BPM will be "stranded technologies", advances we know how to build, but never actually deploy in volume.
Future Technology ChoicesSimply for business reasons, the time and the cost of developing the necessary huge manufacturing capacity, it is likely that disk and flash will dominate the bulk storage market for the medium term. Tape and optical will probably fill the relatively small niche for off-line media. Nevertheless, it is worth examining the candidate technologies being touted as potentially disrupting this market.
Storage Class MemoriesAs we saw above, and just like hard disk's magnetic domains, flash cells are approaching the physical limits at which they can no longer retain data. 3D and QLC have delayed the point at which flash density slows, but they are one-time boosts to the technology. Grouped under the name Storage Class Memories (SCMs), several technologies that use similar manufacturing techniques to those that build DRAM (Dynamic Random Access Memory) are competing to be the successor to flash. Like flash but unlike DRAM, they are non-volatile, so are suitable as a storage medium. Like DRAM but unlike flash, they are random- rather than block-access and do not need to be erased before writing, eliminating a cause of the tail latency that can be a problem with flash (and hard disk) systems.
Typically, these memories are slower than DRAM so are not a direct replacement but are intended to form a layer bewteen (faster) DRAM and (cheaper) flash. But included in this group is a more radical technology that claims to able to replace DRAM while being non-volatile. Nantero uses carbon nanotubes to implement memories that they claim:
will have several thousand times faster rewrites and many thousands of times more rewrite cycles than embedded flash memory.Fujitsu is among the companies to have announced plans for this technology.
NRAM, non-volatile RAM, is based on carbon nanotube (CNT) technology and has DRAM-class read and write speeds. Nantero says it has ultra-high density – but no numbers have been given out – and it's scalable down to 5nm, way beyond NAND.
This is the first practicable universal memory, combining DRAM speed and NAND non-volatility, better-than-NAND endurance and lithography-shrink prospects.
Small volumes of some forms of SCM have been shipping for a couple of years. Intel and Micron attracted a lot of attention with their announcement of 3D XPoint, an SCM technology. Potentially, SCM is a better technology than flash, 1000 times faster than NAND, 1000 times the endurance, and 100 times denser.
|SSD vs NVDIMM
account for about 15 microseconds of delay. If you were to use a magical memory that had zero delays, then its bar would never get any smaller than 15 microseconds.Thus for SCM's performance to justify the large cost increase, they need to eliminate these overheads:
In comparison, the upper bar, the one representing the NAND-based SSD, has combined latencies of about 90 microseconds, or six times as long.
Since so much of 3D XPoint’s speed advantage is lost to these delays The SSD Guy expects for the PCIe interface to contribute very little to long-term 3D XPoint Memory revenues. Intel plans to offer another implementation, shipping 3D XPoint Memory on DDR4 DIMMs. A DIMM version will have much smaller delays of this sort, allowing 3D XPoint memory to provide a much more significant speed advantage.DIMMs are the form factor of DRAM; the system would see DIMM SCMs as memory not as an I/O device. They would be slower than DRAM but non-volatile. Exploiting these attributes requires a new, non-volatile memory layer in the storage hierarchy. Changing the storage software base to implement this will take time, which is why it makes sense to market SCMs initially as SSDs despite these products providing much less than the potential performance of the medium.
Like flash, SCMs leverage much of the semiconductor manufacturing technology. Optimistically, one might expect SCM to impact the capacity market sometime in the late 2030s. SCMs have occupied the niche for a technology that exploits semiconductor manufacturing. A technology that didn't would find it hard to build the manufacturing infrastructure to ship the thousands of exabytes a year the capacity market will need by then.
Exotic Optical MediaThose who believe that quasi-immortal media are the solution to the long-term storage problem have two demonstrable candidates, each aping a current form factor. One is various forms of robust DVD, such as the University of Southampton's 5D quartz DVDs, or Hitachi's fused silica glass, claimed to be good for 300 million years. The other is using lasers to write data on to the surface of steel tape.
Both are necessarily off-line media, with the limited applicability that implies. Neither are shipping in volume, nor have any prospect of doing so. Thus neither will significantly impact the bulk storage market in the medium term.
The latest proposal for a long-term optical storage medium is diamond, lauded by Abigail Beall in the Daily Mail as Not just a girl's best friend: Defective DIAMONDS could solve our data crisis by storing 100 times more than a DVD:
'Without better solutions, we face financial and technological catastrophes as our current storage media reach their limits,' co-first author Dr Siddharth Dhomkar wrote in an article for The Conversation. 'How can we store large amounts of data in a way that's secure for a long time and can be reused or recycled?'.In reality, the researchers reported that:
Speaking to the future real-world practicality of their innovation, Mr. Jacob Henshaw, co-first author said: 'This proof of principle work shows that our technique is competitive with existing data storage technology in some respects, and even surpasses modern technology in terms of re-writability.'
"images were imprinted via a red laser scan with a variable exposure time per pixel (from 0 to 50 ms). Note the gray scale in the resulting images corresponding to multivalued (as opposed to binary) encoding. ... Information can be stored and accessed in three dimensions, as demonstrated for the case of a three-level stack. Observations over a period of a week show no noticeable change in these patterns for a sample kept in the dark. ... readout is carried out via a red laser scan (200 mWat 1 ms per pixel). The image size is 150 × 150 pixels in all cases."So, at presumably considerable cost, the researchers wrote maybe 100K bits at a few milliseconds per bit, read them back a week later at maybe a few hundred microseconds per bit without measuring an error rate. Unless the "financial and technological catastrophes" are more than two decades away, diamond is not a solution to them.
DNANature recently featured a news article by Andy Extance entitled How DNA could store all the world's data, which claimed:
If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.The article is based on research at Microsoft that involved storing 151KB in DNA. The research is technically interesting, starting to look at fundamental DNA storage system design issues. But it concludes (my emphasis):
DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. While this is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry.
both technologies are improving at an exponential ratein a somewhat less optimistic light. It may be true that DNA sequencing is getting cheaper very rapidly. But already the cost of sequencing (read) was insignificant in the total cost of DNA storage. What matters is the synthesis (write) cost. Extance writes:
A closely related factor is the cost of synthesizing DNA. It accounted for 98% of the expense of the $12,660 EBI experiment. Sequencing accounted for only 2%, thanks to a two-millionfold cost reduction since the completion of the Human Genome Project in 2003.The rapid decrease in the read cost is irrelevant to the economics of DNA storage; if it were free it would make no difference. Carlson's graph shows that the write cost, the short DNA synthesis cost (red line) is falling more slowly than the gene synthesis cost (yellow line). He notes:
But the price of genes is now falling by 15% every 3-4 years (or only about 5% annually).A little reference checking reveals that the Microsoft paper's claim that:
both technologies are improving at an exponential ratewhile strictly true is deeply misleading. The relevant technology is currently getting cheaper slower than hard disk or flash memory! And since this has been true for around two decades, making the necessary 3-4 fold improvement just to keep up with the competition is going to be hard.
Decades from now, DNA will probably be an important archival medium. But the level of hype around the cost of DNA storage is excessive. Extance's article admits that cost is a big problem, yet it finishes by quoting Goldman, lead author of a 2013 paper in Nature whose cost projections were massively over-optimistic. Goldman's quote is possibly true but again deeply misleading:
"Our estimate is that we need 100,000-fold improvements to make the technology sing, and we think that's very credible," he says. "While past performance is no guarantee, there are new reading technologies coming onstream every year or two. Six orders of magnitude is no big deal in genomics. You just wait a bit."Yet again the DNA enthusiasts are waving the irrelevant absolute cost decrease in reading to divert attention from the relevant lack of relative cost decrease in writing. They need an improvement in relative write cost of at least 6 orders of magnitude. To do that in a decade means halving the relative cost every year, not increasing the relative cost by 10-15% every year.
Journalists like Beall writing for mass-circulation newspapers can perhaps be excused for merely amplifying the researcher's hype, but Extance, writing for Nature, should be more critical.
Storage Media or Storage Systems?As we have seen, the economics of long-term storage mean that the media to be used will have neither the service life nor the reliability needed. These attributes must therefore be provided by a storage system whose architecture anticipates that media and other hardware components will be replaced frequently, and that data will be replicated across multiple media.
The system architecture surrounding the media is all the more important given two findings from research into production use of large disk populations:
- The media themselves cause only about half the detected errors, with the other half coming from non-media components such as buses, controllers, power supplies, etc.
- The media are significantly less reliable than their specified error rates.
- An object storage fabric.
- With low power usage and rapid response to queries.
- That maintains high availability and durability by detecting and responding to media failures without human intervention.
- And whose reliability is externally auditable.
The following year Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We showed that it was indeed plausible that, even at then current flash prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk while providing high performance and enabling self-healing.
Two subsequent developments suggest we were on the right track. First, Seagate's announcement of its Kinetic architecture and Western Digital's subsequent announcement of drives that ran Linux. Drives have on-board computers that perform command processing, internal maintenance operations, and signal processing. These computeers have spare capacity, which both approaches use to delegate computation from servers to the storage media, and to get IP communication all the way to the media, as DAWN suggested. IP to the medium is a great way to future-proof the drive interface.
- Compute: 8-core Xeon system-on-a-chip, and Elastic Fabric Connector for external, off-blade, 40GbitE networking,
- Storage: NAND storage with 8TB or 52TB raw capacity of raw capacity and on-board NV-RAM with a super-capacitor-backed write buffer plus a pair of ARM CPU cores and an FPGA,
- On-blade networking: PCIe card to link compute and storage cards via a proprietary protocol.
DAWN exploits two separate sets of synergies:
- Like FlashBlade, DAWN moves the computation to where the data is, rather then moving the data to where the computation is, reducing both latency and power consumption. The further data moves on wires from the storage medium, the more power and time it takes. This is why Berkeley's Aspire project's architecture is based on optical interconnect technology, which when it becomes mainstream will be both faster and lower-power than wires. In the meantime, we have to use wires.
- Unlike FlashBlade, DAWN divides the object storage fabric into a much larger number of much smaller nodes, implemented using the very low-power ARM chips used in cellphones. Because the power a CPU needs tends to grow faster than linearly with performance, the additional parallelism provides comparable performance at lower power.
Both OpenIO and Igneous have launched plug-on ARM server cards for storage drives: these single-board computers each snap onto a hard drive to form nano-servers that are organized into a grid of object storage nodes. ... Igneous ... launched its service in October. There were two parts to Igneous’ concept: a subscription service to a managed on-premises storage array presented like a public cloud S3 API-accessed storage service; and the actual technology, with two 1U stateless x86 dataRouters (aka controllers) fronting a 4U dataBox containing 60 nano-servers ... Each strap-on board has a 32-bit ARMv7 1GHz dual-ARM-Cortex-A9-core Marvell Armada 370 system-on-chip that includes two 1Gbps Ethernet ports and can talk SATA to the direct-attached 3.5-inch disk drive. That processor gives the card enough compute power to run Linux right up against the storage.
The SLS-4U86 is a 4U box holding up to 96 vertically mounted 3.5-inch disk drives, providing up to 960TB of raw storage with 10TB drives and 1,152 TB with 12TB drives. The disk drives are actually nano-servers, nano-nodes in OpenIO terms, as they each have on-drive data processing capability.
A nano-node contains:
- ARM cpu - Marvell Armada-3700 Dual core Cortex-A53 ARM v8 @1.2Ghz
- Hot-swappable 10TB or 12TB 3.5-inch SATA nearline disk drive
- Dual 2.5Gb/s SGMII (Serial Gigabit Media Independent Interface) ports
PredictionsAlthough predictions are always risky, it seems appropriate to conclude with some. The first is by far the most important:
- Increasing technical difficulty and decreasing industry competition will continue to keep the rate at which the per-byte cost of bulk storage media decrease well below pre-2010 levels. Over a decade or two this will cause a very large increase in the total cost of ownership of long-term data.
- No new media will significantly impact the bulk storage layer of the hierarchy in the medium term. It will be fought out between tape, disk, flash and conventional optical. Media that would impact the layer in this time-frame would already be shipping in volume, and none are.
- Their long latencies will confine tape and optical to really cold data, and thus to a small niche of the market.
- Towards the end of the medium term, storage class memories will start to push flash down the hierarchy into the bulk storage layer.
- Storage system architectures will migrate functionality from the servers towards the storage media.