Thursday, May 27, 2021

Storage Update

It has been too long since I wrote about storage technologies, so below the fold I comment on a keynote and three papers of particular interest from Usenix's File and Storage Technologies conference last February, and a selection of other news.

DNA Storage

I have been blogging about DNA data storage since 2012, and last blogged about the groundbreaking work of the U. Washington/Microsoft Molecular Information Systems Lab in January 2020. For FAST's closing keynote, entitled DNA Data Storage and Near-Molecule Processing for the Yottabyte Era, Karin Strauss gave a comprehensive overview of the technology for writing and reading data in DNA, and its evolution since the 1980s. Luis Ceze described how computations can be performed on data in DNA, ending with this vision of how both quantum and molecular computing can be integrated at the system level. Their abstract reads:
DNA data storage is an attractive option for digital data storage because of its extreme density, durability, eternal relevance and environmental sustainability. This is especially attractive when contrasted with the exponential growth in world-wide digital data production. In this talk we will present our efforts in building an end-to-end system, from the computational component of encoding and decoding to the molecular biology component of random access, sequencing and fluidics automation. We will also discuss some early efforts in building a hybrid electronic/molecular computer system that can offer more than data storage, for example, image similarity search.
The video of their talk is here, and it is well worth watching.

Facebook's File System

I wrote back in 2014 about Facebook's layered storage architecture, with Haystack as the hot layer, f4 as the warm layer, and the optical media cold layer. Now, Satadru Pan et al describe how Facebook realized many advantages by combining both hot and warm layers in a single infrastructure, Facebook's Tectonic Filesystem: Efficiency from Exascale. Their abstract reads:
Tectonic is Facebook’s exabyte-scale distributed filesystem. Tectonic consolidates large tenants that previously used service-specific systems into general multitenant filesystem instances that achieve performance comparable to the specialized systems. The exabyte-scale consolidated instances enable better resource utilization, simpler services, and less operational complexity than our previous approach. This paper describes Tectonic’s design, explaining how it achieves scalability, supports multitenancy, and allows tenants to specialize operations to optimize for diverse workloads. The paper also presents insights from designing, deploying, and operating Tectonic.
They explain how these advantages are generated:
Tectonic simplifies operations because it is a single system to develop, optimize, and manage for diverse storage needs. It is resource-efficient because it allows resource sharing among all cluster tenants. For instance, Haystack was the storage system specialized for new blobs; it bottlenecked on hard disk IO per second (IOPS) but had spare disk capacity. f4, which stored older blobs, bottlenecked on disk capacity but had spare IO capacity. Tectonic requires fewer disks to support the same workloads through consolidation and resource sharing.
The paper is well worth reading; the details of the implementation are fascinating and, as the graphs show, the system achieves performance comparable with Haystack and f4 with higher efficiency.

Caching At Scale

A somewhat similar idea underlies The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus by Kan Wu et al:
We introduce non-hierarchical caching (NHC), a novel approach to caching in modern storage hierarchies. NHC improves performance as compared to classic caching by redirecting excess load to devices lower in the hierarchy when it is advantageous to do so. NHC dynamically adjusts allocation and access decisions, thus maximizing performance (e.g., high throughput, low 99%-ile latency). We implement NHC in Orthus-CAS (a block-layer caching kernel module) and Orthus-KV (a user-level caching layer for a key-value store). We show the efficacy of NHC via a thorough empirical study: Orthus-KV and Orthus-CAS offer significantly better performance (by up to 2x) than classic caching on various modern hierarchies, under a range of realistic workloads.
They use an example to motivate their approach:
consider a two-level hierarchy with a traditional Flash-based SSD as the capacity layer, and a newer, seemingly faster Optane SSD as the performance layer. As we will show, in some cases, Optane outperforms Flash, and thus the traditional caching/tiering arrangement works well. However, in other situations (namely, when the workload has high concurrency), the performance of the devices is similar (i.e., the storage hierarchy is actually not a hierarchy), and thus classic caching and tiering do not utilize the full bandwidth available from the capacity layer. A different approach is needed to maximize performance.
To over-simplify, their approach satisfies requests using the performance layer until it appers to be saturated then satisfies as much of the remaining load as it can from the capacity layer. The saturation may be caused simply by excess load, or it nay be caused by the effects of concurrency.

3D XPoint

One of the media Kan Wu et al evaluated in their storage hierarchies was Intel's Optane, based on the Intel/Micron 3D XPoint Storage-Class Memory (SCM) technology. Alas, Chris Mellor's Micron: We're pulling the plug on 3D XPoint. Anyone in the market for a Utah chip factory? doesn't bode well for its future availability:
Intel and Micron started 3D XPoint development in 2012, with Intel announcing the technology and its Optane brand in 2015, claiming it was 1,000 times faster than flash, with up to 1,000 times flash's endurance. That speed claim was not the case for block-addressable Optane SSDs, which used a PCIe interface. However bit-addressable Optane Persistent Memory (PMEM), which comes in a DIMM form factor, was much faster than SSDs but slower than DRAM. It is a form of storage-class memory, and required system and application software changes for its use.

These software changes were complex and Optane PMEM adoption has been slow, with Intel ploughing resources into building an ecosystem of enterprise software partners that support Optane PMEM.

Intel decided to make Optane PMEM a proprietary architecture with a closed link to specific Xeon CPUs. It has not made this CPU-Optane DIMM interconnect open and neither AMD, Arm nor any other CPU architectures can use it. Nor has Intel added CXL support to Optane.

The result is that, more than five years after its introduction, it is still not in wide scale use.
Intel priced Optane too high for the actual performance it delivered, especially considering the programming changes it needed to achieve full performance. And proprietary interfaces in the PC and server space face huge difficulties in gaining market share.

Hard Disk Stagnation

As Chris Mellor reports in Seagate solidifies HHD market top spot as areal density growth stalls the hard disk market remains a two-and-a-half player market focused on nearline, with Toshiba inching up in third place by taking market share from Western Digital, putting them a bit further behind Seagate. The promised technology transition to HAMR and MAMR is more than a decade late and progressing very slowly. Mellor quotes Tom Coughlin:
“The industry is in a period of extended product and laboratory areal density stagnation, exceeding the length of prior stagnations.”

The problem is that a technology transition from perpendicular magnetic recording (PMR), which has reached a limit in terms of decreasing bit area, to energy-assisted technologies – which support smaller bit areas – has stalled.

The two alternatives, HAMR (Heat-Assisted Magnetic Recording) and MAMR (Microwave-Assisted Magnetic Recording) both require new recording medium formulations and additional components on the read-write heads to generate the heat or microwave energy required. That means extra cost. So far none of the three suppliers: Seagate (HAMR), Toshiba and Western Digital (MAMR), have been confident enough in the characteristics of their technologies to make the switch from PMR across their product ranges.
Wikibon analyst David Floyer said: “HDD vendors of HAMR and MAMR are unlikely to drive down the costs below those of the current PMR HDD technology.”

Due to this: “Investments in HAMR and MAMR are not the HDD vendors’ main focus. Executives are placing significant emphasis on production efficiency, lower sales and distribution costs, and are extracting good profits in a declining market. Wikibon would expect further consolidation of vendors and production facilities as part of this focus on cost reduction.”
It has been this way for years. The vendors want to delay the major costs of the transition as long as possible. In the meantime, since the areal density of the PMR platters isn't growing, the bes they can do is to add platters for capacity, and add a second set of arms and heads for performance, and thus add costs. Jim Salter reports on an example in Seagate’s new Mach.2 is the world’s fastest conventional hard drive:
Seagate has been working on dual-actuator hard drives—drives with two independently controlled sets of read/write heads—for several years. Its first production dual-actuator drive, the Mach.2, is now "available to select customers," meaning that enterprises can buy it directly from Seagate, but end-users are out of luck for now.

Seagate lists the sustained, sequential transfer rate of the Mach.2 as up to 524MBps—easily double that of a fast "normal" rust disk and edging into SATA SSD territory. The performance gains extend into random I/O territory as well, with 304 IOPS read / 384 IOPS write and only 4.16 ms average latency. (Normal hard drives tend to be 100/150 IOPS and about the same average latency.)
It is a 14TB, helium-filled PMR drive. It isn't a surprise that the latency doesn't improve; the selected head still needs to seek to where the data is, so dual actuators don't help.

Seagate's Roadmap

At the 2009 Library of Congress workshop on Architectures for Digital Preservation, Dave Anderson of Seagate presented the company's roadmap for hard disks He included this graph projecting that the next recording technology, Heat Assisted Magnetic Recording (HAMR), would take over in the next year, and would be supplanted by a successor technology called Bit Patterned Media around 2015. I started writing skeptically about industry projections of technology evolution the next year in 2010

Whenever hard disk technology stagnates, as it has recently, the industry tries to distract the customers, and delight the good Dr. Pangloss, by publishing roadmaps of the glorious future awaiting them. Anton Shilov's Seagate's Roadmap: The Path to 120 TB Hard Drives covers the latest version:
Seagate recently published its long-term technology roadmap revealing plans to produce ~50 TB hard drives by 2026 and 120+ TB HDDs after 2030. In the coming years, Seagate is set to leverage usage of heat-assisted magnetic recording (HAMR), adopt bit patterned media (BPM) in the long term, and to expand usage of multi-actuator technology (MAT) for high-capacity drives. This is all within the 3.5-inch form factor.
In the recent years HDD capacity has been increasing rather slowly as perpendicular magnetic recording (PMR), even boosted with two-dimensional magnetic recording (TDMR), is reaching its limits. Seagate's current top-of-the-range HDD features a 20 TB capacity and is based on HAMR, which not only promises to enable 3.5-inch hard drives with a ~90 TB capacity in the long term, but also to allow Seagate to increase capacities of its products faster.

In particular, Seagate expects 30+ TB HDDs to arrive in calendar 2023, then 40+ TB drives in 2024 ~ 2025, and then 50+ TB HDDs sometimes in 2026. This was revealed at its recent Virtual Analyst Event. In 2030, the manufacturer intends to release a 100 TB HDD, with 120 TB units following on later next decade. To hit these higher capacities, Seagate is looking to adopt new types of media.
Today's 20 TB HAMR HDD uses nine 2.22-TB platters featuring an areal density of around 1.3 Tb/inch2. To build a nine-platter 40 TB hard drive, the company needs HAMR media featuring an areal density of approximately 2.6 Tb Tb/inch2. Back in 2018~2019 the company already achieved a 2.381 Tb/inch2 areal density in spinstand testing in its lab and recently it actually managed to hit 2.6 Tb/inch2 in the lab, so the company knows how to build media for 40 TB HDDs. However to build a complete product, it will still need to develop the suitable head, drive controller, and other electronics for its 40 TB drive, which will take several years.
In general, Seagate projects HAMR technology to continue scaling for years to come without cardinal changes. The company expects HAMR and nanogranular media based on glass substrates and featuring iron platinum alloy (FePt) magnetic films to scale to 4 ~ 6 Tb/inch2 in areal density. This should enable hard drives of up to 90 TB in capacity.

In a bid to hit something like 105 TB, Seagate expects to use ordered-granular media with 5 ~ 7 Tb/inch2 areal density. To go further, the world's largest HDD manufacturer plans to use 'fully' bit patterned media (BPM) with an 8 Tb/inch2 areal density or higher. All new types of media will still require some sort of assisted magnetic recording, so HAMR will stay with us in one form or another for years to come.
After they finally bite the investment bullet for mass adoption of HAMR, they're going to postpone the next big investment, getting to BPM, as long as they possibly can, just as they did for a decade with HAMR.

As usual, I suggest viewing the roadmap skeptically. One area where I do agree with Seagate is:
Speaking of TCO, Seagate is confident that hard drives will remain cost-effective storage devices for many years to come. Seagate believes that 3D NAND will not beat HDDs in terms of per-GB cost any time soon and TCO of hard drives will remain at competitive levels. Right now, 90% of data stored by cloud datacenters is stored on HDDs and Seagate expects this to continue.


David. said...

In This hard drive breakthrough could see HDDs holding 10 times more data, Liam Tung points to Graphene overcoats for ultra-high storage density magnetic media by Dwivedi, Ott et al. They replaced the top layer of hard disk platters with graphene allowing the heads to get closer to the magnetic layer:

"Here we show that graphene-based overcoats can overcome all these limitations, and achieve two-fold reduction in friction and provide better corrosion and wear resistance than state-of-the-art COCs, while withstanding HAMR conditions. Thus, we expect that graphene overcoats may enable the development of 4–10 Tb/in2 areal density HDDs when employing suitable recording technologies, such as HAMR and HAMR+bit patterned media."

Platypus said...

I worked on "Tectonic" (which is just the external name BTW; we all knew it as something else and AFAIK still do). It's not a filesystem. My first team at FB supported an actual filesystem, and that was deprecated in favor of Tectonic for reasons both real and political. It does have many interesting features, though. Most of those are related to its sheer scale. Placing data across network and power domains at multiple levels for availability, then preserving those constraints when rebalancing (both due to natural imbalance and due to loss/gain of capacity), can get pretty interesting. Finding/repairing blocks affected by a bug not previously known to the background "anti entropy" scanner was another good one, usually involving custom scripts and data pipelines that would take days to weeks to run across the entire cluster. I retired in September, I don't miss the code or work environment at all, but I do miss the abstract problems and the colleagues who shared an understanding of them.

David. said...

Jim Salter's Western Digital introduces new non-SMR 20TB HDDs with onboard NAND explains how WD uses a small amount of flash to provide an effective increase in areal density:

"Repeatable Runout (RRO) is a description of a rotational system's inaccuracies that can be predicted ahead of time—for example, a steady wobble caused by the microscopically imperfect alignment of a rotor. RRO data is specific to each individual drive and is generated at the factory during manufacturing and stored on the disk itself.

In typical conventional drives, RRO metadata is interleaved with customer-accessible data on the platters themselves, reducing the overall areal density of the disk due to reducing the number of tracks per inch (TPI) available for customer data. OptiNAND architecture allows Western Digital to move this metadata off the platters and onto the onboard NAND.

In order to hit 20TB on a nine-platter drive with current recording technologies, you need a next-generation "edge" of some sort. In last year's 20TB drives, that edge was SMR—in this year's newest models, it's OptiNAND."

David. said...

A team st Stanford has greatly reduced the energy needed to switch phase-change memory:

"To address this challenge, the Stanford team set out to design a phase-change memory cell that operates with low power and can be embedded on flexible plastic substrates commonly used in bendable smartphones, wearable body sensors and other battery-operated mobile electronics.
In the study, Daus and his colleagues discovered that a plastic substrate with low thermal conductivity can help reduce current flow in the memory cell, allowing it to operate efficiently.

“Our new device lowered the programming current density by a factor of 10 on a flexible substrate and by a factor of 100 on rigid silicon,” Pop said. “Three ingredients went into our secret sauce: a superlattice consisting of nanosized layers of the memory material, a pore cell – a nanosized hole into which we stuffed the superlattice layers – and a thermally insulating flexible substrate. Together, they significantly improved energy efficiency.”