Tuesday, March 1, 2016

The Cloudy Future of Disk Drives

For many years, following Dave Anderson of Seagate, I've been pointing out that the constraints of manufacturing capacity mean that the only medium available on which to store the world's bulk data is hard disk. Eric Brewer's fascinating FAST2016 keynote, entitled Spinning Disks and their Cloudy Future and Google's associated white paper, start from this premise:
The rise of portable devices and services in the Cloud has the consequence that (spinning) hard disks will be deployed primarily as part of large storage services housed in data centers. Such services are already the fastest growing market for disks and will be the majority market in the near future.
Eric's argument is that since cloud storage will shortly be the majority of the market, and that other segments are declining, the design of hard drives no longer needs to be a compromise suitable for a broad range of uses, but should be optimized for the Cloud. Below the fold, I look into some details of the optimizations and provide some supporting evidence.

Unattributed quotes below come from the white paper Disks for Data Centers, which you should definitely read. It starts by pointing out the scale of the problem cloud providers face:
for YouTube alone, users upload over 400 hours of video every minute, which at one gigabyte per hour requires a petabyte (1M GB) of new storage ​every day. As shown in the graph, this continues to grow exponentially, with a 10x increase every five years.
Google agrees with Dave's and my argument but puts it differently:
An obvious question is why are we talking about spinning disks at all, rather than SSDs, which have higher IOPS and are the “future” of storage. The root reason is that the cost per GB remains too high, and more importantly that the growth rates in capacity/$ between disks and SSDs are relatively close (at least for SSDs that have sufficient numbers of program­/erase cycles to use in data centers), so that cost will not change enough in the coming decade.
The reason why the rates are close is that the investment in fab capacity needed to displace hard disks would not generate an economic return, so that disks and SSDs will share the market, with SSDs taking the market segment in which their higher performance can justify the higher cost needed to generate a return on the fab investment.

In summary, the white paper's argument is that at scale disks are only ever used as part of a massive collection. This has two effects. The first is:
Achieving durability in practice requires storing valuable data on multiple drives. Within a data center, high availability in the presence of host failures also requires storing data on multiple disks, even if the disks were perfect and failure free. Tolerance to disasters and other rare events requires replication of data to multiple independent locations. Although it is a goal for disks to provide durability, they can at best be only part of the solution and should avoid extensive optimization to avoid losing data.

This is a variation of the “end to end” argument: avoid doing in lower layers what you have to do anyway in upper layers. In particular, since data of value is never just on one disk, the bit error rate (BER) for a ​single disk could actually be orders of magnitude higher (i.e. lose more bits) than the current target of 1 in 10​15, assuming that we can trade off that error rate (at a fixed TCO) for something else, such as capacity or better tail latency.
This is caricatured in the media as Google wants less reliable hard disks, which misses the point. Eight years ago, Jiang et al's Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics showed that disks are the cause of less than half of storage failures encountered in the field. Thus even perfectly reliable disks would not remove the need for replication. Because the data is replicated, and because disks are already very reliable, if by sacrificing reliability other metrics were improved sufficiently, it would be a good trade-off. If the other metrics could be improved without sacrificing reliability, that would be even better, since system-level recovery from errors is never free.

The second effect is related, The key metrics (as we see, including data durability) are collection-level not disk level. They are:
  1. Higher I/Os per second (​IOPS)
  2. Higher​ capacity​, in GB
  3. Lower​ tail latency​
  4. Meet ​security​ requirements
  5. Lower total cost of ownership (TCO​)
The first two pose a trade-off, and "The industry is relatively good at improving GB/$, but less so at IOPS/GB". The reason is obvious, at constant cost moving to faster disks means reducing capacity, so moves toward the market segment dominated by SSDs.

The effect of the trade-off is illustrated in this conceptual graph, plotting GB/$ against IOPS/GB. Improving both metrics would move up and to the right, but successive generations tend to move up and to the left - they get a lot bigger but not much faster because the data rate increases linearly with the density but the capacity increases with the square of the density. Google has a target for IOPS/GB, and to meet it will buy the mix of drives on the red line that drives the fleet average closer to the target.

Perhaps the most critical metric is the third, tail latency. Krste Asanović's keynote at the 2014 FAST conference cited Dean and Barroso's 2013 paper The Tail At Scale in drawing attention to the importance of tail latency at data center scale. Dean and Barroso explain:
Variability in the latency distribution of individual components is magnified at the service level; for example, consider a system where each server typically responds in 10ms but with a 99th-percentile latency of one second. If a user request is handled on just one such server, one user request in 100 will be slow (one second). ... service-level latency in this hypothetical scenario is affected by very modest fractions of latency outliers. If a user request must collect responses from 100 such servers in parallel, then 63% of user requests will take more than one second
One of the interesting papers at this year's FAST, Mingzhe Hao et al's The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments showed that tail latency is a big issue for both hard disks and, more surprisingly, SSDs. They analyzed:
storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an overall total of 857 million (disk) and 7 million (SSD) drive hours.
using data from NetApp filers in the field. They found a small proportion of very slow responses from the media:
0.2% of the time, a disk is more than 2x slower than its peer drives in the same RAID group (and 0.6% for SSD). As a consequence, disk and SSD-based RAIDs experience at least one slow drive (i.e., storage tail) 1.5% and 2.2% of the time. ... We observe that storage tails can adversely impact RAID performance
No kidding. Adapting a strategy from Dean and Barroso, they propose a tail-tolerant RAID-6 implementation they call ToleRAID:
In normal reads, the two parity drives are unused (if no errors), and thus can be leveraged to mask up to two slow data drives. For example, if one data drive is slow, ToleRAID can issue an extra read to one parity drive and rebuild the “late” data.
This is quite effective, it:
can cut long tails and ensure RAID only slows down by at most 3x, while only introducing 0.5% I/O overhead
Google agrees with the multiple-request strategy:
For better overall tail latency, we sometimes issue the same read multiple times to different disks and use the first one to return. This consumes extra resources, but is still sometimes worthwhile. Once we get a returned value, we sometimes ​cancel the other outstanding requests (to reduce the wasted work). This is pretty likely if one disk (or system) has it cached, but the others have queued up real reads.
Google's suggestions for improving tail latency involve enhancing the disk's API in the following ways:
  • Flexible read bounds
  • Profiling data
  • Host-managed read retries
  • Target Error rate
  • Background task management
  • Background scanning
The details are in the white paper, but the key idea is that the system has a prioritized queue of requests for data, and the drive in addition has a queue of requests for internal activities, such as scanning for bad blocks. These queues need to be managed by both sides. For example, a read in the system's queue is cancelled because the data was obtained elsewhere. Or the drive notices that a read near the tail of the queue can be merged with one near the front because although they are disparate in logical space they are adjacent in physical space.

With the current API the system can't communicate the prioritization to the drive, so it keeps the drive queue as short as possible. This has two effects. It prevents the drive using its knowledge of what is actually going on to manage the queue so as to conform to the system's priorities. And it gives the drive a false impression that there isn't much work queued up, so it gives its internal activities more resource than they deserve.

This is just a high-level sample of the thought-provoking ideas in the white paper. Google's hope is to launch a conversation between manufacturers, cloud providers and academics to re-think the design compromises embodied in current designs. I hope they succeed.

1 comment:

David. said...

Backblaze and I have argued for quite some time that you don't have to get all that big to reap enough economies of scale to get cheaper than Amazon. Now it seems that Dropbox (at 1/2EB!) has figured it out too, and is moving their user's data from S3 to in-house infrastructure.