One of the best parts of FAST are the papers based on collecting large amounts of data about the behavior of storage systems in production. This time there were three of them.
Ioannis Manousakis et al got one of the two Best Paper awards for Environmental Conditions and Disk Reliability in Free-cooled Datacenters, based on a massive dataset recording the environmental conditions and the disk errors in nine of Microsoft's data centers, of three different types:
- the original resource-intensive, chiller-based type in which both chillers and cooling towers are used continually to maintain tight control of environmental conditions.
- the water-side economized type, in which the chillers are used only when the cooling towers are unable to maintain environmental control
- free-cooled data centers, which use fans to drive filtered outside air through the data center, adding evaporative cooling (and thus humidity) when needed to maintain acceptable temperatures
Bianca Schroeder gave a polished presentation of work with two co-authors from Google. Flash Reliability in Production: The Expected and the Unexpected used data covering "many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years" to reveal some unexpected aspects of large-scale flash use. Google built the drives themselves, using different commodity flash chips but:
We focus on two generations of drives, where all drives of the same generation use the same device driver and firmware. That means that they also use the same error correcting codes (ECC) to detect and correct corrupted bits and the same algorithms for wear-levellingThus they were really looking at differences in behavior of the underlying flash media. Among their results were:
The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. ... Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors.Krste Asanović's keynote at the 2014 FAST conference cited The Tail At Scale by Dean and Barroso in drawing attention to the importance of tail latency at data center scale. Mingzhe Hao et al's The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments paid homage to Dean and Barroso's work as they analyzed:
storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an overall total of 857 million (disk) and 7 million (SSD) drive hours.using data from NetApp filers in the field. They found a small proportion of very slow responses from the media:
0.2% of the time, a disk is more than 2x slower than its peer drives in the same RAID group (and 0.6% for SSD). As a consequence, disk and SSD-based RAIDs experience at least one slow drive (i.e., storage tail) 1.5% and 2.2% of the time. ... We observe that storage tails can adversely impact RAID performanceAt scale, these slow-downs are very significant. Many of them reflect internal drive operations such as bad-block remapping and garbage collection, making this paper highly relevant to Eric Brewer's keynote.
Drives are not the only cause of errors and performance problems in storage systems. Storage system software has bugs, and the inherent non-determinism of these systems makes the more subtle of them very hard to find. Pantazis Deligiannis et al's Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) described an important technique by which critical software components could be exercised in a framework that mimic-ed the non-determinism while capturing enough trace information to allow the root causes of the bugs it found to be identified.
As usual at FAST, the University of Wisconsin-Madison was well represented. Lanyue Lu et al's WiscKey: Separating Keys from Values in SSD-conscious Storage looked at the implementation of key-value stores, typically by LSM-trees, in SSD storage and showed how simply introducing a level of indirection between the keys and their values could radically reduce the I/O amplification:
this I/O amplification in typical LSM-trees can reach a factor of 50x or higherand enable a set of other optimizations.
At scale, dev-ops tools such as Docker are essential. They make configuring and bringing up instances of systems easy, but they aren't as fast as one would like. Another good paper from Madison was Tyler Harter et al's Slacker: Fast Distribution with Lazy Docker Containers developed a benchmark to look into why this is and found that:
pulling packages accounts for 76% of container start time, but only 6.4% of that data is read.Slacker, their new, lazy Docker storage driver:
speeds up the median container development cycle by 20x and deployment cycle by 5x.It turns out that Docker's standard storage driver is based on AUFS (Another Union File System), and a good deal of the inefficiency comes from the way AUFS is used. I discussed the history of union file systems in It Takes Longer Than It Takes.
Meza et al from C-MU and Facebook have a study of the reliability of:
ReplyDelete"a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours"
entitled A Large-Scale Study of Flash Memory Failures in the Field.