Tuesday, February 19, 2013

Thoughts from FAST 2013

I attended Usenix's 2013 FAST conference. I was so interested in Kai Li's keynote entitled Disruptive Innovation: Data Domain Experience that I'll devote a separate post to it. Below the fold are some other things that caught my attention. Thanks to Usenix's open access policy, you can follow the links and read the papers if I've piqued your interest.

First, an important Work In Progress and poster by Jon Elerath and Jiri Schindler of NetApp entitled Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation. It has been obvious for years that the reliability estimate in the original RAID paper was vastly optimistic. One reason was that it assumed that the errors were random and un-correlated, which turns out not to be the case. But producing better reliability estimates was very difficult, needing complex Monte Carlo models, a single run of which could last many hours. Elerath and Schindler did the Monte Carlo simulations based on NetApp's extensive database of disk reliability data and derived a simple analytic computation that summarizes the results of both the simulations and the experience in the field very closely. In doing so it confirms that the original estimate is so wildly optimistic as to be useless, and that RAID-6 is approaching the end of its useful deployment. They have made a RAID6 reliability calculator available for everyone here. Try it to see just how optimistic the MTTDL estimate is!

Second, and one of the Best Paper winners, A Study of Linux File System Evolution by Lanyue Lu et al. from the University of Wisconsin, Madison. This fascinating paper analyzed every one of the more than 5000 changes to six of the file systems in the Linux kernel over the past 8 years, classifying them exhaustively. It must have been an enormous amount of work; the results are available in a public database for others to use. They conclude that nearly half the patches are the result of refactoring, and 40% (about 1800) are fixing bugs. Four of the file systems in question are stable and have been in production use for the whole 8 years. Ext4 became stable during the period and BTRFS is still under development. Even the stable file systems had a continuing flow of bug fixes, and most of them were fixing problems that could cause the system to crash and/or cause corruption of the on-disk image.

From the same group came ffsck: The Fast file System Checker by Ao Ma et al. This asked the question, obvious in hindsight, "why is fsck so slow?" They showed that most of the time went into reading the indirect blocks, and that a simple re-organization of the layout of these blocks produces a dramatic speed-up. Of course, it could be argued that speeding up fsck is addressing a symptom rather than the underlying problem, which is that fsck is needed in the first place. But in the light of Lu et al. and earlier research showing the prevalence of bugs in file system code, the belt-and-braces approach may be unavoidable.

Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory by Eunji Lee and Hyokyung Bahn of Ewha University and Sam H. Noh of Hongik University (both in Seoul, South Korea) was the other Best Paper winner. It is an early paper looking at the effects the availability of future non-volatile memories will have on system design. They show that if the RAM in which the buffer cache is stored is non-volatile, the buffer cache can take on the role of the file system journal and provide a major I/O speedup. I think in the long term the effects of non-volatile RAM will be far more fundamental, but this is very interesting low-hanging fruit.

Horus: Fine-Grained Encryption-Based Security for Large-Scale Storage by Yan Li et al. was a collaboration between UC Santa Cruz and Sandia Labs, and Shroud: Ensuring Private Access to Large-Scale Data in the Data Center by Jacob R. Lorch et al. was a collaboration between Microsoft Research, IBM Research and AMD. Both examined how to use encryption to protect information in the cloud.

Horus is the simpler and more practical of the two, showing how to compute nodes can be allowed to decrypt only the particular part of the particular file they need. Horus encrypts each block of a file with a separate key. These keys are generated from a single root key as a hash tree of keys, so that given the key for any node in the tree the keys for its children can easily be computed, but computing the keys for its parent nodes is infeasible. The compute nodes request the keys for the range of data in the file that they need from a key server service. This service, which contains the root keys, can be at the data owner's site; it does not need to be in the cloud with the encrypted data, which is a very valuable security feature. The key server can be implemented efficiently; it does not need a lot of storage, compute power or network bandwidth.

Shroud is tackling a much harder problem, and is a long way from being usable in practice. It assumes cloud compute and storage services controlled by the adversary, and tries to ensure that not merely the content of the files does not leak but neither does the information as to who acccessed which part of which file when. Shroud does so using a technique called Oblivious RAM and a large number of small, trusted computers such as those found in smart-cards. The basic idea is to bury the actual data accesses in a large number of noise data accesses, which makes it too expensive and slow for practical use in the near future. But during the Q&A after the presentation Dave Anderson of Seagate pointed out that disks now contain substantial computing resources, and that supporting trusted data access in the drives would be quite feasible. This would probably make doing Shroud-level security feasible.

Returning to the issue of correlated failures in storage systems, SATA Port Multipliers Considered Harmful by Peng Li et al. from University of Minnesota & FutureWei Technologies looked at the effect of SATA port multipliers on system reliability and showed that the failure of one drive on a port multiplier caused I/Os to other drives on the multiplier to fail as well. As can be seen from their Figure 6, after they cause the first drive to fail (by removing the drive cover and waiting for dirt in the air to cause a head crash! They did this 40 times!) the second drive stops performing I/Os, then starts and stops for a while before their test code determines that it too has failed. The authors are correct in pointing out the risks this poses for systems, such as Backblaze's, that use SATA port multipliers. But this looks to me more like a problem with the driver's handling of error conditions than a fundamental hardware problem. The non-failed drive on the port multiplexer was still capable of performing I/Os after the failure, but it was continually stopped by something that is very likely to be software resetting the controller. Thus, since it is pretty certain that the SATA controllers in Backblaze's case are different, and it seems likely that the SATA port multiplexers are different, it isn't safe to extrapolate from this result to other system configurations whose interactions with the driver might be different.

Also having fun doing bad things to good storage were Mai Zheng et al. of Ohio State and HP Labs, with Understanding the Robustness of SSDs under Power Fault. They built a test rig in which they could cut the DC power to an SSD while it was under heavy load, power it back up and examine the wreckage.They identified 6 different types of error, and actually saw 5 of them in their tests of 15 different SSDs and 2 HDDs. Two SSDs became unusable early in testing, others showed a wide range of resistance. Two SSDs and one HDD showed no problems at all. This talk got my favorite trick question of the conference, when someone from Microsoft asked if they could tell whether the failures happened during power off, or power on.

No comments: