Friday, February 20, 2015

Report from FAST15

I spent most of last week at Usenix's File and Storage Technologies conference. Below the fold, notes on the most interesting talks from my perspective.


A Brief History of the BSD Fast Filesystem. My friend Kirk McKusick was awarded the 2009 IEEE Reynold B. Johnson Information Storage Systems Award at the 2009 FAST conference for his custody of this important technology, but he only had a few minutes to respond. This time he had an hour to review over 30 years of high-quality engineering. Two aspects of the architecture were clearly important.

The first dates from the beginning in 1982. It is the strict split between the mechanism of the on-disk bitmaps (code unchanged since the first release), and the policy for laying out blocks on the drive. It is this that means that, if you had an FFS disk from 1982 or its image, the current code would mount it with no problems. The blocks would be laid out very differently from a current disk (and would be much smaller) but the way this different layout was encoded on the disk would be the same. The mechanism guarantees consistency, there's no way for a bad policy to break the file system, it can just slow it down. As an example, over lunch after listening to Ao Ma et al's 2013 FAST paper ffsck: The Fast File System Checker Kirk implemented their layout policy for FFS. Ma et al's implementation added 1357 lines of code to the ext3 implementation.

The second dates from 1987 and, as Kirk tells it, resulted from a conversation with me. It is the clean and simple implementation of stacking vnodes, which allows very easy and modular implementation of additional file system functionality, such as user/group ID remapping or extended attributes. Most of Kirk's talk was a year-by-year recounting of incremental progress of this kind.


Analysis of the ECMWF Storage Landscape by Matthias Grawinkel et al is based on a collection of logs from two tape-based data archives fronted by disk cache (ECFS is 15PB with disk:tape ratio 1:43, MARS is 55PB with a 1:38 ratio). They have published the data:
  • ECFS access trace: Timestamps, user id, path, size of GET, PUT, DELETE, RENAME requests. 2012/01/02-2014/05/21.
  • ECFS / HPSS database snapshot: Metadata snapshot of ECFS on tape. Owner, size, creation/read/modification date, paths of files. Snapshot of 2014/09/05.
  • MARS feedback logs: MARS client requests (ARCHIVE, RETRIEVE, DELETE). Timestamps, user, query parameters, execution time, archived or retrieved bytes and fields. 2010/01/01-2014/02/27.
  • MARS / HPSS database snapshot: Metadata snapshot of MARS files on tape. Owner, size, creation/read/modification date, paths of files. Snapshot of 2014/09/06.
  • HPSS WHPSS logs / robot mount logs: Timestamps,tape ids, information on full usage lifecycle from access request till cartridges are put back to the library. 2012/01/01 - 2013/12/31 
This is extraordinarily valuable data for archival system design, and their analyses are very interesting. I plan to blog in detail about this soon.

Efficient Intra-Operating System Protection Against Harmful DMAs by Moshe Malka et al provides a fascinating insight into the cost to the operating system of managing IOMMUs such as those used by Amazon and NVIDIA and identifies major cost savings.

ANViL: Advanced Virtualization for Modern Non-Volatile Memory Devices by Zev Weiss et al looks at managing the storage layer of a file system the same way the operating system manages RAM, by virtualizing it with a page map. This doesn't work well for hard disk, because the latency of the random I/Os needed to do garbage collection is so long and variable. But for flash and its successors it can potentially simplify the file system considerably.

Reducing File System Tail Latencies with Chopper by Jun He et al. Krste Asanovic's keynote at the last FAST stressed the importance for large systems of suppressing tail latencies. This paper described ways to exercise the file system to collect data on tail latencies, and to analyse the data to understand where the latencies were coming from so as to fix their root cause. They found four problems in the ext4 block allocator that were root causes.

Skylight—A Window on Shingled Disk Operation by Abutalib Aghayev and Peter Desnoyers won the Best Paper award. One response of the drive makers to the fact that Shingled Magnetic Recording (SMR) turns hard disks from randomly writable to append-only media is Drive-Managed SMR, in which a Shingled Translation Layer (STL) hides this fact using internal buffers to make the drive interface support random writes. Placing this after the tail latency paper was a nice touch - one result of the buffering is infrequent long delays as the drive buffers are flushed! The paper is a very clear presentation of the SMR technology, the problems it poses, the techniques for implementing STLs, and their data collection techniques. These included filming the head movements with a high-speed camera through a window they installed in the drive top cover.

RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures by Ao Ma et al shows that in EMC's environment they can effectively predict SATA disk failures by observing the reallocated sector count and, by proactively replacing drives whose counts exceed a threshold, greatly reduce RAID failures. This is of considerable importance in improving the reliability of disk-based archives.

Work-In-Progress talks and posters

Building Native Erasure Coding Support in HDFS by Zhe Zhang et al - this WIP described work to rebuild the framework underlying HDFS so that flexible choices can be made between replication and erasure coding, between contiguous and striped data layout, and between erasure codes.

Changing the Redundancy Paradigm: Challenges of Building an Entangled Storage by Verónica Estrada Galiñanes and Pascal Felber - this WIP updated work published earlier in Helical Entanglement Codes: An Efficient Approach for Designing Robust Distributed Storage Systems. This is an alternative to erasure codes for efficiently increasing the robustness of stored data. Instead of adding parity blocks, they entangle incoming blocks with previously stored blocks:
To upload a piece of data to the system, a client must first download some existing blocks ... and combine them with the new data using a simple exclusive-or (XOR) operation. The combined blocks are then uploaded to different servers, whereas the original data is not stored at all. The newly uploaded blocks will be subsequently used in combination with future blocks, hence creating intricate dependencies that provide strong durability properties. The original piece of data can be reconstructed in several ways by combining different pairs of blocks stored in the system. These blocks can themselves be repaired by recursively following dependency chain
It is an interesting idea that, at data center scale, is claimed to provide very impressive fault-tolerance for archival data.

No comments: