Friday, March 3, 2017

Notes from FAST17

As usual, I attended Usenix's File and Storage Technologies conference. Below the fold, my comments on the presentations I found interesting.

Kimberly Keeton's keynote, Memory-Driven Computing reported on HPE's experience in their "Machine" program of investigating the impact of Storage Class Memory (SCM) technology on system architecture. HPE has built, and open-sourced, a number of components of a future memory-based architecture. They have also run simulations on a 240-core 12TB DRAM machine, which show the potential for very large performance improvements in a broad range of applications processing large datasets. It is clearly important to research these future impacts before they hit.

The environment they have built so far does show the potential performance gains from a load/store interface to persistent storage. It is, however, worth noting that DRAM is faster than SCMs, so the gains are somewhat exaggerated. But Keeton also described the "challenges" that result from the fact that it doesn't provide a lot of services that we have come to depend upon from persistent storage, including access control, resilience to failures, encryption and naming. The challenges basically reduce to the fact that the potential performance comes from eliminating the file system and I/O stack. But replacing the necessary services that the file system and I/O stack provides will involve creating a service stack under the load/store interface, which will have hardware and performance costs. It will also be very hard, since the finer granularity of the interface obscures information that the stack needs.

Last year's Test of Time Award went to a NetApp paper from 2004. This year's award went to a NetApp paper from 2002. Not content with that, one of the two Best Paper awards went to Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL by Ram Kesavan et al from NetApp and UW Madison. This was an excellent paper describing how the fix they developed for a serious scaling problem that impacted emerging workloads for NetApp filers created two problems. First, it slightly degraded performance on legacy workloads, i.e. for the majority of their customers. Second, and much worse, in certain very rare cases it caused the filer to go catatonic for about 30 seconds. The story of how they came up with a second solution that fixed both is interesting.

A big theme was how to deal with devices, such as Shingled Magnetic Recording (SMR) hard drives and flash SSDs, that do expensive things like garbage collection autonomously. Murphy's Law suggests that they will do them at inconvenient times.

Now that SMR disks are widely available, the technology rated an entire session. Abutalib Aghayev and co-authors from CMU, Google and Northeastern presented Evolving Ext4 for Shingled Disks. They report:
For non-sequential workloads, [drive-managed SMR] disks show bimodal behavior: After a short period of high throughput they enter a continuous period of low throughput.

We introduce ext4-lazy, a small change to the Linux ext4 file system that significantly improves the throughput in both modes. We present benchmarks on four different drive-managed SMR disks from two vendors, showing that ext4-lazy achieves 1.7-5.4x improvement over ext4 on a metadata-light file server benchmark. On metadata-heavy benchmarks it achieves 2-13x improvement over ext4 on drive-managed SMR disks as well as on conventional disks.
This work is impressive; a relatively small change caused a really significant performance improvement. We use Seagate's 8TB drive-managed SMR disks in our low-cost LOCKSS box prototype. We need to investigate whether our workload triggers this behavior and, if so, try their fix.

Among the many papers on flash, one entire session was devoted to Open-Channel SSDs, an alternate architecture for flash storage providing a lower-level interface that allows the Flash Translation Layer to be implemented on the host. Open-Channel SSD hardware is now available, and Linux is in the process of releasing a subsystem that uses it. LightNVM: The Linux Open-Channel SSD Subsystem by Matias Bjørling and co-authors from CNEX Labs, Inc. and IT University of Copenhagen described the subsystem and its multiple levels in detail. From their abstract:
We present our experience building LightNVM, the Linux Open-Channel SSD subsystem. We introduce a new Physical Page Address I/O interface that exposes SSD parallelism and storage media characteristics. LightNVM integrates into traditional storage stacks, while also enabling storage engines to take advantage of the new I/O interface. Our experimental results demonstrate that LightNVM has modest host overhead, that it can be tuned to limit read latency variability and that it can be customized to achieve predictable I/O latencies.
Other papers in the session described how wear-leveling of Open-Channel SSDs can be handled in this subsystem, and how it can be used to optimize the media for key-value caches.

It is frequently claimed that modern file systems have engineered away the problem of fragmentation and the resulting performance degradation as the file system ages, at least in file systems that aren't nearly full. File Systems Fated for Senescence? Nonsense, Says Science! by Alex Conway et al from Rutgers, UNC, Stony Brook, MIT and Farmingdale State College demonstrated that these claims aren't true. They showed that repeated git pull invocations on a range of modern file systems caused throughput to decrease by factors of 2 to 30 despite the fact that they were never more than 6% full:
Traditional file systems employ heuristics, such as collocating related files and data blocks, to avoid aging, and many file system implementors treat aging as a solved problem. ... However, this paper describes realistic as well as synthetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging. ... BetrFS, a file system based on write-optimized dictionaries, exhibits almost no aging in our experiments. ... We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging.
Bharath Kumar Reddy Vangoor and co-authors from IBM Almaden and Stony Brook looked at the performance characteristics of a component of almost every approach to emulation for digital preservation in To FUSE or Not to FUSE: Performance of User-Space File Systems. From their abstract:
Nowadays, user-space file systems are often used to prototype and evaluate new approaches to file system design. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. ... In this paper we analyze the design and implementation of the most widely known user-space file system framework—FUSE—and characterize its performance for a wide range of workloads. ... Our experiments indicate that depending on the workload and hardware used, performance degradation caused by FUSE can be completely imperceptible or as high as –83% even when optimized; and relative CPU utilization can increase by 31%.
I haven't heard reports of FUSE causing performance problems in emulation systems, but this issue is something to keep an eye on as usage of these systems increases.

I really liked High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System by Harendra Kumar and co-authors from UW Madison and NetApp. If you're NetApp and have about 250K boxes in production at customers, all the common bugs have been found. The ones that haven't yet been found happen very rarely, but they can have a big impact not just on the customers, whose systems are mission-critical, but also within the company, because they are very hard to diagnose and replicate.

The paper described three defensive techniques that catch malign attempts to change metadata and panic the system before the metadata is corrupted. After this kind of panic there is no need for a lengthy fsck-like integrity check of the file system, so recovery is fast. Deployment of the two low-overhead defenses reduced the incidence of recoveries by a factor of three! Over the last five years 83 systems have been protected from 17 distinct bugs. This implies that the annual probability of one of these bugs occurring in a system is about 0.007%, which gives you some idea of how rare they are.

vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O by Ming Chen and co-authors from Stony Brook University, IBM Research-Almaden and Ward Melville High School (!) described a client-side library that exported a vectorized I/O interface allowing multiple operations to multiple files to be aggregated. The NFS 4.1 protocol supports aggregated (compound) operations, but they aren't much used because the POSIX I/O interface doesn't support aggregation. The result is that operations are serialized, and each operation takes a network round-trip latency. Applications that were ported to used the aggregating library showed really large performance improvements, up to two orders of magnitude, running against file systems exported by unmodified NFS 4.1 servers.

The Work In Progress session featured some interesting talks:
  • On Fault Resilience of File System Checkers by Om Rameshwar Gatla and Mai Zheng of New Mexico State described fault injection in file system checkers. It was closely related to the paper by Ganesan et al on fault injection in distributed storage systems, so I discussed it in my post on that paper.
  • 6Stor: A Scalable and IPv6-centric Distributed Object Storage System by Ruty Guillaume et al discussed the idea of object storage using the huge address space of IPv6 to assign an IPv6 address to both each object's metadata and its data. At first sight, this idea seems nuts, but it sort of grows on you. Locating objects in distributed storage requires a process that is somewhat analogous to the network routing that is happening underneath.
  • Enhancing Lifetime and Performance of Non-Volatile Memories through Eliminating Duplicate Writes by Pengfei Zuo et al from Huazhong University of Science and Technology and Arizona State University made the interesting point that, just like hard disks, the fact that non-volatile memories like flash and SCMs retain data when power is removed means that they need encryption. This causes write amplification - modifying part of the plaintext causes the entire ciphertext to change.
  • A Simple Cache Prefetching Layer Based on Block Correlation by Juncheng Yang et al from Emory University showed how using temporal correlations among blocks to drive pre-fetching could improve cache hit rates.


baldman said...

Thanks for your kind words on the WAFL papers. Some of us have finally found the time to write+submit papers describing a large amount of interesting and sometimes seminal (we think!) work in the file systems area. We hope to do more of that in the near future.

David. said...

It is starting to look as though the drives in our low-cost LOCKSS box prototype are running into the bimodal performance problem described by Aghayev et al. We are investigating and, if the diagnosis is correct, planning to try their ext4-lazy.