Saturday, February 7, 2015

It takes longer than it takes

I hope it is permissible to blow my own horn on my own blog. Two concepts recently received official blessing after a good long while, for one of which I'm responsible, and for the other of which I'm partly responsible. The mysteries are revealed below the fold.



The British Parliament is celebrating the 800th anniversary of Magna Carta:
On Thursday 5 February 2015, the four surviving original copies of Magna Carta were displayed in the Houses of Parliament – bringing together the documents that established the principle of the rule of law in the place where law is made in the UK today.  
The closing speech of the ceremony in the House of Lords was given by Sir Tim Berners-Lee, who is reported to have said:

I invented the acronym LOCKSS more than a decade and a half ago. Thank you, Sir Tim!

On October 24, 2014 Linus Torvalds added overlayfs to release 3.18 of the Linux kernel. Various Linux distributions have implemented various versions of overlayfs for some time, but now it is an official part of Linux. Overlayfs is a simplified implementation of union mounts, which allow a set of file systems to be superimposed on a single mount point. This is useful in many ways, for example to make a read-only file system such as a CD-ROM appear to be writable by mounting a read-write file system "on top" of it.

Other Unix-like systems have had union mounts for a long time. BSD systems first implemented it in 4.4BSD-Lite two decades ago. The concept traces back five years earlier to my paper for the Summer 1990 USENIX Conference Evolving the Vnode Interface which describes a prototype implementation of "stackable vnodes". Among other things, it could implement union mounts as shown in the paper's Figure 10:
This use of stackable vnodes was in part inspired by work at Sun two years earlier on the Translucent File Service, a user-level NFS service by David Hendricks that implemented a restricted version of union mounts. All I did was prototype the concept, and like many of my prototypes it served mainly to discover that the problem was harder than I initially thought. It took others another five years to deploy it in SunOS and BSD. Because they weren't hamstrung by legacy code and semantics by far the most elegant and sophisticated implementation was around the same time by Rob Pike and the Plan 9 team. Instead of being a bolt-on addition, union mounting was fundamental to the way Plan 9 worked.

About five years later Erez Zadok at Stony Brook led the FiST project, a major development of stackable file systems including two successive major releases of unionfs, a unioning file system for Linux.

About the same time I tried to use OpenBSD's implementation of union mounts early in the boot sequence to construct the root directory by mounting a RAM file system over a read-only root file system on a CD, but gave up on encountering deadlocks.

In 2009 Valerie Aurora published a truly excellent series of articles going into great detail about the difficult architectural and implementation issues that arise when implementing union mounts in Unix kernels. It includes the following statement, with which I concur:
The consensus at the 2009 Linux file systems workshop was that stackable file systems are conceptually elegant, but difficult or impossible to implement in a maintainable manner with the current VFS structure. My own experience writing a stacked file system (an in-kernel chunkfs prototype) leads me to agree with these criticisms.
Note that my original paper was only incidentally about union mounts, it was a critique of the then-current VFS structure, and a suggestion that stackable vnodes might be a better way to go. It was such a seductive suggestion that it took nearly two decades to refute it! My apologies for pointing down a blind alley.

The overlayfs implementation in 3.18 is minimal:
Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer.
But given the architectural issues doing one thing really well has a lot to recommend itself over doing many things fairly well. This is, after all, the use case from my paper.

It took a quarter of a century, but the idea has finally been accepted. And, even though I had to build a custom 3.18 kernel to do so, I am using it on a Raspberry Pi serving as part of the CLOCKSS Archive.

Thank you, Linus! And everyone else who worked on the idea during all that time!

References (date order):

2 comments:

David. said...

Kirk McKusick's fascinating FAST15 keynote A Brief History of the BSD Fast File System corrects my memory. BSD acquired stacking vnodes in 1987, according to Kirk after discussions with me. I think Steve Kleiman was responsible for much of the content of those discussions. For sure, Kirk's implementation of stacking in BSD was much cleaner than my prototype in SunOS.

David. said...

Tim Anderson reports that systemd has taken up overlayfs:

"As developer Lennart Poettering explained: "When a system extension image is activated, its /usr/ and /opt/ hierarchies and os-release information are combined via overlayfs with the file system hierarchy of the host OS."

The primary use case for system extension images is for immutable operating systems like Red Hat's Silverblue and Kinoite. In these OSes, the file system is read-only and is updated by replacing it with a new image rather than being patched, which is better both for security and stability.

It does cause compatibility issues for applications that need updated system files, and is difficult for developers who need more flexibility. Typically, these problems are overcome by running virtual machines or containers, but system extension images let users and developers update or add system files without actually modifying the immutable file system."