Tuesday, July 9, 2024

Engineering For The Long Term

Content Warning: this post contains blatant self-promotion.

Contributions to engineering fields can only reasonably be assessed in hindsight, by looking at how they survived exposure to the real world over the long term. Four of my contributions to various systems have stood the test of time. Below the fold, I blow my own horn four times.

Four Decades

X11R1 on a Sun/1
Wikipedia has a pretty good history of the early days of the X Window System. In The X Window System At 40 I detailed my contributions to the early development of X. To my amazement 40 years after Bob Scheifler's initial release it is still valiantly resisting obsolescence. I contributed to the design, implementation, testing, release engineering and documentation of X11 starting a bit over 39 years ago. At least my design for how X handles keyboards is still the way it works.

All this while I was also working on a competitor, Sun's NeWS — which didn't survive the test of time.

Nearly Three-and-a-Half Decades

One of the things I really enjoyed about working on NeWS was that the PostScript environment it implemented was object-oriented, a legacy of PostScript's origins at Xerox PARC. Owen Densmore and I developed A User‐Interface Toolkit in Object‐Oriented PostScript that made developing NeWS applications very easy, provided you were comfortable with an object-oriented programming paradigm.

I think it was sometime in 1988 while working on the SunOS 4.0 kernel that I realized that the BSD Vnode interface was in a loose sense object-oriented. It defines the interface between the file system and the rest of the kernel. An instance of BSD's type vnode consisted of some instance data and a pointer to an "ops vector" that defined its class via an array of methods (function pointers). But it wasn't object-oriented enough to, for example, implement inheritance properly.

This flaw had led to some inelegancies as the interface had evolved through time, but what interested me more was the potential applications that would be unleashed if the interface could be made properly object-oriented. Instead of being implemented from scratch, file systems could be implemented by sub-classing other file systems. For example, a read-only file system such as a CD-ROM could be made writable by "stacking" a cache file system on top, as shown in Figure 11. I immediately saw the possibility of significant improvements in system administration that could flow from stacking file systems.

Evolving the Vnode Interface: Fig. 11
I started building a prototype by performing major surgery on a copy of the code that would become SunOS 4.1. By late 1989 it worked well enough to demonstrate the potential of the idea, so I published 1990's Evolving the Vnode Interface. The paper describes a number of Vnode modules that can be stacked together to implement interesting functions. Among them was cache-fs, which layered a writable local file system above a local or remote read-only file system:
This simple module can use any file system as a file-level cache for any other (read-only) file system. It has no knowledge of the file systems it is using; it sees them only via their opaque vnodes. Figure 11 shows it using a local writable ufs file system to cache a remote read-only NFS file system, thereby reducing the load on the server. Another possible configuration would be to use a local writable ufs file system to cache a CD-ROM, obscuring the speed penalty of CD.
Over the next quarter-century the idea of stacking vnodes and the related idea of "union mounts" from Rob Pike and Plan 9 churned around until, in October 2014, Linus Torvalds added overlayfs to the 3.18 kernel. I covered the details of this history in 2015's It takes longer than it takes. In it I quoted from Valerie Aurora's excellent series of articles about the architectural and implementation difficulties involved in adding union mounts to the Linux kernel. I concurred with her statement that:
The consensus at the 2009 Linux file systems workshop was that stackable file systems are conceptually elegant, but difficult or impossible to implement in a maintainable manner with the current VFS structure. My own experience writing a stacked file system (an in-kernel chunkfs prototype) leads me to agree with these criticisms.
I wrote:
Note that my original paper was only incidentally about union mounts, it was a critique of the then-current VFS structure, and a suggestion that stackable vnodes might be a better way to go. It was such a seductive suggestion that it took nearly two decades to refute it!
Nevertheless, the example I used in Evolving the Vnode Interface of a use for stacking vnodes was what persisted. It took a while for the fact that overlayfs was an official part of the Linux kernel to percolate through the ecosystem, but after six years I was able to write Blatant Self-Promotion about the transformation it wrought on Linux's packaging and software distribution, inspired by Liam Proven's NixOS and the changing face of Linux operating systems. He writes about less radical ideas than NixOS:
So, instead of re-architecting the way distros are built, vendors are reimplementing similar functionality using simpler tools inherited from the server world: containers, squashfs filesystems inside single files, and, for distros that have them, copy-on-write filesystems to provide rollback functionality.

The goal is to build operating systems as robust as mobile OSes: periodically, the vendor ships a thoroughly tested and integrated image which end users can't change and don't need to. In normal use, the root filesystem is mounted read-only, and there's no package manager.
Since then this model has become universal. Distros ship as a bootable ISO image, which uses overlayfs to mount a writable temporary file system on top. This is precisely how my 1989 prototype was intended to ship SunOS 4.1. The technology has spread to individual applications with systems such as Snaps and Flatpak.

Three Decades

The opportunity we saw when we started Nvidia was that the PC was transitioning from the ISA bus to version 1 of the PCI bus. The ISA bus' bandwidth was completely inadequate for 3D games, but the PCI bus had considerably more. Whether it was enough was an open question. We clearly needed to make the best possible use of the limited bandwidth we could get.

Nvidia's first chip had three key innovations:
  1. Rendering objects with quadric patches not triangles. A realistic model using quadric patches needed perehaps a fifth of the data for an equivalent triangle model.
  2. I/O virtualization with applications using a write-mostly, object-oriented interface. Read operations are neccessarily synchronous, whereas write operations are asynchronous. Thus the more writes per read across sthe bus, the better the utilization of the available bus bandwidth.
  3. A modular internal architecture based on an on-chip token-ring network. Thie goal was that each functional unit be simple enough to be designed and tested by a three-person team.
SEGA's Virtua Fighter on NV1
The first two of these enabled us to get Sega's arcade games running at full frame rate on a PC. Curtis Priem and I designed the second of these, and it is the one that has lasted:
  • I/O virtualization allowed multiple processes direct access to the graphics hardware, with no need to pass operations through the operating system. I explained the importance of this a decade ago in Hardware I/O Virtualization, using the example of Amazon building their own network interface cards. Tne first chip appeared on the bus as having 128 wide FIFOs. The operating system could map one of them into each process wanting access to the chip, allowing applications direct access to the hardware but under the control of the operating system.
  • The interface was write-mostly because the application could read from the FIFO the number of free slots, that is the number of writes before the bus would stall.
  • The interface was object-oriented because the data and the offset in the FIFO formed an invocation of a method on an instance of a (virtual) class. Some classes were implemented in hardware, others trapped into the kernel and were implemented by the driver, but the application just created and used instances of the available classes without knowing which was which. The classes were arranged in a hierarchy starting with class CLASS. Enumerating the instances of class CLASS told the application which classes it could use. Enumerating the instances of each of those classes told the application how many of each type of resource it could use.
The importance of the last of these was that it decoupled the hardware and software release schedules. Drivers could emulate classes that had yet to appear in hardware, the applications would use the hardware once it was available. Old software would run on newer hardware, it would just see some classes it didn't know how to use. One of our frustrations with Sun was the way software and hardware release schedules were inextricably interlinked.

Two-and-a-Half Decades

Last October I celebrated the LOCKSS Program Turns 25. Vicky Reich explained to me how libraries preserved paper academic journals and how their move to the Web was changing libraries role from purchasing a copy of the journal to renting access to the publisher's copy, and I came up with the overall peer-to-peer archtecture (and the acronym). With help from Mark Seiden I built the prototype, and after using it to demonstrate the feasibility of the concept, also used it to show vulnerabilities in the initial protocol. In 2003 I was part of the team that solved these problems, for which we were awarded the Best Paper award at the Symposium on Operating System Principles for Preserving peer replicas by rate-limited sampled voting.

4 comments:

Geoff said...

Cachefs plus immutable distributions.... that was Pravda, right?

David. said...

Right, that was my name for the project. The point being to establish the official "party line" and identify any deviations from it.

David. said...

Andreas Spies has a useful video explaining how you can greatly extend the life of SD cards with the Raspbian OS for the Pi with two clicks. Raspbian implements a setup option to layer overlayfs over the SD card, thus avoiding writes to it.

David. said...

Enable and disable overlayfs from the command line with

sudo raspi-config nonint enable_overlayfs or sudo raspi-config nonint disable_overlayfs