Tuesday, December 16, 2014

Hardware I/O Virtualization

At enterprisetech.com, Timothy Prickett Morgan has an interesting post entitled A Rare Peek Into The Massive Scale Of AWS. It is based on a talk by Amazon's James Hamilton at the re:Invent conference. Morgan's post provides a hierarchical, network-centric view of the AWS infrastructure:
  • Regions, 11 of them around the world, contain Availability Zones (AZ).
  • The 28 AZs are arranged so that each Region contains at least 2 and up to 6 datacenters.
  • Morgan estimates that there are close to 90 datacenters in total, each with 2000 racks, burning 25-30MW.
  • Each rack holds 25 to 40 servers.
AZs are no more than 2ms apart measured in network latency, allowing for synchronous replication. This means the AZs in a region are only a couple of kilometres apart, which is less geographic diversity than one might want, but a disaster still has to have a pretty big radius to take out more than one AZ. The datacenters in an AZ are not more than 250us apart in latency terms, close enough that a disaster might take all the datacenters in one AZ out.

Below the fold, some details and the connection between what Amazon is doing now, and what we did in the early days of NVIDIA.


Amazon uses custom-built hardware, including network hardware, and their own network software. Doing so is simpler and more efficient than generic hardware and software because they only need to support a very restricted set of configurations and services. In particular they build their own network interface cards (NICs). The reason is particularly interesting to me, as it is to solve exactly the same problem that we faced as we started NVIDIA more than two decades ago.

The state-of-the-art of PC games, and thus PC graphics, were based on Windows, at that stage little more than a library on top of MS-DOS. The game was the only application running on the hardware. It didn't have to share the hardware with, and thus need the operating system (OS) to protect it from, any other application. Coming from the Unix world we knew how the OS shared access to physical hardware devices, such as the graphics chip, among multiple processes while protecting them (and the operating system) from each other. Processes didn't access the devices directly, they made system calls which invoked device driver code in the OS kernel that accessed the physical hardware on their behalf.

We understood that Windows would have to evolve into a multi-process OS with real inter-process protection. Our problem, like Amazon's, was two-fold; latency and the variance of latency. If the games were to provide arcade performance on mid-90s PCs, there was no way the game software could take the overhead of calling into the OS to perform graphics operations on its behalf. It had to talk directly to the graphics chip, not via a driver in the OS kernel.

If there would have been only a single process, such as the X server, doing graphics this would not have been a problem. Using the Memory Management Unit (MMU), the hardware provided to mediate access of multiple processes to memory, the OS could have mapped the graphic chip's IO registers into that process' address space. That process could access the graphics chip with no OS overhead. Other processes would have to use inter-process communications to request graphics operations, as X clients do.

SEGA's Virtua Fighter on NV1
Because we expected there to be many applications simultaneously doing graphics, and they all needed low, stable latency, we needed to make it possible for the OS safely to map the chip's registers into multiple processes at one time. We devoted a lot of the first NVIDIA chip to implementing what looked to the application like 128 independent sets of I/O registers. The OS could map one of the sets into a process' address space, allowing it to do graphics by writing directly to these hardware registers. The technical name for this is hardware I/O virtualization; we pioneered this technology in the PC space. It provided the very low latency that permitted arcade performance on the PC, despite other processes doing graphics at the same time. And because the competition between the multiple process' accesses to their virtual I/O resources was mediated on-chip as it mapped the accesses to the real underlying resources, it provided very stable latency without the disruptive long tail that degrades the user experience.

Amazon's problem was that, like PCs running multiple graphics applications on one real graphics card, they run many virtual machines (VMs) on each real server. These VMs have to share access to the physical network interface card (NIC). Mediating this in software in the hypervisor imposes both overhead and variance. Their answer was enhanced NICs:
The network interface cards support Single Root I/O Virtualization (SR-IOV), which is an extension to the PCI-Express protocol that allows the resources on a physical network device to be virtualized. SR-IOV gets around the normal software stack running in the operating system and its network drivers and the hypervisor layer that they sit on. It takes milliseconds to wade down through this software from the application to the network card. It only takes microseconds to get through the network card itself, and it takes nanoseconds to traverse the light pipes out to another network interface in another server. “This is another way of saying that the only thing that matters is the software latency at either end,” explained Hamilton. SR-IOV is much lighter weight and gives each guest partition on a virtual machine its own virtual network interface card, which rides on the physical card.
This, as shown on Hamilton's graph, provides much less variance in latency:
The new network, after it was virtualized and pumped up, showed about a 2X drop in latency compared to the old network at the 50th percentile for latency on data transmissions, and at the 99.9th percentile the latency dropped by about a factor of 10X.
The importance of reducing the variance of latency for Web services at Amazon scale is detailed in a fascinating, must-read paper, The Tail At Scale by Dean and Barroso.

Amazon had essentially the same problem we had, and came up with the same basic hardware solution - hardware I/O virtualization.

1 comment:

David. said...

In the first of a series Rich Miller at Data Center Frontier adds a little to the Timothy Prickett Morgan post.