- Regions, 11 of them around the world, contain Availability Zones (AZ).
- The 28 AZs are arranged so that each Region contains at least 2 and up to 6 datacenters.
- Morgan estimates that there are close to 90 datacenters in total, each with 2000 racks, burning 25-30MW.
- Each rack holds 25 to 40 servers.
Below the fold, some details and the connection between what Amazon is doing now, and what we did in the early days of NVIDIA.
Amazon uses custom-built hardware, including network hardware, and their own network software. Doing so is simpler and more efficient than generic hardware and software because they only need to support a very restricted set of configurations and services. In particular they build their own network interface cards (NICs). The reason is particularly interesting to me, as it is to solve exactly the same problem that we faced as we started NVIDIA more than two decades ago.
The state-of-the-art of PC games, and thus PC graphics, were based on Windows, at that stage little more than a library on top of MS-DOS. The game was the only application running on the hardware. It didn't have to share the hardware with, and thus need the operating system (OS) to protect it from, any other application. Coming from the Unix world we knew how the OS shared access to physical hardware devices, such as the graphics chip, among multiple processes while protecting them (and the operating system) from each other. Processes didn't access the devices directly, they made system calls which invoked device driver code in the OS kernel that accessed the physical hardware on their behalf.
We understood that Windows would have to evolve into a multi-process OS with real inter-process protection. Our problem, like Amazon's, was two-fold; latency and the variance of latency. If the games were to provide arcade performance on mid-90s PCs, there was no way the game software could take the overhead of calling into the OS to perform graphics operations on its behalf. It had to talk directly to the graphics chip, not via a driver in the OS kernel.
If there would have been only a single process, such as the X server, doing graphics this would not have been a problem. Using the Memory Management Unit (MMU), the hardware provided to mediate access of multiple processes to memory, the OS could have mapped the graphic chip's IO registers into that process' address space. That process could access the graphics chip with no OS overhead. Other processes would have to use inter-process communications to request graphics operations, as X clients do.
|SEGA's Virtua Fighter on NV1
Amazon's problem was that, like PCs running multiple graphics applications on one real graphics card, they run many virtual machines (VMs) on each real server. These VMs have to share access to the physical network interface card (NIC). Mediating this in software in the hypervisor imposes both overhead and variance. Their answer was enhanced NICs:
The network interface cards support Single Root I/O Virtualization (SR-IOV), which is an extension to the PCI-Express protocol that allows the resources on a physical network device to be virtualized. SR-IOV gets around the normal software stack running in the operating system and its network drivers and the hypervisor layer that they sit on. It takes milliseconds to wade down through this software from the application to the network card. It only takes microseconds to get through the network card itself, and it takes nanoseconds to traverse the light pipes out to another network interface in another server. “This is another way of saying that the only thing that matters is the software latency at either end,” explained Hamilton. SR-IOV is much lighter weight and gives each guest partition on a virtual machine its own virtual network interface card, which rides on the physical card.
The new network, after it was virtualized and pumped up, showed about a 2X drop in latency compared to the old network at the 50th percentile for latency on data transmissions, and at the 99.9th percentile the latency dropped by about a factor of 10X.The importance of reducing the variance of latency for Web services at Amazon scale is detailed in a fascinating, must-read paper, The Tail At Scale by Dean and Barroso.
Amazon had essentially the same problem we had, and came up with the same basic hardware solution - hardware I/O virtualization.