The first generation of Warehouse-Scale Computers (WSC) built everything from commercial off-the-shelf (COTS) components: computers, switches, and racks. The second generation, which is being deployed today, uses custom computers, custom switches, and even custom racks, albeit all built using COTS chips. We believe the third generation of WSC in 2020 will be built from custom chips. If WSC architects are free to design custom chips, what should they do differently?There is much to think about in the talk, which stands out because it treats the entire stack, from hardware to applications, in a holistic way. It is well worth your time to read the slides and watch the video. Below the fold, I have some comments.
The Aspire project's concept is that the third generation of warehouse computing would consist of "Fireboxes":
Firebox is a 50kW WSC building block containing a thousand compute sockets and 100 Petabytes (2^57B) of non-volatile memory connected via a low-latency, high-bandwidth optical switch. We expect a 2020 WSC to be composed of 200 to 400 FireBoxes instead of 20,000 to 40,000 servers, thereby reducing management overhead. Each compute socket contains a System-on-a-Chip (SoC) with around 100 cores connected to high-bandwidth on-package DRAM. Fast SoC network interfaces reduce the software overhead of communicating between application services and high-radix network backplane switches connected by Terabit/sec optical fibers reduce the network's contribution to tail latency. The very large non-volatile store directly supports in-memory databases, and pervasive encryption ensures that data is always protected in transit and in storage.The two key goals for this architecture are reduce power consumption, and reduce tail latency. Low tail latency is one major reason why Google beat Alta Vista to become the dominant search engine. Alta Vista's centralized architecture often delivered results faster than Google, but sometimes was much slower. Google's front-ends fanned the query out to their distributed architecture, waited a fixed time to collect results, and delivered whatever results they had at that predictable time. The proportion of Google's searches that took noticeably longer was insignificant. Alta Vista's architecture meant that a failure caused a delay, Google's architecture meant that a failure caused the search result to be slightly worse. Even a small proportion of noticeable delays is perceived as a much worse user interface. Krste cites (slide 11) a fascinating paper on this topic, The Tail At Scale by Dean and Barroso, which will repay careful reading. They describe some remarkably effective techniques:
For example, in a Google benchmark that reads the values for 1,000 keys stored in a BigTable table distributed across 100 different servers, sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests.The benefits of this architecture are considerable, but so is the cost of designing, testing and fabricating the custom silicon. But with each data center costing upwards of $100M and the need for at least three geographically separate centers, diverting some of the cost to custom silicon would make economic sense even if the benefit was fairly small. The project will release an open-source chip generator to reduce the design cost.
Two other points from the talk are noteworthy:
- Slide 9, based on a collection of project case information on 50,000 real-life IT environments and software projects since 1985, is a great illustration that in software, small projects succeed cheaply but large projects fail expensively. Thus the system design assumes a service-oriented software architecture, enabling a large project to be implemented as a loosely-coupled set of small projects.
- This picture, from slide 21, shows that the architecture takes a realistic view of future chip technology. Krste said that, right now, we probably have the cheapest transistors we're ever going to have. Scaling down from here is possible, but getting to be expensive. Future architectures cannot count on solving their problems by throwing more, ever-cheaper transistors at them.