So far this year I've attended two talks that were really revelatory; Krste Asanović's keynote at FAST 13, which I blogged about earlier, and Kestutis Patiejunas' talk about Facebook's cold storage systems. Unfortunately, Kestutis' talk was off-the-record, so I couldn't blog about it at the time. But he just gave a shorter version at the Library of Congress' Designing Storage Architectures workshop, so now I can blog about this fascinating and important system. Below the fold, the details.
The initial response to Facebook's announcement of their prototype Blu-ray cold storage system focused on the 50-year life of the disks, but it turns out that this isn't the interesting part of the story. Facebook's problem is that they have a huge flow of data that is accessed rarely but needs to be kept for the long-term at the lowest possible cost. They need to add bottom tiers to their storage hierarchy to do this.
The first tier they added to the bottom of the hierarchy stored the data on mostly powered-down hard drives. Some time ago a technology called MAID (Massive Array of Idle Drives) was introduced but didn't make it in the market. The idea was that by putting a large cache in front of the disk array, most of the drives could be spun-down to reduce the average power draw. MAID did reduce the average power draw, at the cost of some delay from cache misses, but in practice the proportion of drives that were spun-down wasn't as great as expected so the average power reduction wasn't as much as hoped. And the worst case was about the same as a RAID, because the cache could be thrashed in a way that caused almost all the drives to be powered up.
Facebook's design is different. It is aimed at limiting the worst-case power draw. It exploits the fact that this storage is at the bottom of the storage hierarchy and can tolerate significant access latency. Disks are assigned to groups in equal numbers. One group of disks is spun up at a time in rotation, so the worst-case access latency is the time needed to cycle through all the disk groups. But the worst-case power draw is only that for a single group of disks and enough compute to handle a single group.
Why is this important? Because of the synergistic effects knowing the maximum power draw enables. The power supplies can be much smaller, and because the access time is not critical, need not be duplicated. Because Facebook builds entire data centers for cold storage, the data center needs much less power and cooling. It can be more like cheap warehouse space than expensive data center space. Aggregating these synergistic cost savings at data center scale leads to really significant savings.
Nevertheless, this design has high performance where it matters to Facebook, in write bandwidth. While a group of disks is spun up, any reads queued up for that group are performed. But almost all the I/O operations to this design are writes. Writes are erasure-coded, and the shards all written to different disks in the same group. In this way, while a group is spun up, all disks in the group are writing simultaneously providing huge write bandwidth. When the group is spun down, the disks in the next group take over, and the high write bandwidth is only briefly interrupted.
Next, below this layer of disk cold storage Facebook implemented the Blu-ray cold storage that drew such attention. It has 12 Blu-ray drives for an entire rack of cartridges holding 10,000 100TB Blu-ray disks managed by a robot. When the robot loads a group of 12 fresh Blu-ray disks into the drives, the appropriate amount of data to fill them is read from the currently active hard disk group and written to them. This scheduling of the writes allows for effective use of the limited write capacity of the Blu-ray drives. If the data are ever read, a specific group has to be loaded into the drives, interrupting the flow of writes, but this is a rare occurrence. Once all 10,000 disks in a rack have been written, the disks will be loaded for reads infrequently. Most of the time the entire Petabyte rack will sit there idle.
It is this careful, organized scheduling of the system's activities at data center scale that enables the synergistic cost reductions of cheap power and space. It is, or at least may be, true that the Blu-ray disks have a 50-year lifetime but this isn't what matters. No-one expects the racks to sit in the data center for 50 years, at some point before then they will be obsoleted by some unknown new, much denser and more power-efficient cold storage medium (perhaps DNA).
After the off-the-record talk I was thinking about the synergistic effects that Facebook got from the hard limit the system provides on the total power consumption of the data center. This limit is fixed, the system schedules its activities to stay under the limit. I connected this idea to the Internet Archive's approach to operating their data center in the church in San Francisco without air conditioning. Most of the time, under the fog, San Francisco is quite cool, but there are occasional hot days. So the Internet Archive team built a scheduling system that, when the temperature rises, delays non-essential tasks. Since hot days are rare, these delays do not significantly reduce the system's throughput although they can impact the latency of the non-essential tasks.
The interesting thing about renewable energy sources, such as solar and wind, is that these days the output of these sources can be predicted with considerable accuracy. Suppose Facebook's scheduler could modulate the power limit dynamically, it could match the data center's demand to the power available from the solar panels or wind turbines powering it. This could enable, for example, off-the-grid cold storage data centers in the desert, eliminating some of the possible threats to the data.