Thursday, October 23, 2014

Facebook's Warm Storage

Last month I was finally able to post about Facebook's cold storage technology. Now, Subramanian Muralidhar and a team from Facebook, USC and Princeton have a paper at OSDI that describes the warm layer between the two cold storage layers and Haystack, the hot storage layer. f4: Facebook's Warm BLOB Storage System is perhaps less directly aimed at long-term preservation, but the paper is full of interesting information. You should read it, but below the fold I relate some details.

A BLOB is a Binary Large OBject. Each type of BLOB contains a single type of immutable binary content, such as photos, videos, documents, etc. Section 3 of the paper is a detailed discussion of the behavior of BLOBs of different kinds in Facebook's storage system.

Figure 3 shows that the rate of I/O requests to BLOBs drops rapidly through time. The rates for different types of BLOB drop differently, but all 9 types have dropped by 2 orders of magnitude within 8 months, and all but 1 (profile photos) have dropped by an order of magnitude within the first week.

The vast majority of Facebook's BLOBs are warm, as shown in Figure 5 - notice the scale goes from 80-100%. Thus the vast majority of the BLOBs generate I/O rates at least 2 orders of magnitude less than recently generated BLOBs.

In my talk to the 2012 Library of Congress Storage Architecture meeting I noted the start of an interesting evolution:
a good deal of previous meetings was a dialog of the deaf. People doing preservation said "what I care about is the cost of storing data for the long term". Vendors said "look at how fast my shiny new hardware can access your data".  ... The interesting thing at this meeting is that even vendors are talking about the cost.
This year's meeting was much more cost-focused. The Facebook data make two really strong cases in this direction:
  • That significant kinds of data should be moved from expensive, high-performance hot storage to cheaper warm and then cold storage as rapidly as feasible.
  • That the I/O rate that warm storage should be designed to sustain is so different from that of hot storage, at least 2 and often many more orders of magnitude, that attempting to re-use hot storage technology for warm and even worse for cold storage is futile.
This is good, because hot storage will be high-performance flash or other solid state memory and, as I and others have been pointing out for some time, there isn't going to be enough of it to go around.

Haystack uses RAID-6 and replicates data across three data centers, using 3.6 times as much storage as the raw data. f4 uses two fault-tolerance techniques:
  • Within a data center it uses erasure coding with 10 data blocks and 4 parity blocks. Careful layout of the blocks ensures that the data is resilient to drive, host and rack failures at an effective replication factor of 1.4.
  • Between data centers it uses XOR coding. Each block is paired with a different block in another data center, and the XOR of the two blocks stored in a third. If any one of the three data centers fails, both paired blocks can be restored from the other two.
The result is fault-tolerance to drive, host, rack and data center failures at an effective replication factor of 2.1, reducing overall storage demand from Haystack's factor of 3.6 by nearly 42% for the vast bulk of Facebook's BLOBs.  When fully deployed, this will save 87PB of storage. Erasure-coding everything except the hot storage layer seems economically essential.

Another point worth noting that the paper makes relates to heterogeneity as a way of avoiding correlated failures:
We recently learned about the importance of heterogeneity in the underlying hardware for f4 when a crop of disks started failing at a higher rate than normal. In addition, one of our regions experienced higher than average temperatures that exacerbated the failure rate of the bad disks. This combination of bad disks and high temperatures resulted in an increase from the normal ~1% AFR to an AFR over 60% for a period of weeks. Fortunately, the high-failure-rate disks were constrained to a single cell and there was no data loss because the buddy and XOR blocks were in other cells with lower temperatures that were unaffected.

1 comment:

David. said...

Via High Scalability I find another interesting blog post about Facebook's storage architecture by Murat.