Tuesday, March 8, 2011

Deduplicating Devices Considered Harmful

In my brief report from FAST11 I mentioned that Michael Wei's presentation of his paper on erasing information from flash drives (PDF) revealed that at least one flash controller was, by default, doing block-level deduplication of data written to it. I e-mailed Michael about this, and learned that the SSD controller in question is the SandForce SF-1200. This sentence is a clue:
DuraWrite technology extends the life of the SSD over conventional controllers, by optimizing writes to the Flash memory and delivering a write amplification below 1, without complex DRAM caching requirements.
This controller is used in SSDs from, for example, Corsair, ADATA and Mushkin.

It is easy to see the attraction of this idea. Flash controllers need a block re-mapping layer, called the Flash Translation Layer (FTL) (PDF) and, by enhancing this layer to map all logical blocks written with identical data to the same underlying physical block, the number of actual writes to flash can be reduced, the life of the device improved, and the write bandwidth increased. However, it was immediately obvious to me that this posed risks for file systems. Below the fold is an explanation.

File systems write the same metadata to multiple logical blocks as a way of avoiding a single block failure causing massive, or in some cases total, loss of user data. An example is the superblock in UFS. Suppose you have one of these SSDs with a UFS file system on it. Each of the multiple alternate logical locations for the superblock will be mapped to the same underlying physical block. If any of the bits in this physical block goes bad, the same bit will go bad in every alternate logical superblock,

I discussed this problem with Kirk McKusick, and he with the ZFS team. In brief, that devices sometimes do this is very bad news indeed, especially for file systems such as ZFS intended to deliver the level of reliability that large file systems need.

Thanks to the ZFS team, here is a more detailed explanation of why this is a problem for ZFS. For critical metadata (and optionally for user data) ZFS stores up to 3 copies of each block. The checksum of each block is stored in its parent, so that ZFS can ensure the integrity of its metadata before using it. If corrupt metadata is detected, it can find an alternate copy and use that. Here are the problems:
  • If the stored metadata gets corrupted, the corruption will apply to all copies, so recovery is impossible.
  • To defeat this, we would need to put a random salt into each of the copies, so that each block would be different. But the multiple copies are written by scheduling multiple writes of the same data in memory to different logical block addresses on the device. Changing this to copy the data into multiple buffers, salt them, then write each one once would be difficult and inefficient.
  • Worse, it would mean that the checksum of each of the copies of the child block would be different; at present they are all the same. Retaining the identity of the copy checksums would require excluding the salt from the checksum. But ZFS computes the sum of every block at a level in the stack where the kind of data in the block is unknown. Loosing the identity of the copy checksums would require changes to the on-disk layout.
This isn't something specific to ZFS; similar problems arise for all file systems that use redundancy to provide robustness. The bottom line is that drivers for devices capable of doing this need to turn it off. But the whole point of SSDs is that they live behind the same generic disk driver as all SATA devices. It may be possible to use mechanisms such as FreeBSD's quirks to turn deduplication off, but that assumes that you know the devices with controllers that deduplicate, that the controllers support commands to disable deduplication, and that you know what the commands are.

2 comments:

David. said...

This blog post morphed into an ACM Queue piece.

David. said...

Robin Harris carries this story further, and I respond here.