DuraWrite technology extends the life of the SSD over conventional controllers, by optimizing writes to the Flash memory and delivering a write amplification below 1, without complex DRAM caching requirements.This controller is used in SSDs from, for example, Corsair, ADATA and Mushkin.
It is easy to see the attraction of this idea. Flash controllers need a block re-mapping layer, called the Flash Translation Layer (FTL) (PDF) and, by enhancing this layer to map all logical blocks written with identical data to the same underlying physical block, the number of actual writes to flash can be reduced, the life of the device improved, and the write bandwidth increased. However, it was immediately obvious to me that this posed risks for file systems. Below the fold is an explanation.
File systems write the same metadata to multiple logical blocks as a way of avoiding a single block failure causing massive, or in some cases total, loss of user data. An example is the superblock in UFS. Suppose you have one of these SSDs with a UFS file system on it. Each of the multiple alternate logical locations for the superblock will be mapped to the same underlying physical block. If any of the bits in this physical block goes bad, the same bit will go bad in every alternate logical superblock,
I discussed this problem with Kirk McKusick, and he with the ZFS team. In brief, that devices sometimes do this is very bad news indeed, especially for file systems such as ZFS intended to deliver the level of reliability that large file systems need.
Thanks to the ZFS team, here is a more detailed explanation of why this is a problem for ZFS. For critical metadata (and optionally for user data) ZFS stores up to 3 copies of each block. The checksum of each block is stored in its parent, so that ZFS can ensure the integrity of its metadata before using it. If corrupt metadata is detected, it can find an alternate copy and use that. Here are the problems:
- If the stored metadata gets corrupted, the corruption will apply to all copies, so recovery is impossible.
- To defeat this, we would need to put a random salt into each of the copies, so that each block would be different. But the multiple copies are written by scheduling multiple writes of the same data in memory to different logical block addresses on the device. Changing this to copy the data into multiple buffers, salt them, then write each one once would be difficult and inefficient.
- Worse, it would mean that the checksum of each of the copies of the child block would be different; at present they are all the same. Retaining the identity of the copy checksums would require excluding the salt from the checksum. But ZFS computes the sum of every block at a level in the stack where the kind of data in the block is unknown. Loosing the identity of the copy checksums would require changes to the on-disk layout.
This blog post morphed into an ACM Queue piece.
ReplyDeleteRobin Harris carries this story further, and I respond here.
ReplyDelete