DSHR's Blog: fast11

Showing posts with label fast11. Show all posts

Tuesday, June 28, 2011

More on de-duplicating flash controllers

My ACM Queue piece describing the problems caused by storage devices invisibly doing de-duplication attracted the attention of Robin Harris who actually asked SandForce and other manufacturers to comment. The details of SandForce's response are in Robin's StorageMojo article but the key claims are:

There is no more likelihood of DuraWrite loosing data than if it was not present.

and:

That is why SandForce created RAISE (Redundant Array of Independent Silicon Elements) and includes it on every SSD that uses a SandForce SSD Processor. ... if the ECC engine is unable to correct the bit error RAISE will step in to correct a complete failure of an entire sector, page, or block. ...This combination of ECC and RAISE protection provides a resulting UBER of 10^-29 virtually eliminates the probabilities of data corruption.

I would regard both claims with considerable skepticism:

SandForce has not disclosed the details of their technology, but the research performed by Michael Wei and his co-authors at UCSD revealed that it definitely includes de-duplication. Thus the first claim is not credible; there is only a single copy of the supposedly redundant data in the flash array instead of multiple copies. The metadata must therefore be at higher risk.
As regard the second claim, in my earlier ACM Queue article I show that manufacturers claims of error rates such as 10^-29 are not credible because they are not the result of experiments, but of models which are unrealistic and unverified.

Tuesday, March 8, 2011

Deduplicating Devices Considered Harmful

In my brief report from FAST11 I mentioned that Michael Wei's presentation of his paper on erasing information from flash drives (PDF) revealed that at least one flash controller was, by default, doing block-level deduplication of data written to it. I e-mailed Michael about this, and learned that the SSD controller in question is the SandForce SF-1200. This sentence is a clue:

DuraWrite technology extends the life of the SSD over conventional controllers, by optimizing writes to the Flash memory and delivering a write amplification below 1, without complex DRAM caching requirements.

This controller is used in SSDs from, for example, Corsair, ADATA and Mushkin.

It is easy to see the attraction of this idea. Flash controllers need a block re-mapping layer, called the Flash Translation Layer (FTL) (PDF) and, by enhancing this layer to map all logical blocks written with identical data to the same underlying physical block, the number of actual writes to flash can be reduced, the life of the device improved, and the write bandwidth increased. However, it was immediately obvious to me that this posed risks for file systems. Below the fold is an explanation.

File systems write the same metadata to multiple logical blocks as a way of avoiding a single block failure causing massive, or in some cases total, loss of user data. An example is the superblock in UFS. Suppose you have one of these SSDs with a UFS file system on it. Each of the multiple alternate logical locations for the superblock will be mapped to the same underlying physical block. If any of the bits in this physical block goes bad, the same bit will go bad in every alternate logical superblock,

I discussed this problem with Kirk McKusick, and he with the ZFS team. In brief, that devices sometimes do this is very bad news indeed, especially for file systems such as ZFS intended to deliver the level of reliability that large file systems need.

Thanks to the ZFS team, here is a more detailed explanation of why this is a problem for ZFS. For critical metadata (and optionally for user data) ZFS stores up to 3 copies of each block. The checksum of each block is stored in its parent, so that ZFS can ensure the integrity of its metadata before using it. If corrupt metadata is detected, it can find an alternate copy and use that. Here are the problems:

If the stored metadata gets corrupted, the corruption will apply to all copies, so recovery is impossible.
To defeat this, we would need to put a random salt into each of the copies, so that each block would be different. But the multiple copies are written by scheduling multiple writes of the same data in memory to different logical block addresses on the device. Changing this to copy the data into multiple buffers, salt them, then write each one once would be difficult and inefficient.
Worse, it would mean that the checksum of each of the copies of the child block would be different; at present they are all the same. Retaining the identity of the copy checksums would require excluding the salt from the checksum. But ZFS computes the sum of every block at a level in the stack where the kind of data in the block is unknown. Loosing the identity of the copy checksums would require changes to the on-disk layout.

This isn't something specific to ZFS; similar problems arise for all file systems that use redundancy to provide robustness. The bottom line is that drivers for devices capable of doing this need to turn it off. But the whole point of SSDs is that they live behind the same generic disk driver as all SATA devices. It may be possible to use mechanisms such as FreeBSD's quirks to turn deduplication off, but that assumes that you know the devices with controllers that deduplicate, that the controllers support commands to disable deduplication, and that you know what the commands are.

Friday, February 18, 2011

FAST'11

I attended USENIX's File And Storage Technologies conference. Here's a brief list of the things that caught my attention:

The first paper, and one of the Best Paper awardees, was "A Study of Practical Deduplication" (PDF), an excellent overview of deduplication applied to file systems. It makes available much valuable new data. In their environment whole-file deduplication achieves about 3/4 of the total savings from aggressive block-level deduplication.
In fact, deduplication and flash memory dominated the conference. "Reliably Erasing Data From Flash-Based Solid State Drives" from a team at UCSD on revealed that, because flash memories effectively require copy-on-write techniques, they contain many logically inaccessible copies of a file. These copies are easily accessible by de-soldering the chips and thus gaining a physical view of the storage. Since existing "secure delete" techniques can't go around the controller, and most controllers either don't or don't correctly implement the "sanitization" commands, it is essential to use encrypted file systems on flash devices if they are to store confidential information.
Even worse, Michael Wei's presentation of this paper revealed that at least one flash controller was doing block deduplication "under the covers". This is very tempting, in that it can speed up writes and extend the device lifetime considerably. But it can play havoc with the techniques file systems use to improve robustness.
"AONT-RS: Blending Security and Performance in Dispersed Storage Systems" was an impressive overview of how all-or-nothing transforms can provide security in Cleversafe's k-of-n dispersed storage system, without requiring complex key management schemes. I will write more on this in subsequent posts.
"Exploiting Memory Device Wear-Out Dynamics to Improve NAND Flash Memory System Performance" from RPI provides much useful background on the challenges flash technology faces in maintaining reliability as densities increase.
Although it is early days, it was interesting that several papers and posters addressed the impacts that non-volatile RAM technologies such as Phase Change Memory and memristors will have.
"Repairing Erasure Codes" was an important Work In Progress talk from a team at USC, showing how to reduce one of the more costly functions of k-of-n dispersed storage systems, organizing a replacement when one of the n slices fails. Previously, this required bringing together at least k slices, but they showed that it was possible to manage it with many fewer slices for at least some erasure codes, though so far none of the widely used ones. The talk mentioned this useful Wiki of papers about storage coding.