Thursday, May 21, 2015

Unrecoverable read errors

Trevor Pott has a post at The Register entitled Flash banishes the spectre of the unrecoverable data error in which he points out that while disk manufacturers quoted Bit Error Rates (BER) for hard disks are typically 10-14 or 10-15, SSD BERs range from 10-16 for consumer drives to 10-18 for hardened enterprise drives. Below the fold, a look at his analysis of the impact of this difference of up to 4 orders of magnitude.

When a disk in a RAID-5 array fails and is replaced, all the data on other drives in the array must be read to reconstruct the data from the failed drive. If an unrecoverable read error (URE) is encountered in this process, one or more data blocks will be lost. RAID-6 and up can survive increasing numbers of UREs.

It has been obvious for some time that as hard disks got bigger without a corresponding decrease in BER that RAID technology had a problem, in that the probability of encountering a URE during reconstruction was going up, and thus so was the probability of losing data when a drive failed.As Trevor writes:
Putting this into rather brutal context, consider the data sheet for the 8TB Archive Drive from Seagate. This has an error rate of 10^14 bits. That is one URE every 12.5TB. That means Seagate will not guarantee that you can fully read the entire drive twice before encountering a URE.
Let's say that I have a RAID 5 of four 5TB drives and one dies. There is 12TB worth of data to be read from the remaining three drives before the array can be rebuilt. Taking all of the URE math from the above links and dramatically simplifying it, my chances of reading all 12TB before hitting a URE are not very good.
With 6TB drives I am beyond the math. In theory, I shouldn't be able to rebuild a failed RAID 5 array using 6TB drives that have a 10^14 BER. I will encounter a URE before the array is rebuilt and then I’d better hope the backups work.
So RAID 5 for consumer hard drives is dead.
Well, yes, but RAID-5, and RAID in general, is just one rather simple form of erasure coding. There are better forms of erasure coding for long-term data reliability. I disagree with Trevor when he writes:
There are plenty of ways to ensure that we can reliably store data, even as we move beyond 8TB drives. The best way, however, may be to put stuff you really care about on flash arrays. Especially if you have an attachment to the continued use of RAID 5.
Trevor is ignoring the economics. Hard drives are a lot cheaper for bulk storage than flash. As Chris Mellor pointed out in a post at The Register about a month ago, each byte of flash contains at least 50 times as much capital investment as a byte of hard drive. So it will be a lot more expensive, even if not 50 times as expensive. For the sake of argument, lets say it is 5 times as expensive. To a first approximation, cost increases linearly with the replication factor, but reliability increases exponentially. So, instead of a replication factor of 1.2 in a RAID-5 flash array, for the same money I can have a replication factor of 12.2 in a hard disk array. Data in the hard drive array would be much, much safer for the same money. Or suppose I used a replication factor of 2.5, the data would be a great deal safer for 40% of the cost.

1 comment:

Q said...

Personally I think those URE numbers like 10^14 are way too conservative. I don't think it reflects real-life that we see an URE every 12.5 TB, do you?