Thursday, February 5, 2015

Disk reliability

Two recent publications about disk reliability are of considerable interest. Continuing their exemplary tradition of transparency, Backblaze updated their 2013 report on their experience of disk failures with a report on 2014, and the raw data and a set of FAQs. And J-F Paris et al published Self-Repairing Disk Arrays. Below the fold, thoughts on the relationship between these two.

Backblaze now have over 41K drives ranging from 1.5TB to 6TB spinning. Their data for a year consists of 365 daily tables each with one row for each spinning drive, so there is a lot of it, over 12M records. The 4TB disk generation looks good:
We like every one of the 4 TB drives we bought this year. For the price, you get a lot of storage, and the drive failure rates have been really low. The Seagate Desktop HDD.15 has had the best price, and we have a LOT of them. Over 12 thousand of them. The failure rate is a nice low 2.6% per year. Low price and reliability is good for business.
The HGST drives, while priced a little higher, have an even lower failure rate, at 1.4%. It’s not enough of a difference to be a big factor in our purchasing, but when there’s a good price, we grab some. We have over 12 thousand of these drives.
Its too soon to tell about the 6TB generation:
Currently we have 270 of the Western Digital Red 6 TB drives. The failure rate is 3.1%, but there have been only 3 failures. ... We have just 45 of the Seagate 6 TB SATA 3.5 drives, although more are on order. They’ve only been running a few months, and none have failed so far.
What grabbed all the attention was the 3TB generation:
The HGST Deskstar 5K3000 3 TB drives have proven to be very reliable, but expensive relative to other models (including similar 4 TB drives by HGST). The Western Digital Red 3 TB drives annual failure rate of 7.6% is a bit high but acceptable. The Seagate Barracuda 7200.14 3 TB drives are another story.
Their 1163 Seagate 3TB drives with an average age of 2.2 years had an annual failure rate (AFR) over 40% in 2014. Backblaze's economics mean that they can live with a reasonably high failure rate:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)
40% AFR is really high, but labor to replace the failed drives would still have cost less than $8/drive. The cost isn't the interesting aspect of this story. The drives would have failed at some point anyway, incurring the replacement labor cost. The 40% AFR just meant the labor cost, and the capital cost of new drives, was incurred earlier than expected, reducing the return on the investment in purchasing those drives.

Alas, there is a long history of high failure rates among particular batches of drives. An experience similar to Backblaze's at Facebook is related here, with an AFR over 60%. My first experience of this was nearly 30 years ago in the early days of Sun Microsystems. Manufacturing defects, software bugs, mishandling by distributors, vibration resonance, there are many causes for these correlated failures. It is the correlated failures that make the interesting connection with the Self-Repairing Disk Arrays paper.

The first thing to note about the paper is that Paris et al are not dealing with Backblaze-scale arrays:
These solutions are not difficult to implement in installations that have trained personnel on site round-the-clock. When this is not the case, disk repairs will have to wait until a technician can service the failed disk. There are two major disadvantages to this solution. First, it introduces an additional delay, which will have a detrimental effect on the reliability of the storage system. Second, the cost of the service call is likely to exceed that of the equipment being replaced.
4-slot Drobo
The first problem with the paper is that there has been a technological solution to this problem for a decade since Data Robotics (now Drobo) introduced the Drobo. I've been using them ever since. They are available in configurations from 4 to 12 slots and in all cases when a drive fails the light by the slot flashes red. All that is needed is to pull out the failed drive and push in a replacement disk the same size or bigger. The Drobo's firmware handles hot-swapping and recovers the failed drive's data with no human intervention. No technician and much less than 15 minutes per drive needed.

The second problem is that although the paper's failure model is based on 2013 failure data from Backblaze, it appears to assume that the failures are uncorrelated. The fact that errors in storage systems are correlated has been known since at least the work of Talagala at Berkeley in 1999. Correlated failures such as those of the 3TB Seagate drives at Backblaze in 2014 would invalidate the paper's claim that:
we have shown that several complete two-dimensional disk arrays with n parity disks, n ( n– 1)/2 data disks, and less than n ( n + 1)/2 data disks could achieve a 99.999 percent probability of not losing data over four years.
A 99.999 percent probability would mean that only 1 in 100,000 arrays would lose data in 4 years. But the very next year's data from their data source would probably have caused most of the arrays to lose data. When designing reliable storage, the failure model needs to be pessimistic, not average. And it needs to consider correlated failures, which is admittedly very hard to do.

No comments: