Tuesday, October 31, 2017

Storage Failures In The Field

It's past time for another look at the invaluable hard drive data that Backblaze puts out quarterly. As Peter Bright notes at Ars Technica, despite being based on limited data, the current stats reveal two interesting observations:
  • Backblaze is seeing reduced rates of infant mortality for the 10TB and 12TB drive generations:
    The initial data from the 10TB and 12TB disks, however, has not shown that pattern. While the data so far is very limited, with 1,240 disks and 14,220 aggregate drive days accumulated so far, none of these disks (both Seagate models) have failed.
  • Backblaze is seeing no reliability advantage from enterprise as against consumer drives:
    the company has now accumulated 3.7 million drive days for the consumer disks and 1.4 million for the enterprise ones. Over this usage, the annualized failure rates are 1.1 percent for the consumer disks and 1.2 percent for the enterprise ones.
The first thing to note is that devoting engineering effort to reducing infant mortality can have a significant return on investment. A drive that fails early will be returned under warranty, costing the company money. A drive that fails after the warranty expires cannot be returned. Warranty costs must be reserved against in the company's accounts. Any reduction in the rate of early failures goes straight to the company's bottom line.

Thus engineering devoted to reducing infant mortality is much more profitable than engineering devoted to extending the drives' service life. Extending service life beyond the current five years is wasted effort, because unless Kryder's law slows even further, the drives will be replaced to get more capacity in the same slot. Backblaze is replacing drives for this reason:
You’ll also notice that we have used a total of 85,467 hard drives. But at the end of 2016 we had 71,939 hard drives. Are we missing 13,528 hard drives? Not really. While some drives failed, the remaining drives were removed from service due primarily to migrations from smaller to larger drives.
The first observation makes it look as though the disk manufacturers have been following this strategy. This also explains the second observation. The goal is zero infant failures for both enterprise and consumer drives. To the extent that this goal is met, failure rates for both types in the first two years would be the same, zero. It might be that after the first two years, when the consumer drives were out of warranty, they would start to fail where the enterprise drives, still in warranty, would not.

But my guess is that both drive types will continue to fail at about the same rate because they share so much underlying technology. Backblaze has a long history of using consumer drives, and their stats show some models are reliable over 4-5 years, others not. A significant part of the enterprise drives' higher price is the cost of the five-year warranty.

