Tuesday, October 28, 2025

The Bathtub Curve

The economics of long-term data storage are critically dependent not just upon the Kryder rate, the rate at which the technology improves cost per byte, but also upon the reliability of the media over time. You want to replace media because they are no longer economic, not because they are no longer reliable despite still being economic.

Source
For more than a decade Backblaze has been providing an important public service by publishing data on the reliability of their hard drives, and more recently their SSDs. Below the fold I comment on this month's post from their Drive Stats Team, Are Hard Drives Getting Better? Let’s Revisit the Bathtub Curve.

Wikipedia defines the Bathtub Curve as a common concept in reliability engineering:
The 'bathtub' refers to the shape of a line that curves up at both ends, similar in shape to a bathtub. The bathtub curve has 3 regions:
  1. The first region has a decreasing failure rate due to early failures.
  2. The middle region is a constant failure rate due to random failures.
  3. The last region is an increasing failure rate due to wear-out failures.

In 2017's Storage Failures In The Field I commented on Backblaze's observation that the 10TB and 12TB HDD generations showed much reduced infant mortality:
devoting engineering effort to reducing infant mortality can have a significant return on investment. A drive that fails early will be returned under warranty, costing the company money. A drive that fails after the warranty expires cannot be returned. Warranty costs must be reserved against in the company's accounts. Any reduction in the rate of early failures goes straight to the company's bottom line.
Enterprise disks are typically warranted for 5 years, so a disk manufacturer is incentized to focus engineering effort on eliminating the "first region", the left side of the bathtub, and ensuring that the second region extends past the 5 year mark. Eight years ago Backblaze was starting to see that the engineers were succeeding in the first region:
While the data so far is very limited, with 1,240 disks and 14,220 aggregate drive days accumulated so far, none of these disks (both Seagate models) have failed. The low level of usage means that the disks have been installed and formatted and not much beyond that, but true infant mortality—disks that immediately expire on their first use—hasn’t become apparent.
Source
Four years later in Drive Failure Over Time: The Bathtub Curve Is Leaking Klein had some much stronger evidence:
The left side of the bathtub, the area of “decreasing failure rate,” is dramatically lower in 2021 than in 2013. In fact, for our 2021 curve, there is almost no left side of the bathtub, making it hard to take a bath, to say the least. We have reported how Seagate breaks in and tests their newly manufactured hard drives before shipping in an effort to lower the failure rates of their drives. Assuming all manufacturers do the same, that may explain some or all of this observation.
Note that the engineers hadn't quite succeeded in the second region, as the 2021 failure rate for years 4 and 5 was noticeably higher than for younger drives. But they had succeeded in pushing the major increase in failures out beyond 5.5 years. Everything before that was less than 4% Annual Failure Rate (AFR).

Source
Another four years later in the current graph things have changed quite a bit:
  • For the first 5 years the AFR is under 2%.
  • For the first 5 years the AFR gradually increases, the first region has been completely eliminated.
  • For the three years after a 5-year warranty expires, the AFR is under 3%.
  • For the next two years the AFR drops, bottoming out around 1%.
  • Everything out to 10 years has an AFR under 3%.
  • The third region starts with a spike at 10.5 years.
  • Even at 10.5 years the AFR is only just over 4%
Source
Backblaze's final graph puts all three sets of data in one graph and shows a dramatic improvement in drive longevity:
  • Drives in 2013 were dramatically less reliable than in later years, both because their AFRs were consistently higher and because their AFR hit 13% after 3 years.
  • By 2021 the engineers had kept the AFR around 2% through the warranty, but the drives wore out rapidly in the 7th year.
  • Now, the drives are only showing signs of beginning to wear out at 10.5 years.
That is some serious engineering at the long end! And at the short end things are great:
we see that the drive failure rates on the front end of the curve are also incredibly low—when a drive is between zero and one years old, we barely crack 1.30% AFR.
The left side of the bathtub is really gone, improving the manufacturers' margins. But if I eyeball the 2025 graph's first 20 quarters I estimate the AFR averages 1.6%, which implies that over the 5-year warranty 8% of the drives failed. Clearly, the engineers still have work to do.

Remember to check back in 2029 when Backblaze plans to return to this issue.

No comments: