Thursday, June 9, 2022

Backblaze On Hard Disk Reliability

It has been a long time since I blogged about the invaluable hard drive reliability data that Backblaze has been publishing quarterly since 2015, so I checked their blog and found Andy Klein's Star Wars themed Backblaze Drive Stats for Q1 2022, as well as his fascinating How Long Do Disk Drives Last?. Below the fold I comment on both.

Apart from the appropriate Star Wars quotes Klein uses as section headings, there are a number of interesting tidbits in the Q1 2022 stats. First, a tribute to the disk vendors' "kaizen" (continuous improvement) process:
The lifetime annualized failure rate for all the drives listed above is 1.39%. That was down from 1.40% at the end of 2021. One year ago (3/31/2021), the lifetime AFR was 1.49%.
Iit may not sound like much, but improving the reliability of an already incredibly reliable product by 0.1%/year is a significant achievement.

Second, when I say "incredibly reliable" I mean something like this:
The 6TB Seagate (model: ST6000DX000) continues to defy time with zero failures during Q1 2022 despite an average age of nearly seven years (83.7 months). 98% of the drives (859) were installed within the same two-week period back in Q1 2015.
Only 86 drives out of a total of 886 have failed in nearly seven years.

Third, Klein's innovation of two forms of a quadrant chart:
Each point on the Drive Stats Failure Square represents a hard drive model in operation in our environment as of 3/31/2022 and lies at the intersection of the average age of that model and the annualized failure rate of that model. We only included drive models with a lifetime total of one million drive days or with a confidence interval of all drive models included being 0.6 or less.
Klein describes each quadrant thus:
  1. Retirees are drives that are no longer reliable and should be replaced.
  2. Winners are drives that have performed well for a long time.
  3. Challengers are drives that are currently performing well but are still young.
  4. Muddlers are young drives that are performing less well.
Even more interesting is Klein's second version of the qudrant chart, featuring only the "Winners":
Each drive model is represented by a snake-like line (Snakes on a plane!?) which shows the AFR of the drive model as the average age of the fleet increased over time.
This chart is extremely informative:
Interestingly, each of the six models currently in quadrant II has a different backstory. For example, who could have predicted that the 6TB Seagate drive (model: ST6000DX000) would have ended up in the Winners quadrant given its less than auspicious start in 2015. And that drive was not alone; the 8TB Seagate drives (models: ST8000NM0055 and ST8000DM002) experienced the same behavior.

This chart can also give us a visual clue as to the direction of the annualized failure rate over time for a given drive model. For example, the 10TB Seagate drive seems more interested in moving into the Retiree quadrant over the next quarter or so and as such its replacement priority could be increased.
Last December Klein posted How Long Do Disk Drives Last?, updating a version posted in 2013:
The initial drive life study was done with 25,000 disk drives and about four years of data. Today’s study includes data from over 200,000 disk drives, many of which have survived six years and longer. This gives us more data to review and lets us extend our projections. For example, in our original report we reported that 78% of the drives we purchased were living longer than four years. Today, about 90% of the drives we own have lasted four years and 65% are living longer than six years. So how long do drives last? Keep reading.
What Klein wants to figure out is the half-life of the drive:
The number that should be able to compute is the median lifespan of a new drive. That is the age at which half of the drives fail. Let’s see how close we can get to predicting the median lifespan of a new drive given all the data we’ve collected over the years.
Klein plotted the survival rate, the proportion of drives still alive, against the age of the drives. He noted that:
The life expectancy decreases at a fairly stable rate of 2% to 2.5% a year for the first four years, then the decrease begins to accelerate. Looking back at the AFR by quarter chart above, this makes sense as the failure rate increases beginning in year four. After six years we end up with a life expectancy of 65%. Stated another way, if we bought a hard drive six years ago, there is a 65% chance it is still alive today.

Klein then used the data to project out over six years, which is the limit of the statistically significant data they have:
What happens to drives when they’re older than six years? We do have drives that are older than six years, so why did we stop there? We didn’t have enough data to be confident beyond six years as the number of drives drops off at that point and becomes composed almost entirely of one or two drive models versus a diverse selection. Instead, we used the data we had through six years and extrapolated from the life expectancy line to estimate the point at which half the drives will have died.

How long do drives last? It would appear a reasonable estimate of the median life expectancy is six years and nine months.
This is actually all another tribute to the engineers. The failure rate, the slope of the graph, is low until the drive warranty expires, and then increases. This (a) decreases the vendors' warranty costs, and (b) implements planned obsolescence, motivating the drives' replacement and generating income for the vendor. Thus economics means that drive life is probably stable into the future, although AFR during the first 4-5 years is likely to continue its slow decline, making the break in the slope of the graph sharper.

No comments: