Tuesday, September 11, 2018

What Does Data "Durability" Mean

In What Does 11 Nines of Durability Really Mean? David Friend writes:
No amount of nines can prevent data loss.

There is one very important and inconvenient truth about reliability: Two-thirds of all data loss has nothing to do with hardware failure.

The real culprits are a combination of human error, viruses, bugs in application software, and malicious employees or intruders. Almost everyone has accidentally erased or overwritten a file. Even if your cloud storage had one million nines of durability, it can’t protect you from human error.
Friend may be right that these are the top 5 causes of data loss, but over the timescale of preservation as opposed to storage they are far from the only ones. In Requirements for Digital Preservation Systems: A Bottom-Up Approach we listed 13 of them. Below the fold, some discussion of the meaning and usefulness of durability claims.

Having a number to distinguish your storage service from the competition, and illustrate how much more cost-effective yours is than theirs turns out to be a marketing necessity. So we get the claims Friend set out in this table, based on this assumption:
At Wasabi, we store billions of “objects,” or files that customers have sent us. On average, files are about 800 MB in size. So if your organization is storing 1 PB of data, it’s likely that you have something like 1.2 billion objects.
Note that the table is out-of-date. Based on improved code and better drive reliability, B2 is now also claiming 11 nines of durability.

The first thing to note is that data loss is defined as the probability that an "object" is lost. Friend assumes that an "object" is 800MB. The amount of data lost depends on how big the objects are. To take it to an extreme, if you stored a single Petabyte object in S3 RRS, the claim means you'd lose it on average once in 10,000 years, which is clearly not realistic.

The second thing to note is that only Backblaze reveals how they arrive at their number of nines, and that their methodology considers only hardware failures, and in fact only whole-drive failures. This is understandable; they have excellent data compiled over many years on drive failures. Developing a model that uses this data to predict data loss due to drive failures is tractable. Modeling "human error" or "theft" is not tractable, and even if it were there is much less data of much lower quality to drive the model.

The third thing to note is that, even if we accept that Friend's 5 causes are the only reasons for data loss, hardware failure is only 1/3 of them. Lets suppose that a service has durability from hardware failures of 8 nines, i.e. that its durability against all failures is 3*10-8. Now suppose we improve its durability against hardware failures to 11 nines. Its durability against all failures is now (2 + 10-3)*10-8. Three orders of magnitude change in hardware durability has made a 33% improvement in overall durability.

In Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter Brian Wilson wrote:
Yes, our nines go to 11. Where is that point? That’s open for debate. But somewhere around the 8th nine we start moving from practical to purely academic. Why? Because at these probability levels, it’s far more likely that:
  • An armed conflict takes out data center(s).
  • Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers.
  • There’s a prolonged billing problem and your account data is deleted.
That last one is particularly interesting. Any vendor selling cloud storage relies on billing its customers. If a customer stops paying, after some grace period, the vendor will delete the data to free up space for a paying customer.

Some customers pay by credit card. We don’t have the math behind it, but we believe there’s a greater than 1 in a million chance that the following events could occur:
  • You change your credit card provider. The credit card on file is invalid when the vendor tries to bill it.
  • Your email service provider thinks billing emails are SPAM. You don’t see the emails coming from your vendor saying there is a problem.
  • You do not answer phone calls from numbers you do not recognize; Customer Support is trying to call you from a blocked number; they are trying to leave voicemails but the mailbox is full.
If all those things are true, it’s possible that your data gets deleted simply because the system is operating as designed.
Thus Backblaze believes that the probability of losing all your objects due to billing problems is more than 10-6, which makes the difference between 8 and 11 nines of durability of a single object irrelevant.

In Mmm, yes. 11-nines data durability? Mmmm, that sounds good. Except it's virtually meaningless Chris Mellor wrote:
There are two general ways to lengthen the data durability time. The first is to use algorithms, along with extra information about the data, to detect corruption and restore files and objects if some portions are lost to bit rot. Erasure coding is one such method. Reed-Solomon coding is another.

The second way is to store multiple copies of the data across multiple locations, allowing you to overcome individual drive and array failures all the way to data centers being flooded, torched by rioters, shattered by earthquakes, or eating a nuke. This is redundancy.
As I understand it, Amazon's S3 is redundant across three data centers, whereas Backblaze's B2 is not. Thus, despite both quoting 11 nines of durability against hardware failures, S3 is durable against failures that B2 is not, and is thus better. Mellor concludes:
It's unlikely Amazon, Azure, and Google will reveal the basis of their data durability calculations just because minnow Backblaze shook a stick at them this week. The moral is that we're not necessarily comparing apples and oranges when looking at costs for 11 nines data durability from cloud storage providers. Sup their data with a long spoon.


Paul McJones said...

The link https://www.blogger.com/XXX ("What Does 11 Nines of Durability Really Mean?") doesn't seem right.

ranti said...

The link to the "What Does 11 Nines of Durability Really Mean?" points to a non-existent web.

Is this the one? https://wasabi.com/blog/11-nines-durability/

David. said...

My bad - thank you both - now fixed.

Warren Myers said...

Yet more positive reason to diversify your vendor base

I love Backblaze, GCP, etc - but to rely on just one of them seems monumentally foolish, if you have anything of value running/stored therein.