DSHR's Blog: Threats to stored data

Recently there's been a lively series of exchanges on the pasig-discuss mail list, sparked by an inquiry from Jeanne Kramer-Smyth of the World Bank as to any additional risks posed by media such as disks that did encryption or compression. It morphed into discussion of the "how many copies" question and related issues. Below the fold, my reflections on the discussion.

The initial question was pretty clearly based on a misunderstanding of the way self-encrypting disk drives (SED) and hardware compression in tape drives work. Quoting the Wikipedia article Hardware-based full disk encryption:

The drive except for bootup authentication operates just like any drive with no degradation in performance.

The encrypted data is never visible outside the drive, and the same is true for the compressed data on tape. So as far as systems using them are concerned, whether the drive encrypts or not is irrelevant. Unlike disk, tape capacities are quoted assuming compression is enabled. If your data is already compressed, you likely get no benefit from the drive's compression.

SED have one additional failure mode over regular drives; they support a crypto erase command which renders the data inaccessible. The effect as far as the data is concerned is the same as a major head crash. Archival systems that fail if a head crashes are useless, so they must be designed to survive total loss of the data on a drive. There is thus no reason not to use self-encrypting drives, and many reasons why one might want to.

But note that their use does not mean there is no reason for the system to encrypt the data sent to the drive. Depending on your threat model, encrypting data at rest may be a good idea. Depending on the media to do it for you, and thus not knowing whether or how it is being done, may not be an adequate threat mitigation.

Then the discussion broadened but, as usual, it was confusing because it was about protecting data from loss, but not based on explicit statements about what the threats to the data were, other than bit-rot.

There was some discussion of the "how many copies do we need to be safe?" question. Several people pointed to research that constructed models to answer this question. I responded:

Models claiming to estimate loss probability from replication factor, whether true replication or erasure coding, are wildly optimistic and should be treated with great suspicion. There are three reasons:

The models are built on models of underlying failures. The data on which these failure models are typically based are (a) based on manufacturers' reliability claims, and (b) ignore failures upstream of the media. Much research shows that actual failures in the field are (a) vastly more likely than manufacturers' claims, and (b) more likely to be caused by system components other than the media.

The models almost always assume that the failures are un-correlated, because modeling correlated failures is much more difficult, and requires much more data than un-correlated failures. In practice it has been known for decades that failures in storage systems are significantly correlated. Correlations among failures greatly raise the probability of data loss.

The models ignore almost all the important threats, since they are hard to quantify and highly correlated. Examples include operator error, internal or external attack, and natural disaster.

For replicated systems, three replicas is the absolute minimum IF your threat model excludes all external or internal attacks. Otherwise four (see Byzantine Fault Tolerance).

For (k of n) erasure coded systems the absolute minimum is three sites arranged so that k shards can be obtained from any two sites. This is because shards in a single site are subject to correlated failures (e.g. earthquake).

This is a question I've blogged about in 2016 and 2011 and 2010, when I concluded:

The number of copies needed cannot be discussed except in the context of a specific threat model.

The important threats are not amenable to quantitative modeling.

Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".

In the discussion Matthew Addis of Arkivum made some excellent points, and pointed to two interesting reports:

A report from the PrestoPrime project. He wrote:

There’s some examples of the effects that bit-flips and other data corruptions have on compressed AV content in a report from the PrestoPRIME project. There’s some links in there to work by Heydegger and others, e.g. impact of bit errors on JPEG2000. The report mainly covers AV, but there are some references in there about other compressed file formats, e.g. work by CERN on problems opening zips after bit-errors. See page 57 onwards.
A report from the EU's DAVID project. He wrote:

This was followed up by work in the DAVID project that did a more extensive survey of how AV content gets corrupted in practice within big AV archives. Note that bit-errors from storage, a.k.a bit rot was not a significant issue, well not compared with all the other problems!

Matthew wrote the 2010 PrestoPrime report, building on among others Heydegger's 2008 and 2009 work on the effects of flipping bits in compressed files (Both links are paywalled but the 2008 paper is available via the Wayback Machine). The 2013 DAVID report concluded:

It was acknowledged that some rare cases or corruptions might have been explained by the occurrence of bit rot, but the importance and the risk of this phenomenon was at the present time much lower than any other possible causes of content losses.

On the other hand, they were clear that:

Human errors are a major cause of concern. It can be argued that most of the other categories may also be caused by human errors (e.g. poor code, incomplete checking...), but we will concentrate here on direct human errors. In any complex system, operators have to be in charge. They have to perform essential tasks, maintaining the system in operation, checking that resources are sufficient to face unexpected conditions, and recovering the problems that can arise. However vigilant an operator is, he will always make errors, usually without consequence, but sometimes for the worst. The list is virtually endless, but one can cite:

Removing more files than wanted

Removing files in the wrong folder

Pulling out from a RAID a working disk instead of the faulty one

Copying and editing a configuration file, not changing all the necessary parameters

Editing a configuration file into a bad one, having no backup

Corrupting a database

Dropping a data tape / a hard disk drive

Introducing an adjustment with unexpected consequences

Replacing a correct file or setup from a wrong backup.

Such errors have the potential for affecting durably the performances of a system, and are not always reversible. In addition, the risk of error is increased by the stress introduced by urgency, e.g. when trying to make some room on in storage facilities approaching saturation, or introducing further errors when trying to recover using backup copies.

We agree, and have been saying so since at least 2005. And the evidence keeps rolling in. For example, on January 31^st Gitlab.com suffered a major data loss. Simon Sharwood at The Register wrote:

Source-code hub GitLab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. ... Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated.

Commendably, Gitlab made a Google Doc public with a lot of detail about the problem and their efforts to mitigate it:

LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage

Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.

The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost

The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.

Our backups to S3 apparently don’t work either: the bucket is empty

We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked

The operator error revealed the kind of confusion and gradual decay of infrastructure processes that is common when procedures are used only to recover from failures, not as a routine. Backups that are not routinely restored are unlikely to work when you need them. The take-away is that any time you reach for the backups, you're likely already in big enough trouble that your backups can't fix it. I was taught this lesson in the 70s. The early Unix dump command failed to check the return value from the write() call. If you forgot to write-enable the tape by inserting the write ring the dump would appear to succeed, the tape would look like it was spinning, but no data would be written to the backup tape.

Fault injection should be, but rarely is, practiced at all levels of the system. The results of not doing so are shown by UW Madison's work injecting faults into file systems and distributed storage. My blog posts on this topic include Injecting Faults in Distributed Storage, More bad news on storage reliability, and Forcing Frequent Failures.

Update: much as I love Kyoto, as a retiree I can't afford to attend iPRES2017. Apparently, there's a panel being proposed on the "bare minimum" for digital preservation. If I were on this panel I'd be saying something like the following.

We know the shape of the graph of loss probability against cost - it starts at one at zero cost and is an S-curve that gets to zero at infinite cost. Unfortunately, because the major threats to stored data are not amenable to quantitative modeling (see above), and technologies differ in their cost-effectiveness, we cannot actually plot the graph. So there are no hard-and fast answers.

The real debate here is how to distinguish between "digital storage" and "digital preservation". We do have a hard-and-fast answer for this. There are three levels of certification; the Data Seal of Approval (DSA), NESTOR's DIN31644, and TRAC/ISO16363. If you can't even pass DSA then what you're doing can't be called digital preservation.

Especially in the current difficult funding situation, it is important NOT to give the impression that we can "preserve" digital information with ever-decreasing resources, because then what we will get is ever-decreasing resources. Because there will always be someone willing to claim that they can do the job cheaper. Their short-cuts won't be exposed until its too late. That's why certification is important.

We need to be able to say "I'm sorry, but preserving this stuff costs this much. Less money, no preservation, just storage.".

DSHR's Blog

Thursday, March 23, 2017

Threats to stored data

1 comment: