Tuesday, March 15, 2011

How Few Copies?

I spoke at the Screening the Future 2011 conference at the Netherlands Beeld en Geluid in Hilversum, on the subject of "How Few Copies?". Below the fold is an edited text of the talk with links to the resources.

The program I work on at the Stanford Libraries is called LOCKSS, for Lots Of Copies Keep Stuff Safe. The goal is to allow libraries to continue in the digital world the model that worked for centuries in the paper world; libraries pay money to publishers and, in return, get to own a copy of the publisher's copyright material. It is theirs to keep and use, subject to long-established restrictions that prevent purchasers, for example, copying their copy and selling the copies. The advent of the Web allowed publishers to force a change on libraries; on the Web their money didn't buy a copy, it merely rented access to the publisher's copy. The LOCKSS technology provides libraries with a digital equivalent of the stacks, where each library can keep its own copy of the material that it purchased from the digital publisher.

When we designed the LOCKSS system 12 years ago, the major problems we had to solve were not technical but legal. In order for a library to keep a copy of the book or journal for which they had paid the publisher in their LOCKSS stacks, copyright law required that the publisher grant them permission. Thus it was important that the system was simple, familiar and reassuring to publishers.

Making the system work the way that the paper system did achieved these goals. Money flowed from each library to each publisher. In return, a copy of each publisher's content flowed from that publisher to that library. That library could use its copy for its own readers but not for others.

Each library had its own copy so, for popular works such as the New England Journal of Medicine, the system had many, many copies. This architecture meant that the technical question posed by the LOCKSS design was one that hadn't been asked previously; what could we do with this abundance of copies to make the individual copies resist loss or damage?

Note that because each LOCKSS box collects content independently, and then audits the content against the other LOCKSS boxes with the same content, we have the kind of automated authenticity check discussed earlier by Howard Besser and Seamus Ross, albeit only for static Web content.

People often ask us, isn't it wasteful to have so many copies? It is true that a system with fewer copies would spend less in total on disks. But it would spend more in total on lawyers. Lawyers are much more expensive than disks. 5 lawyer/hours currently buys enough disk to store the entire corpus of academic journals.

Fortunately for us, e-journals and e-books are a very efficient way of encoding culture to transmit it to future generations. Other genres require less efficient encodings. Digitized books, audio and video are much larger per unit of culture. Enough disk to hold two million feature movies, an estimate of the total corpus, would cost about 1000 lawyer/hours (half a lawyer/year) even if movie industry lawyers were as cheap as book industry lawyers.

Digital preservation systems for these types of content can't afford to solve their legal problems buy throwing hardware at them. They have to throw money at the lawyers, which leaves them asking a different question, the question this talk addresses. Not "we have lots of copies, what can we do with them?" but "how few copies can we get away with?.

The need to back up your computer is drilled in to every recruit to cyberspace. Ignoring the lesson is painful and expensive, although Cathy Marshall's research shows that many people accept the pain and expense fatalistically. So you would think that there would be a ready answer to the question "how few backup copies do I need to be safe?". Last November I published a paper in Communications of the ACM that explains in detail why there is no ready answer to this question; here is the essence of the explanation.

Digital preservation is an engineering problem. Engineers operate in the real world; they can never achieve perfection. Suppose you have a Petabyte of data that you want to keep for 100 years, and have a 50% probability that every bit will survive undamaged. Think of each of those bits as being like a radioactive atom that can decay randomly with a given half-life. The half-life you need to meet your goal is about 60 million times the age of the universe. You can't measure that half-life.

Another way to look at it is that you need 18 nines of reliability. The most reliable systems we know how to build are perhaps 9 nines of reliability. We need a system a billion times more reliable than the most reliable systems we know how to build.

So, stuff is going to get lost and damaged. We can spend more to keep more copies, or be more careful with each copy, and hopefully lose less stuff. Its a trade-off. In theory, for each technology there is a curve like this, relating the number of copies to the probability of loss. Each additional copy improves reliability, subject to the law of diminishing returns.

In theory, for each number of copies there is a curve like this, relating the cost per copy to the probability of loss. Each additional dollar improves reliability, subject to the law of diminishing returns.

What we'd like to do is to put these two sets of graphs together, like this. Then, given a target loss probability, we could work out the most cost-effective technology and replication factor. Or, given a budget, we could work out the best achievable loss probability.

Why don't we have such a graph? There are two main reasons. The first is that, although they aren't nearly reliable enough, storage systems are very, very reliable. To measure a system's loss rate involves looking at a huge amount of data for a long time, so gathering the data for a graph like this is way too expensive to be feasible. The second is that the loss rate depends critically on the details of the internal design and implementation of every component of every system. We have to do this infeasibly expensive experiment for every single system, we can't do it once and generalize.

In fact, vendors don't even try to do the experiment once. Instead, they build mathematical models of their systems and simulate them to project what they think the reliability of their system should be. These simulations produce results that are ridiculously optimistic; typically claiming that there is almost no possibility that any of their systems will ever lose any data. There are five main reasons why these simulations are useless:

  • they don't include most of the causes of data loss,
  • they don't include correlations between these causes,
  • the people doing the simulations have a vested interest in the answer,
  • each vendor has to be at least as optimistic as their competitors,
  • vendors have no liability for loss or damage to the stored data.
What are the causes of data loss? If you ask most people, you would get a list like this:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
It's true, all these are causes of data loss. But if you ask the people who run large data centers what are the most important causes of data loss, you get a list like this:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organization Failure
The vendor's simulations include fairly realistic models of data loss in individual disk drives, but they don't have any model for insider abuse or operator error.

Why are correlations between causes of data loss important? Suppose we have two replicas, and at some point one replica fails. Some time after that, the failure is detected and a repair process is initiated, copying from the other replica. If the second replica fails after the initial failure and before the copy is completed the data will be lost. Thus the critical number is not the probability of a replica failing, but the probability of a second replica failing during the detection and repair interval after a first failure.

Lets take a real-life example. We all know that RAID systems are much less reliable than the manufacturers claim. The reason is that the theoretical models of RAID reliability assume that disks fail randomly, the failures are not correlated. But they are. The disks of a RAID are typically all from the same batch with the same firmware and manufacturing glitches. They share the same power supply, same cooling, same vibration environment. So if one fails, there is a high probability that another will fail soon after. The failures are correlated.

And this goes for all causes. Suppose the power fails to your data center. When the power comes back it is likely to fail again very soon. The system administrators will be already be stressed dealing with the fallout of the first failure. Stressed operators are much more likely to make mistakes when the power fails again. There is no way to build realistic models of these failures.

We'll just have to make the best decisions we can without the graph. Even though we can't draw the exact graph, we know the shape it will have, and thus we can formulate some rules of thumb, all of which are subject to diminishing returns:
  • The more copies the safer.
  • The less correlated the copies the safer.
  • The more reliable each copy the safer.
  • The faster failures are detected and repaired the safer.
In the normal case, these rules are in descending order of importance; extra copies are more important than less correlation is more important than increased copy reliability is more important than faster detection and repair.

Especially for audio/visual material, we usually can't afford a lot of copies. So it is important to spread the few copies we can afford across diverse technology, so that the technology failures aren't correlated, and across multiple administrative domains, so that a single sleepy or paniced or disgruntled operator can't overwrite or delete all the copies. Once we've done that, we should look at increasing the reliability of each copy.

Typically, in order to provide access, at least one of your copies is going to be on disk. How reliable are disks? Remarkably reliable, but also remarkably big. This year, we are promised 4TB disk drives. Each drive contains 1/30 of an Exabit. Disk drives specify an Unrecoverable Bit Error Rate (UBER), which is typically something like 10 in an Exabit. Lets put aside skepticism about these numbers for now and look at the implication.

If these bit errors were uncorrelated, this would mean that 1 in 3 times you read the whole drive you would get an error. In practice, it is likely that the errors affect a string of adjacent bits, so we get bigger errors less often. If the average error destroyed 10 bits, then about 1 in 30 times you read the whole disk you would get an error. The sustained transfer rate is typically about a gigabit a second, so reading the whole drive takes 30K seconds. Reading the drive continuously you'd get an error every 1M seconds or one every 12 days.

Measurements of real storage system show many more errors than would be expected from the UBER of the drives. Their root causes have been traced; about half of them come from the drives and about half from the buses, controllers, caches, RAM and other components.

The problem is, you have lots of disks. Suppose you have a Petabyte on a RAID-5 configuration of 312 4TB drives. Every single time you read the Petabyte you will get about 1 10-bit error. You're going to be doing a RAID repair every time you read the data, just from the UBER of the drives and the read operation itself, let alone all the other possible causes of data loss. In practice you will be doing many RAID repairs every time you read the data.

We know in archival use we write the data once. How often do we read it? A study of real archives has shown that the vast majority of reads are internal, resulting from integrity checks and replica management. The small proportion of external reads are spread fairly evenly across the whole archive, with only modest popularity hot-spots. Incidentally, this is the reason at least one of your copies needs to be on disk; even a large cache in front of a tape robot will have a poor hit rate in archival use, both because the small proportion of external reads are hard to predict and because the large proportion of internal reads are all guaranteed to be cache misses.

This workload poses problems. We can't piggy-back the integrity checks on the reads for user access, since they are so infrequent. We want to detect errors quickly, so that they can be quickly repaired. But doing so requires reading all the data frequently, which in and of itself causes errors that need to be repaired. A similar dilemma applies to tape.

Here are figures from Vijay Gill, who runs Google's server farms, for the monthly cost of a bunch of machines in a farm. The hardware is a little over a quarter. Power, cooling and space are nearly two thirds.

The low-hanging fruit in cost reduction is power and cooling. Suppose we turn the disks off. A study has shown that at least some disks written once and stored retain data very well. There is a technology, called MAID for Massive Array of Idle Disks, that works this way. Unfortunately, spinning the disks up for external accesses and integrity checks causes wear, and leads to errors. And, again, MAID works best with a cache. The archival workload means that the vast majority of reads would miss the cache and take a delay while the disk spins up.

Powering down disks doesn't appear to be a good way to reduce total storage costs for archives, but are the alternatives? The development of solid state storage, currently flash memory but soon to have alternatives such as phase change memory and memristors, is leading in that direction. It is true that the raw media cost of these technologies is much higher than disk or tape, flash is currently about 15 times as expensive as disk. But, as a team from Carnegie-Mellon has shown with an architecture called Fast Array of Wimpy Nodes (FAWN), flash-based systems can deliver the same performance as conventional disk-based servers using two orders of magnitude less power. Against an archival workload, the savings should be even greater and flash-based systems have many other advantages:
  • No moving parts, physically robust, much smaller form factor.
  • Zero power for data retention, good data retention even at high temperature.
  • Low latency and low power for access.
  • No effect of power cycles on reliability.
To approximate the numbers from Google and also from the San Diego Supercomputer Center, say that running costs are 2/3 and hardware costs are 1/3 for disk-based systems, and their life is 3 years. Suppose that a FAWN-like architecture makes running costs insignificant, and that the life of it's nodes is 15 years. Spending 15 times as much on flash-based hardware as on disk-based hardware and amortizing over the life of the hardware would make the monthly costs the same. This analysis is oversimplified but suggests that flash, which currently costs 15 times as much as hard disk, could be competitive for archival use.

Alternatively, compare owning flash with renting storage from Amazon's S3, which charges $0.14/GB/month, or $1680/TB/year for the first Terabyte. You can currently get a Terabyte of SSD for $1600 at retail. S3 is replicating your data for reliability, but probably not more than 3 times. So, if the SSDs last more than 3 years they would be cheaper. This again indicates that flash could already be competitive with disk for long-term storage. Amazon's pricing goes down to $0.055/GB/mo over 5PB, or $660/TB/year; how long it will be before solid state memory is competitive at this scale depends on the relative performance of Kryder's and Moore's laws. My skepticism about the near-term future of Kryder's law is on record.

Because their contents are large, audio-visual archives typically can't afford enough copies, enough redundancy, to keep the rate of loss and damage as low as they would like it to be. But they face another dilemma. Audio-visual content can be effectively compressed. The more it is compressed, the more copies you can afford to keep, and the safer the content will be. But the other side of this is that compression works by eliminating redundancy within the file. Just as redundancy among the copies acts to reduce loss and damage, so redundancy within the file acts to reduce the impact of corruption. A single bit flip in an uncompressed video file will probably be invisible, a single bit flip in a motion JPEG file will probably affect only a single frame, a single bit flip in a MP4 will probably affect everything beyond it in the stream. The more you reduce internal redundancy, the bigger the effect of any corruption.

This sets up another tradeoff for which we can't provide exact answers. For example, halving the size of an audio-visual collection by compressing it allows twice as many copies to be kept for the same budget. Clearly, if before you could keep one copy, and now you can keep two, this has greatly decreased the risk of loss and damage. Even though any corruption is likely to affect more of one copy, it is likely that it can be repaired from the other. For the eventual audience, this extra safety will probably outweigh the loss in fidelity from compression.

If before you could afford 10 copies and now you can afford 20, the law of diminishing returns means that you haven't decreased the risk of loss and damage much. If your system is designed properly, the risk of loss and damage affecting all 10 copies is extremely low already, so the increase in the size of damage from a single bit flip is not a concern. Since compression hasn't changed the risk much, but it has reduced the fidelity experienced by future audiences, it isn't likely to be justified.

Compression reduces the redundancy within a single copy and increases the risk of damage. There are also techniques that increase the redundancy within a single copy and reduce the risk. Using cryptographic techniques called erasure codes it is possible to encode a file into n subfiles called shares such that from any k of them the file can be decoded. For example, suppose n is 8 and k is 5. Then any 3 of the shares can be lost or damaged without affecting the system's ability to recover the file. You can think of this as having 1.6 copies of the file. If the 8 shares are distributed among geographically diverse servers, this technique can be very effective against some of the threats we saw earlier. Against others, such as software failure, insider abuse and economic failure, it is ineffective.

One would think that the question "how few copies can I keep and still meet my reliability target?' would have a definite answer. Alas, it doesn't. Archives, especially large audio-visual archives are chronically short of money and, even if there was a definite answer, would probably not be able to afford that many copies.

Thus we can say that some digital content is going to get lost or damaged. This shouldn't be news; the same is true of analog content. We have rules of thumb to guide us in trying to reduce the amount of loss and damage:
  • The more copies the safer.
  • The less correlated the copies the safer.
  • The more reliable each copy the safer.
  • The faster failures are detected and repaired the safer.
  • The less aggressive the compression the smaller the effect of damage.


Erik Hetzner said...

A quick remark about compression. As far as I can tell you are only discussing lossy compression, which seems to me something which an organization needs to evaluate separately.

With regards to lossless compression, it doesn't make sense to me to keep data uncompressed, for reasons I outlined here: http://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c

David. said...

No, the effect I discussed, that the greater the compression the greater the effect of corruption, applies to both lossy and lossless compression. Of course, lossy compression is only worth using if it compresses more than lossless compression. So this effect would be more pronounced in practice if lossy compression were to be used.

Erik Hetzner said...

The effect is the there for lossy & lossless. But whether or not an organization is willing to accept a loss in fidelity of an image or video is something they will need to consider carefully. Lossless compression is not going to involve the same considerations.

But as I outlined briefly in that linked mail message, I believe that using the natural redundancy of an uncompressed document is a poor way of exploiting redundancy to prevent errors. My intuition is that it is better to compress a document losslessly, then use an error correcting code to increase the redundancy.

Thanks for the article! I think you are correct to focus on operator error, etc., and this has something that has been very much neglected in the world of digital preservation.

Anonymous said...

This is a great article. Have you considered trying to 'publish' it somewhere (scare quotes because your blog is a kind of publishing) with a wider readership?

The Code4Lib Journal might be interested, and doesn't mind that it was on your blog first.

David. said...

The link to Heydegger's 2008 IST paper is broken but it is available from the Wayback machine so you don't have to pay Ingenta $20.