Tuesday, October 30, 2012

Forcing Frequent Failures

My co-author Mema Roussopoulos pointed me to an Extremetech report on Harvard team's success in storing 70 billion copies of a 700KB file in about 60 grams of DNA:
Scientists have been eyeing up DNA as a potential storage medium for a long time, for three very good reasons: It’s incredibly dense (you can store one bit per base, and a base is only a few atoms large); it’s volumetric (beaker) rather than planar (hard disk); and it’s incredibly stable — where other bleeding-edge storage mediums need to be kept in sub-zero vacuums, DNA can survive for hundreds of thousands of years in a box in your garage.
I believe that DNA is a better long-term medium than, for example, the "stone" DVDs announced last year. The reason is that "lots of copies keep stuff safe", and with DNA it is easy to make lots of copies. The Harvard team made about 70 billion copies, which is a lot by anyone's standards. Their paper in Science is here.

However DNA, like all "archival" media, poses a system-level problem. Below the fold I discuss the details.
Does the system:
  • Assume that the medium is far more reliable than needed to meet the system-level reliability targets, and simply ignore the possibility of media failure?
  • Assume that even very reliable media can fail, and implement techniques for detecting and recovering from such failures, for example replication and integrity checks?
If the system takes the first approach and the medium does fail, it will be in big trouble. The history of projections of media reliability is that in practice media rarely live up to their manufacturers claims. This all-eggs-in-one-basket approach is not a good idea. I discussed some reasons here.

If the system takes the second approach, and the media are extremely reliable, then the media failure detection and recovery code will be exercised very, very infrequently. The fact that it is almost never exercised means that it is much less likely to work when it is actually needed than similar code running on less reliable media, where it is exercised routinely. This problem is illustrated by file system code, whose main lines are executed intensively and are extremely reliable, but whose less frequently executed paths have been shown to be full of bugs (PDF). Media reliability is a good thing up to a point, beyond which it is actually counter-productive.

Obviously, one should not look a gift horse in the mouth and reject more reliable media. But if one believes that the media is so reliable it has become counter-productive one should inject deliberate, random faults in order to exercise the detection and recovery code.

There is a fascinating analogy to this argument in recent discussions of the "too big to fail" banking problem. Steve Randy Waldman writes:
I’m sympathetic to the view that financial regulation ought to strive not to prevent failures but to ensure that failures are frequent and tolerable. Rather than make that case, I’ll refer you to the oeuvre of the remarkable Ashwin Parameswaran, or macroresilience. Really, take a day and read every post. Learn why “micro-fragility leads to macro-resilience”.
The idea that “micro-fragility leads to macro-resilience” is really important. If failures are routine, they will be handled properly and not be disruptive. Waldman argues that banks should be at a risk of deliberate but random regulatory reorganization in which their debt would be converted to equity. That risk should increase as their balance sheet gets riskier:
Stochastic failures are desirable for a variety of reasons. If failures were not stochastic, if we simply chose the worst-ranked banks for restructuring, then we’d create perverse incentives for iffy banks to game the criteria, because very small changes in ones score would lead to very large changes in outcomes among tightly clustered banks. If restructuring is stochastic and the probability of restructuring is dependent upon a bank’s distance from the center rather than its relationship with its neighbor, there is little benefit to becoming slightly better than the next guy. It only makes sense to play for substantive change. ... It might make sense for the scale of debt/equity conversions to be stochastic as well, so that most forced failures would be manageable, but investors would still have to prepare for occasional, very disruptive reorganizations.
Both Waldman and Parameswaran use the analogy of controlled burns in forestry which, by reducing the fuel load, greatly reduce the potential damage from natural fires:
Just like channelisation of a river results in increased silt load within the river banks, the absence of fires leads to a fuel buildup thus making the eventual fire that much more severe. In Minskyian terms, this is analogous to the buildup of leverage and ‘Ponzi finance’ within the economic system.
In our context, the more you try to ensure that failures at one layer of the system never happen, the more you increase the damage likely to be caused at a system level by the failures that do occur at that level.

No comments: