Thursday, November 18, 2010

The Anonymity of Crowds

In an earlier post I discussed the consulting contract that Ithaka S+R is working on for GPO to project the future of the Federal Depository Library Program (FDLP). As part of this, Roger Schonfeld asked us:
about minimum levels of replication required in order to ensure long-term reliability

We get asked this all the time, because people think this is a simple question that should have a simple answer. We reply that experience leads us to believe that for the LOCKSS system the minimum number of copies is about 7, and surround this answer with caveats. But this answer is almost always quoted out of context as being applicable to systems in general, and being a hard number. It may be useful to give my answer to Schonfeld's question wider distribution; it is below the fold.



We have been careful to say that our estimate of 7 LOCKSS boxes to be reasonably safe is based on experience, not on data. The search for "data-driven information about numbers of copies needed" is futile. The reason can be seen in our 2005 D-Lib paper Requirements for Digital Preservation Systems: A Bottom-Up Approach.

Briefly, stored bits are subject to a broad range of threats. Building realistic models of even the obvious ones, such bit rot and media failure, is very hard because the failures are known to be highly correlated - see our 2006 EuroSys paper A Fresh Look at the Reliability of Long-term Digital Storage.

But the threats that actually cause significant data loss, such as operator error and internal attack, are even harder to model. Not merely are these threats even more highly correlated, and also depend critically on the detailed architecture of the systems in question, but even more importantly they are so embarrassing to the victims that they are actively covered up. Thus, even if the difficulties of modeling such highly correlated threats could be overcome, the data necessary to drive the models are not available.

Any claim to a data-driven adequate number of copies should be treated with great skepticism, since it will be based on simplistic models that ignore the threats that actually lose data, and these models will use highly suspect data. Further, these difficulties all lead to estimates of the adequate number of copies that are systematically too low.

In the particular case that Schonfeld is examining, government documents, the most critical threat is insider abuse. I regularly point out that governments routinely suppress and sanitize documents. Under the circumstances, the paper FDLP was remarkably resistant to this. When the government wanted to recall a document, for example Vol. XXVI of Foreign Relations of the United States, they had to notify a large number of librarians. Human librarians were essential to making the system both tamper-evident, and to some extent tamper-resistant. Because there were so many of them, some of them would view their responsibility to society at large as requiring them to make this request public, or to squirrel away a copy. In modern terms, the resilience of the system came from crowd-sourcing. Just in the last few days we have had another example of the "anonymity of crowds" defeating censorship.

It is worth noting that our "Best Paper" award at the 2003 SOSP, on which the LOCKSS protocol is based, exploits the "anonymity of crowds" too. An attacker needs to predict which nodes will vote in any given poll. The more nodes, the smaller the proportion of the total number of nodes that needs to vote, and thus the smaller the probability of a successful prediction.

Clearly, an electronic system that follows Schonfeld's recommendation of

a small number of dedicated preservation entities
is unlikely to have these essential properties. Suppression or sanitization would probably be automatic and extremely hard to detect. Even if humans at the "entities" were involved, because there would be so few of them it would be too easy for the government to identify the leakers. Arm-twisting would be effective in deterring librarians at the "entities" from following their conscience; the more so in that these "entities" would be very vulnerable to the government retaliating by excluding them from the program.

This illustrates three important principles:
  • The number of copies needed cannot be discussed except in the context of a specific threat model.
  • The important threats are not amenable to quantitative modeling.
  • Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".

No comments: