The fundamental problem in the design of the
LOCKSS system was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:
- Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
- Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
- Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.
Our approach, published now
more than 16 years ago, was to have each node in the network compare its content with that of the consensus among a randomized subset of the other nodes holding the same content. They did so using a peer-to-peer protocol using proof-of-work, in some respects one of the
many precursors of
Satoshi Nakamoto's Bitcoin protocol.
Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by
Mehul Shah et al, and
Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.
A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin
et al's
Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work.