- Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
- Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
- Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.
Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.
A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work.
Their abstract reads:
Cloud computing has been an essential technology for providing on‐demand computing resources as a service on the Internet. Not only enterprises but also individuals can outsource their data to the cloud without worrying about purchase and maintenance cost. The cloud storage system, however, is not fully trustable. Cloud data integrity auditing is crucial for defending against the security threats of data in the untrusted multicloud environment. Storing multiple replicas is a commonly used strategy for the availability and reliability of critical data. In this paper, we summarize and analyze the state‐of‐the‐art multiple‐replica integrity auditing schemes in cloud data storage. We present the system model and security threats of outsourcing data to the cloud with classification of ongoing developments. We also summarize the existing data integrity auditing schemes for multicloud data storage. The important open issues and potential research directions are addressed.
System ArchitectureThere are three possible system architectures for auditing the integrity of multiple replicas:
- As far as I'm aware, LOCKSS is unique in using a true peer-to-peer architecture, in which nodes storing content mutually audit each other.
- In another possible architecture the data owner (DO in Yangfei Lin et al's nomenclature) audits the replicas.
- Yangfei Lin et al generally consider an architecture in which a trusted third party audits the replicas on behalf of the DO.
Proof-of-Possession vs. Proof-of-RetrievabilityThere are two kinds of audit:
- A Proof-of-Retrievability (PoR) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed and every bit was intact.
- A Proof-of-Possession (PoP) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed, but not that every bit was intact. The paper uses the acronym PDP for Provable Data Possession.
Immutable, Trustworthy StorageThe reason integrity audits are necessary is that storage systems are neither reliable nor trustworthy, especially at scale. Some audit systems depend on storing integrity tokens, such as hashes, in storage which has to be assumed reliable. If the token storage is corrupted, it may be possible to detect but not recover from the corruption. It is generally assume that, because the tokens are much smaller than the content to whose integrity they attest, they are correspondingly more reliable. But it is easy to forget that both the tokens and the content are made of the same kind of bits, and that even storage protected by cryptographic hardware has vulnerabilities.
Encrypted ReplicasIn many applications of cloud storage it is important that confidentiality of the data is preserved by encrypting it. In the digital preservation context, encrypting the data adds a significant single point of failure, the loss or corruption of the key, so is generally not used. If encryption is used, some means for ensuring that the ciphertext of each replica is different is usually desirable, as is the use of immutable, trustworthy storage for the decryption keys. The paper discusses doing this via probabilistic encryption using public/private key pairs, or via symmetric encryption using random noise added to the plaintext.
If the replicas are encrypted they are not bit-for-bit identical and thus their hashes will be different whether they are intact or corrupt. Thus a homomorphic encryption algorithm must be used:
Homomorphic encryption is a form of encryption with an additional evaluation capability for computing over encrypted data without access to the secret key. The result of such a computation remains encrypted.In Section 3.3 Yangfei Lin et al discuss two auditing schemes based on homomorphic encryption:
- RSA-based homomorphism
- Boneh-Lynn-Shacham (BLS) signature-based homomorphism.
SecretsIf an audit operation is not to involve downloading the entire content, it must involve the auditor requiring the system storing the replica to perform a computation that:
- The storage system does not know the result of ahead of time.
- Takes as input part (PoP) or all (PoR) of the replica.
PoP systems can, for example, satisfy these requirements by requesting the hash of a random range of bytes within the content. PoR systems can, for example, satisfy these requirements by providing a random nonce that the replica store must prepend to the content before hashing it. It is important that, if the auditor pre-computes and stores these random values, they be kept secret from the replica stores. If the replica store discovers them, it can pre-compute the responses to future audit requests and discard the content without detection.
Unfortunately, it is not possible to completely exclude the possibility that a replica store, or a conspiracy among the replica stores, has compromised the storage holding the auditor's pre-computed values. A ideal design of auditor would generate the random values at each audit time, rather than pre-computing them. Alas, this is typically possible only if the auditor has access to a replica stored in immutable, trustworthy storage (see above). In the mutual audit architecture used by the LOCKSS system, the nodes do have access to a replica, albeit not in reliable storage, so the random nonces the system uses are generated afresh for each audit.
It is an unfortunate reality of current systems that, over long periods, preventing secrets from leaking and detecting in a timely fashion that they have leaked are both effectively impossible.
Auditing Dynamic vs. Static DataIn the digital preservation context, the replicas being audited can be assumed to be static, or at least append-only. The paper addresses the much harder problem of auditing replicas that are dynamic, subject to updates through time. In Section 3.2 Yangfei Lin et al discuss a number of techniques for authenticated data structures (ADS) to allow efficient auditing of dynamic data:
There are three main ADSs: rank-based authenticated skip list (RASL), Merkle hash tree (MHT), and map version table (MVT).
MethodsYangfei Lin et al summarize the seven methods they describe in Table 4:
|Adapted from Yangfei Lin et al: Table 4|
|MR-PDP||-||NO||NO||Pseudo-random functions for masking, BLS signatures|
|EMC-PDP||-||YES||NO||Symmetric encryption diffusion, BLS signatures|
|FHE-PDP||-||YES||YES||Fully homomorphic encryption|
|MB-PMDDP||Map version table||YES||YES||Require extra storage but less computation|
|MR-MHT||Merkle hash table||YES||YES||Replica as a subtree|
|MuR-DPA||Merkle hash table||YES||YES||Block replica as a subtree|
As you can see, only four of the seven satisfy both of their requirements and if I interpret this (my emphasis):
an overall comparison summary among the existing replicated data possession auditing schemes ... is shown in Table 4.correctly all are PoP not PoR.
But What About Blockchain?In discussions of integrity verification these days, the idea of using a blockchain is inevitably raised. A recent example is R&D: Blockchain-Based Publicly Verifiable Cloud Storage by Nachiket Tapas et al. Their abstract reads:
Cloud storage adoption, due to the growing popularity of IoT solutions, is steadily on the rise, and ever more critical to services and businesses. In light of this trend, customers of cloud-based services are increasingly reliant, and their interests correspondingly at stake, on the good faith and appropriate conduct of providers at all times, which can be misplaced considering that data is the "new gold", and malicious interests on the provider side may conjure to misappropriate, alter, hide data, or deny access. A key to this problem lies in identifying and designing protocols to produce a trail of all interactions between customers and providers, at the very least, and make it widely available, auditable and its contents therefore provable. This work introduces preliminary results of this research activity, in particular including scenarios, threat models, architecture, interaction protocols and security guarantees of the proposed blockchain-based solution.
|Ether miners 7/9/19|
Everything they want from the Ethereum blockchain could be provided by the same kind of verifiable logs as are used in Certificate Transparency, thereby avoiding the problems of public blockchains. But doing so would face insuperable scaling problems under the transaction rates of industrial cloud deployments.