DSHR's Blog: Auditing The Integrity Of Multiple Replicas

The fundamental problem in the design of the LOCKSS system was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:

Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.

Our approach, published now more than 16 years ago, was to have each node in the network compare its content with that of the consensus among a randomized subset of the other nodes holding the same content. They did so using a peer-to-peer protocol using proof-of-work, in some respects one of the many precursors of Satoshi Nakamoto's Bitcoin protocol.

Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.

A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work.

Their abstract reads:

Cloud computing has been an essential technology for providing on‐demand computing resources as a service on the Internet. Not only enterprises but also individuals can outsource their data to the cloud without worrying about purchase and maintenance cost. The cloud storage system, however, is not fully trustable. Cloud data integrity auditing is crucial for defending against the security threats of data in the untrusted multicloud environment. Storing multiple replicas is a commonly used strategy for the availability and reliability of critical data. In this paper, we summarize and analyze the state‐of‐the‐art multiple‐replica integrity auditing schemes in cloud data storage. We present the system model and security threats of outsourcing data to the cloud with classification of ongoing developments. We also summarize the existing data integrity auditing schemes for multicloud data storage. The important open issues and potential research directions are addressed.

System Architecture

There are three possible system architectures for auditing the integrity of multiple replicas:

As far as I'm aware, LOCKSS is unique in using a true peer-to-peer architecture, in which nodes storing content mutually audit each other.
In another possible architecture the data owner (DO in Yangfei Lin et al's nomenclature) audits the replicas.
Yangfei Lin et al generally consider an architecture in which a trusted third party audits the replicas on behalf of the DO.

Proof-of-Possession vs. Proof-of-Retrievability

There are two kinds of audit:

A Proof-of-Retrievability (PoR) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed and every bit was intact.
A Proof-of-Possession (PoP) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed, but not that every bit was intact. The paper uses the acronym PDP for Provable Data Possession.

Immutable, Trustworthy Storage

The reason integrity audits are necessary is that storage systems are neither reliable nor trustworthy, especially at scale. Some audit systems depend on storing integrity tokens, such as hashes, in storage which has to be assumed reliable. If the token storage is corrupted, it may be possible to detect but not recover from the corruption. It is generally assumed that, because the tokens are much smaller than the content to whose integrity they attest, they are correspondingly more reliable. But it is easy to forget that both the tokens and the content are made of the same kind of bits, and that even storage protected by cryptographic hardware has vulnerabilities.

Encrypted Replicas

In many applications of cloud storage it is important that confidentiality of the data is preserved by encrypting it. In the digital preservation context, encrypting the data adds a significant single point of failure, the loss or corruption of the key, so is generally not used. If encryption is used, some means for ensuring that the ciphertext of each replica is different is usually desirable, as is the use of immutable, trustworthy storage for the decryption keys. The paper discusses doing this via probabilistic encryption using public/private key pairs, or via symmetric encryption using random noise added to the plaintext.

If the replicas are encrypted they are not bit-for-bit identical and thus their hashes will be different whether they are intact or corrupt. Thus a homomorphic encryption algorithm must be used:

Homomorphic encryption is a form of encryption with an additional evaluation capability for computing over encrypted data without access to the secret key. The result of such a computation remains encrypted.

In Section 3.3 Yangfei Lin et al discuss two auditing schemes based on homomorphic encryption:

RSA-based homomorphism
Boneh-Lynn-Shacham (BLS) signature-based homomorphism.

One of the schemes they discuss uses Paillier encryption, another homomorphic technique. It and RSA are compared here. See also here.

Secrets

If an audit operation is not to involve downloading the entire content, it must involve the auditor requiring the system storing the replica to perform a computation that:

The storage system does not know the result of ahead of time.
Takes as input part (PoP) or all (PoR) of the replica.

Thus, for example, asking the replica store for the hash of the content is not adequate, since the store could have pre-computed and stored the hash, rather than the content.

PoP systems can, for example, satisfy these requirements by requesting the hash of a random range of bytes within the content. PoR systems can, for example, satisfy these requirements by providing a random nonce that the replica store must prepend to the content before hashing it. It is important that, if the auditor pre-computes and stores these random values, they be kept secret from the replica stores. If the replica store discovers them, it can pre-compute the responses to future audit requests and discard the content without detection.

Unfortunately, it is not possible to completely exclude the possibility that a replica store, or a conspiracy among the replica stores, has compromised the storage holding the auditor's pre-computed values. A ideal design of auditor would generate the random values at each audit time, rather than pre-computing them. Alas, this is typically possible only if the auditor has access to a replica stored in immutable, trustworthy storage (see above). In the mutual audit architecture used by the LOCKSS system, the nodes do have access to a replica, albeit not in reliable storage, so the random nonces the system uses are generated afresh for each audit.

It is an unfortunate reality of current systems that, over long periods, preventing secrets from leaking and detecting in a timely fashion that they have leaked are both effectively impossible.

Auditing Dynamic vs. Static Data

In the digital preservation context, the replicas being audited can be assumed to be static, or at least append-only. The paper addresses the much harder problem of auditing replicas that are dynamic, subject to updates through time. In Section 3.2 Yangfei Lin et al discuss a number of techniques for authenticated data structures (ADS) to allow efficient auditing of dynamic data:

There are three main ADSs: rank-based authenticated skip list (RASL), Merkle hash tree (MHT), and map version table (MVT).

Methods

Yangfei Lin et al summarize the seven methods they describe in Table 4:

Adapted from Yangfei Lin et al: Table 4
Scheme	Authenticated Data Structure	Public Verifiability	Dynamic Data Operation	Implementation
MR-PDP	-	NO	NO	Pseudo-random functions for masking, BLS signatures
EMC-PDP	-	YES	NO	Symmetric encryption diffusion, BLS signatures
DMR-PDP	-	NO	YES	Paillier encryption
FHE-PDP	-	YES	YES	Fully homomorphic encryption
MB-PMDDP	Map version table	YES	YES	Require extra storage but less computation
MR-MHT	Merkle hash table	YES	YES	Replica as a subtree
MuR-DPA	Merkle hash table	YES	YES	Block replica as a subtree

As you can see, only four of the seven satisfy both of their requirements and if I interpret this (my emphasis):

an overall comparison summary among the existing replicated data possession auditing schemes ... is shown in Table 4.

correctly all are PoP not PoR.

But What About Blockchain?

In discussions of integrity verification these days, the idea of using a blockchain is inevitably raised. A recent example is R&D: Blockchain-Based Publicly Verifiable Cloud Storage by Nachiket Tapas et al. Their abstract reads:

Cloud storage adoption, due to the growing popularity of IoT solutions, is steadily on the rise, and ever more critical to services and businesses. In light of this trend, customers of cloud-based services are increasingly reliant, and their interests correspondingly at stake, on the good faith and appropriate conduct of providers at all times, which can be misplaced considering that data is the "new gold", and malicious interests on the provider side may conjure to misappropriate, alter, hide data, or deny access. A key to this problem lies in identifying and designing protocols to produce a trail of all interactions between customers and providers, at the very least, and make it widely available, auditable and its contents therefore provable. This work introduces preliminary results of this research activity, in particular including scenarios, threat models, architecture, interaction protocols and security guarantees of the proposed blockchain-based solution.

Ether miners 7/9/19

As usual, Tapas et al simply assume the Ethereum blockchain is immutable, secure and always available, despite the fact that its security depends on its being decentralized, which it isn't. Since the infrastructure on which their edifice is based does not in practice provide the conditions their system depends upon, it would not in practice provide the security guarantees they claim for it.

Everything they want from the Ethereum blockchain could be provided by the same kind of verifiable logs as are used in Certificate Transparency, thereby avoiding the problems of public blockchains. But doing so would face insuperable scaling problems under the transaction rates of industrial cloud deployments.

DSHR's Blog

Thursday, November 14, 2019

Auditing The Integrity Of Multiple Replicas