Tuesday, November 6, 2018

Making PIEs Is Hard

In The Four Most Expensive Words In The English Language I wrote:
Since the key property of a cryptocurrency-based storage service is a lack of trust in the storage providers, Proofs of Space and Time are required. As Bram Cohen has pointed out, this is an extraordinarily difficult problem at the very frontier of research.
The post argues that the economics of decentralized storage services aren't viable, so the difficulty of Proofs of Space and Time isn't that important. All the same, this area of research is fascinating. Now, in One File for the Price of Three: Catching Cheating Servers in Decentralized Storage Networks Ethan Cecchetti, Ian Miers, and Ari Juels have pushed the frontier further out by inventing PIEs. Below the fold, some details.

Here's the problem the Cornell team is working on:
Decentralized Storage Networks (DSNs) such as Filecoin want to store your files on strangers spare disk space. To do it properly, you need to store the file in multiple places since you don't trust any individual stranger's computer. But how do you differentiate between three honest servers with one copy each and three cheating servers with one copy total? Anything you ask one server about the file it can get from its collaborator.
This is most of the problem of Proofs of Space. Their solution is:
the world's first provably secure, practical Public Incompressible Encoding (PIE). A PIE lets you encode a file in multiple different ways such that anyone can decode it, but nobody can efficiently use one encoding to answer questions about another. Using known techniques, you can ask simple questions to see if someone is storing an encoded replica, giving you a Proof of Replication (PoRep).
In other words, instead of storing N identical replicas of the original file, N different replicas are stored. From any one of the different replicas, the original file can be recovered. But an auditor can ask questions that are specific to each individual replica.

Anyone can already do this for private data:
Sia, Storj, MaidSafe, etc. do this by encrypting your pictures three times and distributing those encrypted copies. This strategy solves the problem perfectly if you're the one encrypting your photos and someone else is storing them.
But not for public data. The Cornell team's goal is to do it for public data such as blockchains:
What happens if it's a public good like blockchain state? Blockchains are getting huge—Ethereum is over 75 GB and growing fast—and it's really expensive to have everyone store everything. But who encrypts the blockchain state? Everyone needs to read it. And worse, we don't trust anyone to encrypt it properly. We need something totally different that anyone can check and anyone can decode.
The process of encrypting each of the "copies" needs to be slow, and previous proposals about how to make it slow have been broken. The Cornell team's proposal describes their computation as:
a directed acyclic graph (DAG). Each vertex represents an operation with two steps: first we derive a key using KDF [Key Derivation Function - think hash], and then we encrypt some data using that key. Edges are either data edges, representing the inputs and outputs of the encryption, or key edges, representing the data used by KDF to derive the key.
In order to ensure that a copy is intact, the computation must be:
A depth-robust graph (DRG) is a DAG with ... this property: even if a (potentially sizable) fraction of the vertices are removed, it retains a long path.
Thus, if the computation is a DRG, not storing some of the data doesn't avoid the worst case. But that's not enough. What's needed is that not storing all of the data doesn't avoid the best case. The Cornell team ensure this by layering alternate DRGs with:
a second type of graph called butterfly graph, which has the property that there's a path from every input node to every output node. This graph, with its dependency of every output on every input, helps ensure, roughly speaking, that Mallory must compute every input node, and therefore cannot avoid computing along the long sequential path in the depth-robust graph.
The result is what they call a Dagwood sandwich. Now they have a Public Incompressible Encoding, they propose to use it to build a Decentralized Storage Network:
Pretty much any PIE-based DSN architecture involves two steps for a given file:
  1. Prove once that the file is encoded correctly.
  2. Audit by verifying continuously that the file is intact.
Let's start with (1). Storage providers in the DSN must prove to somebody—the file owner or the network—that an encoding G of a file F is a correct PIE. Given an authenticated version of F, such as a hash stored in a trusted place, it's easy to verify that a PIE is correct.
I hope to return to the issues raised by "Given an authenticated version of F" in a later post. As with LOCKSS:
As for (2), it's not much help for G to be correct if it goes missing. It's critical to continuously check that storage providers are still storing G and haven't thrown data away. There are a number of efficient, well established techniques for this purpose.
So far, I may be a bit hazy on the details but I think I understand the big picture. They then propose:
A blockchain can then perform the auditing. This could be an existing blockchain like Ethereum or, as with schemes like Sia, a new one whose consensus algorithm is independent of its storage goals. A particularly intriguing option, though, is to create a new blockchain where audit = mining.
The idea of audit = mining is where they lose me. Here follows the explanation for my confusion.

In conventional blockchains such as Bitcoin's, a miner (or more likely a mining pool) decides which transactions are in the block it will mine. Among them will be one coinbase transaction:
A special kind of transaction, called a coinbase transaction, has no inputs. It is created by miners, and there is one coinbase transaction per block. Because each block comes with a reward of newly created Bitcoins (e.g. 50 BTC for the first 210,000 blocks), the first transaction of a block is, with few exceptions, the transaction that grants those coins to their recipient (the miner). In addition to the newly created Bitcoins, the coinbase transaction is also used for assigning the recipient of any transaction fees that were paid within the other transactions being included in the same block.
If the block the miner created wins, the coinbase and other transactions in the block take effect. The miner has created their own reward for the resources they devoted to the primary function of the blockchain, in this case, verifying transactions.

In DSNs, the major costs are incurred by the storage nodes. Audit is a minor, albeit necessary, cost. Presumably, a storage node cannot audit itself. Thus audit is a function that some other node has to perform on behalf of each storage node. Audit = mining, as I see it, gives rise to a number of issues:
  • Auditors will create the blocks they mine, deciding which transactions are included, and thus which nodes they audit for this block. Among the transactions will be the coinbase transaction, rewarding the auditor for auditing. How are the storage nodes rewarded for storing data? Presumably, the idea is that the auditor will also include transactions that transfer fees paid for storage from the owners of the data they store to each of the nodes audited in this block. This means that storage nodes will depend on the kindness of strangers to get paid.
  • So, like wallets that want to transact Bitcoin, storage nodes will need to pay fees to auditors. Like transactions in Bitcoin, storage nodes will be in a blind auction for auditing, leading to both over-bidding and long delays from under-bidding.
  • For the same economic reasons as in Bitcoin, auditors will form pools. Auditing will be dominated by a few large pools. They will be able to collude in various forms of anti-competitive behavior, such as auditing only storage nodes which are members of the pools, or charging higher audit fees to non-members. Doing so would increase the costs of competing storage nodes.
  • But the key point is that if the economics are to work out, audit fees and the inflation of the cryptocurrency by the issuance of audit rewards must be only a small part of the cost of running a storage node. In The Four Most Expensive Words In The English Language I pointed out that DSN storage nodes can't charge more than Amazon's S3, and realistically have to charge much less. Thus the income from auditing has to be minuscule.
  • But the idea is that audit = mining. It may seem like a laudable goal to make mining cheap, but it is problematic. Eric Budish writes:
    From a computer security perspective, the key thing to note ... is that the security of the blockchain is linear in the amount of expenditure on mining power ... In contrast, in many other contexts investments in computer security yield convex returns (e.g., traditional uses of cryptography) — analogously to how a lock on a door increases the security of a house by more than the cost of the lock.
    Bram Cohen's Chia Network is using Proofs of Space and Time to force miners to waste resources by storing large amounts of otherwise useless data, so running a miner is expensive. But if audit = mining in a DSN storing useful data, running a miner (= auditor) has to be cheap, and thus the network will be vulnerable.
Cecchetti et al don't elaborate on the details of their audit = mining concept, and a Google search doesn't reveal other sources for details. So it is possible that I have misinterpreted their ideas. But unless they have some way to make mining (= auditing) expensive, they are on the wrong track.

No comments: