Tuesday, June 24, 2014


For a long time there have been a number of possible "holy grails" for digital preservation, ideas that if it were possible to implement them would transform the problem. One of them has been the idea of an Internet-scale peer-to-peer network that would use excess disk storage at everyone's computers, in the same way that networks like Folding@Home use excess CPU, to deliver a robust, attack-resistant, decentralized storage infrastructure. Intermemory, from NEC's Princeton lab in 1998, was one of the first, but the concept is so attractive that there have been many others, such as Berkeley's Oceanstore. None have succeeded in attracting the mass participation of projects such as Folding@Home. None have become a widely-used infrastructure for digital preservation because without mass participation none provides the needed robustness or capacity.

By far the most successful peer-to-peer network in attracting participation has been Bitcoin, because the reward for participation is monetary. Now, it seems to me that Andrew Miller and his co-authors from the University of Maryland and Microsoft Research have taken a giant step towards this "holy grail" with their paper Permacoin: Repurposing Bitcoin Work for Data Preservation (hereafter MJSPK). This is despite the fact that, as I predicted in a comment last April, the current Bitcoin implementation has now definitively failed in its goal of establishing a decentralized currency because GHash has, for extended periods, controlled an absolute majority of the mining power. Follow me below the fold for my analysis of Permacoin and how this failure affects it.

Why didn't Intermemory, Oceanstore and the others succeed? I can think of a number of reasons:
  • Their model of the edge of the Internet was that there were a lot of desktop computers, continuously connected and powered-up, with low latency and no bandwidth charges, and with 3.5" hard disks that were mostly empty. Since then, the proportion of the edge with these characteristics has become vanishingly small.
  • In many cases, for example Samsara, the idea was that participants would contribute disk space and, in return, be entitled to store data in the network. Mechanisms were needed to enforce this trade, to ensure that peers actually did store the data they claimed to, and these mechanisms turned out to be hard to make attack-resistant.
  • Even if erasure coding were used to reduce the overall replication factor, it would still be necessary for participants to contribute significantly more space than they could receive in return. And the replication factor would need to be higher than in a centrally managed storage network.
  • I don't remember any case in which participants could expect a monetary reward. In the days when storage was thought to be effectively free, why would participants need to be paid? Alas, storage is a lot less free than it used to be.
  • The idea that peer-to-peer technology could construct a reliable long-term storage infrastructure from a self-organizing set of unreliable, marginally motivated desktops wasn't persuasive.
So, the holy grail needs a network consisting of permanently connected servers, which generate monetary income more than enough to cover their costs.  Bitcoin is such a network; the rewards are not merely monetary, but also speculative, always a major draw. They have been enough to drive a large and continuing  investment in hardware. MJSPK estimate a lower bound of $80M for the current network, likely a considerable under-estimate. Note also that this investment has a very short productive life:
the overall network hash rate has been doubling every 3-4 weeks, and therefore, mining equipment has been losing half its production capability within the same time frame. After 21-28 weeks (7 halvings), mining rigs lose 99.3% of their value. So there is very high equipment turnover in the mining industry.
Thus the rate of investment in the network must be several hundred million dollars a year. Even though Bitcoin is currently an $8000M market, that is a significant overhead. Not to mention power, cooling, space and other costs.

Bitcoin peers expend vast amounts of compute power, and thus energy, performing operations that have no point other than running the currency network. MJSPK shows how this effort could, in addition to running the network, perform the integrity checking needed for a robust, decentralized storage infrastructure at Internet scale. What follows is my summary of MJSPK; I don't claim to be a Bitcoin expert so for the real scoop you need to read the paper.

The process that consumes the compute power in Bitcoin is SHA256 hashing. MJSPK describes the role this hashing plays in the system as being a scratch-off puzzle (SOP). They express the requirements for Bitcoin SOPs as:
  • Predictable effort. The gradual re-calibration of the difficulty of mining to ensure a reward is issued about every 10 minutes requires that the effort to solve an SOP be, on average, predictable.
  • Fast verification. Since transactions and coins have to be checked frequently by users, the process for checking that a scratched-off ticket is a winner must be cheap.
  • Pre-computation resistance. It must not be possible for a miner to get started working on a new SOP before its competitors.
  • Linearity of expected reward. I.e. incentive-compatibility.
For simplicity, MJSPK envisages the network storing a single static dataset too large to be stored by any single node in the network, although they discuss extensions to handle dynamic data. They use erasure coding, so that the dataset is divided into f blocks and the blocks coded into rf (r > 1) segments for storage such that from any f segments the dataset can be recovered. Although tightly managed storage infrastructures such as Cleversafe use values of r only slightly greater than 1, the loosely managed Permacoin network would need to use much larger values; in one example they use r=4.

Each node initially generates a public-private key pair, chooses the number of segments it will store, and fetches that many segments. The specific set of segments is determined by the hash of its public key. It is important for Permacoin that the segments be small enough that a typical node will store a large number of them. In the paper 64 bytes is the segment size.

The nodes cannot simply be trusted to store their assigned segments, they must be required at regular intervals to prove that they are correctly storing them. It is these Proofs of Retrievability (PORs) that form Permacoin's SOPs. Each SOP ticket requires the node, for a subset of the segments it stores, to compute the hash of the segment and other information, then sign the resulting hash with their private key.

The next segment from the subset is then chosen based on the node's public key and the signature, so that it cannot be pre-fetched. This disincentivizes nodes from using remote storage, e.g. in the cloud or in another node, since the network latency would slow down their mining and reduce their income. The alternative would be to reveal their private key to the cloud; since this is the key that controls their payments it is unlikely to be revealed.

One interesting aspect of Permacoin is:
Another distinctive feature of our setting is that most PORs aren’t explicitly verified. In order to try to solve an SOP, a client must generate PORs on its blocks of F . But a client that never mines a block successfully may never release any POR. In this sense, PORs implicitly incentivize correct storage: a client stores Fi in order to be able to generate a correct POR, whether or not the POR is ever in fact verified.
Nodes must store their segments of the dataset correctly in order to have any chance of a reward. They will verify their own PORs in order to ensure this whether or not they ever gain a reward and thus have their POR publicly verified.

Permacoin is an extremely interesting proposal, although I have three doubts about its practicality.

First, I doubt that it would be possible to evolve the current Bitcoin network into Permacoin, or that a separate Permacoin network could survive in competition with Bitcoin. The additional hardware and power costs of the storage needed would prevent existing Bitcoin miners from switching. Network effects have successfully prevented alternative currencies even with the same mining costs from gaining significant market share against Bitcoin. Note that unless Permacoin gains significant participation it will suffer the same fate as systems like Intermemory.

Second, as I noted in my April comment, increasing returns to scale (economies of scale) pose a fundamental problem for peer-to-peer networks that do gain significant participation. One necessary design goal for networks such as Bitcoin is that the protocol be incentive-compatible, or as Ittay Eyal and Emin Gun Sirer (ES) express it:
the best strategy of a rational minority pool is to be honest, and a minority of colluding miners cannot earn disproportionate benefits by deviating from the protocol
They show that the Bitcoin protocol was, and still is, not incentive-compatible.

Even if the protocol were incentive-compatible, the implementation of each miner would, like almost all technologies, be subject to increasing returns to scale. Buying the specialized mining hardware in bulk would be cheaper, owning a data center in which to install it would be cheaper than renting space, and so on the larger the scale. Thus the profits from mining will accrue preferentially to the larger players. W. Brian Arthur's analysis will apply, and the mining market will be dominated by one, or a few organizations. This has actually happened (see GHash). So far these organizations have been pools, in which many individual miners cooperate to reduce the variance of their income.

Indeed, miners have a very strong incentive to join a pool. In their blog post proposing a fix to disincentivize large pools, ES point out that:
A miner scratches one ticket after another until he finds a winning ticket and publishes it to receive its rewards. But a lone miner working alone can be scratching for months before he finds a winning ticket. For instance, A $1500 mining card, available for preorder from Butterfly Labs, will need an expected time of over 6 years to mine a single block, if it were available today. That's a long time to wait in between winning tickets.
A "long time" is the kind of understatement that I, as an Englishman, appreciate. The same post reveals that the "long time" is more than an order of magnitude longer than the useful life of the hardware. So the overwhelming majority of mining cards will never find a winning ticket. Pools are essential to maintaining participation in the network; without them participation would be like a lottery with a single prize.

ES's proposed disincentive for large pools depends on making all participants in a pool capable of stealing the reward, or backstabbing, so they have to place a great deal of trust in each other. Unfortunately, backstabbing might well be so effective that no pool could gather enough members to provide variance of income low enough to motivate participation.

Backstabbing doesn't work among members belonging to the same organization (or infected by the same malware):
This mechanism does not address the case of a single large miner who owns all of his own equipment. In such a case, he can simply trust all of his machines to not leak the secret, and amass power. We not only have never heard of any technique that can address this issue, but we do not believe it needs to be addressed. The problem at hand stems from open, public pools. The costs of acquiring sufficient hardware to own 51% of the Bitcoin network put it out of the reach of most non-governmental single actors.
The proposed disincentive is ingenious and would work against current large pools, but I wouldn't be so sanguine:
GHash reportedly owns 25% of their own mining rigs, which corresponds only 14% of the overall hash rate.
A lower bound on the cost of the current hardware is $80M. GHash itself (not its members) has already invested at least $10M. A lower bound on the cost of the additional hardware GHash would need to achieve total permanent control of an $8000M market would be about $60M. That is very considerable leverage; they would only need to extract 0.7% of the value of the market to be in profit. Even if the cost was well above the lower bound it would be well within the budget of crime syndicates, for example in Russia where GHash is apparently based.

Even if no organization set out to buy control of the network, it would end up with very few individuals, basically just lottery addicts, and a set of organizations each owning their own equipment. Again, increasing returns to scale would be effective, and despite the disincentive a single dominant organization would be likely to emerge.

Unless some solution to the effect of increasing returns to scale can be developed, Permacoin would not be very different from buying storage from a few large providers, such as Amazon. Storing, as MJSPK suggests, "data of great public significance" in a network that might suddenly, without notice, become controlled by a single organization does not seem prudent. Especially if, as appears to be the case with GHash, the organization has a history of behaving in suspicious ways, breaking its promises, and abusing its power. And if the potential rewards for misbehavior vastly exceed those for correct behavior.

Third, note that Bitcoin may be wildly successful in comparison to other digital currencies, but its rate  of 60-70K transactions/day is about 10% of Western Union's, 1% of PayPal's and 0.02% of the big 4 credit card companies' rate. The rate was 60-70K a year ago, meaning Bitcoin's share of the transaction market was probably decreasing even before its current troubles. So even if it were possible to evolve the Bitcoin network into Permacoin, and even if the problem of increasing returns to scale were solved, the network might not have enough traction to survive over timescales of interest for digital preservation.


Henry Newman said...


A few other comments after some more thought.

1. The difference between Bitcoin and Permacoin is that amount of space needed is far different. You can move a bitcoin to your cell phone but only part of a permacoin will be able to be moved.
2. Permacoin depends quality storage vs. high performance CPU.

David. said...

Henry, one misconception and one good point.

What you want on your cell phone is your Bitcoin wallet. All that is in your Bitcoin wallet is the cryptographic credentials that secure your transactions; the coins themselves are entries in the blockchain which the miners store. Permacoin wallets would not be significantly bigger than Bitcoin wallets.

The storage in Permacoin is associated not with wallets but with mining. No-one mines Bitcoin on a phone except via malware (and Permacoin would make even that pretty much infeasible); these days it takes a rack full of custom chips to be a competitive miner.

On the other hand, you are correct that to be a competitive Permacoin miner, as MJSPK demonstrate, requires the miner to store its shards of the dataset in RAM (see Table 2) coupled with (at least) a GPU.

In effect, Permacoin is a storage system that keeps r copies, say 4, of the dataset in RAM. That is extraordinarily expensive in hardware and power compared to more conventional storage techniques. Couple that with the very rapid depreciation of mining hardware and it seems clear that Permacoin would be vastly too expensive to compete for the long-term storage market.

If my estimate of, say, two hundred million dollars per year for the rate of equipment investment in Bitcoin mining is correct, and we assume the RAM is free, and the network could store 4PB (see Section 7), we end up with $50M/PB/yr. Storing all 4PB in S3's Reduced Redundancy Storage costs just over $1M/yr. So even if Permacoin were successful it wouldn't be even close to cost-competitive.

David. said...

The unfortunate side-effects of storing "data of great public significance" in a system like Permacoin need to be taken into account too.

David. said...

Ghash.io is promising to be good in the future. What such promises are worth given their history is left as an exercise for the reader.

David. said...

The Bitcoin mining arms race continues apace. At least two mining silicon vendors have build 20MW data centers full of their mining chips in an illustration of the need for scale even to generate steady rather than increasing returns.

Industrial miners are expected to invest at least $100M/month over the second half of 2014. My estimate was too low by a factor of 6, so we are talking $300M/PB/yr.

This level of industrialization would make Permacoin essentially identical to S3 just 300 times as expensive. What exactly do you get for the extra $299M/yr?

David. said...

A recent Bitcoin theft reveals another threat to networks such as Permacoin in which storage is rewarded with valuable "coins"; BGP hijacking.

Of course, even storage networks that don't reward with valuable "coins" are vulnerable to BGP hijacking, but there is much less motivation to employ it. Storage networks should validate both ends of each connection with SSL certificates using keystores supplied out-of-band, not via "trusted CAs" which are not in fact trustworthy. LOCKSS networks can be configured in this way.

David. said...

A couple of links for those readers who still think the edge of the Internet is made up of desktop systems with lots of spare disk capacity.