Tuesday, May 30, 2017

Blockchain as the Infrastructure for Science? (updated)

Herbert Van de Sompel pointed me to Lambert Heller's How P2P and blockchains make it easier to work with scientific objects – three hypotheses as an example of the persistent enthusiasm for these technologies as a way of communicating and preserving research, among other things. Another link from Herbert, Chris H. J. Hartgerink's Re-envisioning a future in scholarly communication from this year's IFLA conference, proposes something similar:
Distributing and decentralizing the scholarly communications system is achievable with peer-to-peer (p2p) Internet protocols such as dat and ipfs. Simply put, such p2p networks securely send information across a network of peers but are resilient to nodes being removed or adjusted because they operate in a mesh network. For example, if 20 peers have file X, removing one peer does not affect the availability of the file X. Only if all 20 are removed from the network, file X will become unavailable. Vice versa, if more peers on the network have file X, it is less likely that file X will become unavailable. As such, this would include unlimited redistribution in the scholarly communication system by default, instead of limited redistribution due to copyright as it is now.
I first expressed skepticism about this idea three years ago discussing a paper proposing a P2P storage infrastructure called Permacoin. It hasn't taken over the world. [Update: my fellow Sun Microsystems alum Radia Perlman has a broader skeptical look at blockchain technology. I've appended some details.]

I understand the theoretical advantages of peer-to-peer (P2P) technology. But after nearly two decades researching, designing, building, deploying and operating P2P systems I have learned a lot about how hard it is for these theoretical advantages actually to be obtained at scale, in the real world, for the long term. Below the fold, I try to apply these lessons.

For the purpose of this post I will stipulate that the implementations of both the P2P technology and the operating system on which it runs are flawless, and their design contains no vulnerabilities that the bad guys can exploit. Of course, in the real world there will be flaws and vulnerabilities, but discussing their effects on the system would distract from the message of this post.

Heller's three hypotheses are based on the idea of using a P2P storage infrastructure such as IPFS that names objects by their hash:
  • It would be better for researchers to allocate persistent object names than for digital archives to do so. There are a number of problems with this hypothesis. First, it doesn't describe the current situation accurately. Archives such as the Wayback Machine or LOCKSS try hard not to assign names to content they preserve, striving to ensure that it remains accessible via its originally assigned URL, DOI or metadata (such as OpenURL). Second, the names Heller suggests are not assigned by researchers, they are hashes computed from the content. Third, hashes are not persistent over the timescales needed because, as technology improves over time, it becomes possible to create "hash collisions", as we have seen recently with SHA1.
  • From name allocation plus archiving plus x as a “package solution” to an open market of modular services. Heller is correct to point out that:
    The mere allocation of a persistent name does not ensure the long-term accessibility of objects. This is also the case for a P2P file system such as IPFS. ... Since name allocation using IPFS or a blockchain is not necessarily linked to the guarantee of permanent availability, the latter must be offered as a separate service.
    The upside of using hashes as names would be that the existence and location of the archive would be invisible. The downside of using hashes as names is that the archive would be invisible, posing insurmountable business model difficulties for those trying to offer archiving services, and insurmountable management problems for those such as the Keeper's Registry who try to ensure that the objects that should be preserved actually are being preserved. There can't be a viable market in archiving services if the market participants and their products are indistinguishable and accessible freely to all. Especially not if the objects in question are academic papers, which are copyright works.
  • It is possible to make large volumes of data scientifically usable more easily without APIs and central hosts. In an ideal world in which both storage and bandwidth were infinite and free, storing all the world's scientific data in an IPFS-like P2P service backed up by multiple independent archive services would indeed make the data vastly more accessible, useful and persistent than it is now. But we don't live in an ideal world. If this P2P network is to be sustainable for the long term, the peers in the network need a viable business model, to pay for both storage and bandwidth. But they can't charge for access to the data, since that would destroy its usability. They can't charge the researchers for storing their data, since it is generated by research that is funded by term-limited grants. Especially in the current financial environment, they can't charge the researchers' institutions, because they have more immediate funding priorities than allowing other institutions' researchers to access the data in the future for free.
I have identified three major problems with Heller's proposal which also apply to Hartgerink's:
  • They would populate the Web with links to objects that, while initially unique, would over time become non-unique. That is, it would become possible for objects to be corrupted. When the links become vulnerable, they need to be replaced with better hashes. But there is no mechanism for doing so. This is not a theoretical concern, the BitTorrent protocol underlying IPFS has been shown to be vulnerable to SHA1 collisions.
  • The market envisaged, at least for archiving services, does not allow for viable business models, in that the market participants are indistinguishable.
  • Unlike Bitcoin, there is no mechanism for rewarding peers for providing services to the network.
None of these has anything to do with the functioning of the software system. Heller writes:
There is hope that we will see more innovative, reliable and reproducible services in the future, also provided by less privileged players; services that may turn out to be beneficial and inspirational to actors in the scientific community.
I don't agree, especially about "provided by less privileged players". Leave aside that the privileged players in the current system have proven very adept at countering efforts to invade their space, for example by buying up the invaders. There is a much more fundamental problem facing P2P systems.

Four months after the Permacoin post, inspired in part by Natasha Lomas' Techcrunch piece The Server Needs To Die To Save The Internet about the MaidSafe P2P storage network, I wrote Economies of Scale in Peer-to-Peer Networks. This is a detailed explanation of how the increasing returns to scale inherent to technologies in general (and networked systems in particular) affect P2P systems, making it inevitable that they will gradually lose their decentralized nature and the benefits that it provides, such as resistance to some important forms of attack.

Unconfirmed transactions
The history of Bitcoin shows this centralizing effect in practice. It also shows that, even when peers have a viable (if perhaps not sustainable) business model, based in Bitcoin's case on financial speculation, Chinese flight capital and crime such as ransomware, resources do not magically appear to satisfy demand.

As I write, about 100MB of transactions are waiting to be confirmed. A week and a half ago, Izabella Kaminska reported that there were over 200,000 transactions in the queue. At around 5 transaction/sec, that's around an 11-hour backlog. Right now, the number is about half that. How much less likely are resources to become available to satisfy demand if the peers lack a viable business model?

Because Bitcoin has a lot of peers and speculation has driven its value sky-high, it is easy to assume that it is a successful technology. Clearly, it is very successful along some axes. Along others, not so much. For example, Kaminska writes:
The views of one trader:
... This is the biggest problem with bitcoin, it’s not just that it’s expensive to transact, it’s uncertain to transact. It’s hard to know if you’ve put enough of a fee. So if you significantly over pay to get in, even then it’s not guaranteed. There are a lot of people who don’t know how to set their fees, and it takes hours to confirm transactions. It’s a bad system and no one has any solutions.
Transactions which fail to get the attention of miners sit in limbo until they drop out. But the suspended state leaves payers entirely helpless. They can’t risk resending the transaction, in case the original one does clear eventually. They can’t recall the original one either. Our source says he’s had a significant sized transaction waiting to be settled for two weeks.

The heart of the problem is game theoretical. Users may not know it but they’re participating in what amounts to a continuous blind auction.

Legacy fees can provide clues to what fees will get your transactions done — and websites are popping up which attempt to offer clarity on that front — but there’s no guarantee that the state of the last block is equivalent to the next one.
Right now, if you want a median-sized transaction in the next block you're advised to bid nearly $3. The uncertainty is problematic for large transactions and the cost is prohibitive for small ones. Kaminska points out that the irony is:
given bitcoin’s decentralised and real-time settlement obsession, ... how the market structure has evolved to minimise the cost of transaction.

Traders, dealers, wallet and bitcoin payments services get around transaction settlement choke points and fees by netting transactions off-blockchain.

This over time has created a situation where the majority of small-scale payments are not processed on the bitcoin blockchain at all. To the contrary, intermediaries operate for the most part as trusted third parties settling netted sums as and when it becomes cost effective to do so. ... All of which proves bitcoin is anything but a cheap or competitive system. With great irony, it is turning into a premium service only cost effective for those who can’t — for some reason, ahem — use the official system.
There's no guarantee that the axes on which Bitcoin succeeded are those relevant to other blockchain uses; the ones on which it is failing may well be. Among the blockchain's most hyped attributes were the lack of a need for trust, and the lack of a single point of failure. Another of Kaminska's posts:
Coinbase has been intermittently down for at least two days.

With an unprecedented amount of leverage in the bitcoin and altcoin market, a runaway rally that doesn’t seem to know when to stop, the biggest exchange still not facilitating dollar withdrawals and incremental reports about other exchanges encountering service disruption, it could just be there’s more to this than first meets the eye.

(Remember from 2008 how liquidity issues tend to cause a spike in the currency that’s in hot demand?)
These problems illustrate the difficulty of actually providing the theoretical advantages of a P2P technology "at scale, in the real world, for the long term".

Update: In Blockchain: Hype or Hope? Radia Perlman provides a succinct overview of blockchain technology, asks what is novel about it, and argues that the only feature of the blockchain that cannot be provided at much lower cost by preexisting technology is:
a ledger agreed upon by consensus of thousands of anonymous entities, none of which can be held responsible or be shut down by some malevolent government
But, as she points out:
most applications would not require or even want this property. And, as demonstrated by the Bitcoin community's reaction to forks, there really are a few people in charge who can control the system
She doesn't point out that, in order to make money, the "thousands of ... entities" are forced to cooperate in pools, so that in practice the system isn't very decentralized, and the "anonymous entities" are much less anonymous than they would like to believe (see here and here).

Radia's article is a must-read corrective to the blockchain hype. Alas, although I have it in my print copy of Usenix ;login:, it doesn't appear to be on the Usenix website yet, and even when it is it will only be available to members for a year. I've made a note to post about it again when it is available.


David. said...

Radia Perlman's article is now on-line. Its free for Usenix members but otherwise $8 until this time next year. A sample of her critique:

"The startling aspect of the Bitcoin hash is that it is equally difficult for the community of miners to compute a hash as for someone to forge a hash. This means that the security of Bitcoin depends on the assumption that no entity or collection of entities can amass as much compute power as the Bitcoin mining community. This is a very surprising assumption. It would indeed be easy for a nation-state to amass more compute power than the Bitcoin community."

It doesn't take a nation-state, just a collaboration among miners mustering more than half the mining power. This has is not a theoretical possibility, it happened in 2014.

David. said...

Izabella Kaminska latest post to the FT's Alphaville blog discusses Dissecting Ponzi schemes on Ethereum:
identification, analysis, and impact
by Bartoletti et al. Kaminska writes:

"For the most part, the question that really needs asking is this: What sort of legitimate organisation really benefits from a decentralised or headless state? Also, what sort of company benefits from decentralised funding options or from providing decentralised services? The answer is almost none.

Decentralisation is, in almost all cases, not an efficiency. To the contrary, it’s a cost that adds complexity and creates an unnecessary burden for both users and operators unless centralised layers are added on top of it — defying the whole point.


At the end of the day, there are only two groups of people prepared to go to costly lengths to decentralise a service which is already available (in what is often a much higher quality form) in a centralised or conventional hierarchal state. One group is criminals and fraudsters. The other is ideologues and cultists. The first sees the additional cost/effort as worthwhile due to the un-censorable utility of these systems. The second consumes it simply as a luxury or cultured good."