Thursday, May 5, 2022

Probabilistic Fault Tolerance

In our 2003 SOSP "Best Paper" Preserving Peer Replicas By Rate-Limited Sampled Voting (also here, expanded version here) we introduced a number of concepts. The most important was consensus based on sampling the electorate at random, in contrast to the traditional Byzantine Fault Tolerance (BFT), which at every step solicits votes from the entire electorate. The advantages of this mechanism are so significant that, a decade and a half later, it was re-discovered and improved upon in the context of a blockchain protocol. Below the fold I explain our original consensus system, and describe how the basic idea has been adopted and enhanced.
To err is human, but to really foul things up you need a computer. Anonymous


In the real world computers are, alas, neither reliable nor proof against attacks. Thus when in 1982 Leslie Lamport et al published The Byzantine Generals Problem it was a major advance in computer science. They showed an algorithm by means of which a system of n=3f+1 replicas would perform correctly despite f of the replicas simultaneously being failed or malign.

Our application was long-term preservation of the digital academic literature, as libraries have managed so successfully over centuries for the paper literature. Reliability and resistance to attack were critical. In this context BFT suffered the following problems:
  • It was brittle. Despite the theoretical rigor of the sufficiency of 3f+1 replicas to survive f simultaneous faults, any practical implementation would have some finite probability of encountering F>f faults at the same time. Once F simultaneous faults occur, the state of the entire system is unpredictable, or in our case failed. Thus BFT provides probabilistic fault tolerance, and does not degrade gracefully under failure or attack.
  • It was centralized. BFT requires trustworthy global knowledge of the n members of the electorate, which through time requires a trusted central authority admintting and excluding members, and communicating these changes to the electorate. Part of the reason for the global community of libraries' success in preserving paper is that it is decentralized; there is no central authority determining whether an institution is a library. As we wrote in 2000's Permanent Web Publishing:
    Libraries' circulating collections form a model fault-tolerant distributed system. It is highly replicated, and exploits this to deliver a service that is far more reliable than any individual component. There is no single point of failure, no central control to be subverted. There is a low degree of policy coherence between the replicas, and thus low systemic risk. The desired behavior of the system as a whole emerges as the participants take actions in their own local interests and cooperate in ad-hoc, informal ways with other participants.
  • It did not scale. Because the content we preserved is owned by some of the most litigious copyright holders, we felt any scheme that involved libraries providing preservation as a service to other libraries was likely to end up in court. Important academic journals have hundreds or thousands of library subscriptions, so we needed a system that would scale to thousands of replicas. BFT requires global communication from an elected leader node, so has overhead that rises rapidly with n.
We summarized our approach thus:
In common with the Byzantine-fault-tolerance ... our voting protocol derives an apparent prevailing opinion among a set of peers, some of whom are malicious. There are many differences; our population size is too large for BFT’s global communication, we degrade gradually rather than mask the effects of failure or attack, and because we cannot assume an absolute upper bound on the malicious peers’ resources we have to consider the possibility of being overwhelmed. We use sampling to avoid global knowledge or communication, rate-limiters to prevent our adversary’s unlimited resources from overwhelming the system quickly, and integrated intrusion detection to preempt unrecoverable failure.
A library's node assures itself that its copy of some content matches the consensus of the network of libraries, each node at intervals selects a random subset of the other libraries it knows about and invites them to vote in a poll. To oversimplify, the group of nodes vote on the hash of the content. Why is this a good idea?

Our research simulated attacks by three kinds of adversaries; a stealth attacker whose goal was to change the consensus without detection, a nuisance adversary wwhose goal was to discredit the system by flooding it wih false alarms, and an attrition adversary whose goal was DDoS-like, to waste resources. Because random corruption of content is both rare and un-correlated, poll results should be bimodal; landslide agreement if the node's content is good, landslide disagreement if it is corrupt. Any poll result between these extremes indicates correlated corruption, a signal of a stealth attack that raises an alarm.

The stealth attacker is assumed to control many Sybil nodes, but in order to avoid raising an alarm must vote "correctly" in any poll for which these nodes are not a large majority. The attacker's problem is that he only knows which of them have been invited into each poll, not how many invitees there are in total. Thus he has to guess in which polls he has enough of a majority to win without raising an alarm. Since changing the overall consensus requires winning many polls in quick succession, the attacker has to guess right consistently many times, making an undetected successful attack extremely difficult. The sample graph shows simulations of a stealth attack. The X axis is the proportion of the network nodes the attacker controls, the Y axis is the time to attack detection in years. In this simulation no runs featured a successful undetected attack. Note that the more of the network the attacker controls, the sooner he is detected.
  • It is resilient. It is extremely difficult for an attack, even with unlimited resources, to succeed in corrupting a majority of nodes before being detected. Once detected, node administrators can recover from the majority of valid nodes.
  • It is decentralized. Each node operates autonomously, communicating with nodes it discovers because in the past they have agreed with other nodes it knows about. There is no assigned or elected leader and no global communication or synchronization.
  • It scales, because there is no global communication or synchronization, and because as n increases the fraction of nodes involved in each poll decreases.


Scalable and Probabilistic Leaderless BFT Consensus through Metastability by "Team Rocket" et al claims (my emphasis):
This paper introduces a new family of consensus protocols called Snow. Inspired by gossip algorithms, this family gains its properties through a deliberately metastable mechanism. Specifically, the system operates by repeatedly sampling the network at random, and steering correct nodes towards a common outcome. Analysis shows that this metastable mechanism is powerful: it can move a large network to an irreversible state quickly, where the irreversibility implies that a sufficiently large portion of the network has accepted a proposal and a conflicting proposal will not be accepted with any higher than negligible (ε) probability.

... the Snow family tolerates discrepancies in knowledge of membership, as we discuss later. In contrast, classical consensus protocols require the full and accurate knowledge of n as its safety foundation.

Snow’s subsampled voting mechanism has two additional properties that improve on previous approaches for consensus. Whereas the safety of quorum-based approaches breaks down immediately when the predetermined threshold f is exceeded, Snow’s probabilistic safety guarantee degrades smoothly when Byzantine participants exceed f.
Does this sound familiar? And yet their extensive "Related Work" section fails to cite our SOSP paper, or the later expanded version in ACM Transactions on Computer Systems. This is all to say that "Team Rocket" et al overstate how novel their system is. Despite that, it is clear that they have made major advances in the following areas:
  • Their system's consensus is dynamic where our application required a static consensus.
  • Their system operates rapidly where our application required strong rate-limiters.
  • They provide a rigorous theoretical analysis, which our paper lacked.
"Team Rocket" et al's exposition proceeds in a sequence of increasingly effective protocols:
  • In Slush, each node repeatedly samples the set of nodes it knows in essentially the same way the we did.
  • In Snowflake, the nodes retain state counting how long it has been since it saw disagreement. This is similar to the state our nodes retained.
  • In Snowball, the nodes retain additional state that provides hysteresis, a significant improvement on our protocol.
  • In Avalanche, they use Snowball to implement the basic functions of a cryptocurrency, a completely different and more demanding application than ours.
The "leaderless" in their title is important. Nakamoto consensus is in effect a two-phase process. First there is a race to elect the node which will validate the block, in Avalanche's terms a leader, and then there is an extended consensus phase driven by the longest chain rule which convinces nodes that disagreed with the election to fall in line. This is why Bitcoin transactions should not be considered final until they are six blocks from the head. Snowball is a single-phase consensus protocol.

Note that none of these protocols defend against Sybil attacks:
A long line of work, including PBFT [15], treats the Sybil problem separately from consensus, and rightfully so, as Sybil control mechanisms are distinct from the underlying, more complex agreement protocol. In fact, to our knowledge, only Nakamoto-style consensus has “baked-in” Sybil prevention as part of its consensus, made possible by chained proof-of-work [5], which requires miners to continuously stake a hardware investment. Other protocols, discussed in Section 7, rely on proof-of-stake (by economic argument) or proof-of-authority (by administrative argument that makes the system “permissioned”). The consensus protocols presented in this paper can adopt any Sybil control mechanism, although proof-of-stake is most aligned with their quiescent operation. One can use an already established proof-of-stake based mechanism [30].
Nor do they defend against the kind of attacks we discussed in Attrition Defenses for a Peer-to-Peer Digital Preservation System by TJ Giuli et al. They write::
Without a protection mechanism, an attacker can generate large numbers of transactions and flood protocol data structures, consuming storage. There are a multitude of techniques to deter such attacks, including network-layer protection, proof-of-authority, local proof-of-work and economic mechanisms. In Avalanche, we use transaction fees, making such attacks costly even if the attacker is sending money back to addresses under its control.
Of course, fees need to be high enough to deter flooding attacks; greatly increasing the supply of transactions as Avalanche does will reduce the per-transaction fee but may not decrease the total cost of a flooding attack. But note that the Solana blockchain's competitive advantage was very low fees, and the result was:
On April 30, NFT minting bots began flooding the Solana network with 4 million transactions per second, causing the network to lose consensus. The project tweeted that "Engineers are still investigating why the network was unable to recover, and validator operators prepare for a restart." The network was offline for seven hours.

This is hardly the first instability the network has demonstrated, much to the chagrin of its users. Transaction flooding is an issue on Solana in part because of the low transaction fees compared to networks like Bitcoin and Ethereum, which have relatively high gas fees that would make flooding extremely expensive.
There are a number of issues around Proof-of-Stake and fees that will have to wait for a future post.

Overall, Avalanche's use of random sampling, a directed acyclic graph rather than a chain, and Proof-of-Stake results in a more efficient and more robust cryptocurrency implementation, which is a significant achievement.

1 comment:

David. said...

In the wake of the great UST de-peg Avalanche's AVAX has been hit hard. Its high for the week was $69.48 - today it dropped to $26.03 before recovering a little.