The problem with pursuing such a goal is that it has led us down a path of "brittle failure" where things work right up until they fail, and then they fail catastrophically. The outcome is forced to be binary.My response also made the IP list:
In most of Computer Science, there have been only relatively modest efforts directed at building systems which fail gracefully, or partially. Certainly some sub-specialties have spent a lot of effort on this notion, but it is not the norm in the education of a journeyman system builder.
If it is the case that we are unlikely to build any large system which is fail-proof, and that certainly seems to be the situation, we need to focus on building systems which can tolerate, isolate, and survive local failures.
Mike is absolutely right to point out the brittle nature of most current systems. But education isn't going to fix this.
My co-authors and I won Best Paper at SOSP2003 for showing a system in a restricted application space that, under attack, failed slowly and made "alarming noises". The analogy is with suspension bridges - they use stranded cables for just this reason.I should explain the comment about the importance of "rate limits, excess replication and randomization":
However, the cost differential between stranded and solid cables in a bridge is small. Brittle fault-tolerant systems such as Byzantine Fault Tolerance are a lot more expensive than a non-fault-tolerant system that (most of the time) does the same job. Systems such as the one we showed are a lot more expensive than BFT. This is because three essential aspects of a, I believe any, solution are rate limits, excess replication and randomization.
The problem is that vendors of systems are allowed to disclaim liability for their products. Given that even the most egregious failure is unlikely to cause more than reputational harm, why would a vendor even implement BFT, let alone something much more expensive?
Just finding techniques that allow systems to fail gracefully is not going to be enough (not that it is happening). We need techniques that do so with insignificant added cost. That is a truly hard problem. But we also need to change the law so that vendors cannot escape financial liability for the failures of their products. That is an even harder problem.
- Rate Limits: The design goal of almost all systems is to do what the user wants as fast as possible. This means that when the bad guy wrests control of the system from the user, the system will do what the bad guy wants as fast as possible. Doing what the bad guy wants as fast as possible pretty much defines brittleness in a system; failures will be complete and abrupt. In last year's talk at UC Berkeley's Swarm Lab I pointed out that rate limits were essential to LOCKSS, and linked to Paul Vixie's article Rate-Limiting State making the case for rate limits on DNS, NTP and other Internet services. Imposing rate limits on system components makes the overall system more expensive.
- Excess Replication: The standard fault-tolerance technique, Byzantine Fault Tolerance (BFT), is brittle. As faults in the system increase, it works perfectly until they pass a threshold. After that the system is completely broken. The reason is that BFT defines the minimum number of replicas that can survive a given number of faults. In order to achieve this minimum, every replica is involved in every operation of the system. There is no cushion of excess, unnecessary replicas to help the system retain some functionality above the threshold at which it stops behaving perfectly. The LOCKSS system was not concerned with minimizing the number of replicas. It assumed that it had excess replicas, Lots Of Copies, so it could Keep Stuff Safe by failing gradually as faults increased. Adding replicas to the system makes it more expensive.
- Randomization: In general, the more predictable the behavior of the system the easier it is to attack. Randomizing the system's behavior makes it unpredictable. A significant part of the LOCKSS system's defenses is that since the selection of replicas to take part in each operation is random, the bad guy cannot predict which they are. Adding randomization to the system makes it more expensive (and harder to debug and test).
One of the motivations for packet switching and the ARPAnet was the ability to continue communications even during/after a nuclear holocaust. (Yes, I know that some people claim that that was not the purpose - but I was there, at SDC, from 1972 building ARPAnet like networks with that specific purpose.)Examples of brittle systems abound:
In recent years, or decades, we seem to be moving towards network architectures that are more brittle.
For example, there is a lot of discussion about "Software Defined Networks" and Openflow - which to my mind is ATM re-invented. Every time I look at it I think to myself "this design invites brittle failures."
My personal concern is slightly different. I come from a family of repairmen - radio and then TV - so when I look at something I wonder "how can it break?" and "how can it be repaired?".
We've engineered the internet so that it is not easy to diagnose problems. Unlike Ma Bell we have not learned to make remote loopbacks a mandatory part of many parts of the system. Thus we often have a flat, one sided view of what is happening. And if we need the view from the other end we often have to ask assistance of non-technical people who lack proper tools or knowledge how to use them.
As a first step we ought to be engineering more test points and remote loopback facilities into internet protocols and devices.
And a second step ought to be the creation of a database of network pathology. With that we can begin to create tools that help us reason backwards from symptoms towards causes. I'm not talking artificial intelligence or even highly expert systems. Rather this would be something that would help us look at symptoms, understand possible causes, and know what tests we need to run to begin to evaluate which of the possible causes are candidates and which are not.
- SSL is brittle in many ways. Browsers trust a pre-configured list of certificate authorities, whose role is to provide the illusion of security. If any one of them is malign or incompetent, the system is completely broken, as we see with the recent failure of the official Chinese certificate authority.
- IP routing is brittle. Economic pressures have eliminated the "route around failure" property of the IP networks that Karl was building to survive nuclear war. Advertizing false routes is a routine trick used by the bad guys to divert traffic for interception.
- Perimeter security as implemented in firewalls is brittle. Once the bad guy is inside there are few limits on what, and how fast, he can do Bad Things.
- The blockchain, and its applications such as Bitcoin are brittle.
More than two decades ago at Sun I was convinced that making systems ductile (the opposite of brittle) was the hardest and most important problem in system engineering. After working on it in the LOCKSS Program for nearly 17 years I'm still convinced that this is true.The revolution in progress can generally be described as “disintermediation”. It is the transference of trust, data, and ownership infrastructure from banks and businesses into distributed peer to peer network protocols.These techno-optimists never seem to ask "what could possibly go wrong"? To quote from this blog post:
A distributed “world wide ledger” is one of several technologies transforming our highly centralized structures. This technology, cryptically named the “block chain” is embodied in several distributed networks such as Bitcoin, Eris Industries DB, and Ethereum.
Through an encrypted world wide ledger built on a block chain, trust in the systems maintained by third party human institutions can be replaced by trust in math. In block chain systems, account identity and transactions are cryptographically verified by network “consensus” rather than by trust in a single third party.
Since then, there has been a flood of proposals to base other P2P storage systems, election voting, even a replacement for the Internet on blockchain technology. Every one of these proposals for using the blockchain as a Solution for Everything I've looked at appears to make three highly questionable assumptions:There have been times in the past when a single mining pool controlled more than 50% of the mining power, and thus the blockchain. That pool is known to have abused their control of the blockchain.
As I write this, 3 pools control 57% of the mining power. Thus a conspiracy between three parties would control the blockchain.