Tuesday, June 2, 2015

Brittle systems

In my recent rant on the Internet of Things, I linked to Mike O'Dell's excellent post to Dave Farber's IP list, Internet of Obnoxious Things, and suggested you read it. I'm repeating that advice as, below the fold, I start from a different part of Mike's post.

Mike writes:
The problem with pursuing such a goal is that it has led us down a path of "brittle failure" where things work right up until they fail, and then they fail catastrophically. The outcome is forced to be binary.

In most of Computer Science, there have been only relatively modest efforts directed at building systems which fail gracefully, or partially. Certainly some sub-specialties have spent a lot of effort on this notion, but it is not the norm in the education of a journeyman system builder.

If it is the case that we are unlikely to build any large system which is fail-proof, and that certainly seems to be the situation, we need to focus on building systems which can tolerate, isolate, and survive local failures.
My response also made the IP list:
Mike is absolutely right to point out the brittle nature of most current systems. But education isn't going to fix this.
My co-authors and I won Best Paper at SOSP2003 for showing a system in a restricted application space that, under attack, failed slowly and made "alarming noises". The analogy is with suspension bridges - they use stranded cables for just this reason.

However, the cost differential between stranded and solid cables in a bridge is small. Brittle fault-tolerant systems such as Byzantine Fault Tolerance are a lot more expensive than a non-fault-tolerant system that (most of the time) does the same job. Systems such as the one we showed are a lot more expensive than BFT. This is because three essential aspects of a, I believe any, solution are rate limits, excess replication and randomization.

The problem is that vendors of systems are allowed to disclaim liability for their products. Given that even the most egregious failure is unlikely to cause more than reputational harm, why would a vendor even implement BFT, let alone something much more expensive?

Just finding techniques that allow systems to fail gracefully is not going to be enough (not that it is happening). We need techniques that do so with insignificant added cost. That is a truly hard problem. But we also need to change the law so that vendors cannot escape financial liability for the failures of their products. That is an even harder problem.
I should explain the comment about the importance of "rate limits, excess replication and randomization":
  • Rate Limits: The design goal of almost all systems is to do what the user wants as fast as possible. This means that when the bad guy wrests control of the system from the user, the system will do what the bad guy wants as fast as possible. Doing what the bad guy wants as fast as possible pretty much defines brittleness in a system; failures will be complete and abrupt. In last year's talk at UC Berkeley's Swarm Lab I pointed out that rate limits were essential to LOCKSS, and linked to Paul Vixie's article Rate-Limiting State making the case for rate limits on DNS, NTP and other Internet services. Imposing rate limits on system components makes the overall system more expensive.
  • Excess Replication: The standard fault-tolerance technique, Byzantine Fault Tolerance (BFT), is brittle. As faults in the system increase, it works perfectly until they pass a threshold. After that the system is completely broken. The reason is that BFT defines the minimum number of replicas that can survive a given number of faults. In order to achieve this minimum, every replica is involved in every operation of the system. There is no cushion of excess, unnecessary replicas to help the system retain some functionality above the threshold at which it stops behaving perfectly. The LOCKSS system was not concerned with minimizing the number of replicas. It assumed that it had excess replicas, Lots Of Copies, so it could Keep Stuff Safe by failing gradually as faults increased. Adding replicas to the system makes it more expensive.
  • Randomization: In general, the more predictable the behavior of the system the easier it is to attack. Randomizing the system's behavior makes it unpredictable. A significant part of the LOCKSS system's defenses is that since the selection of replicas to take part in each operation is random, the bad guy cannot predict which they are. Adding randomization to the system makes it more expensive (and harder to debug and test).
Debugging and testing were key to Karl Auerbach's contribution to the IP list discussion (reproduced in full by permission):
One of the motivations for packet switching and the ARPAnet was the ability to continue communications even during/after a nuclear holocaust. (Yes, I know that some people claim that that was not the purpose - but I was there, at SDC, from 1972 building ARPAnet like networks with that specific purpose.)

In recent years, or decades, we seem to be moving towards network architectures that are more brittle.

For example, there is a lot of discussion about "Software Defined Networks" and Openflow - which to my mind is ATM re-invented. Every time I look at it I think to myself "this design invites brittle failures."

My personal concern is slightly different. I come from a family of repairmen - radio and then TV - so when I look at something I wonder "how can it break?" and "how can it be repaired?".

We've engineered the internet so that it is not easy to diagnose problems. Unlike Ma Bell we have not learned to make remote loopbacks a mandatory part of many parts of the system. Thus we often have a flat, one sided view of what is happening. And if we need the view from the other end we often have to ask assistance of non-technical people who lack proper tools or knowledge how to use them.

As a first step we ought to be engineering more test points and remote loopback facilities into internet protocols and devices.

And a second step ought to be the creation of a database of network pathology. With that we can begin to create tools that help us reason backwards from symptoms towards causes. I'm not talking artificial intelligence or even highly expert systems. Rather this would be something that would help us look at symptoms, understand possible causes, and know what tests we need to run to begin to evaluate which of the possible causes are candidates and which are not.
Examples of brittle systems abound:
  • SSL is brittle in many ways. Browsers trust a pre-configured list of certificate authorities, whose role is to provide the illusion of security. If any one of them is malign or incompetent, the system is completely broken, as we see with the recent failure of the official Chinese certificate authority.
  • IP routing is brittle. Economic pressures have eliminated the "route around failure" property of the IP networks that Karl was building to survive nuclear war. Advertizing false routes is a routine trick used by the bad guys to divert traffic for interception.
  • Perimeter security as implemented in firewalls is brittle. Once the bad guy is inside there are few limits on what, and how fast, he can do Bad Things.
  • The blockchain, and its applications such as Bitcoin are brittle.
The blockchain is brittle because it can be taken over by a conspiracy. As I wrote in another of my contributions to the IP list, responding to and quoting from this piece of techno-optimism:
The revolution in progress can generally be described as “disintermediation”. It is the transference of trust, data, and ownership infrastructure from banks and businesses into distributed peer to peer network protocols.

A distributed “world wide ledger” is one of several technologies transforming our highly centralized structures. This technology, cryptically named the “block chain” is embodied in several distributed networks such as Bitcoin, Eris Industries DB, and Ethereum.

Through an encrypted world wide ledger built on a block chain, trust in the systems maintained by third party human institutions can be replaced by trust in math. In block chain systems, account identity and transactions are cryptographically verified by network “consensus” rather than by trust in a single third party.
These techno-optimists never seem to ask "what could possibly go wrong"? To quote from this blog post:
Since then, there has been a flood of proposals to base other P2P storage systems, election voting, even a replacement for the Internet on blockchain technology. Every one of these proposals for using the blockchain as a Solution for Everything I've looked at appears to make three highly questionable assumptions:
There have been times in the past when a single mining pool controlled more than 50% of the mining power, and thus the blockchain. That pool is known to have abused their control of the blockchain.

As I write this, 3 pools control 57% of the mining power. Thus a conspiracy between three parties would control the blockchain.
More than two decades ago at Sun I was convinced that making systems ductile (the opposite of brittle) was the hardest and most important problem in system engineering. After working on it in the LOCKSS Program for nearly 17 years I'm still convinced that this is true.

9 comments:

  1. In a must-read piece entitled A Machine For Keeping Secrets? Vinay Gupta of Ethereum, a blockchain startup shows that inadequate diversity, the bane of distributed systems, is another way in which the blockchain is brittle.

    An attacker with zero-day exploits for each of the three major operating systems on which blockchain software runs could use them to take over the blockchain. There is a market for zero-day exploits, so we know how much it would cost to take over the blockchain. Good operating system zero-days are reputed to sell for $250-500K each, so it would cost about $1.5M to control the Bitcoin blockchain, currently representing nearly $3.3B in capital. That's 220,000% leverage! Goldman Sachs, eat your heart out.

    ReplyDelete
  2. Benjamin Lawsky, New York State's successful but alas now retiring financial regulator, understands the brittleness of the technology on which society depends. Interviewed at New York magazine by Chris Smith:

    What’s the greatest risk to the financial system right now? A big jump in interest rates? Some trading technology or speculative instrument we don’t understand?

    The problem is we react and try to fix things going forward based on previous experience. But the next thing is always a little different. The thing that really worries me right now is cybersecurity. We’re seeing more and more of these hacks. It is an incredibly difficult issue to deal with, not just for the financial sector, for our entire society. My hope is that it’s not going to take a really catastrophic cyberattack that causes a financial crisis.

    How likely is that right now?

    Look, we just had one announced yesterday, this Chinese thing, which is millions and millions of people [whose data was stolen], and they’ve hacked the federal government. So the chances are 100 percent that it’s going to keep happening. I’ve had trouble ball-parking the chances of something really systemic happening.

    Above or below 50 percent?

    I think below. But Phil Reitinger, who is on the governor’s cybersecurity advisory board with me, says what he worries about is an attack that shuts down power on the Eastern Seaboard and then a day later shuts down a bank or an exchange, and then everyone gets into a panic.

    ReplyDelete
  3. Lawsky is too optimistic. Trustwave reports that the return on malware investment dwarfs anything except the returns on lobbying the government:

    "Attackers receive an estimated 1,425 percent return on investment for exploit kit and ransomware schemes ($84,100 net revenue for each $5,900 investment)."

    ReplyDelete
  4. "Through an encrypted world wide ledger built on a block chain"

    Hmm, there is no encryption in block chains. Only public-key cryptography (signing) and hash-functions. The block chain itself is public and non-encrypted.

    "These techno-optimists never seem to ask "what could possibly go wrong"?"

    I don't know which part of universe you have been living on, but all the bitcoin enthusiasts I know constantly revisit the risks, threaths and opportunities of the system. There are blind believers, but those are not usually very technically oriented. For example, many devs in the bitcoin core group tend to criticize some bitcoin properties quite heavily from time to time.

    "Every one of these proposals for using the blockchain as a Solution for Everything I've looked at appears to make three highly questionable assumptions: "

    Nowadays I quite rarely see these assumptions made clearly by anyone even slightly technically competent. In fact, it is widely known that bitcoin doesn't provide anonymity, and that's why there exists tens of privacy-improving services/tools/projects that try to improve the privacy model either for invidual users or bitcoin users in general.

    Even if bitcoin sucks and has certain risks, it still could be the "least bad" alternative for some use cases.

    ReplyDelete
  5. What you write applies very much to "security", which for most vendors and buyers is a cost to be minimized, just like resilience.

    In another post you write "in my usual way, ask what could possibly go wrong?" and that's not what businesses and even voters worry about. Such a question relates probably to a "maxmin" strategy, maximizing the minimum win, or minimizing the maximum loss; but most businesess etc. are aiming for "maxmax", that is maximizing the maximum win, and the devil take the hindmost.

    Economists and businessmen have names for the strategy of assuming the best and bailing out if the worst happens, like "picking pennies in front of steamrollers" and "capital decimation partners". But it is a very profitable strategy for those who are lucky and the "bad outcome" does not happen.

    Accountants call that "undedepreciation of risk", and that can be extremely profitable on paper, but businessmen pay themselves bonuses based on short-term underdepreciated paper profits, not long-term depreciated actual ones. Basically, paying themselves out of gross profits, not net profit. It is an important tool of asset stripping.

    Economists sum it up by saying that many people have very high "discount rates", that is they value very highly present actual income and worry a lot less about future likely losses.

    Especially if they can take the present actual income for themselves and give the future likely losses to someone else.

    http://dilbert.com/strip/2006-07-11
    http://dilbert.com/strip/1994-10-10

    ReplyDelete
  6. As an illustration of how brittle the systems we depend upon are, today United Airlines, the New York Stock Exchange, and the Wall Street Journal were all shut down through system malfunctions. The Dept. of Homeland Security reassures us that this wasn't "malicious activity". It probably wasn't; if it had been these sites would still be down.

    ReplyDelete
  7. Symantec has been caught issuing "extended validity" certificates for google.com. Aren't you glad that the certificate authorities you trust are so trustworthy?

    ReplyDelete
  8. Dan Goodin at Ars Technica reports that Still fuming over HTTPS mishap, Google makes Symantec an offer it can’t refuse. Google is threatening that if Symantec doesn't shape up Chrome will start flagging their certificates.

    Goodin is wrong to call this a "mishap". It was a complete abdication of everything Symantec is in business to provide Google and its other customers. I agree with the commenter who wrote:

    "I think Google's actually being too lenient here. Symantec has violated the agreement that allows their root CA certificates to be trusted by Chrome. They have similar agreements with MS and Mozilla. By the letter of these agreements, any of these browsers could legitimately stop trusting the Symantec root CA certificates. By offering a remedy, Google is doing them a favor. Not out of altruism, of course, but because enough sites have Symantec certificates that flagging all of them would seriously inconvenience their users."

    All browser and OS vendors should agree that a single instance of issuing a false EV certificate should result in immediate removal of the CA's root certificate. Yes, customers would be annoyed. The alternative is customers being compromised. Which would you prefer?

    Symantec didn't just fail catastrophically, they proceeded to lie about it:

    "Symantec first said it improperly issued 23 test certificates for domains owned by Google, browser maker Opera, and three other unidentified organizations without the domain owners' knowledge. A few weeks later, after Google disputed the low number, Symantec revised that figure upward, saying it found an additional 164 certificates for 76 domains and 2,458 certificates for domains that had never been registered."

    This is the organization that the security of much of the Web relies on. If it wasn't so serious it'd be a joke.

    ReplyDelete
  9. I should have noticed earlier that Reddit's r/Place experiment shows the importance of rate limits. In Reddit and the Struggle to Detoxify the Internet at the New Yorker Andrew Marantz writes:

    'Last April Fools’, instead of a parody announcement, Reddit unveiled a genuine social experiment. It was called r/Place, and it was a blank square, a thousand pixels by a thousand pixels. In the beginning, all million pixels were white. Once the experiment started, anyone could change a single pixel, anywhere on the grid, to one of sixteen colors. The only restriction was speed: the algorithm allowed each redditor to alter just one pixel every five minutes. “That way, no one person can take over—it’s too slow,” Josh Wardle, the Reddit product manager in charge of Place, explained. “In order to do anything at scale, they’re gonna have to coƶperate.'

    The whole article is fascinating, but the description of the evolution of r/Place is a great example of the benefit of rate limits.

    ReplyDelete