Thursday, February 18, 2021

Blast Radius

Last December Simon Sharwood reported on an "Infrastructure Keynote" by Amazon's Peter DeSantis in AWS is fed up with tech that wasn’t built for clouds because it has a big 'blast radius' when things go awry:
Among the nuggets he revealed was that AWS has designed its own uninterruptible power supplies (UPS) and that there’s now one in each of its racks. AWS decided on that approach because the UPS systems it needed were so big they required a dedicated room to handle the sheer quantity of lead-acid batteries required to keep its kit alive. The need to maintain that facility created more risk and made for a larger “blast radius” - the extent of an incident's impact - in the event of failure or disaster.

AWS is all about small blast radii, DeSantis explained, and in the past the company therefore wrote its own UPS firmware for third-party products.

“Software you don’t own in your infrastructure is a risk,” DeSantis said, outlining a scenario in which notifying a vendor of a firmware problem in a device commences a process of attempting to replicate the issue, followed by developing a fix and then deployment.

“It can take a year to fix an issue,” he said. And that’s many months too slow for AWS given a bug can mean downtime for customers.
This is a remarkable argument for infrastructure based on open source software, but that isn't what this post is about. Below the fold is a meditation on the concept of "blast radius", the architectural dilemma it poses, and its relevance to recent outages and compromises.

In 2013 Stanford discovered that an Advanced Persistent Threat (APT) actor had breached their network and compromised the Active Directory server. The assessment was that, as a result, nothing in the network could be trusted. In great secrecy Stanford built a complete replacement network, switches, routers, servers and all. The secrecy was needed because the attackers were watching; for example nothing about the new network could be mentioned in e-mail. Eventually there was a "flag day" when everyone was cut over to the new network with new passwords. Then, I believe, all the old network hardware was fed into the trash compactor. The authentication technology for a network, in Stanford's case Active Directory, has a huge "blast radius", which is why it is a favorite target for attackers.

Two recent outages demonstrate the "blast radius" of authentication services:
  • Last September, many of Microsoft's cloud services failed. Thomas Claiburn's Microsoft? More like: My software goes off reported:
    Microsoft's online authentication systems are inaccessible for at least some customers today, locking those subscribers out of tons of Redmond-hosted services if they are not already logged in.
    ...
    Beyond Microsoft's public and government cloud wobbles ... the authentication system outage has hit its other online services, including Outlook, Office, Teams, and Microsoft Authenticator. If you're not already logged in, it appears, you may be unable to get in and use the cloud-based applications as a result of the ongoing downtime.
    The next day, Tim Anderson's With so many cloud services dependent on it, Azure Active Directory has become a single point of failure for Microsoft had more details of the failed attempts to recover by backing out a recent change, and the lingering after-effects:
    The core service affected was Azure Active Directory, which controls login to everything from Outlook email to Teams to the Azure portal, used for managing other cloud services. The five-hour impact was also felt in productivity-stopping annoyances like some installations of Microsoft Office and Visual Studio, even on the desktop, declaring that they could not check their licensing and therefore would not run. ... If the problem is authentication, even resilient services with failover to other Azure regions may become inaccessible and therefore useless.

    The company has yet to provide full details, but a status report today said that "a recent configuration change impacted a backend storage layer, which caused latency to authentication requests".
    Part of the difficulty Microsoft had in responding was likely due to the need for staff responding to the authentication problem to authenticate to obtain the access required to make the necessary changes.
  • Then, in December, many of Google's cloud services failed. Again, Anderson reported the details in Not just Microsoft: Auth turns out to be a point of failure for Google's cloud, too:
    Google has posted more details about its 50 minute outage yesterday, though promising a "full incident report" to follow. It was authentication that broke, reminiscent of Microsoft's September cloud outage caused by an Azure Active Directory failure.

    In an update to its Cloud Status dashboard, Google said that: "The root cause was an issue in our automated quota management system which reduced capacity for Google's central identity management system, causing it to return errors globally. As a result, we couldn't verify that user requests were authenticated and served errors to our users."

    Not mentioned is the fact that the same dashboard showed all green during at least the first part of the outage. Perhaps it did not attempt to authenticate against the services, which were otherwise running OK. As so often, Twitter proved more reliable for status information.

    Services affected included Cloud Console, Cloud Storage, BigQuery, Google Kubernetes Engine, Gmail, Calendar, Meet, Docs and Drive.

    "Many of our internal users and tools experienced similar errors, which added delays to our outage external communication," the search and advertising giant confessed.
    The last point is important. Even services that didn't require authentication of external requests, such as the font service, failed because they in turn used internal services to which they had to authenticate.
To be fair, I should also mention that around the same time Amazon's US-EAST-1 region also suffered an outage. The cause had a big "blast radius" but wasn't an authentication failure, so you should follow the link for the fascinating explanation. Returning to authentication, Anderson wrote:
Authentication is a tricky problem for the big cloud platforms. It is critically important and cannot be fudged; security trumps resilience.
The keys to resilient services are replication, independence and (ideally) voting among the replicas so that there are no single points of failure. But the whole point of an authentication service is to centralize knowledge in a single authoritative source. Multiple independent databases of authentication information present the risk of inconsistency and delayed updates.

Another service that must be centralized to be effective is network monitoring. The whole point of network monitoring is to provide a single, holistic view of network activity. Thus it is that the recent SolarWinds and Centreon compromises targeted network monitoring systems.

This all suggests that critical systems such as release build environments that can perform their function without being on-line should be air-gapped, or at least be disconnected from the general authentication system. Of course, most systems cannot perform their functions without being on-line and authenticated. For these systems, authentication and network monitoring systems in particular, networks need contingency plans for when, not if, they are compromised.

4 comments:

David. said...

Google agrees with Amazon:

"Lorenc explains some of the steps Google takes to ensure the security of open-source code that it uses internally, including Linux. "One of the things that we try to do for any open source that we use, and something we recommend anybody uses, is being able to build it yourself. It is not always easy or trivial to build, but knowing that you can is half the battle, in case you ever need to.

We require that all open source we use is built by us, from our internal repositories, just to prove that we can, if we ever need to make a patch, and so that we have better provenance, knowing where it is coming from."

David. said...

Brian Krebs'At Least 30,000 U.S. Organizations Newly Hacked Via Holes in Microsoft’s Email Software reports on a cybersecurity failure rivalling SolarWinds. He writes:

"At least 30,000 organizations across the United States — including a significant number of small businesses, towns, cities and local governments — have over the past few days been hacked by an unusually aggressive Chinese cyber espionage unit that’s focused on stealing email from victim organizations, multiple sources tell KrebsOnSecurity. The espionage group is exploiting four newly-discovered flaws in Microsoft Exchange Server email software, and has seeded hundreds of thousands of victim organizations worldwide with tools that give the attackers total, remote control over affected systems."

The problem is that most of these organizations aren't capable of the kind of response Stanford mounted to be sure it had excluded the attackers. Volexity President Steven Adair said:

"he’s fielded dozens of calls today from state and local government agencies that have identified the backdoors in their Exchange servers and are pleading for help. The trouble is, patching the flaws only blocks the four different ways the hackers are using to get in. But it does nothing to undo the damage that may already have been done.
...
Another government cybersecurity expert who participated in a recent call with multiple stakeholders impacted by this hacking spree worries the cleanup effort required is going to be Herculean.

“On the call, many questions were from school districts or local governments that all need help,” the source said, speaking on condition they were not identified by name. “If these numbers are in the tens of thousands, how does incident response get done? There are just not enough incident response teams out there to do that quickly.”

David. said...

Nicholas Weaver's The Microsoft Exchange Hack and the Great Email Robbery is a must-read warning about the consequences to come:

"I would expect these exploits to be in criminal toolkits shortly and that the world is, at most, days away from ransomware gangs mass-exploiting Exchange servers, encrypting the contents, and offering the victims a choice: pay up, or your emails will be published for everyone else and deleted from your own servers."

David. said...

Here comes the ransomware tsunami. Dan Goodin reports that 7,000 Exchange servers first compromised by Chinese hackers hit with ransomware:

"Security firm Kryptos Logic said Friday afternoon that it has detected close to 7,000 compromised Exchange servers that are being infected with ransomware. Kryptos Logic security researcher Marcus Hutchins told Ars that the ransomware is DearCry.
...
Little is known about DearCry. Security firm Sophos said that it’s based on a public-key cryptosystem, with the public key embedded in the file that installs the ransomware. That allows files to be encrypted without the need to first connect to a command-and-control server. To decrypt the data, victims’ must obtain the private key that’s known only to the attackers."