Among the nuggets he revealed was that AWS has designed its own uninterruptible power supplies (UPS) and that there’s now one in each of its racks. AWS decided on that approach because the UPS systems it needed were so big they required a dedicated room to handle the sheer quantity of lead-acid batteries required to keep its kit alive. The need to maintain that facility created more risk and made for a larger “blast radius” - the extent of an incident's impact - in the event of failure or disaster.This is a remarkable argument for infrastructure based on open source software, but that isn't what this post is about. Below the fold is a meditation on the concept of "blast radius", the architectural dilemma it poses, and its relevance to recent outages and compromises.
AWS is all about small blast radii, DeSantis explained, and in the past the company therefore wrote its own UPS firmware for third-party products.
“Software you don’t own in your infrastructure is a risk,” DeSantis said, outlining a scenario in which notifying a vendor of a firmware problem in a device commences a process of attempting to replicate the issue, followed by developing a fix and then deployment.
“It can take a year to fix an issue,” he said. And that’s many months too slow for AWS given a bug can mean downtime for customers.
In 2013 Stanford discovered that an Advanced Persistent Threat (APT) actor had breached their network and compromised the Active Directory server. The assessment was that, as a result, nothing in the network could be trusted. In great secrecy Stanford built a complete replacement network, switches, routers, servers and all. The secrecy was needed because the attackers were watching; for example nothing about the new network could be mentioned in e-mail. Eventually there was a "flag day" when everyone was cut over to the new network with new passwords. Then, I believe, all the old network hardware was fed into the trash compactor. The authentication technology for a network, in Stanford's case Active Directory, has a huge "blast radius", which is why it is a favorite target for attackers.
Two recent outages demonstrate the "blast radius" of authentication services:
- Last September, many of Microsoft's cloud services failed. Thomas Claiburn's Microsoft? More like: My software goes off reported:
Microsoft's online authentication systems are inaccessible for at least some customers today, locking those subscribers out of tons of Redmond-hosted services if they are not already logged in.The next day, Tim Anderson's With so many cloud services dependent on it, Azure Active Directory has become a single point of failure for Microsoft had more details of the failed attempts to recover by backing out a recent change, and the lingering after-effects:
Beyond Microsoft's public and government cloud wobbles ... the authentication system outage has hit its other online services, including Outlook, Office, Teams, and Microsoft Authenticator. If you're not already logged in, it appears, you may be unable to get in and use the cloud-based applications as a result of the ongoing downtime.
The core service affected was Azure Active Directory, which controls login to everything from Outlook email to Teams to the Azure portal, used for managing other cloud services. The five-hour impact was also felt in productivity-stopping annoyances like some installations of Microsoft Office and Visual Studio, even on the desktop, declaring that they could not check their licensing and therefore would not run. ... If the problem is authentication, even resilient services with failover to other Azure regions may become inaccessible and therefore useless.Part of the difficulty Microsoft had in responding was likely due to the need for staff responding to the authentication problem to authenticate to obtain the access required to make the necessary changes.
The company has yet to provide full details, but a status report today said that "a recent configuration change impacted a backend storage layer, which caused latency to authentication requests".
- Then, in December, many of Google's cloud services failed. Again, Anderson reported the details in Not just Microsoft: Auth turns out to be a point of failure for Google's cloud, too:
Google has posted more details about its 50 minute outage yesterday, though promising a "full incident report" to follow. It was authentication that broke, reminiscent of Microsoft's September cloud outage caused by an Azure Active Directory failure.The last point is important. Even services that didn't require authentication of external requests, such as the font service, failed because they in turn used internal services to which they had to authenticate.
In an update to its Cloud Status dashboard, Google said that: "The root cause was an issue in our automated quota management system which reduced capacity for Google's central identity management system, causing it to return errors globally. As a result, we couldn't verify that user requests were authenticated and served errors to our users."
Not mentioned is the fact that the same dashboard showed all green during at least the first part of the outage. Perhaps it did not attempt to authenticate against the services, which were otherwise running OK. As so often, Twitter proved more reliable for status information.
Services affected included Cloud Console, Cloud Storage, BigQuery, Google Kubernetes Engine, Gmail, Calendar, Meet, Docs and Drive.
"Many of our internal users and tools experienced similar errors, which added delays to our outage external communication," the search and advertising giant confessed.
Authentication is a tricky problem for the big cloud platforms. It is critically important and cannot be fudged; security trumps resilience.The keys to resilient services are replication, independence and (ideally) voting among the replicas so that there are no single points of failure. But the whole point of an authentication service is to centralize knowledge in a single authoritative source. Multiple independent databases of authentication information present the risk of inconsistency and delayed updates.
Another service that must be centralized to be effective is network monitoring. The whole point of network monitoring is to provide a single, holistic view of network activity. Thus it is that the recent SolarWinds and Centreon compromises targeted network monitoring systems.
This all suggests that critical systems such as release build environments that can perform their function without being on-line should be air-gapped, or at least be disconnected from the general authentication system. Of course, most systems cannot perform their functions without being on-line and authenticated. For these systems, authentication and network monitoring systems in particular, networks need contingency plans for when, not if, they are compromised.