Wednesday, May 11, 2011

Amazon's Outage

I've been looking at the problems of specifying, measuring and auditing (PDF) the reliability of storage technologies since 2006. When I heard that Amazon's recent outage had lost customer's data I hoped that I could use this example as an Awful Warning that my Cassandra-like prophecies were coming true.

After some research, I can't claim that this is an example of the doom that awaits unsuspecting customers. But the outage and data loss does illustrate a number of interesting aspects of cloud storage. Details below the fold.

The service most people think of when they hear the words "cloud storage" is Amazon's Simple Storage Service (S3). This is "designed to provide 99.999999999% durability of objects over a given year" and to "sustain the concurrent loss of data in two facilities". Amazon also provides Reduced Redundancy Storage (RRS), a version of S3 that is "Designed to provide 99.99% durability and 99.99% availability of objects over a given year." Had this outage affected S3 or even RRS, it would have indicated that these services were not achieving their design levels of reliability. But as far as I can tell they were unaffected.

The service that was affected was Elastic Block Storage (EBS), a quite different storage service. S3 stores buckets containing digital objects which are accessed via a Web API and replicated across Regions. EBS stores volumes containing disk blocks, which are accessed by mounting them as devices on virtual machines running in Amazon's Elastic Compute Cloud (EC2). Virtual machines in EC2 come and go; EBS provides storage that persists while they do. EBS is intended to be quite reliable, but blocks are replicated only within a single Availability Zone in a single Region (in this case the US East Region). They are thus far less reliable than S3:

Amazon EBS volumes are designed to be highly available and reliable. Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.

Because Amazon EBS servers are replicated within a single Availability Zone, mirroring data across multiple Amazon EBS volumes in the same Availability Zone will not significantly improve volume durability. However, for those interested in even more durability, Amazon EBS provides the ability to create point-in-time consistent snapshots of your volumes that are then stored in Amazon S3, and automatically replicated across multiple Availability Zones. So, taking frequent snapshots of your volume is a convenient and cost effective way to increase the long term durability of your data. In the unlikely event that your Amazon EBS volume does fail, all snapshots of that volume will remain intact, and will allow you to recreate your volume from the last snapshot point.
Thus, customers who actually lost data were not using EBS as it is intended to be used, as primary storage backed up by snapshots in S3. Those, such as Henry Blodget, complaining about Amazon's backups of data appear to be pointing the finger of blame in the wrong direction.

After they had restored service, Amazon published a commendably detailed discussion of the details, which is revealing. The outage was triggered by an operator error in one of Amazon's Availability Zones; I have been stressing the importance of operator error in our model of threats to digital preservation for a long time.

The initial trigger, which caused a large number of nodes in the affected zone to lose connectivity with each other, was quickly repaired, re-establishing connectivity among the nodes. However, the nodes had noticed the loss of connectivity and assumed (wrongly) that the nodes they could no longer contact had failed. Each node needs other nodes in the same zone to which it can replicate blocks of data in order to meet the reliability targets Amazon sets. When a node notices that another node to which it is replicating blocks has failed, its top priority is to find another node which has free space and "re-mirror" the no longer replicated blocks to it. Normally, only a small proportion of the nodes in a zone have failed, so there is only a small amount of load on the network and nodes from this re-mirroring process, and only a small increase in demand for storage. But in this case a large proportion of the nodes were seeking storage, which rapidly became exhausted, leaving many nodes hammering the network and each other looking fruitlessly for free space. The load generated by this re-mirroring storm triggered bugs in the node software, causing nodes to crash and re-start, which in turn caused more need for re-mirroring.

Although these cascading problems were, as designed, confined to a single Availability Zone, they increased the load on the EBS "control plane", which provides management and coordination services to the multiple Availability Zones in a Region. The added load swamped the control plane, and degraded service to all the Availability Zones in the Region. In particular, a bug in the control plane was triggered that caused high error rates when new virtual machines using EBS were being created across the entire Region.

After about 12 hours Amazon succeeded in isolating the trouble from the control plane and preventing the affected nodes from futilely searching for free space, stabilizing the situation and restoring normal service outside the affected zone. Resolving the problems within the affected zone was a complex and time-consuming process involving trucking large amounts of spare storage to the affected data center, figuring out how to tell the affected nodes, which had been prevented from looking for free storage, that free storage was now available, and finally re-enabling communication with the control plane. Then, the small proportion of the nodes which lost un-replicated blocks were dealt with, mostly by recovering from S3 snapshots taken by Amazon early in the outage. Ultimately, 0.07% of the volumes could not be recovered.

What can we learn from this?
  • Operator error is a major threat to stored data.
  • Cloud storage services are not all the same. The differences, in this case between EBS and S3, are important. Using the services in the way they were designed to be used is essential; customers who were frequently creating snapshots of their EBS volumes in S3 were at very low risk of significant data loss.
  • Replicated services that appear uniform necessarily have "control planes" that can propagate failures between apparently isolated replicas.
  • These complex connections between systems intended to be independent create correlated failures making estimates, such as Amazon's 11 nines of durability, suspect.
I'll close with a clip from a fascinating and accessible analysis of the stock market's "flash crash" on 6 May 2010 by Donald MacKenzie in the London Review of Books (link added):
Systems that are both tightly coupled and highly complex, Perrow argues in Normal Accidents (1984), are inherently dangerous. Crudely put, high complexity in a system means that if something goes wrong it takes time to work out what has happened and to act appropriately. Tight coupling means that one doesn’t have that time. Moreover, he suggests, a tightly coupled system needs centralised management, but a highly complex system can’t be managed effectively in a centralised way because we simply don’t understand it well enough; therefore its organisation must be decentralised. Systems that combine tight coupling with high complexity are an organisational contradiction, Perrow argues: they are ‘a kind of Pushmepullyou out of the Doctor Dolittle stories (a beast with heads at both ends that wanted to go in both directions at once)’.
This is relevant because the "flash crash" appears to be a case where a cascading series of failures in a complex system was broken by a programmed pause, of only five seconds, triggered when the Chicago Globex's systems detected it.


LagPoker said...

A lot of issues of this class are described in detail in "Normal Accidents" which was written about other complex systems. It was written 20 years ago, and is completely applicable. As they say NNUTS.

David. said...

Following Amazon's excellent example, Microsoft has released an analysis of Azure's September 4th outage, which Richard Speed reports on for The Register. It is fascinating reading, describing a sequence of correlated failures starting with:

"A lightning strike at 0842 UTC caused one data center to switch to generator power and also overloaded suppressors on the mechanical cooling system, shutting it down."