In English this time: How Amazon let its cloud crash and why it should have known better

When you screw up, don't just admit the problem; fix it and let us know if we can rely on you next time.

On Friday Amazon posted an apology to its EC2 customers and a more detailed explanation of what caused three days of glitches and downtime that brought sites including Quran, Reddit, Foursquare and HootSuite down for days.

Though it posted updates throughout the three-day troubleshooting period, and admitted afterward it might never recover some customers' data, this is the most comprehensive explanation Amazon has provided for the outages.

The short version is that the enemy of Good was Better.

Amazon breaks its EC2 infrastructure into segments it refers to as Availability Zones. Part of the data network within each Availability Zone is dedicated to exchanges of data among volumes of data that are part of the Elastic Block Store (EBS) virtual storage network.

EBS storage volumes talk among themselves a lot. Some of the traffic is just overhead, identifying which volumes are present, the routes among them, replication schedules among various data. The rest is actual data transfer, as EBS volumes replicate their own information to other volumes as backup and to create alternative sources for that data in case the principal source fails.

As that traffic increases, Amazon has to increase the bandwidth available so it doesn't take all day for one volume to tell another all the changes it has seen lately.

There are actually two EBS networks – a primary running on top-end routers that carries the replication data and does most of the work, and a secondary that carries housekeeping information and backs up the primary network.

The primary and secondary each have their own backup routers – so if the primary network's router failed, traffic would failover to the backup primary router. The secondary network is there if it's needed, but runs much slower routers.

Just after midnight Pacific time April 21, Amazon techs routed that overly large stream of data not onto the backup primary server, but onto the low-bandwidth secondary network.

The backup primary router, full of bandwidth and ready to work, was left out of the loop instead.

The secondary network, whose narrow pipes could handle the full-force failover from the primary network only in the case of real disaster – and with warning to the rest of the network to not overload it – was swamped almost immediately.

Data flowing among replicas, overhead traffic identifying what volumes were online and where, router and network ID information were all cut off.

To each storage volume in the affected part of the network it was as if the rest of the world disappeared all at once, leaving each one isolated, with no replicas around to make them feel secure, and no other nodes on which to build new replicas.

Each began flooding what negligible bit of network that was available with one request after another for replication and re-mirroring, almost none of which got through the overburdened network.

Before techs could correct the network-configuration error, or do anything about the saturated network, 13 percent of the storage volumes caught in what Amazon called a "re-mirroring storm" entered a semi-permanent com Amazon called a "stuck" state, waiting for responses to their requests, but no longer able to receive any.

That didn't slow down requests for data or storage space from applications or storage volumes elsewhere in the network, however.

With nothing responding to them – and APIs built with long lead times so a little extra latency wouldn't confuse applications written by customers less familiar than Amazon techs with EBS – requests started to pile up across the controllers that directed traffic and responses across the EBS.

Fairly quickly, the control plan became swamped by backed-up requests and started to "brown out" as well.

At 2:40 a.m. PDT, techs cut off the portion of the storage network that was affected to ease the backup. By 2:50 latency and errors from requests to create new volumes of data had cleared.

The problem had not.

Servers within the affected zone kept aggressively sending out requests, and waiting too long before writing off a request as hopeless – essentially drowning themselves in their own calls for help.

Servers and storage volumes outside the affected area didn't know it was down, or that part of the storage network was degraded, so they kept sending out requests as usual.

By 8:20 that morning the EBS control plane began to brown out again, swamped by requests from distant servers just as it had been by those nearby.

Amazon techs shut off the ability of the rest of the cloud to pass requests into the affected area.

The Relational Database Service (RDS), the database function available to any application in the cloud, also started backing up and throwing errors as various instances of it realized they couldn't get to their data.

Input/output requests from databases are even more sensitive than replication requests between data volumes.

Amazon wouldn't say how many of the database instances trying to talk to storage in the affected area also became "stuck," but estimated "only" 41 percent of all the database instances in the affected area were still stuck 24 hours after the initial error; 14.6 percent were still stuck after 48 hours. The rest recovered over the weekend.

The backup had caused enough havoc that customers whose segments of cloud had nothing to do with the affected area and whose data requests were coming nowhere near it were seeing performance bog down and errors increase as the delay between an application's request and the response from storage became longer.

Those performance problems continued throughout the day. Within the affected zone, most of the big problems were fixed by noon Pacific time.

It took a lot longer to pull the "stuck" data volumes out of their funk and get them working again, however.

By lunchtime on the 22nd all but 2.2 percent of the stuck volumes had been unstuck, though Amazon techs had to filter the response of each as it burst out of stasis and tried to replicate promiscuously across all the bandwidth suddenly available to it.

By 12:30 the afternoon of the 24th – three-and-a-half days after the initial problem – the response crew decided it had gotten everything unstuck that was going to be.

1.04 percent of the volumes that had been affected refused to come back; Amazon started restoring them using snapshot backups.

It brought all but 0.7 percent back to life, though some of the snapshots might have been minutes or hours old when the whole works froze – meaning anything not backed up by that time was lost during the recovery.

Amazon plans to keep similar disasters from happening by letting customers run applications in more than one Availability Zone – giving them the opportunity to build their own failover instances for critical data, or at least a second point from which to operate, even without access to alll their data.

It will also make multi-Zone deployments simpler to design and operate, it said. Right now the complications are such that even customers who would like the security avoid the complication, the report found.

It also plans to upgrade its own ability to see into each Zone to see when and how problems are developing, add expand its ability to recover data or apps by adding things like the ability to recover a "stuck" volume by taking a snapshot of it, turning the snapshot into a live volume, and deleting the stuck version.

It also promises better communication with customers and a 10-day credit to customers whose volumes of data, application instances or databases were unavailable during the outage.

Conclusion:

The explanation is a reasonable one, and reasonably detailed.

I wrote during the outage that services such as Amazon's were still built on good old computers, not airy, information-automagic that could transform the dirty, complex job of operating a data center into something more like a petting zoo for unicorns.

My intent was to remind people bedazzled with the cloud concept that even the most sophisticated, well-maintained IT systems are subject to the laws of physics and it would probably be an overreaction to decide Amazon had somehow caused the expulsion of humanity from cloud-computing paradise and that the only option was to burn it as a witch.

Even its customers seemed pretty understanding, even though villagers marching on the lair of the witch or evil scientist makes for a great story.

I didn't want to blame Amazon too early for something that might not have been entirely its fault. It would always have been culpable for any damage or downtime, but if the root cause was a meteor through the roof, there's a limit to how much moral blame you can lather on.

Now that we know the real problem was some network tech telling a storage-area-network router to send traffic to Node C rather than Node B, I'm less sympathetic.

That a common type of mistake, but not one that should take down a big part of one of the biggest data-center operations in the country.

If your data networks are set up in such a way that one screwup can take down a big chunk of it – even if you're running redundant routers for your primaries and have a whole secondary network with redundant routers of its own – you set it up badly.

Amazon needs to do more than offer apologies, a discount for services it couldn't deliver and a really good explanation of what happened.

It needs to look at the fundamental architecture of its storage network, figure out how to make sure that "redundant" really means "redundant," and come to its customers with an explanation of that, too.

No matter how cool "cloud" is or how much better the really good geeks like you than they like Microsoft and it's cloud, you owe your customers that much.

In fact, Amazon owes the industry that much. As the company that really got people to take the plunge into cloud computing, and a million IT companies to dive in and try to offer it as well, it would show poor leadership and no consideration for those who rely on you to put a patch over the hole and go on about your business like nothing ever happened.

That hole appeared because one of your guys leaned on a wall that wasn't able to keep from being pierced by an elbow. And the wall he leaned against was the dyke holding back a river of hurt.

It's probably not a good idea to leave the rest of the wall in that condition no matter how well the patch appears to be holding.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Join the discussion
Be the first to comment on this article. Our Commenting Policies