The backup had caused enough havoc that customers whose segments of cloud had nothing to do with the affected area and whose data requests were coming nowhere near it were seeing performance bog down and errors increase as the delay between an application's request and the response from storage became longer.
Those performance problems continued throughout the day. Within the affected zone, most of the big problems were fixed by noon Pacific time.
It took a lot longer to pull the "stuck" data volumes out of their funk and get them working again, however.
By lunchtime on the 22nd all but 2.2 percent of the stuck volumes had been unstuck, though Amazon techs had to filter the response of each as it burst out of stasis and tried to replicate promiscuously across all the bandwidth suddenly available to it.
By 12:30 the afternoon of the 24th – three-and-a-half days after the initial problem – the response crew decided it had gotten everything unstuck that was going to be.
1.04 percent of the volumes that had been affected refused to come back; Amazon started restoring them using snapshot backups.
It brought all but 0.7 percent back to life, though some of the snapshots might have been minutes or hours old when the whole works froze – meaning anything not backed up by that time was lost during the recovery.
Amazon plans to keep similar disasters from happening by letting customers run applications in more than one Availability Zone – giving them the opportunity to build their own failover instances for critical data, or at least a second point from which to operate, even without access to alll their data.
It will also make multi-Zone deployments simpler to design and operate, it said. Right now the complications are such that even customers who would like the security avoid the complication, the report found.
It also plans to upgrade its own ability to see into each Zone to see when and how problems are developing, add expand its ability to recover data or apps by adding things like the ability to recover a "stuck" volume by taking a snapshot of it, turning the snapshot into a live volume, and deleting the stuck version.
It also promises better communication with customers and a 10-day credit to customers whose volumes of data, application instances or databases were unavailable during the outage.
Conclusion:
The explanation is a reasonable one, and reasonably detailed.
I wrote during the outage that services such as Amazon's were still built on good old computers, not airy, information-automagic that could transform the dirty, complex job of operating a data center into something more like a petting zoo for unicorns.



















