November 01, 2012, 11:29 AM — An endless stream of tweets and blog posts have noted, described and bewailed last week's Amazon Web Services outage. Some people characterized the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterized it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages.
Most of these responses are reflective of bias or the commenter's own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.
Analysis: Amazon Outage Started Small, Snowballed Into 12-Hour Event
The first thing to look at is what risk really is. Wikipedia calls it "the probable frequency and probable magnitude of future loss." In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish.
Amazon Outages Show That Failure Is An Option
The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.
In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon's Elastic Block Storage (EBS) service.