AWS first reported a small issue at 10 a.m. PT but within an hour said the issue was impacting a "large number of volumes" in the affected availability zone. This seems to be the point when major sites such as Reddit, Imgur, AirBNB and Salesforce.com's Heroku platform all went down. By 1:40 p.m. PT, AWS said, 60% of the impacted volumes had recovered, but AWS engineers were still baffled as to why.
"The large surge in failover and recovery activity in the cluster made it difficult for the team to identify the root cause of the event," the report reads. Two hours later the team figured out the problem and restoration of the remaining impacted services continued until it was almost fully complete by 4:15 p.m. PT.
AWS vows to not get caught twice with the same bug. It has instituted new alarms to prevent this specific incident from happening again, and has also modified the broader EBS memory monitoring and alerts for detecting if new hardware is not being accepted into the system. "We believe we can make adjustments to reduce the impact of any similar correlated failure or degradation of EBS servers within an Availability Zone," AWS says.
SOUND FAMILIAR? (From July): Amazon takes blame for outages, bugs and bottlenecks
Gartner analyst Kyle Hilgendorf says it's slightly surprising that human error caused a DNS propagation issue, which led to much of an availability zone going down. "They're supposed to be deploying the best and brightest to handle these systems," he says. But, accidents happen. The bigger flaw, he says, is that AWS did not have alerts in place to catch the issue earlier. "That's the damaging part," he says. "A week passed and no one noticed memory was continuously leaking. That was the unacceptable part of this."
So could it have been prevented?
AWS says that customers who have heeded the company's advice about using multiple availability zones were able to tolerate the outage, for the most part. Some customers, including at least one Network World reader (see comments section of this story), reported that even in a multi-AZ architecture he still had problems moving workloads into healthy AZs. AWS says they messed this up, too.