July 03, 2012, 1:45 PM — Amazon Web Services says power outages, software bugs and rebooting bottlenecks led to a "significant impact to many customers," last week, according to a detailed post-mortem report the company released today about the service disruption.
As storms raged through the mid-Atlantic on Friday night, AWS experienced power outages that initially impacted the company's Elastic Cloud Compute (EC2), Elastic Block Storage (EBS) and Relational Database Service (RDS) offerings, but extended into "control plane," services, such as its Elastic Load Balancer, which are designed is designed to shift traffic away from impacted areas of the company's service.
REMEMBER WHEN: Amazon EC2 outage calls 'availability zones' into question
ONE YEAR LATER: Amazon outage one year later: Are we safer?
AWS experienced multiple power outages on Friday night, most of which were handled by a backup generator kicking in to supply power. Shortly before 8 p.m. PDT, a backup generator failed to fully kick in after a power outage. The company's "uninterruptable power supply," another backup, was depleted within seven minutes. For 10 minutes at 8:04, parts of the impacted data center did not have power, which brought down the EC2 and EBS services in the impacted area.
As a result, for more than an hour between 8:04 and 9:10 p.m. PDT on Friday, customers were unable to create new EC2 instances or EBS volumes. The "vast majority" of the instances came back online between 11:15 p.m. PDT and just after midnight, AWS says, but that was delayed somewhat because of a bottleneck in the server booting process due to the large number of reboot requests. AWS says removing the bottleneck is an area they will work to improve on in the case of a power failure.
AWS breaks its regions up into multiple availability zones (AZs), which are designed to be isolated from failure. Even though the issues on Friday were centered in a single AZ, AWS ran into more trouble when load balancers attempted to switch traffic to unaffected AZs. "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn't seen before," the company wrote. The bug caused a flood of requests which, combined with EC2 instances coming back online, created a backlog in the system.