July 16, 2012, 7:32 AM — In June, Amazon Web Services (AWS) suffered two high-profile outages that left a number of customers--including clients like Instagram, Pinterest and Netflix--unable to provide services to their customers. But do AWS' clients share some of the responsibility?
"There's blame on both sides of the equation," says Jason Currill, founder and CEO of Ospero, a global Infrastructure as a Service (IaaS) company. "From the Amazon side, clearly I think there's an obvious issue with their redundancy power. After one outage, you'd think they'd have learned their lesson. Redundancy power is one of those elementary things that data centers are normally very, very good at."
Organizations Must Treat Cloud Providers as a Utility But the clients affected by the outage also share blame, Currill says, because they failed customers using their services.
"If you're a corporation and you have a building, you have a diesel generator in the basement in case the electricity goes out," he says. "You have two telco lines coming in so if you lose one, you still have communications. Cloud is the same thing. It's a utility. Have two."
Generator Failures, Software Bug Cause Outage In a detailed post-mortem released after the most recent outage on June 29, Amazon cited a series of power outages, generator failures and rebooting backlogs that led to a "significant impact to many customers."
The problems began with a large-scale electrical storm in northern Virginia, in what Amazon designates its U.S. East-1 Region. U.S. East-1 consists of more than 10 data centers structured into multiple availability zones. This structure is designed to prevent exactly the sort of problem that occurred on June 29; availability zones run on their own physically distinct, independent infrastructure. Common points of failure like generators and cooling equipment are not shared across availability zones. In theory, even disasters like fires, tornados or flooding would only affect a single availability zone and service should remain uninterrupted by routing around that availability zone to the others.
But on that Friday, when a large voltage spike occurred in the electrical switching equipment in two of the U.S. East-1 data centers, there was a problem bringing the generators online in one of the affected data centers.