That is what many AWS customers experienced this past April, when Amazon's Northern Virginia data center suffered a glitch and -- to use the technical term -- went totally nutso.
The error started during a network upgrade, when a misrouted traffic shift sent a cluster of Amazon EBS (Elastic Block Store) volumes into a remirroring storm, as they sought out available boxes into which they could insert backups of themselves -- perverse, I know. That set off a series of events that ultimately took down much of the company's U.S. East Region.
That's the short version, anyway -- if you're interested in the full nitty-gritty, clear out 47 hours in your schedule and read Amazon's novel-length explanation.
The problems persisted for about four days. But while many businesses struggled, others such as Netflix took the storm in stride. The key to survival? Designing your systems with these types of failures in mind.
"Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3, and Cassandra services that we do depend upon were not affected by the outage," Netflix engineers wrote in their "Lessons Netflix Learned From the AWS Outage" blog post. Stateless services and multiple redundant hot copies of data across availability zones were key to avoiding AWS cloud fail pain.
Think you have to be a Netflix-size business to stay safe? Think again. Twilio, a company that helps developers integrate communications into their Web apps, uses Amazon's EC2 to host the core of its infrastructure -- yet April's outage had little to no impact on its stability.
"The fundamental premise of building on the cloud is assuming that the network will have glitches," says Evan Cooke, Twilio's co-founder and chief technology officer. "We built an infrastructure around the idea that a host can and will fail, so we don't rely on any single machine or single component in the core architecture itself."
Colossal cloud outage No. 2: The Sidekick shutdownSmartphones make it easy to access your data on the go, but just because something has "smart" in its name doesn't mean it can't be dumb. Case in point: the T-Mobile Sidekick screwup, circa fall 2009.