In English this time: How Amazon let its cloud crash and why it should have known better

When you screw up, don't just admit the problem; fix it and let us know if we can rely on you next time.


My intent was to remind people bedazzled with the cloud concept that even the most sophisticated, well-maintained IT systems are subject to the laws of physics and it would probably be an overreaction to decide Amazon had somehow caused the expulsion of humanity from cloud-computing paradise and that the only option was to burn it as a witch.

Even its customers seemed pretty understanding, even though villagers marching on the lair of the witch or evil scientist makes for a great story.

I didn't want to blame Amazon too early for something that might not have been entirely its fault. It would always have been culpable for any damage or downtime, but if the root cause was a meteor through the roof, there's a limit to how much moral blame you can lather on.

Now that we know the real problem was some network tech telling a storage-area-network router to send traffic to Node C rather than Node B, I'm less sympathetic.

That a common type of mistake, but not one that should take down a big part of one of the biggest data-center operations in the country.

If your data networks are set up in such a way that one screwup can take down a big chunk of it – even if you're running redundant routers for your primaries and have a whole secondary network with redundant routers of its own – you set it up badly.

Amazon needs to do more than offer apologies, a discount for services it couldn't deliver and a really good explanation of what happened.

It needs to look at the fundamental architecture of its storage network, figure out how to make sure that "redundant" really means "redundant," and come to its customers with an explanation of that, too.

No matter how cool "cloud" is or how much better the really good geeks like you than they like Microsoft and it's cloud, you owe your customers that much.

In fact, Amazon owes the industry that much. As the company that really got people to take the plunge into cloud computing, and a million IT companies to dive in and try to offer it as well, it would show poor leadership and no consideration for those who rely on you to put a patch over the hole and go on about your business like nothing ever happened.

That hole appeared because one of your guys leaned on a wall that wasn't able to keep from being pierced by an elbow. And the wall he leaned against was the dyke holding back a river of hurt.

It's probably not a good idea to leave the rest of the wall in that condition no matter how well the patch appears to be holding.

Join us:






Answers - Powered by ITworld

Ask a Question