The 10 worst cloud outages (and what we can learn from them)
Sending your IT business to the cloud comes with risk, as those affected by these 10 colossal cloud outages can attest
As a concept, there's a lot to like about the cloud. Drop those bulky servers and get yourself a big, white hard drive in the sky. Someone else handles the upkeep and lets you put your data where you want it. Even the word "cloud" itself brings to mind a heavenly (if slightly fluffy) fantasy.
The reality is, of course, a mixed bag. What you gain in avoiding upkeep, you lose in control. And the security concerns are considerable. But nowhere is the nightmare as vivid as it is when your cloud service goes down.
[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Stay up on the cloud with InfoWorld's Cloud Computing Report newsletter. ]
Just ask any of the businesses affected by Amazon Web Services' high-profile outage in April.
"We were pretty blown away," says Nick Francis, whose startup, Help Scout, had launched just one week prior to Amazon's problem. "We definitely weren't prepared."
Francis wasn't the only one caught off-guard. Big-name properties like Reddit and Foursquare fell flat when Amazon's cloud sputtered.
"The cloud has been sold as this magical thing that just works and is totally reliable," says Lew Moorman, chief strategy officer of Rackspace, a cloud provider that's seen its fair share of outages. "The truth is that buying through the cloud is another way of buying computing, and computing is inherently flawed. If you want to make sure those flaws don't hurt you, you have to plan ahead."
To help keep your business pain-free in the cloud, we offer these hard-earned lessons at the hands of 10 of the worst cloud storms the Web has weathered.
Colossal cloud outage No. 1: Amazon Web Services goes poofFreeing yourself from network maintenance gruntwork is a chief selling point for doing business in the cloud. The downside? Standing by helplessly when your cloud vendor's routine configuration change grinds your business to a halt.
That is what many AWS customers experienced this past April, when Amazon's Northern Virginia data center suffered a glitch and -- to use the technical term -- went totally nutso.
The error started during a network upgrade, when a misrouted traffic shift sent a cluster of Amazon EBS (Elastic Block Store) volumes into a remirroring storm, as they sought out available boxes into which they could insert backups of themselves -- perverse, I know. That set off a series of events that ultimately took down much of the company's U.S. East Region.
That's the short version, anyway -- if you're interested in the full nitty-gritty, clear out 47 hours in your schedule and read Amazon's novel-length explanation.
The problems persisted for about four days. But while many businesses struggled, others such as Netflix took the storm in stride. The key to survival? Designing your systems with these types of failures in mind.
"Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3, and Cassandra services that we do depend upon were not affected by the outage," Netflix engineers wrote in their "Lessons Netflix Learned From the AWS Outage" blog post. Stateless services and multiple redundant hot copies of data across availability zones were key to avoiding AWS cloud fail pain.
Think you have to be a Netflix-size business to stay safe? Think again. Twilio, a company that helps developers integrate communications into their Web apps, uses Amazon's EC2 to host the core of its infrastructure -- yet April's outage had little to no impact on its stability.
"The fundamental premise of building on the cloud is assuming that the network will have glitches," says Evan Cooke, Twilio's co-founder and chief technology officer. "We built an infrastructure around the idea that a host can and will fail, so we don't rely on any single machine or single component in the core architecture itself."
Colossal cloud outage No. 2: The Sidekick shutdownSmartphones make it easy to access your data on the go, but just because something has "smart" in its name doesn't mean it can't be dumb. Case in point: the T-Mobile Sidekick screwup, circa fall 2009.
Remember this fiasco? The Microsoft-owned Sidekick suffered a nearly week-long service outage that left users without access to email, calendar info, and other personal data. Then, adding insult to injury, Microsoft confessed it had completely lost the cloud-stored bits and wouldn't be able to restore them. Evidently, the good ol' gang from Redmond had forgotten to make backups.
The technology may have evolved since then, but the lesson remains the same: When it comes to crucial data, never assume someone else is automatically protecting you. Make sure you understand your cloud provider's disaster recovery setup -- better yet, make your own arrangements to back up your important data independently.
"The same operational rules apply even in the cloud," says Ken Godskind, vice president of monitoring products for AlertSite, a SmartBear company. "Organizations using the cloud can't just assume that because it's in the cloud, all the responsibility for business continuity planning has somehow been transferred to the provider."
Colossal cloud outage No. 3: Gmail failOf all cloud services, Google's Gmail presents one of the more likely threats to Microsoft's on-premises stranglehold on the enterprise. Replace your high-maintenance Exchange servers with a cheap, dependable email service backed by Postini. What's not to like?
A rash of irksome outages, the most recent of which had 150,000 Gmail users signing into their accounts only to find blank slates -- no emails, no folders, nothing that indicated they were actually looking at their own inboxes. To Google's credit, it provided regular updates and promised a quick fix. But repairs took as long as four days for some of the affected users.
"How could this happen if we have multiple copies of your data, in multiple data centers?" Google vice president of engineering Ben Treynor asked in a blog posted at the time. "In some rare instances, software bugs can affect several copies of the data. That's what happened here."
Google ended up having to turn to actual physical tape backups in order to restore the data. Ultimately, the company's multilayered data protection did work, but not without leaving thousands of users locked out of their email for days.
Is that a reason to run, arms flailing, away from anything cloud-connected? Probably not. But it is a reason to look carefully at your own data safeguards and think about setting up a backup or offline-access solution now, before an urgent need arises.
"When you look at broad averages, the cloud will have a lot more operational success than you would as an individual," says AlertSite's Ken Godskind. "It's just that when you go to Web scale, the impact of failure is amplified in a much greater way."
Colossal cloud outage No. 4: Hotmail's hot messOf course, Microsoft hasn't always provided the greatest advertisement for its big push for the cloud, either. Witness Microsoft's Hotmail service, which experienced database errors of its own at the end of 2010, resulting in tens of thousands of empty inboxes at the turn of the new year.
The error, according to Microsoft, stemmed from a script that was meant to delete dummy accounts created for automated testing. The script mistakenly targeted 17,000 real accounts instead.
It took Microsoft three days to restore service for most of those users. An unlucky 8% of affected emailers had to wait an extra three days before their data was back where it belonged.
Even Clippy couldn't smile through a headache like that.
Colossal cloud outage No. 5: The Intuit double-downIntuit hit a rough patch last year when its cloud-connected services, including popular platforms like TurboTax, Quicken, and QuickBooks, went offline twice within a single month. The worst case was a 36-hour outage in June. A power failure evidently caused things to go haywire, with the company's primary and backup systems getting knocked completely off the grid.
It only added insult to injury, then, when another apparent power failure hit Intuit weeks later. Among other issues, the second outage appeared to cause an abnormally high rate of obscenity-laden shouting.
"Twenty-five hours downtime is hard to swallow," one user tweeted at the time. "Passive, opaque and stiff communication from Intuit didn't help."
"The truth is, there are better solutions than a single cloud if you need absolute availability," says Chris Whitener, chief strategist of HP's Secure Advantage program. "It's not necessarily that you have to duplicate everything, but even putting one extra step in there -- maybe backing up crucial data yourself -- can make all the difference."
Colossal cloud outage No. 6: Microsoft's BPOS oopsIt's hard to be productive when your cloud-based productivity suite bites the virtual dust. That's what happened to organizations relying on Microsoft's business cloud offering just weeks ago: The service, named -- in true Microsoft style -- Microsoft Business Productivity Online Standard Suite, started to stutter around May 10. Paying customers' email was delayed by as much as nine hours as a result.
Two days later, just when it looked like BPOS was in the clear, the delay returned and outgoing messages started getting stuck in the pipeline, too. If that weren't enough, Microsoft experienced a separate issue that prevented users from logging into its Web-based Outlook portal as well.
"I'd like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused," Dave Thompson, corporate vice president for Microsoft Online Services, wrote in a blog.
"I'd also like to apologize for the obvious inconvenience of having to speak 15 syllables every time you say our service's ridiculous name," he probably should have added.
Colossal cloud outage No. 7: The Salesforce slipupAn hour of downtime may not sound like much, but when your company holds the keys to the customer service operations of tens of thousands of businesses, more than a few of those organizations are bound to view those 60 minutes as a lifetime.
Salesforce.com learned this the hard way when its data center shut down last January. Just four days into the new year, Salesforce.com reported a full-on failure -- meaning services, backups, the whole nine yards were kaput.
Annoying? Absolutely. Surprising? Not entirely.
"The reality is that cloud-based data centers -- guess what? -- they go down, too," says Tim Crawford, chief information officer of All Covered, a division of Konica Minolta. "That has always been the case and will always be the case. We have to be realistic about it."
Crawford says successful cloud computing requires a different mind-set than traditional server setups: It's up to you, he suggests, to decide whether your business's data can endure occasional downtime -- and if not, to make sure your configuration has the resiliency needed to avoid it.
"When you pick a cloud provider, you need to do your homework to understand how they're providing those services and if they're able to build a level of redundancy as good or better than what you're able to do on your own," Crawford says. "If the answer is no, then why are you using them?"
Colossal cloud outage No. 8: Terremark's terrible dayThese days, Terremark may be making headlines for its billion-dollar Verizon deal, but in early 2010, an extended outage dominated the cloud provider's coverage.
Terremark's luck turned sour on St. Patrick's Day, March 17, 2010. The company's vCloud Express service took a nosedive that day, with a Miami-based data center going offline for about seven hours. Users were unable to access data stored in the center for the entire period.
Not to get overly redundant, but this brings up the value of redundancy -- having your crucial data available on multiple servers in different data centers or, even better, different regions. You could also take the extra step of spreading it among different providers as a failsafe.
"You can pick a series of vendors to host a workload -- one as a backup or two as a backup, and then another as your primary," suggests Harold Moss, chief technology officer of IBM's Cloud Security Strategy program. "You can then implement your workload there in a secure manner, with the appropriate security, and start to introduce your resiliency capabilities."
This is no hypothetical exercise: PayPal fell for real in the summer of 2009, leaving millions of merchants around the world with no way to sell their stuff. The service was completely unavailable for about an hour and remained spotty for several more. PayPal said hardware failure was to blame.
It's a rare kind of outage, no doubt -- but with all the sales lost, this unfortunate interruption easily earns a spot in cloud computing's hall of shame.
Colossal cloud outage No. 10: Rackspace's rough yearWhen you provide cloud services to Web presences like TechCrunch and Justin Timberlake, you'd better believe people are going to notice when your servers stop working.
Rackspace learned that lesson a few times in 2009. The cloud provider suffered four high-profile failures throughout the year, adding up to hours of offline time for the company's customers. One blip was bad enough that Rackspace had to pay out nearly $3 million in service credits to its users.
Rackspace called the incidents "painful and very disappointing" and promised to "execute at a high level for a long time" after. Today, the company continues to focus on uptime but also works to help users plan for the inevitable turbulence that comes with life in the cloud.
"If you want to cluster a server or build geographical redundancy, it's easier to do now than it ever was before, but you have to actually take those steps," says Rackspace's Lew Moorman. "The cloud doesn't bring inherent weaknesses that weren't present if you did things in-house before."
All considered, the biggest lesson here may be that no single server, center, or service is 100% reliable. If you don't build your business with that in mind -- well, my friend, you're just walking around with your head in the cloud.
- Cloud development: 9 gotchas to know before you jump in
- How to integrate with the cloud
- Download: Cloud Computing Deep Dive Report
- Download: Cloud Security Deep Dive Report
- What cloud computing really means
This article, "The 10 worst cloud outages (and what we can learn from them)," originally appeared at InfoWorld.com. Track the latest developments in cloud computing at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.
Read more about cloud computing in InfoWorld's Cloud Computing Channel.