The beauty and convenience of mobile computing, Web 2.0 applications and the always-online work- and lifestyle is that everything you know or need to know is always available – as long as your mobile has a connection and your cloud doesn't crash.
Startup Springpad and its customers are discovering the danger in that last bit, as Amazon confesses that not all the customers affected by its hours-long outage and days-long recovery will get all their data back.
That's a problem for Springpad, whose whole business is not only built on Amazon's EC2, but also on the proposition that its own customers can type, scan or speak anything they want into whatever device is in front of them and Springpad will save it for them in a way that will make it useful later.
If it hasn't disappeared into the cracks in Amazon's data centers.
Springpad has been making huge inroads into online note taking, a field crowded with competitors but dominated by blogger-favorite Evernote. Favorable reviews call Springpad more comprehensive,easier to use, easier to organize and better suited for business (even as a CRM).
With more than a million users, a little venture-capital backing, three years in business and only 15 people on staff, Springpad couldn't afford to build its own data centers. It rented Amazon's.
It didn't back up Amazon's cloud with an internal one of its own. Big mistake; understandable given Amazon's reputation, the reputation of "the cloud" for bulletproof reliability and the expense of replicating high-end data centers you can only afford to rent anyway.
" We know that many of you rely on Springpad in your daily lives, and we take that responsibility very seriously," Springpad wrote to users in its management blog. "We can’t do enough to apologize for the impact that you’re feeling right now."
Springpad offered more detail to customers than most other online service companies that came down along with Amazon:
"Yesterday morning at 1:41AM PDT Amazon started experiencing network issues with EBS. Well over 100 companies were impacted. Amazon has been working hard to recover from the issue in all data centers, but the zone that Springpad is in is still recovering. Specifically, user data is stored in 12 different replicated Cassandra servers and 50% of those servers are down, which is beyond our capacity to recover... Right now we are holding tight and researching our options. We are hoping that EBS recovers soon, but in the event that it does not, we are planning our next steps."
Those next steps don't involve a backup data center.
They involve adding reliability by changing the way Springpad uses the cloud:
- Using Amazon's multiple availability zones to let EC2 essentially back itself up, using one zone as failover for another;
- Adding another cloud provider such as Rackspace, as possible failover for the first one;
- Possibly shifting from cloud rental to co-location arrangements to give Springpad more control over its own environment;
- Accelerating software development to let customers store data offline on their own equipment.
All are good ideas, except that last one; don't sell your data storage service as reliable while telling customers the best way to make sure their data are safe is to not rely on your service.
Most of Springpad's customer responses say customers understand computers and computer services (especially free ones) can be problematic and unreliable. One compared it to the elevator in his building that was out of service for two days so he had to walk up seven floors.
He called complainers "whiners."
I'd call them "customers" and say Springpad was right to be apologetic and to look into alternative backup plans. It should have done that sooner.
If you offer a service to customers, especially if you pitch it as a business tool they can rely on, it has to be available.
If you can't control how reliable the equipment providing the service is, you can't ensure the service will be reliable.
Amazon's outage is absolutely the kind of thing that "just happens" in computing. It happens due to hardware or software failures, mistakes in data-center or systems-software design, errors in operations management and the failure of systems designed to back up everything that might crash if anything else goes wrong.
It's possible to get far too anal about reliability and availability in computing. HPC people are famous for it, though it's only a rumor they always wear two pairs of underwear in case one mysteriously dissolves, they have been known to carpool by following each other to work so employees in the cars behind can pick up the ones in front in case of car trouble. (They do something similar about the underwear situation, but those arrangements are too delicate to discuss here.)
That kind of obsessive attention to over-preparation and redundancy is why they make a living, though not many companies really need that level of attention.
A company like Springpad probably doesn't. One like Amazon absolutely does.
A company like Springpad needing three-nines of reliability that hires one with five nines, like Amazon, is still out of luck if the bottom drops out and all those nines end up scattered on the floor.
Cloud is great. Control is better. Rent one, buy the other. Don't count on either one to always work the way they're supposed to.
Simple lesson from a complex failure.