Amazon's outage is absolutely the kind of thing that "just happens" in computing. It happens due to hardware or software failures, mistakes in data-center or systems-software design, errors in operations management and the failure of systems designed to back up everything that might crash if anything else goes wrong.
It's possible to get far too anal about reliability and availability in computing. HPC people are famous for it, though it's only a rumor they always wear two pairs of underwear in case one mysteriously dissolves, they have been known to carpool by following each other to work so employees in the cars behind can pick up the ones in front in case of car trouble. (They do something similar about the underwear situation, but those arrangements are too delicate to discuss here.)
That kind of obsessive attention to over-preparation and redundancy is why they make a living, though not many companies really need that level of attention.
A company like Springpad probably doesn't. One like Amazon absolutely does.
A company like Springpad needing three-nines of reliability that hires one with five nines, like Amazon, is still out of luck if the bottom drops out and all those nines end up scattered on the floor.
Cloud is great. Control is better. Rent one, buy the other. Don't count on either one to always work the way they're supposed to.
Simple lesson from a complex failure.