RightScale CTO Thorsten von Eicken says during the latest Amazon outage, internal operations within RightScale had trouble scaling across availability zones in Amazon's cloud. AWS admitted it was "throttling" customers, meaning it limited how much data they could transfer from one AZ to another, something it has vowed it will not be as aggressive doing in the future. The point is that even if a system is architected to be fault tolerant, unexpected problems can still arise.
There are multiple ways to architect fault tolerant systems though, von Eicken says. Customers can create two active-active services, or create one active and a "clone" standby, for example. Each has its own advantages and cost considerations, though.
Basic fault tolerance: In a basic fault tolerant architecture, there is a production architecture and a standby "clone architecture." If there is a fail in the master AZ, then the system can be manually switched to use the cloned version, a process that not only usually requires a manual switch-over, but the databases are usually replicated in Amazon's Simple Storage Service (S3) about every 10 minutes, so when a switch-over does occur, you could lose about the last 10 minutes worth of data, RightScale says.
Advanced fault tolerant system: A more advanced system creates two active systems running simultaneously. In this active-active setup, any instance, or even an entire AZ can fail and the system will automatically be able to complete all its functions from another AZ that is pre-architected and ready to run on. RightScale says this architecture will cost more than double the cost of a single AZ setup, because all of the services form the single AZ not only have to be replicated, but there are data transfer costs that come with ensuring both systems are kept up-to-date in real time.
There are other options, too.
2. Application design
Sean Hull is an independent scalability and performance consultant with iHeavy in New York, and shortly after the AWS outage authored a blog post titled "AirBNB didn't have to fail," referring to the travel site that was one of dozens across the Internet that went down when AWS's cloud hiccupped. In the post, Hull argues there are tools Web developers can use to be tolerant against outages.