Rapid growth makes five-nines availability more important, not less, so when the aging three-tier IT infrastructure started to look creaky, the company had to decide whether or not to build a series of its own data centers at high cost in both capital and time, according to Netflix spokesblogger John Ciancutti.
"We could have chosen to build out new data centers, build our own redundancy and failover, data synchronization systems, etc. Or, we could opt to write a check to someone else to do that instead," he wrote.
The migration was an incredible amount of work, because AWS provides "undifferentiated heavy lifting" -- tons of capacity that can help Netflix compensate for being "not very good at predicting customer growth or device engagement."
The biggest problems were adjusting to servers and networks picked out by someone else -- which meant different requirements for memory management and expectations of the rate of failure of component failure and variable latency in network connections.
In its own data center, Netflix could control for those variables. In Amazon's it had to share infrastructure with other tenants. That meant either handling all the systems management with their own IT crew rather than Amazon's, or write apps so they could learn how to share -- putting a stalled sub-task on hold to let someone else's task take over the processor or memory resource first.
Without being able to predict mean rates of failure, Netflix also had to build its systems with more resilience than they normally would. Each system had to survive the failure of one or more external apps or components without freaking out.
They also had to promise to wash all their own dishes and take their turns vacuuming the common areas.
The coolest thing the Netflix development crew did, though, was a piece of software called the Chaos Monkey, which would roam the system and randomly shut down application instances or services to keep up constant testing of the architecture to be sure it would survive unexpected outages of either its own processes or others.
It's not a complex story the way Ciancutti tells it, but makes clear it's possible, given a clearly defined set of requirements, willingness to experiment with new situations and patience to keep tinkering until you get things right, that it's possible to use external cloud services for resource-intensive systems that touch customers directly and can't be allowed to fail.
The most common request from commenters, interestingly, was for an open-source version of Chaos Monkey. If there were a plush version, I'd give them as gifts for Christmas.