Netflix shows how to build a big business on the cloud

Most cloud projects are still limited to pilots or cloudbursting

Most of the attention Blockbuster-slayer Netflix has gotten recently has been about its conflict with Comcast, which has become the poster child for Grinchlike bandwidth rationing and secretive throttling of all the good stuff customers wanted when Comcast reps gave them the hard sell on high-capacity connections.

At least as interesting is how Netflix, whose subscriber base has increased 270 percent between January 2007, when it introduced its media-streaming service and the end of September 2010.

That kind of growth would be hard on the logistics and IT operations of any company, especially with 17 million small customers rather than one big customer who pays a cost equivalent to 17 million monthly subscriptions. (In which case it wouldn't need a database with 17 million records, but would have to make each database field would be really, really big.)

How'd Netflix scale that quickly without breaking the bank? (Scale on the IT side, I mean. Logistics only involves the real world, not technology, so who cares.)

Same way everyone else does; hired help. Sorry -- Partnerships and outsourcing.

Its most important partnership for the streaming service, though also its newest one, is with Level 3, a content-distribution-network provider that shifts Netflix content onto its own servers in data centers around the world. Physical proximity and high-capacity backbone links allow Level3 to get content to customers a lot faster than Netflix could by itself, at least until the content hits the Last Mile Tollbooth and has to wake up the Comcast guy to let part of it through.

Much earlier, and to a much more significant degree, it started offloading a lot of its Web-site and search-engine traffic, streaming servers, data storage, caching, databases and applications not in its own data center, but onto Amazon's Web Services (AWS)

Rapid growth makes five-nines availability more important, not less, so when the aging three-tier IT infrastructure started to look creaky, the company had to decide whether or not to build a series of its own data centers at high cost in both capital and time, according to Netflix spokesblogger John Ciancutti.

"We could have chosen to build out new data centers, build our own redundancy and failover, data synchronization systems, etc. Or, we could opt to write a check to someone else to do that instead," he wrote.

The migration was an incredible amount of work, because AWS provides "undifferentiated heavy lifting" -- tons of capacity that can help Netflix compensate for being "not very good at predicting customer growth or device engagement."

The biggest problems were adjusting to servers and networks picked out by someone else -- which meant different requirements for memory management and expectations of the rate of failure of component failure and variable latency in network connections.

In its own data center, Netflix could control for those variables. In Amazon's it had to share infrastructure with other tenants. That meant either handling all the systems management with their own IT crew rather than Amazon's, or write apps so they could learn how to share -- putting a stalled sub-task on hold to let someone else's task take over the processor or memory resource first.

Without being able to predict mean rates of failure, Netflix also had to build its systems with more resilience than they normally would. Each system had to survive the failure of one or more external apps or components without freaking out.

They also had to promise to wash all their own dishes and take their turns vacuuming the common areas.

The coolest thing the Netflix development crew did, though, was a piece of software called the Chaos Monkey, which would roam the system and randomly shut down application instances or services to keep up constant testing of the architecture to be sure it would survive unexpected outages of either its own processes or others.

It's not a complex story the way Ciancutti tells it, but makes clear it's possible, given a clearly defined set of requirements, willingness to experiment with new situations and patience to keep tinkering until you get things right, that it's possible to use external cloud services for resource-intensive systems that touch customers directly and can't be allowed to fail.

The most common request from commenters, interestingly, was for an open-source version of Chaos Monkey. If there were a plush version, I'd give them as gifts for Christmas.

Kevin Fogarty writes about enterprise IT for ITworld. Follow him on Twitter @KevinFogarty.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon