Netflix uncages Chaos Monkey disaster testing system

By Brandon Butler, Network World |  Cloud Computing, Amazon Web Services, GitHub

Netflix has released Chaos Monkey, which it uses internally to test the resiliency of its Amazon Web Services cloud computing architecture, making available for free one of the tools the video streaming company uses to keep its massive cloud computing architecture running.

Chaos Monkey is a free download available from GitHub as of today. It works by randomly terminating instances of virtual machines in applications, simulating what would happen during a disaster event. "The best defense against major unexpected failures is to fail often," Netflix officials wrote in the blog post titled "Chaos Monkey released into the wild."

RELATED: Four tips to prepare for the next Amazon outage

YOU'VE GOT TO SEE THIS: Eye candy for IT: 25 award-winning designs

Just how secure public cloud computing offerings are from providers has come into focus this summer as Amazon Web Services suffered a significant outage that brought down Netflix, as well as other media companies Instagram and Pinterest. Salesforce.com, the major software-as-a-service (SaaS) provider, was hit with two outages in as many weeks earlier this summer as well.

Chaos Monkey can be configured to work on the Amazon Web Services offering or, with some tweaking, on other cloud computing offerings. It can be programmed to initiate a testing scheme with various frequencies and to be done during various times of the day, for example on average of once a week or once a day. In practice, a highly resilient cloud should automatically detect the outage and spin up new, identically configured virtual machines that keep the application running with no visible impact to the user.

Netflix says it has run Chaos Monkey internally to create 65,000 failed instances across its system. "Failures happen and they inevitably happen when least desired or expected," the blog reads, continuing later: "Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that 'simple fix' you put in place last week could have undesired consequences."


Originally published on Network World |  Click here to read the original story.
Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Cloud ComputingWhite Papers & Webcasts

See more White Papers | Webcasts

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question