Ensure cloud application resilience the Netflix way

By Bernard Golden, CIO |  Cloud Computing, Amazon Web Services, Netflix

More important, the fixes one must make to address the problems that will be seen are unknowable in advance. This means outages may be lengthy as development organizations attempt to sort through layer upon layer of complexity to understand the problem and design a fix. Clearly, if this approach is inadequate for the new application model, something different is required-something that represents an approach that is appropriate for the new application type.

Netflix Uses Army of Monkeys to Make Apps Robust

The company that best represents this new application architecture, architecture and topology is undoubtedly Netflix. It runs a highly decentralized application comprised of independent services that are aggregated to provide specific functionality. Each of the services operates separately, and the resulting application is unique for every user. Finally, the entire collection of services runs in Amazon Web Services.

News: Netflix Releases Customized Amazon Control Console

The approach Netflix has taken to address the problems associated with these new applications reveals a new pattern of resilience, one we will see more in the future as companies move to this new application design orientation. Its approach might be summed up as "If it's not broke, break it"-to be sure the application is robust in the face of unexpected failures.

Netflix began its resilience approach in a straightforward way. It developed a tool to unexpectedly shut down instances within the underlying AWS infrastructure to ensure its application is robust in the face of resource failure. It dubbed this tool Chaos Monkey. It has continued development of other tools to improve resilience, uses Monkey as a standard naming convention and refers to the collection of tools as the Simian Army.

Janitor Monkey cleans up after the application by shutting down unneeded instances, Security Monkey finds instances with improper security settings and shuts them down and Doctor Monkey tracks instance behavior and shuts down instances that have poor response time or show high resource use without corresponding useful activity.

Netflix even has the Chaos Gorilla, which simulates an outage of an entire AWS availability zone. Netflix is currently AWS region-bound, but is actively exploring how to spread its application across regions to further improve resilience. One can be sure that there will be another tool created to validate resilience in the event of an entire region going down, perhaps it will be called Chaos King Kong.


Originally published on CIO |  Click here to read the original story.
Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Cloud ComputingWhite Papers & Webcasts

See more White Papers | Webcasts

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question
randomness