More important, the fixes one must make to address the problems that will be seen are unknowable in advance. This means outages may be lengthy as development organizations attempt to sort through layer upon layer of complexity to understand the problem and design a fix. Clearly, if this approach is inadequate for the new application model, something different is required-something that represents an approach that is appropriate for the new application type.
Netflix Uses Army of Monkeys to Make Apps Robust
The company that best represents this new application architecture, architecture and topology is undoubtedly Netflix. It runs a highly decentralized application comprised of independent services that are aggregated to provide specific functionality. Each of the services operates separately, and the resulting application is unique for every user. Finally, the entire collection of services runs in Amazon Web Services.
The approach Netflix has taken to address the problems associated with these new applications reveals a new pattern of resilience, one we will see more in the future as companies move to this new application design orientation. Its approach might be summed up as "If it's not broke, break it"-to be sure the application is robust in the face of unexpected failures.
Netflix began its resilience approach in a straightforward way. It developed a tool to unexpectedly shut down instances within the underlying AWS infrastructure to ensure its application is robust in the face of resource failure. It dubbed this tool Chaos Monkey. It has continued development of other tools to improve resilience, uses Monkey as a standard naming convention and refers to the collection of tools as the Simian Army.
Janitor Monkey cleans up after the application by shutting down unneeded instances, Security Monkey finds instances with improper security settings and shuts them down and Doctor Monkey tracks instance behavior and shuts down instances that have poor response time or show high resource use without corresponding useful activity.
Netflix even has the Chaos Gorilla, which simulates an outage of an entire AWS availability zone. Netflix is currently AWS region-bound, but is actively exploring how to spread its application across regions to further improve resilience. One can be sure that there will be another tool created to validate resilience in the event of an entire region going down, perhaps it will be called Chaos King Kong.