The Origin of Chaos Monkey - Why Netflix Needed to Create Failure

7 minute read

In this chapter we’ll take a deep dive into the origins and history of Chaos Monkey, how Netflix streaming services emerged, and why Netflix needed to create failure within their systems to improve their service and customer experiences. We’ll also provide a brief overview of the Simian Army and its relation to the original Chaos Monkey technology. Finally, we’ll jump into the present and future of Chaos Monkey, dig into the creation and implementation of Failure Injection Testing at Netflix, and discuss the potential issues and limitations presented by Chaos Monkey’s reliance on Spinnaker.

The History of Netflix Streaming

Netflix launched their streaming service in early 2007, as a free addon for their existing DVD-by-mail subscribers. While their initial streaming library contained only around 1,000 titles at launch, the popularity and demand continued to rise, and Netflix kept adding to their streaming library, reaching over 12,000 titles by June 2009.

Netflix’s streaming service was initially built by Netflix engineers on top of Microsoft software and housed within vertically scaled server racks. However, this single point of failure came back to bite them in August 2008, when a major database corruption resulted in a three-day downtime during which DVDs couldn’t be shipped to customers. Following this event, Netflix engineers began migrating the entire Netflix stack away from a monolithic architecture, and into a distributed cloud architecture, deployed on Amazon Web Services.

This major shift toward a distributed architecture of hundreds of microservices presented a great deal of additional complexity. This level of intricacy and interconnectedness in a distributed system created something that was intractable and required a new approach to prevent seemingly random outages. But by using proper Chaos Engineering techniques, starting first with Chaos Monkey and evolving into more sophisticated tools like FIT, Netflix was able to engineer a resilient architecture.

Netflix’s move toward a horizontally scaled software stack required systems that were much more reliable and fault tolerant. One of the most critical lessons was that “the best way to avoid failure is to fail constantly.”. The engineering team needed a tool that could proactively inject failure into the system. This would show the team how the system behaved under abnormal conditions, and would teach them how to alter the system so other services could easily tolerate future, unplanned failures. Thus, the Netflix team began their journey into Chaos.

The Simian Army

The Simian Army is a suite of failure injection tools created by Netflix that shore up some of the limitations of Chaos Monkey’s scope. Check out the Simian Army - Overview and Resources chapter for all the details on what the Simian Army is, why it was created, the tools that make up the Army, the strategies used to perform various Chaos Experiments, and a tutorial to help you install and begin using the Simian Army tools.

Chaos Monkey Today

Chaos Monkey 2.0 was announced and publicly released on GitHub in late 2016. The new version includes a handful of major feature changes and additions.

  • Spinnaker Requirement: Spinnaker is an open-source, multi-cloud continuous delivery platform developed by Netflix, which allows for automated deployments across multiple cloud providers like AWS, Azure, Kubernetes, and a few more. One major drawback of using Chaos Monkey is that it forces you and your organization to build atop Spinnaker’s CD architecture. If you need some guidance on that, check out our Spinnaker deployment tutorials.
  • Improved Scheduling: Instance termination schedules are no longer determined by probabilistic algorithms, but are instead based on the mean time between terminations. Check out How to Schedule Chaos Monkey Terminations for technical instructions.
  • Trackers: Trackers are Go language objects that report instance terminations to external services.
  • Loss of Additional Capabilities: Prior to 2.0, Chaos Monkey was capable of performing additional actions beyond just terminating instances. With version 2.0, those capabilities have been removed and moved to other Simian Army tools.

Failure Injection Testing

In October 2014, dissatisfied with the lack of control introduced when unleashing some of the Simian Army tools, Netflix introduced a solution they called Failure Injection Testing (FIT). Built by a small team of Netflix engineers – including Gremlin Co-Founder and CEO Kolton Andrus – FIT added dimensions to the failure injection process, allowing Netflix to more precisely determine what was failing and which components that failure impacted.

FIT works by first pushing failure simulation metadata to Zuul, which is an edge service developed by Netflix. Zuul handles all requests from devices and applications that utilize the back end of Netflix’s streaming service. As of version 2.0, Zuul can handle dynamic routing, monitoring, security, resiliency, load balancing, connection pooling, and more. The core functionality of Zuul’s business logic comes from Filters, which behave like simple pass/fail tests applied to each request and determine if a given action should be performed for that request. A filter can handle actions such as adding debug logging, determining if a response should be GZipped, or attaching injected failure, as in the case of FIT.

The introduction of FIT into Netflix’s failure injection strategy was a good move toward better, modern-day Chaos Engineering practices. Since FIT is a service unto itself, it allowed failure to be injected by a variety of teams, who could then perform proactive Chaos Experiments with greater precision. This allowed Netflix to truly emphasize a core discipline of Chaos Engineering, knowing they were testing for failure in every nook and cranny, proving confidence that their systems were resilient to truly unexpected failures.

Unlike Chaos Monkey, tools like FIT and Gremlin are able to test for a wide range of failure states beyond simple instance destruction. In addition to killing instances, Gremlin can fill available disk space, hog CPU and memory, overload IO, perform advanced network traffic manipulation, terminate processes, and much more.

Chaos Monkey and Spinnaker

As discussed above and later in our Spinnaker Quick Start guide, Chaos Monkey can only be used to terminate instances within an application managed by Spinnaker.

This requirement is not a problem for Netflix or those other companies (such as Waze) that use Spinnaker to great success. However, limiting your Chaos Engineering tools and practices to just Chaos Monkey also means limiting yourself to only Spinnaker as your continuous delivery and deployment solution. This is a great solution if you’re looking to tightly integrate with all the tools Spinnaker brings with it. On the other hand, if you’re looking to expand out into other tools this may present a number of potential issues:

  • Setup and Propagation: Spinnaker requires quite a bit of investment in server setup and propagation. As you may notice in even the streamlined, provider-specific tutorials found later in this guide, getting Spinnaker up and running on a production environment takes a lot of time (and a hefty number of CPU cycles).
  • Limited Documentation: Spinnaker’s official documentation is rather limited and somewhat outdated in certain areas.
  • Provider Support: Spinnaker currently supports most of the big name cloud providers, but if your use case requires a provider outside of this list you’re out of luck (or will need to develop your own CloudDriver).