Netflix is the birthplace of chaos engineering, an increasingly significant approach to how complex modern technology architectures are developed. It essentially means that as you’re binging on your favourite Netflix show, the platform is testing its software while you watch. (Take a look at alternative user testing software.)
The practice of chaos engineering began when Netflix’s core business was online DVD rentals. A single database corruption meant a big systems outage, which delayed the shipping of DVDs for three days. This prompted Netflix’s engineers to migrate from a monolithic on-premises software stack to a distributed cloud-based architecture running on Amazon Web Services (AWS).
While users of a distributed architecture and hundreds of micro-services benefitted from the elimination of a single point of failure, it created a much more complex system to manage and maintain. This consequently resulted in the counterintuitive realisation that in order to avoid any possibility of failure, the Netflix engineering team needed to get used to failing regularly!
Enter Chaos Monkey: Netflix’s unique tool that enables users to roam across its intricate architecture and cause failures in random places and at arbitrary intervals throughout the systems. Through its implementation, the team was able to quickly verify if the services were robust and resilient enough to overcome unplanned incidents.
This was the beginning of chaos engineering – the practice of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent conditions in production and unexpected failures.
Chaos Monkey’s open source licence permits a growing number of organisations like Amazon, Google and Nike to use chaos engineering in their architectures. But how chaotic can chaos engineering really get?
Successful chaos engineering includes a series of thoughtful, planned and controlled experiments, designed to demonstrate how your systems behave in the face of failure.
Ironically, this sounds like the opposite of chaos. However, practitioners must keep in mind that the goal is learning in order to prepare for the unexpected. Modern software systems are often too complex to fully interpret, so this discipline is about performing experiments to expose all elements of the unknown. A chaos engineering experiment expands our knowledge about systemic weaknesses.
Before chaos engineering can be put into practice, you must first have some level of steadiness in your systems. We do not recommend inducing chaos if you are constantly fighting fires. If that’s in place, here are some key tips for conducting successful chaos engineering experiments:
01. Figure out steady systems
Begin by identifying metrics that indicate your systems are healthy and functioning as they should. Netflix uses ‘streams per second’ – the rate at which customers press the play button on a video streaming device – to measure its steady state.
02. Create a hypothesis
Every experiment needs a hypothesis to test. As you’re trying to disrupt the steady state your hypothesis should look something like, 'When we do X, there should be no change in the steady state of this system’. All chaos engineering activities should involve real experiments, using real unknowns.
03. Consider real world scenarios
For optimal results, think: ‘What could go wrong?’ and then simulate that. Ensure you prioritise potential errors too. Chaos engineering might seem scary at first but when done in a controlled way, it can be invaluable for understanding how complex modern systems can be made more resilient and robust. Learning to embrace organised chaos will help your teams fully understand the efficiency and resiliency of your systems against hazardous conditions.