How to make the most out of chaos engineering

(Image credit: Netflix)

Netflix is the birthplace of chaos engineering, an increasingly significant approach to how complex modern technology architectures are developed. It essentially means that as you’re binging on your favourite Netflix show, the platform is testing its software while you watch. (Take a look at alternative user testing software.)

The practice of chaos engineering began when Netflix’s core business was online DVD rentals. A single database corruption meant a big systems outage, which delayed the shipping of DVDs for three days. This prompted Netflix’s engineers to migrate from a monolithic on-premises software stack to a distributed cloud-based architecture running on Amazon Web Services (AWS).

While users of a distributed architecture and hundreds of micro-services benefitted from the elimination of a single point of failure, it created a much more complex system to manage and maintain. This consequently resulted in the counterintuitive realisation that in order to avoid any possibility of failure, the Netflix engineering team needed to get used to failing regularly!

01. Figure out steady systems

Begin by identifying metrics that indicate your systems are healthy and functioning as they should. Netflix uses ‘streams per second’ – the rate at which customers press the play button on a video streaming device – to measure its steady state.

02. Create a hypothesis

Every experiment needs a hypothesis to test. As you’re trying to disrupt the steady state your hypothesis should look something like, 'When we do X, there should be no change in the steady state of this system’. All chaos engineering activities should involve real experiments, using real unknowns.

03. Consider real world scenarios

For optimal results, think: ‘What could go wrong?’ and then simulate that. Ensure you prioritise potential errors too. Chaos engineering might seem scary at first but when done in a controlled way, it can be invaluable for understanding how complex modern systems can be made more resilient and robust. Learning to embrace organised chaos will help your teams fully understand the efficiency and resiliency of your systems against hazardous conditions.

This article was originally published in issue 324 of net, the world's best-selling magazine for web designers and developers. Buy issue 324 or subscribe to net today.

Related articles:

Wieldt is a developer evangelist and senior solutions marketing manager at New Relic.