How Netflix pioneered Chaos Engineering
Outages can be lethal for any service or application. In the ‘on-demand’ age we live in today, we can’t expect consumers to hang around when the lights go out. That’s true for any business, but for the most popular streaming service in the world, one approaching 150 million users, that’s a code to live by.
Giving us the phenomenon now known to most as ‘binge-watching’ and even creeping into a cultural lexicon- ‘Netflix-n-chill’- the streaming service has disrupted the entertainment industry immeasurably.
Beyond savvy marketing and a flawless business model, though, its product matters most, and it’s at the coalface where Netflix’s engineers have pioneered new ground— a method of software development dubbed ‘chaos engineering’.
Put simply, chaos engineering comprises causing deliberate faults to distributed software systems in production to test resilience in the face of turbulent or unexpected conditions. On outing this concept to the coding community, Netflix reports it was met with both “incredulity and skepticism”.
Drawn in by this maverick approach and the tool that sprung from it, Chaos Monkey, TechHQ approached Netflix’s engineering team for comment and were pointed towards Ali Basiri, the company’s Senior Software Development Lead and a central founder of the Chaos Engineering methodology.
Basiri told TechHQ that the method came about when Netflix moved its services from the data center to the cloud. In doing so, his team learned quickly that they needed to adapt to the changing reliability of the company’s servers, or ‘instances’, as a result.
“Instance uptime could be days or weeks instead of months or years,” Basiri said. “An instance could disappear in the middle of the night when it was harder for the engineers to respond quickly.”
With users scattered across time zones— and Netflix being choice entertainment for many at any hour of the day— the impact of those outages on paying subscribers would inevitably prove damaging to the perception of the service among users.
“That’s when we asked, what if instances could terminate during business hours when engineers are available and can respond quickly. And what if it was a much more regular occurrence that you couldn’t just ignore and couldn’t just leave to chance.”
So emerged Chaos Monkey, the name given to a tool that randomly “unplugs” instances during business hours, to the extent that instance outage became the norm among the product’s developers— something to prepare for throughout the entire engineering process.
“Every design decision would have to consider the availability of the service in the face of instance terminations,” Basiri said.
The phrase Chaos Engineering implies a lack of control, but in actuality, the Netflix team are measured in their approach. Each experiment begins with the hypothesis that the “steady state” of the system— or the rate at which customers are able to watch shows and movies— will not be changed.
“We then run experiments with scenarios for which we have specifically engineered resilience. But the nature of the experimentation is that sometimes those resilience mechanisms will fail, and there is potential for impacting the customer experience.”
One of the core principles is to minimize “blast radius”— or the impact on the service— when an issue is discovered. That’s where ChAP comes— a tool that runs experiments on a small fraction of traffic by using experiment and control populations, and the foundation of the next stage of its work with the approach.
“Chaos monkey was the start of chaos engineering at Netflix. However, as our service became more complex, our chaos engineering methodology became more sophisticated”, explains Basiri, adding that ChAP represents the company’s next stage of work with Chaos Engineering.
One would imagine it would be hard to convince seniors at the growing company to deploy the technique.
Basiri made sure to “tread carefully” — “We started chaos monkey with sane defaults for how often terminations would happen and we made it opt-in initially”— but added that the company’s bottom-up culture and onus on freedom and responsibility were integral.
Since releasing Chaos Engineering into the wild, Netflix has gained recognition as a pioneer of the left-field approach to product builds.
That reputation has attracted interest by other companies— not just digital-first enterprises or startups, but large banks and even healthcare services; companies that stand to risk the most from outages.
But for every other business under the sun, is chaos engineering a viable option?
“Systems today are becoming so complex that it’s hard for any single person to reason about how they work and how they will behave under different circumstances,” Basiri said.
“I believe chaos engineering is a great technique for testing the resilience of such systems and building confidence that the business will survive when bad luck strikes.”
For those considering the approach, expect failures to happen whether you adopt it or not: “You can have a controlled failure and use it to learn from and adapt your system to be more resilient, or you can be caught off guard.
“It might help to think of chaos engineering as similar to a controlled forest burn that helps prevent large uncontrolled forest fires.”
That being said, it helps to take baby steps: “Start with the hypothesis that you will not break things when you introduce a fault (such as instance terminations); test that hypothesis in a safe environment, then use chaos engineering to increase your confidence in the validity of a hypothesis.
“Make sure you have observability and can identify and react if things don’t behave as you expect them to,” Basiri added. “The experimentation could take many forms and could be as simple or as sophisticated as you are comfortable with.”
The concept of Chaos Engineering sounds unrelenting, and yet the rationale that programs will be built stronger over time is hard to contradict.
Seven years after its creation, with Netflix now a global giant, Chaos Monkey continues to wreak havoc. But even if it were switched off, Basiri believes Chaos Engineering has now become so embedded among Netflix engineers, that they would continue to develop programs with resilience.
4 October 2022