Chaos engineering – why resilience by design is just best practice

‘This is how software will be built in ten years’ – an interview with the CEO of Gremlin.
16 July 2020

Organized ‘chaos’ – the best approach to building resilience? Source: Shutterstock

The phrase ‘chaos engineering’ has leftfield allure, but beneath the label, it’s simply about best practice.

A discipline pioneered at streaming giant Netflix, chaos engineering is “thoughtful planned experiments designed to reveal results to reveal weaknesses in our systems and in our teams and processes.”

That’s the definition lent by Kolton Andrus. He’s former Amazon and Netflix engineering stock, and now founder and CEO of Gremlin, a SaaS platform devoted to bringing chaos engineering principles to major league firms like Walmart, Under Armour, Siemens, and Twilio.

“In the beginning, people were going out and shutting down racks or cutting network cables – and one of those might have caused a side-effect outage,” Andrus said of his time on software builds in the e-commerce giant’s earlier days.

“That’s where the business said, hey, we’ve got to do this better.”

Andrus’s already developed skill in building resilience into software and services led him to Netflix, a company that was “embracing this” in its culture. “What I learned about three months in was they had this tooling, but they needed better. They needed something that was more safe, more precise, that would let people do [chaos engineering] more in a decentralized way.”

Cementing the culture of chaos engineering into the company further, supported by ‘game days’ – team-based learning exercises designed to give players a chance to put their skills to the test in a real-world, risk-free environment – Andrus helped Netflix go from eight and a half hours of outage in his first year, to less than 45 minutes in his second.

Not just good, but reliable

A culture of resilience is a central tenet to the success of tech giants like Amazon and Netflix. The appeal is not just down to the services they offer, but how reliable those services are. But, as we increasingly rely on software and cloud services in core elements of our daily lives, resilience is less a luxury than a necessity – a fact that’s been highlighted amid the pandemic.

Modern cloud computing technology makes companies more responsive to increased traffic and demand, but cloud’s predictive analytics rely on historical data, and nobody (bar perhaps Bill Gates) could have predicted the pandemic or the fallout it would have on the world of work. Many of the sleepless nights and high blood pressure CIOs have suffered in recent months stems from the fact many companies have neglected to perform ‘fire drills’. Instead, systems’ holes and vulnerabilities have been hidden, dormant, until emergency surges unearth them.

Take Zoom, for example, the video conferencing firm that shot to mass popularity amid the Covid-19 pandemic, whose service has been faultless despite downloads increasing 30-fold. Its readiness for rapid scale has meant it’s maintained dominant market share against slower-to-react heavyweight rivals in Google Meet and Microsoft Teams, in a growing sector, despite significant, justifiable scrutiny regarding security flaws.

On the other hand, it’s easy to spot those services that weren’t ready for the unexpected. Black Friday online retail traffic annually sorts the wheat from the chaff in e-commerce; trading app Robin Hood faced its first lawsuit after an outage on a “historic trading day”.

Outages of trading apps, e-commerce websites, and streaming services may seem trivial in the scheme of things, with disgruntled customers, lost revenues, and bruised reputation perhaps the only things at stake. But the importance of airtight resilience hits home when we take stock of the more central role of software today in applications such as healthcare, online voting, and even autonomous driving.

And, not only does chaos engineering help businesses establish reliable and responsive systems, but it also skirts the need for panicked “knee-jerk” solutions put in place in critical moments.

Chaos engineering culture

Having built the foundations of chaos engineering into individual businesses, Andrus has brought resilience-focused engineers from firms including Amazon, Netflix, Google, and Dropbox to make building resilience a software development industry best practice. Gremlin aims to make companies ready, around-the-clock, for unplanned interruptions.

One of the key barriers to adopting chaos engineering seems to be a lack of understanding about the concept, and an unwillingness to create more “technical debt” in trying to integrate it. Andrus admits the name can also be a “misnomer.”

“When I go and talk to banking executives or higher-ups at more traditional companies, I’m often speaking about ‘reliability engineering’, and how we’re going about accomplishing it,” he said.

But while the name conjures associations with, well, chaos, induced software breakages are methodical. The ‘blast radius’ — or the impact on the service — is always minimized and enforced outages are scheduled: “We’re not creating chaos; there’s already chaos – we have to tame it,” Andrus said.

“I’m quick to point out our number one goal is to never cause an outage,” he continued, “and there’s a time and a place to randomly cause failures. But really, we think about the scientific method; we have a hypothesis, we have some risk mitigation, we’re going to go test this hypothesis and we’re going to learn from it to improve things […] It’s better to schedule it and communicate it and let people know it’s coming.

“To do this well, to really build a reliable offering, requires a culture of resilience, it requires it that everyone understands that it’s important […] Everyone needs to grow up and be there – and when that doesn’t occur, you get these piecemeal approaches.”

One of the common objections Andrus’s team gets is that, amid various projects and features developer teams are tasked with, there is simply no time to allocate to building in more reliability. This mindset results inefficiencies down the road, when things do break.

“The [engineers] that we’ve convinced are the ones that we’ve shown that this approach saves you time, saves you outages, makes you better at your job,” Andrus said.

It’s ‘best practice’

As a SaaS provider, Gremlin provides companies with the tools to safely run chaos engineering tests in order to improve the health of their systems. But, staffed with chaos engineering practitioners themselves who are experienced with developing the concept as a culture, Gremlin also advises on how to weave the process into teams with boot camps or game days.

“We’re able to get people hands-on, running these experiments and gaining that comfort. There’s a little bit of [apprehension] until you’ve run it yourself, and the world hasn’t caught on fire.”

For Andrus and his team, there is no doubt that chaos engineering is just “best practice,” whether it’s already in play or hasn’t been considered.

“There’s no doubt it’s just the best practice – it’s how software will be built in ten years,” Andrus told us. “Everyone else is waiting to understand it, and have it clearly articulated before they’re willing to jump into it.

“There’s a lot of critical parts of society that are going to be emerging over the next 10 and 20 years, whether it’s drones or self-driving cars, whether it’s elections, whether it’s how money is transferred and exchanged – there’s a lot of important things where people’s safety could be at risk.

“As an engineer, I’m a big believer that if you want me to get in a self-driving car, I would hope that you’re taking every step possible to mitigate risk and ensure that the system will operate when things go wrong.

“Because in the real world, we know, things will go wrong.”