JPMC, Uber, GrubHub — tech giants talk chaos engineering
“Any sufficiently advanced technology is indistinguishable from magic,” is the recently revived quote of science fiction writer Arthur C. Clarke.
We can now summon a private car to our doorstep with the tap of the thumb, and it costs less than a cab. In a decade, that car may not even have a driver. As consumers, we take for granted our increasing reliance on apps to navigate modern life — to order our food, manage our finances, and get from A to B. The icons on our smartphone home screens are portals to vast and powerful services, and the best thing is, they just work.
Product simplicity is a constant goal of technology companies; both startups and evolving incumbents. If the user catches a glimpse of the pulleys working behind the scenes or worse, the curtain has to fall as stagehands fix a jammed trapdoor, the illusion of simplicity is ruined.
Continuing the illusion becomes more difficult as systems become more complex. Today’s technology companies — that is to say, nearly every business — host vast architectures of microservices. As businesses continue to scale, serve more customers, and move to the cloud, they must ensure their product continues to function flawlessly.
In these complex backend environments, it gets harder to pinpoint incidents when they occur, and obstructions get heavier. Outages lasting even just minutes can cost companies hundreds of thousands of dollars. It’s not uncommon for services to strain at the most opportune times due to peaks in traffic.
And while outages of trading apps, e-commerce websites, and streaming services may seem trivial in the scheme of things, with disgruntled customers, lost revenues, and bruised reputations perhaps the only things at stake, the importance of airtight resilience hits home when we take stock of the more central role of software today in applications such as healthcare, online voting and — when that car pulls up with no driver — even autonomous driving.
Chaos engineering is a methodology that comprises examining how everything in that system and all of its dependencies can fail and building in resilience, as reliability becomes central to the success of apps and services.
Ahead of Gremlin’s Chaos Conf live this week from October 6-8, TechHQ caught up with leading figures from three world-leading brands to find out just how much resilience matters.
JP Morgan Chase
JP Morgan Chase (JPMC) is a banking giant, but some may be surprised at the sheer scale of its development operations.
Rahul Arya heads up the firm’s global technology solutions architecture team which is responsible for accelerating the bank’s modern platform architecture and engineering. In a finance world now disrupted by hungry challengers and data-powered fintech products, JPMC’s digital ambitions are staggering. Arya’s unit serves mind-boggling 50,000 engineers globally and around 6,500 applications around the globe.
“Customer experience and our digital products ultimately drive our business. Technology teams at JPMC and our large, diverse, and global bench of engineers make this all possible,” Arya told TechHQ.
“Given our scale, we focus on both line of business-driven products to our end customers and internal products and services that enable our next-gen apps and platforms to scale.”
Digital transformation in the finance sector, however, carries unique complexities. Given the criticality of customer data processed by banking apps every day, members are held to strict compliance requirements which can often stymie the pace and scale of next-gen products much to the chagrin of well-oiled engineering teams. Arya’s team is focused on removing the friction of compliance away from developers, which “massively increases their velocity to ship code.”
“As we enable our modern cloud platforms that empower our developers and abstract away our internal compliance requirements, ensuring resiliency for our apps then becomes extremely critical,” said Arya. “Chaos enables resiliency via code.”
While chaos engineering is a cultural shift away from legacy resiliency engineering and requires a focus on empowering developers to seamlessly use chaos to manage their apps, similar to what Arya’s team did for compliance, they made chaos “incredibly simple and easy to use.”
“Ultimately, resilient apps means better products and happy customers,” Arya said.
In the hyper-competitive food delivery sector, system failures can lead to late or missing deliveries. That’s not only letting down hungry customers, but also the restaurant partners connected to those apps. If a customer has a bad experience, there is no shortage of competition.
“Reliability is always at the forefront of our technology discussions and decisions,” said Doug Campbell, senior site reliability engineer at GrubHub. Day to day, Campbell does everything from developing new tools to evangelizing good continuous integration & development practices and maintaining the infrastructure involved in delivering software.
“Grubhub development teams have pressure to design reliable, fault-tolerant, distributed services. Everything needs to operate in a microservice, active-active multi-region setup, and we put strict requirements on monitoring and availability,” said Campbell.
GrubHub engineers are ultimately responsible for deployments and support of their own services, Campbell explained: “We have a variety of ways to manage all of this, like robust continuous integration and continuous delivery pipelines with a strong focus on automation and reproducibility.
“Our biggest focus with chaos engineering is enabling our developers to experiment.
“While we are still early in our chaos journey, the biggest success I have seen is around confidence. Chaos engineering practices have helped us validate long-held assumptions about the health of our services, which leads to increased confidence that our services can successfully handle a variety of failure scenarios.
“The biggest challenge is getting everyone on board. We really have to evangelize the tools and be a chaos engineering salesperson. From a technical perspective, chaos engineering is not hard to implement, dealing with the human side is the challenging part.”
YOU MIGHT LIKE
How Netflix pioneered Chaos Engineering
Like other rapid-growth tech companies, Uber’s outages stem from things change management, code deployments, configuration changes, and scaling quickly. “The landscape is [continuously] changing, the technologies are always changing. And there is always a balance between speed reliability,” explained Ranjib Dey, software engineer at Uber.
And there’s a lot that can go wrong: “We are the largest microservice-based architecture that I am aware of and we are operating at a web-scale, where everything can independently evolve over time and as a result, certain things go different ways,” said Rey. “There are a lot of interfaces in terms of how they are interconnected and how the failures cascade or propagate across different boundaries. So, that makes it very technologically challenging.”
But the ridesharing leader also faces entirely unique challenges. Its services out in the real world are at the mercy of judicial and legal systems, protests, terrorist attacks, and natural causes — all of which can mean the firm must “shut down” certain cities with zero notice.
The reality of Uber’s marketplace is unpredictable, but this means the app has been built with an “anti-fragile” mindset, where outages are accepted as an inevitability, but measures are in place to ensure they are minimized at all costs.
“We know there is a delta between the production of any small, miniature replica or simulated environment versus real-life — even further greater in a microservice-like architecture.”
In the event of an outage, Rey explains that Uber takes every instance for its own benefit, to learn from. “We have extended continuous integration and continued to continuous delivery. And instead of unit tests, now we have this chaos engineering practice that predicates our deployment and change management.”
Rey said the introduction of chaos engineering was an incremental push: “We never made big changes anywhere. Because we know that our adoption will be hindered by those kinds of massive maneuvers.”
“For an anti-fragile company like Uber, not just in product but in process and people, persuading teams that it’s ok to “proactively inject failure in your database layer, though this sounds very, very strange.”
2 July 2022