Recent AWS outages foreshadow a much larger problem

The AWS outages in 2021 indicate a problem of service concentration.
21 December 2021

As the company that pioneered application and deployment resilience in the cloud, it is ironic that the recent AWS outages seem to take out so many essential applications and services. Last week’s downtime lasted only around 30 minutes, affecting the company’s US-West-1 and US-West-2 regions, but it came quick on the heels of the more prolonged outage earlier that hit the US-East-1 node.

A statement from AWS to The Register, issued after services had returned to normal, read, “This traffic engineering incorrectly moved more traffic than expected to parts of the AWS backbone that affected connectivity to a subset of Internet destinations. The issue has been resolved, and we do not expect a recurrence.”

Coping with “more traffic than expected” is a primary raison d’être of cloud providers: essentially, it’s what they are meant to be good at. But Schadenfreude, while enjoyed by some, doesn’t help those businesses and essential organizations hit hard by this latest and the earlier, larger-scale AWS outages.

With every misstep by large internet businesses, awareness is beginning to hit even the mainstream media that large portions of everyday life and activities rely almost entirely on internet services and that a handful of private companies control what is vital infrastructure.

Businesses choosing SaaS applications over in-house solutions place trust in external parties, from the internet gateway through to the end-provider, which in many cases is one of the hyperscale cloud providers. While the internet’s protocols were developed with resilience in mind (auto-routing around bottlenecks and dead waypoint nodes, for instance), cloud providers’ systems are not engineered to the same tolerances. And it’s the end-user that pays when someone thousands of miles away accidentally power cycles the wrong box – metaphorically speaking.

End-users in the form of customers of SaaS business applications and services are well abstracted away from the actual bare-metal of what they use. In its simplest form, Company A (Bob’s Building Blocks Inc.) pays Company B (Peter’s Payroll Services) for a service, and Company B’s stack – or enough of it to matter – gets hosted on AWS’s US-East-1. When AWS misconfigures an acronym (DNS is a favorite), it’s Bob’s workers that don’t get their paycheck. Taken to its logical end, it’s only a matter of time before everyone on the planet can say that they, too, have been negatively impacted to a significant extent by AWS/GCP/Azure outages. But by that point, IT decision-makers will hopefully have reconsidered their cloud strategy, if not their hosting strategy, at a deeper level.

The problem here isn’t just with cloud providers. Too high a concentration of power and capability creates market imbalances. Google’s dominance in search technology has changed the face of the web. Where websites were once a digitally agnostic method of disseminating useful information, sites are now often little more than marketing collateral carefully worded for “search engine” ranking. By “search engine,” we mean, Google, of course.

Many organizations in 2020 and 2021 have been hit by further internet service outages that are little understood by the mainstream press and even less by people outside of IT: content delivery networks. In June this year, the Fastly network’s downtime cost Amazon around US$6000 a second and stopped access to some of the world’s biggest and best-known sites and services until the previously-dormant bug was weeded out.

CDNs are typically deployed for their caching capabilities and as a way to help prevent DDoS attacks. Traffic via CDNs should flow more predictably, with less originating server load and a lower chance of malicious actors successfully cueing up bots to request responses. But like the hyperscale cloud providers, concentrating use of CDNs to Cloudflare, Fastly, Akamai and CloudFront means that should one of those services fall over, the consequences will be felt both up and downstream.

While the extreme yet impractical answer to dependency on large providers of clouds or CDNs is to host everything possible on-premise, and ramp up funding for cybersecurity, that ignores many of the cloud’s (and CDNs’) advantages. A longer-term approach might be to ensure that staff are trained not in cloud-specific technologies, but in more generic systems administration/storage engineering/infrastructure development techniques. That way, when the time comes to spin up a new service, the choice will not be limited to what staff are most conversant in, but what best suits the business.

The modern version of the 1960’s clarion call, “no-one ever got fired for choosing IBM” currently reads, “no-one ever got fired for choosing AWS.” As more essential services fall over because IT decision-makers choose what’s front-of-mind rather than best-fit, the Route One choices start to look less attractive.