What we can learn from five recent IT outages
Financial services provider Allianz Group found that business interruptions and cyber incidents such as IT failures, outages, and data breaches were the two leading global business risks for 2019.
A study of technical outages at UK banks found that daily IT failures were common across the six leading banks, with one major incident occurring every two weeks.
Similarly, the U.S. General Accounting Office (GAO) studied 34 airline IT outages from 2015 to 2017 and found that 85 percent of such outages caused flight disruptions, delays, or cancellations.
The year 2019 was remarkable for the sheer volume and diversity of IT outages that organizations experienced. It seemed like no one was immune from performance degradations, including major airlines, hospitals, commercial banks, stock exchanges, and even cloud providers.
Here are five outages this year which not only caused reputational damage, customer dissatisfaction, and financial losses but also offered valuable lessons in how to respond to large-scale incidents.
January 2019: The UK Ministry of Justice
Background. In January 2019, the UK’s Ministry of Justice (MoJ) suffered an embarrassing IT outage that derailed the performance of critical IT systems like the Crown Prosecution Service, the Criminal Justice Secure Email system (CJSM), and the court hearing information recording system for an entire week.
Impact. Legal professionals in the UK were unable to access either the court Wi-Fi system or email services from the MoJ. The outage affected hundreds of MoJ websites, preventing jurors from enrolling and delaying hearings for minor offenses by more than two weeks.
Cause. The MoJ held its supplier, Atos responsible for the system-wide disruption. While the MoJ blamed an infrastructure failure at an Atos’ datacenter as the likely culprit, historical under-investment in aging IT systems at the MoJ also created a perfect storm for the collapse of the court’s technology systems.
March 2019: Facebook, Instagram, and WhatsApp
Background. Facebook suffered its worst outage this March when its popular social media applications (Facebook, Instagram, Messenger, and WhatsApp) remained inaccessible for more than 14 hours. The outage affected more than two billion people across the world who rely on Facebook’s family of apps for business and pleasure.
Impact. Facebook earned US$17 billion in its latest quarter, with 99 percent of revenues coming from its advertising platform. Wall Street shorted Facebook’s stock by close to 3 percent in early trading the next day while several advertisers queued up for refunds on wasted ad spend during the day-long outage.
Cause. The reason for Facebook’s epic outage was a server configuration change that triggered a cascading series of issues. Cascading failures can create a chain of incidents across interdependent systems resulting in large-scale network disruption.
Alex Stamos, former chief information security officer of Facebook, has the clearest explanation for probable reasons behind this massive outage.
May 2019: Salesforce Pardot
Background. Salesforce faced its biggest service disruption in May 2019 when the deployment of a database script to its Pardot Marketing Cloud ended up granting elevated permissions to regular users.
Salesforce had to block access to Pardot users to prevent employees from stealing sensitive corporate data. However, when this fix didn’t work, Salesforce had to then block network access to other Salesforce services like Sales Cloud and Service Cloud.
Impact. Customers were unable to access the Pardot Marketing Cloud for 20 hours as Salesforce engineers took affected systems offline to resolve user access permissions. While Salesforce was able to restore data permissions for most customers within a day, it took an additional 12 days to roll out fixes for other Salesforce services.
Cause. Configuration change issues have been a frequent cause of IT outages at leading cloud providers. A faulty database script led Salesforce to almost shut down its entire infrastructure and address the issue of broken user permissions.
Once the company fixed the problem, Salesforce customers had to spend hours setting up the right levels of access permissions.
August 2019: British Airways
Background. British Airways (BA) is the UK’s largest international airline serving more than 45 million customers every year. BA had to cancel more than 100 flights and delay 200 domestic and international flights at Heathrow, Gatwick and London City airports due to performance issues with its flight check-in and departure systems.
Impact. BA had to manually check-in passengers during the incident, leading to long queues and inordinate wait times. The airline offered irate customers the option of either rebooking their flights or getting a refund to prevent further chaos and confusion across the three London airports. The outage also led to flight delays at other airports in Great Britain and Europe.
Cause. The service degradation started at 8 am on Wednesday, August 7th and took until nearly 4 pm to resolve. Two separate IT systems for online check-in and flight departures were responsible for the outage that affected more than 25,000 passengers.
YOU MIGHT LIKE
Salesforce hit by 15-hour downtime
October 2019: Chime Banking
Background. Chime is a digital-only bank that serves five million U.S. customers through its mobile app. Chime suffered a protracted outage that started on Wednesday, October 16th and continued till Friday, October 18th, preventing members from withdrawing cash at ATMs, accessing salary accounts, making debit card purchases, and checking account balances.
Impact. Chime customers were unable to pay their bills or meet their financial obligations during the two-day outage. Two hundred customers officially filed complaints with the Federal Deposit Insurance Corporation while 6,000 customers signed an online petition demanding compensation from Chime. After repeated complaints, Chime finally credited $10 to all active customers who couldn’t use their banking services during the prolonged service disruption.
Cause. Chime’s third-party payment gateway processor, Galileo Financial Technologies, experienced an “operational incident” with its database systems that prevented millions of Chime customers from accessing their online accounts. Without any physical branch presence, customers had no recourse but to patiently wait for Chime to restore its banking services.
Ensuring IT resilience during technology outages
On average, a single hour of downtime costs an organization $126,000. Given the financial implications and customer trust issues surrounding a high-profile outage, IT practitioners need to carefully examine their current operational workflows and build the right contingency plans to handle critical incidents. Here are three takeaways that technology leaders should keep in mind while dealing with an unanticipated outage:
- Embrace chaos engineering. Given that the most sophisticated technology teams are not immune to network outages, enterprises should invest in running experiments that truly test the limits of their IT infrastructure. The emerging discipline of chaos engineering offers an excellent framework to minimize systemic risks by designing software and underlying infrastructure for real-world production scenarios.
- Plan for human error. Ponemon’s 2016 report on the Cost of Data Center Outages found that human error accounts for 22% of unplanned downtime. Despite the adoption of software-defined infrastructure and on-demand cloud services, IT teams still need human operators to keep the lights on. Technology leaders will need to build consistent processes, foster open communication, embrace preventive maintenance, and create a culture of blameless postmortems to catch and prevent disruptive outages.
- Include supplier systems in your IT preparedness plan. Several outages this year illustrated the interdependent nature of modern applications that operate through open APIs, including major airlines like American and JetBlue (due to their technology provider, Sabre), Microsoft Office 365 (thanks to a CenturyLink DNS issue) or a Cloudflare outage that took down most of the Internet for a couple of hours in July. CIOs should carefully evaluate the external technology systems that their digital services rely on and collaborate with their suppliers to rapidly restore services during an incident.
Bottom line. Frequent technology outages are a great reminder of the fragile nature of modern infrastructure. While technology failure and system downtime are inevitable, IT operations teams that build the right incident response playbooks, invest in modern performance management tools, establish the right communication cadence during an outage, and learn from existing incidents, will be best placed to deal with unplanned outages.
This article was contributed by Deepak Jannu is Director of Product Marketing at OpsRamp.
31 March 2020