Facebook’s outage highlights the big problem with the remote work policy

With only 25% of its workforce in office, Facebook's remote work policy may have delayed the recovery period of its massive outage.
8 October 2021

(Photo by Kirill KUDRYAVTSEV / AFP)

Despite some companies pushing for hybrid work or requiring vaccinated employees to return to the office, COVID-19 is forcing some of them to relook at the remote working policy they have implemented. Since remote working began to be established in Q1 2020, many businesses were able to adapt quickly to the change, with minimal disruption experienced.

According to the Future of Jobs Report 2020 by the World Economic Forum, some 84% of employees are set to rapidly digitize work processes, including a significant expansion of remote working. Organizations feel there is the potential to move 44% of their workforce to fully remote operations in the future. However, 78% of business leaders expect some negative impact on worker productivity, and many businesses are taking steps to help their employees adapt.

Ideally, everyone would like to work remotely. Not only has it been proven to be more productive, but many also feel they save a lot more time from the commute and can get more work done at home. There is the concern of mental fatigue of continuous remote work, which was why some companies have now opted for hybrid working models whereby employees come into the office some days.

And this is where the tricky part comes in. How do companies schedule their hybrid workforce? How do companies decide which roles should be done remotely and which should not? How do companies invest in technology in the long run? Should they focus just on workplace collaborative tools for remote workers, or focus more on their workers in offices?

While these questions often reflect remote work policy for most employees, the IT teams have a different scenario. For IT teams, they need to ensure employees can have a seamless experience when working, regardless if they are at home or in the office. They are responsible for keeping the services running, no matter where they are.

Today, cloud technology has enabled the IT team to work remotely and provide them visibility on the tools needed to manage company data, workloads, and applications. But what happens when a problem occurs? In most cases, these problems can be solved remotely. However, this wasn’t the case for Facebook.

Two phases of the newly completed Facebook data center sit at the base of mountains in the Rush Valley. (Photo by GEORGE FREY / GETTY IMAGES NORTH AMERICA / Getty Images via AFP)

When remote work policy backfires

Facebook has been advocating remote work for its employees for the longest time. The company has a workforce of 60,000 with 75% of them still working remotely. Little did the company expect a routine configuration update could lead to severe repercussions for the social media company.

Earlier this week, Facebook experienced its biggest outage since 2008. The outage affected Facebook and its other social platforms, WhatsApp and Instagram. The outage left billions of social media users frustrated globally as it happened during peak hours as well. It also impacted many of Facebook’s internal tools and systems for its day-to-day operations, which in turn affected attempts to quickly diagnose and resolve the problem.

While most users were able to move to alternative apps for communication and such, businesses that rely solely on Facebook were not that lucky. The six-hour outage saw Facebook lose over US$6 billion. That’s roughly a billion dollars an hour, and it remains to be seen what the actual losses are for businesses affected by the downtime.

Santosh Janardhan, VP of Infrastructure at Facebook, said their engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between Facebook’s data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way Facebook’s data centers communicate, bringing services to a halt.

“Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end. We also have no evidence that user data was compromised as a result of this downtime,” said Janardhan.

Interestingly, Facebook insiders claim the outage that took all of its services offline was exacerbated by employees working from home as many were locked out of key data centers and messaging services.

Postings on Twitter and chat forums showed that remote working engineers had to be rushed to the company’s data centers to reset the servers manually. However, that was not a simple process as well as most of their employees who could fix the problem were working at home and had logistical challenges. Reports on Reddit also showed that Facebook’s data center had lower staffing due to pandemic restrictions.

Interestingly, Janardhan also mentioned that the outage was a situation which they were actually prepared to deal with.

“This is an event we’re well prepared for thanks to the “storm” drills we’ve been running for a long time now. In a storm exercise, we simulate a major system failure by taking a service, data center, or entire region offline, stress testing all the infrastructure and software involved. Experience from these drills gave us the confidence and experience to bring things back online and carefully manage the increasing loads,” he posted in an update.

Whether the drill was done with remote working in mind is unclear, but the reality is, the entire outage could have been avoided and solved faster if employees were on site. Facebook’s issues with solving the crisis fast enough highlights the major problem organizations, large and small, are facing when it comes to their remote work policy.

Despite this, the company is still allowing its employees to work remotely till January 2022. In fact, most of the big tech companies are hoping to have their employees back in the office, either full-time or in a hybrid working environment by early next year.

With vaccination rates increasing, and the global workforce now inching their way back to offices, perhaps it’s finally time for all companies to relook their remote work policy. While some roles can be worked on remotely, organizations need to ensure they have sufficient backup tools in place should a situation like this ever occur again.