The evolution of observability in IT

Observability has become one of the most talked-about topics in DevOps in 2020. But is it just a fancy word for monitoring?
18 December 2020

Observability has become one of the most talked-about topics in DevOps in 2020. But is it just a fancy word for monitoring? Or is it really a new approach? Looking at the history of the term can tell why it has become so important, and how it’s evolved from an academic formalism to a way of understanding user experience and overall application health by connecting cause and effect.

Observability’s origin story

The term “observability” has its roots in control theory, the mathematical study of systems – from industrial processes to aircraft – with the goal of operating those systems in safe and efficient ways. In this context, a system is observable if its internal state can be inferred from external outputs. In fact, it might not even be possible to know exactly how fast a chemical reaction is occurring, for example, but we can measure temperature and other indicators to help us build models of what is happening inside a processing unit.

Managing complexity?

Observability became part of how we talk about production software systems when a group of engineers published a blog post describing Observability at Twitter and their approach to operating and debugging their increasingly complex distributed system. While the post doesn’t define observability explicitly, it does describe using a combination of tools to collect, store, and analyze three types of telemetry: metrics, logs, and traces. 

Given this, you can’t entirely blame those that went on to define metrics, logging, and distributed tracing as the “three pillars of observability” – after all Twitter did it (and Google too) and if we’ve already got metrics and logging, adding one more to round out the trio seems easy, right? If we’re going to avoid cargo culting, however, we need to understand more about why these three tools were chosen and how they were used.

Managing unknowns?

In Distributed Systems Observability, Cindy Sridharan points out that merely having metrics, logs, and traces is not enough. She defines observability as data-driven debugging and taking a holistic approach to operating reliable systems, starting with design and spanning coding, testing, deployment, and monitoring.

Charity Majors says that “metrics, logs and traces are just data types” in Observability is a Many-Splendored Definition, and that you should instead focus on the “unknown unknowns” – the questions you didn’t know you needed to ask… until something went wrong, and then you did. This is a clear shift from how we debugged monoliths: in a distributed system, you cannot anticipate every possible failure mode – there are too many things that might go wrong. You cannot create a dashboard that enumerates all of these potential failures.

In fact, there is a glimmer of the formal definition from control theory here: you won’t have a chance to add any new diagnostics during an incident. Instead, you need to make do with the externally visible signals that exist right now.

But there are so many failure modes that, even given the right tools and the opportunity, even experienced engineers find asking the right questions to be overwhelming and out of reach. There are just too many things that might have contributed to the failure: using intuition or just guessing are no longer viable options.

Managing change

In control theory, observability is a part of creating stable systems within dynamic environments. In software systems, the environment is certainly dynamic: infrastructure and user behavior frequently change. But in software, the system itself is also constantly changing: we roll out new features, we optimize, and we fix bugs. So trying to model the software today based on how it behaved two weeks ago is a non-starter!

But while constant change can be a challenge, it is also a path to understanding: change is the clue that leads us to the failures that matter. Instead of investigating every infrastructure hiccup, we should start by looking at changes to user-perceived performance, SLOs, and application health – and then trace back to the underlying factors that contributed to those changes. 

This is how we should think about observability in software systems, and not as three independent pillars or the ability to ask arbitrary questions: observability is the ability to quickly navigate from an effect back to its causes. Understanding cause and effect has always been important, but in distributed systems, the connections have become more difficult to follow. This is why we need to adapt not just how we monitor our applications, but how we build, test, and deploy them as well.

This article was contributed by Daniel “Spoons” Spoonhower, CTO, and co-founder of Lightstep.