Getting Started with Observability

A practical look at understanding different concepts in observability. We build our own application and create experiments to make sense of each concept as we move along.

Jan 31, 2025

Let’s start with a definition, according to Wikipedia:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Basically, it is our understanding of what is happening inside of a system based on what it outputs.

Quite often, we have these long-winded explanations about what observability is and I got caught into these explanations as well while reading material about observability. I was confused for quite a while.

And to be honest, none of them are wrong, I think the audience they speak to is incredibly experienced and so for me, it became a journey of understanding this idea and in turn breaking it down for anyone to make sense of through a practical approach.

Observability has always existed in software development for decades, with logging being the oldest form of observability.

In 2017, Peter Bourgon wrote a post about metrics, logs and traces. These 3 pillars are defined shortly as:

Metrics: aggregatable data. An example is the number of http requests.
Traces: any data that can be bound to the lifecycle of a request usually visualised as a waterfall of requests.
Logs: timestamped records with metadata about the events that occurred during a request.

These pillars sort-of define the overall idea of observability.

There have been many others who’ve had their definitions about observability through out the years and there has also been unofficial semantic versioning around the topic, which we’ll discuss in later posts.

Observability isn’t a tool. It’s a property of a system. It’s how we help ourselves see what we’re building as we’re building it.

Using observability we can better understand the why, how and which components break in our systems.

In order to make sense of observability, we’re going to be working on a practical project which will allow us to ask questions, break things and rework our assumptions about metrics, traces and logs.

Hopefully, at the end of this exercise, we’ll have some kind of philosophy we can run with. A better idea of how to understand our systems.

Of course as part of this, I want to cover other topics that relate to building better systems such as canonical logs, SLOs/SLIs, alerting, runbooks, real user monitoring, core web vitals and infrastructure monitoring.

So to begin, I’ll describe the app I’m working on.

Our application

I’ve built my own tiny version of Substack called WriteStack using FastAPI on the back end.

The application has 5 models: User, Subscription, Newsletter, Post and Comment.

Below is what the model definition looks like from a high level.

Model definition

Here’s what my model definition looks like:

You can also look at it in more detail here in code.

The application is composed of these components that we want to understand how they perform under the pressures of live traffic:

Our application is very simple. It has a backend API written in FastAPI, a front end and a Postgres database where we persist our data.

I’ve also defined a load generator that I’m going to use to simulate live traffic on the app using Locust. This will allow us to generate the logs, metrics and traces that we need to make the hypothesis we need about the application.

All the code related to this series is in this repo and I’ve organised the repo in a way that you can choose which ever topic you want to learn about and not be bogged down by other topics. If you want to learn only about traces, you can do that. If your interest is elsewhere, you can make that choice.

Ultimately at the end of the next few weeks we can define ideas that are useful to our everyday work as engineers. As you learn, so am I.

If at some point, during the blog series, you have more questions about a certain idea you can ping me on LinkedIn or send me an email.

Or if you’d like to correct me on certain things, I’d appreciate that as well.

Let’s learn together.

Unreliability.

Discussion about this post