Logs, Metrics, Traces: How I Think About Observability in a Distributed System

2025-06-27 797 words 4 minutes

Contents

Logs, Metrics, Traces: How I Think About Observability in a Distributed System

How to check a system’s pulse and understand what is happening and where.

Measuring is controlling; the time invested in implementing logs is never wasted but the best possible investment.

Difference Between Debugging a Problem in a Monolith and in a Distributed System

In the monolithic world, problem analysis is almost always a linear and predictable path: one codebase, one environment, one point of observation. If something goes wrong, everything is right there: your IDE, the debugger, the full stack trace.

In distributed systems, however, this approach is no longer enough.
A single request may pass through ten services, two queues, a scheduler, an asynchronous job, and maybe even an external API call. The issue may show up minutes after it originated.

You no longer have the ability to “stop and see what happens.”
Classic debugging simply does not apply anymore.
You need a new way to observe the system. A discipline that allows you to see what you cannot pause.

This is where observability comes into play.

LID Philosophy — Log Instead Debugging

The LID (Log Instead Debugging) philosophy stems from a clear idea:

Provide in the logs all the information needed to understand what a process is doing, without requiring debugging.

It is not an absolute replacement for the debugger — debugging is still more powerful.
But in distributed systems it is often impossible to use, because services are isolated, ephemeral, asynchronous, running in clusters or containers that you cannot “pause.”

LID means designing software so that:

anyone reading the logs can reconstruct the program’s flow,
even without having the source code,
even without knowing the underlying technology stack.

This approach has two strategic consequences:

It reduces diagnosis costs
Less experienced technicians can analyze complex issues by reading clear, structured logs.
It makes the system understandable by anyone
You don’t need to be an expert of the stack to understand what is happening.

In other words, LID democratizes troubleshooting.
It’s an investment that pays off multiple times: less tribal knowledge, less dependence on “gurus,” and more autonomy for the team.

Log Management

Logging is not enough: you need to be able to use those logs.
Log management transforms a mass of events into a useful, searchable source of knowledge.

Key objectives include:

centralizing logs from many services;
indexing them quickly;
searching for events, patterns, and correlations;
creating dashboards and visualizations;
setting alerting rules;
managing retention, storage, and large volumes.

Tools like ELK, Loki, OpenSearch, or Datadog have become operational standards. Without a log management platform, even the best logs become useless: just lines scattered across ephemeral containers.

Log management makes the LID investment exploitable.

Quantifying: The Art of Using Metrics

Metrics are the quantitative side of observability.
They are simple, fast, numeric, and comparable over time.

Unlike logs, which tell you what happened, metrics answer the question:

Is the system working as it should?

Key metrics include:

RED metrics (Rate, Errors, Duration) — for APIs and services.
Resource usage — CPU, memory, I/O, DB connections.
Queues and asynchronous systems — length, wait time, processing rate.
Business metrics — application KPIs that really matter (orders, labels generated, successful integrations).

Without metrics there is no control. Without control there is no improvement.
Metrics tell you how severe a problem is and when it started. Logs tell you why.

The “Smoke Trail”: Tracing a Process in a Distributed System

Tracing is what ties logs, metrics, and service calls together.
It provides the end-to-end view of a single request — what I call “the smoke trail”: the complete path of an event through the system.

With tracing:

every request gets a trace ID,
services add their own information,
you obtain a graph of the flow and timings,
you identify bottlenecks and slow components,
you understand where a problem actually originates.

Without tracing, a distributed system is opaque.
With tracing, it becomes readable like a map.

OpenTelemetry has finally standardized this approach, making it accessible without vendor lock-in.

Conclusions

Unfortunately, this is a huge topic, and here I’m only giving a quick overview — but I will write more detailed articles on each specific theme.

Observability is not just a set of tools: it’s a mindset.
It means designing the system knowing that sooner or later something will break — and you’ll need to understand it quickly.

Log Instead Debugging, solid metrics, and widespread tracing are three complementary pillars:

logs tell the story,
metrics measure the system’s health,
tracing connects everything together.

You can live without these in a monolith.
In a distributed system, they are simply essential.

Investing in observability means investing in the future of the project and the team.
Because the more complex a system becomes, the more valuable it is to understand it effortlessly.