Production Incident Management: My Mental Checklist

2025-07-16 766 words 4 minutes

Contents

Production Incident Management: My Mental Checklist

When something goes wrong in production, you need method, clarity, and above all, collaboration.
Over time, I’ve developed a mental checklist that helps me move in an orderly way, avoiding impulsive reactions and working in tandem with the team best suited to the situation.

In this article, I share my approach to incident management, organized into logical blocks that can be applied in any technical environment.

The Goal of Incident Management

Restore operations as quickly as possible, communicate clearly with customers and stakeholders, and prevent the problem from recurring.

Gathering Information and Activating Communication

The first step is not “to rush into debugging,” but to gather information and establish an initial picture.

In parallel:

I inform customers and stakeholders that the issue has been acknowledged,
I define a communication channel,
I start identifying which people need to be involved.

This phase is also crucial to assemble the incident team, which is not fixed but changes based on the information collected: infrastructure, backend, frontend, integrations, database, or customer support.

The Role of the Incident Commander

During an incident, it’s essential that someone maintains overall coordination.
This figure, the Incident Commander, is not necessarily the most technical person on the problem, but the one who:

keeps the big picture in mind,
handles internal and external communication,
shields those investigating from noise,
ensures that steps are executed in order,
prevents the team from losing focus.

This allows specialists to concentrate on diagnosis without worrying about communications or stakeholder pressure.

Immediate Classification: Infrastructure or Software?

Once communication is underway, I move to the first major logical fork:

Infrastructure fault: servers, network, DNS, load balancers, storage, queues, databases.
Software bug: recent releases, configurations, feature flags, functional regressions.

This distinction guides both who to involve and the direction of the analysis.

If It’s a Bug: Evaluating a Rollback

When the incident appears related to a recent release, the most effective question is:

Can we perform a rollback quickly and safely?

A rollback:

restores operability in minutes,
reduces impact,
gives the team time to analyze the issue calmly.

It’s not a failure: it’s an operational tool to guarantee continuity.

Block Thinking: Building the Mental Perimeter

To begin the diagnosis, I apply block thinking, a structured approach that helps avoid being overwhelmed by chaos.

Phase 1 — Define the impact perimeter

which users are affected?
which functionalities are degraded?
which components may be correlated?

Phase 2 — Break the problem down into logical blocks
The idea is simple: always start from the most likely and simplest causes, gradually moving toward the more complex ones.

Examples of blocks:

resources (CPU, RAM, disk, DB connections),
internal / external network,
queues and asynchronous systems,
recent configurations,
communication architectures (APIs, microservices),
application logic.

This approach prevents dispersion and allows the team to proceed systematically.

Narrowing the Scope: From Simple to Complex

This is where block thinking is fully applied.

The principle is:
never start with the most far-fetched hypotheses, nor with the smartest ones… start with the simplest.

A typical sequence:

“Stupid but essential” checks (exhausted resources, exceeded limits, crashed processes).
Infrastructure checks (load balancer, ingress, networking, queues).
Dependency checks (DB, storage, external APIs).
Application log inspection.
Correlation with metrics and tracing.
Only at the end: complex bugs, edge cases, rare conditions.

This method dramatically reduces resolution time.

Restoring Operability

Once the cause has been identified, proceed with restoration:

rollback,
hotfix,
failover,
scaling,
temporary disabling of the problematic functionality.

The rule is: restore first, refine later.
And keep customers and stakeholders updated at every step.

Post-Mortem: Learning From the Incident

The post-mortem is the most important phase of the entire process.

It serves to:

analyze what happened objectively,
understand the conditions that allowed the incident,
evaluate what worked and what didn’t in the response process,
define preventive actions,
improve logging, monitoring, alerting, and release processes.

Conclusions

Incident management is neither a speed contest nor an exercise in individual brilliance: it is a collective process that requires method, communication, and the ability to think in a structured way.

Block thinking, the presence of an Incident Commander, and assembling the right team are elements that transform a critical moment into a controlled operation.

Every incident is different, but the approach can always be the same: gather information, set priorities, analyze from simple to complex, restore operability, and learn from what happened.

This discipline doesn’t eliminate incidents, but it ensures lower impact, more effective responses, and a progressively stronger team.

Ultimately, the real value lies not only in restoring the service, but in the evolution of the system and the people who keep it running.