What I Learned Running HA Systems During the Holidays

2025-12-22 1353 words 7 minutes

Contents

What I Learned Running HA Systems During the Holidays (Without Panic)

Every year, as punctual as Christmas dinner or the New Year’s Eve toast, it arrives:
the production incident during the holidays.

Anyone who works with high-availability systems knows this well: for our services there are no office hours, no “closed for holidays”. And for years I’ve dealt with this from different angles: first as a sysadmin, then as a tech lead, and finally as a CTO.

The conclusion I’ve reached is simple, but not trivial:

Calm, more than technology, is the real key to high availability.

Technology is complex, but relatively predictable. People under stress, much less so. And the difference between a well-handled incident and a night of total panic almost always lies in process, communication, and the ability not to be overwhelmed by urgency.

Below I’m sharing what I’ve learned, especially in that time of the year when half the team is on holiday, customers are hypersensitive, and SLAs don’t go on vacation.

Scenario: holidays, incidents, and SLAs

The holiday period is a perfect organizational stress test.

You have:

fewer people on call,
more traffic on certain services,
smaller maintenance windows,
customers who, if something goes down on “the evening of the 24th”, take it almost personally.

In theory, HA systems should absorb failures and spikes: clustered redundancy, load balancers, replicated storage, tested backups. In practice, we know incidents never show up in a “clean” way:

a node that crashes not by itself, but right while you’re deploying;
an alert that fires, but the person on call doesn’t really understand what it means;
a failover that technically works, but hits performance at the worst possible moment.

What I’ve understood is that you don’t defend SLAs with architecture alone, but with the way the company organizes itself around its systems:

who decides what;
who talks to customers;
who has the final say if you need to “break the change freeze”.

During the holidays, this becomes even more obvious: every organizational ambiguity turns into minutes of downtime.

Escalation processes and communication

The first time a major customer calls you at 5:00 PM on December 31st, you immediately understand how important it is not to improvise.

Escalation: less creativity, more scripts

A good escalation process is almost boring, and that’s exactly how it should be.
When something happens, the person on call needs:

a clear step-by-step checklist: what to check, and in what order;
second-level contacts (database, network, application, cloud provider);
clear thresholds for “raising the level”: when to involve the CTO, when to involve sales, when to officially declare an incident.

The rule I apply is:

“Better one escalation too many than one too few.”

It’s much better to disturb someone and then scale back, than trying to “quick fix” something that is already spiraling out of control.

Communication: silence increases perceived downtime

The truth is that, during an incident, silence weighs more than the outage itself.

There are two types of communication that really matter:

Internal
- A single dedicated channel (e.g. #incident-<date>).
- An incident commander, even informally, who coordinates and decides.
- Real-time event logging: who does what, what has been tried, what has been ruled out.
External (customers / stakeholders)
- A short and honest first message: we are investigating.
- Regular updates (it’s better to say “every 15 minutes” than “we’ll update you as soon as possible”).
- A closing communication that includes:
  - what happened,
  - how long it lasted,
  - what we’ll do to prevent it from happening again.

I’ve noticed that a well-handled incident, clearly communicated, actually increases trust.
A poorly communicated incident, even if technically resolved quickly, leaves a trail of doubt behind it.

Prevention and monitoring

The visible part of incidents are the sleepless nights.
The invisible — and truly decisive — part is the work done beforehand.

Prevention: the real “hero work” is boring

The activities that really save your holidays are rarely glamorous:

failover tests done for real, not just “on paper”;
periodic checks of:
- updated runbooks,
- correct phone numbers,
- working emergency access credentials;
dry runs of recovery procedures in test or staging environments.

Investing time in these things before the holidays has a huge ROI in terms of “incidents that never happen”.

Monitoring: alarms yes, alarmism no

Running an HA environment also means accepting that something is always broken somewhere, in a controlled way. Monitoring should reflect this philosophy.

What I look for in a solid monitoring setup is:

alerts that are few but meaningful, tied to:
- user impact,
- SLA impact,
- real data risk;
true observability, not just metrics:
- centralized logs,
- tracing where it makes sense,
- clear dashboards, not walls of unreadable graphs;
smart silencing during planned work, without hiding important signals.

One indicator I’ve learned to pay attention to is this:
if the person on call systematically ignores certain types of alerts, the problem is not the person, it’s the monitoring system.

The value of human redundancy

We always talk about redundancy of servers, regions, circuits… but the most critical redundancy is people.

Avoiding “bus factor 1” disguised as “the expert”

If there’s one person who “knows everything” about a critical system, that’s a risk.
During the holidays, that risk becomes real exposure:

that person might be on vacation,
they might get sick,
they might simply be unreachable.

What I’ve learned to do is:

distribute knowledge through:
- usable documentation (not novels buried in forgotten wikis),
- pairing sessions on critical procedures,
- real on-call rotation across different stacks;
accept that:
- a less experienced person might need 10 minutes more,
- but the organization as a whole will be much more resilient.

Protecting the people “keeping the lights on”

There’s also the human side: those who are on call during the holidays carry an extra load of stress (and sacrifice).

I’ve seen a clear difference when:

on-call duty is truly recognized (financially and culturally);
there’s at least a “light” backup (someone you can call if needed, without feeling guilty);
after a heavy incident, there’s some decompression:
- a lighter day,
- no pointless meetings,
- or at least explicit recognition of the effort.

High availability doesn’t mean squeezing the same people over and over “because they know how to fix it”.
It means building a team that can hold up over time.

Practical takeaways for CTOs and sysadmins

If I had to summarize in a few concrete actions what really made a difference in running HA systems during the holidays, I’d say:

For CTOs / technical leaders

Define a simple and shared incident management process
- Who decides what.
- How an incident is opened, managed, and closed.
- How and when you communicate with customers.
Enable your team to plan the holiday period like a project, not “just another time of year”
- who is on call,
- who is backup,
- which changes are forbidden (change freeze) and with what exceptions.
Invest in human redundancy
- minimal but useful documentation,
- knowledge sharing on critical stacks,
- responsibility rotation.
Give on-call duty the dignity it deserves
- financial recognition,
- cultural recognition (“you’re a central part of our reliability”),
- post-incident support.

For sysadmins / SREs / devs on call

Prepare beforehand
- verify access,
- test VPN, bastion, tools,
- re-read the runbooks for the most critical areas.
Follow the process, not the “quick fix” instinct
- understand first,
- then act,
- log what you do: it will be useful for the post-mortem (and to avoid repeating useless attempts).
Communicate, even if you don’t have the solution yet
- “we’re investigating X, next update at HH:MM”
- is always better than 40 minutes of silence.
Treat calm as a technical asset
- if you notice you’re clicking around “randomly”, stop,
- breathe, re-read the latest logs, and ask yourself a simple question:
  “What do I know for sure, and what am I just assuming?”

In closing

Running HA systems during the holidays is not a test of technical heroism, it’s a test of organizational maturity.

Technologies change, stacks evolve, cloud providers multiply.
What stays constant is that:

calm beats panic,
process beats improvisation,
people always beat machines.

If we keep these three points at the center, we don’t just survive the holidays… we build an infrastructure – both technical and human – that’s worth far more than any uptime achieved “by luck”.