What Being On-Call Taught Me: Lessons from the Incident Trenches

Phase 1: Detection - When It All Starts

The start of every incident is rarely glamorous. It’s often an annoying ding from PagerDuty during dinner, or a Slack ping about something “looking weird.”

I’ve relied heavily on tools like Kibana, Datadog, and Zabbix. These tools are great- when they’re tuned well. One of the earlier mistakes I suffered from was not configuring alert thresholds properly, which resulted in excessive noise. Eventually, we learned to define crisp, actionable alerts, and our team was no longer drowning in false positives.

Routing alerts to the right person at the right time made all the difference. If there’s one thing I’d wish my past self was aware of: alert fatigue is real, and it's solvable.

For example: We replaced a generic “CPU usage high” alert with “CPU usage > 90% for 5 minutes during working hours,” and added a direct link to relevant dashboards. This reduced noise by around 40%.

Phase 2: Triage - Figuring Out What’s Bugging

This is where your detective hat comes on. One of my more memorable moments involved a SEV-1 incident where users couldn’t log in - during a product launch.

In those critical moments, I’ve learned to ask:

Who’s affected?
What’s still working?
Is data at risk?

We started using severity templates and shared triage boards to make communication fast and clear. Honestly, it took a few messy incidents for us to appreciate how crucial structured triage is. A well-run triage saves time, saves trust, and saves face.

Phase 3: Escalation - Knowing When to Ask for Help

No one should try to be a hero alone. Some of the best outcomes we’ve had were when I escalated early, looped in a domain expert, and got things moving.

We set up a clear escalation matrix - who to call, and for what kind of issue. PagerDuty handles the routing, but what helped was defining roles during a major incident: a commander, a comms lead, and a tech lead. It brought order to chaos.

The best advice here? Escalate before you hit your limit.

Real case: In one incident involving billing data loss, I escalated within 10 minutes by tagging the data engineer and assigning her the ticket. She spotted a failing batch job that wasn’t visible to the frontend team.

Phase 4: Investigation - The Heart of the Incident

In the chaos of an incident, the instinct is to fix fast. But lasting solutions come from understanding. That’s why we lean on the 5 Whys - not as a rigid ritual, but as a tool to cut through complexity and uncover the root of the issue just enough to act effectively.

Take this real case: users were suddenly unable to confirm payments. No obvious errors. Just silent failures.

Why were payments failing?
Because the final payment call never completed.
Why didn’t it complete?
Because it was stuck waiting on a service timeout.
Why was there a timeout?
Because the inventory check API wasn’t responding.
Why was the inventory service failing?
Because of a malformed cache key causing a mismatch.
Why was the key malformed?
Because a recent deploy changed its format - but only in one region.

By the fourth why, we already had a strong lead. That was enough to build a temporary path around the issue. We documented the fifth for the postmortem.

💡 Lean mindset: Stop when you’ve reduced uncertainty enough to act. Don’t waste time chasing the “perfect” root cause in the middle of the fire. Your goal is safe restoration, not academic clarity.

📌 Document each "Why" as you go. It gives structure to your thinking - and context for everyone else later.

Phase 5: Resolution - Bringing It Back to Life

Fixing things feels good - but only if you're careful.

Sometimes it’s a rollback. Sometimes a hotfix. Sometimes, just restarting a pod does the trick. But we’ve burned ourselves before by skipping validation. Now, we test fixes in staging first when possible, monitor the regression window, and communicate broadly once services are restored.

Nothing beats the quiet satisfaction of seeing dashboards go green again - and knowing customers can breathe easy.

For example: During a login failure incident, after hours of debugging auth flows, we found out the root cause was a misconfigured environment variable in one pod. Instead of continuing the deep dive, we restarted the pod with the last known good config. It worked instantly - and we scheduled a deeper config audit later. Sometimes, the fix is boring - and that’s okay.

Phase 6: Post-Incident Review - Where the Real Learning Happens

If the fire is out, it’s time to learn. We schedule our Post-Incident Reviews (PIRs) within 72 hours.

And we made a rule: blameless by design. Inspired by Lean management methods, including frameworks like Dantotsu. This means focusing on understanding the current situation in detail (“What exactly happened?”), defining what an ideal resolution looks like (“What should we aim for?”), identifying root causes (“Why did it happen?”), and deciding on clear actions (“What can we do to fix it and prevent recurrence?”).

We also consider how to verify those fixes work and how to embed the improvements sustainably in our processes. Everyone on the team is encouraged to contribute, with clear responsibilities and deadlines assigned for each action.

For example, rather than blaming an individual for a deployment error, we examined how the deployment process allowed it and improved our testing and review steps accordingly.

This method helps create psychological safety and drives continuous improvement - turning incidents into valuable learning opportunities.

We log action items, assign them, and track them. This is how we’ve made permanent improvements, not just temporary patches.

What Makes a Great On-Call Culture?

For me, it's a mix of:

Psychological safety. (you won’t get blamed for being human)
Strong documentation.
Readiness and repetition.
Open, timely communication.

Over time, I’ve come to see on-call not just as a duty - but as a driver of engineering maturity.

Final Thoughts

Incidents will happen. But chaos? That’s optional.

Being on-call taught me that reliability isn’t just about tools or alerts - it’s about people, habits, and learning. If we treat every incident as a learning opportunity, we not only fix problems - we build better teams.

Last modification:

7.4.2025

17.12.2025

Auteur(s) :

No items found.