Blameless postmortem by design: In praise of the Five Whys
If you’ve ever been part of a post-incident review (postmortem), you’ve probably heard that postmortems should be blameless. And if you’ve grown up in an environment where fault-finding is the norm, embracing blamelessness can feel awkward, even unnatural.
At the core of a blameless postmortem is a simple idea: the human is not the root cause. We build complex, distributed systems, and we must expect them to be tolerant of human mistakes. Systems should be designed to absorb, detect, and recover from failures because mistakes are inevitable.
The First Cause Isn’t Good Enough
Root cause analysis is meant to uncover the underlying reasons for an incident. Often, this involves diving into logs and metrics. But simply identifying the trigger is not enough. That may tell us what happened, but not why our systems allowed it to happen in the first place. Often, postmortems stop short. Consider the following:
A developer updated the production configuration, introducing an error in the cache key format. This caused cache lookups to fail, leading to widespread timeouts.
This might sound like a sufficient explanation. After all, it tells us what happened. However, it subtly frames the incident as a human mistake. Even without naming anyone, we walk away thinking: This was someone’s fault.
But that’s not systemic thinking. That’s symptomatic thinking. Pointing fingers—even indirectly—doesn’t help systems improve. Worse, it erodes psychological safety, one of the most important traits of high-performing teams.
Peeling Back the Layers
To go deeper—and stay truly blameless—we need an approach that assists us. Originally developed at Toyota as part of lean manufacturing, this deceptively simple technique just asks: why? Again and again to uncover different stages in which the system failed us.
Here’s how it might look if we applied the Five Whys to the incident above:
Why did the site not work for users?
Because the backend took longer to respond.Why did the backend take longer to respond?
Because requests to the backend timed out.Why did requests time out?
Because request latency increased.Why did request latency increase?
Because the cache lookup performed retries.Why did the cache lookup perform retries?
Because the cache key was constructed incorrectly.Why was the key incorrect?
Because a configuration change introduced an error in the key format.Why could the configuration change go out to production?
Because CI validation does not have a step to check the configuration.
You might notice this example goes beyond five points. The Five Whys is a starting point, not a hard limit. You should feel encouraged to keep going as long as the answers reveal meaningful or interesting insights.
Each of these whys peels back another layer. And the further you go, the less it becomes about people and the more it becomes about the system and how the system behaves.
Algorithmic Objectivity
One reason the Five Whys works so well is because it’s almost algorithmic: you’re just following the next logical question. No judgment. No assumptions. Just structured curiosity.
It doesn’t have to follow a single straight path. The questions often branch into multiple lines of inquiry, especially when there is more than one answer to the questions.
Why was the cache key wrong?
- Configuration change introduced an error.
- Unit tests didn’t cover the use of the configuration.
- The key format wasn’t documented.
Why did latency spike instead of failing fast?
- Retry logic kept requests alive too long.
- Retry mechanism doesn’t distinguish between transient and fatal errors.
- There are no fallback mechanisms to handle errors gracefully.
Following these branches reveals blind spots, undifferentiated system behavior, and weak assumptions. It exposes where our systems silently rely on humans to do the right thing.
From Investigation to Action Items
The Five Whys doesn’t stop at understanding. Each answer opens a door to action. What can we do at this layer of the system to reduce risk and mitigate it in the future?
In our example, we might derive the following action items:
- Add CI validation for production configurations.
- Write unit tests to verify correct cache key construction.
- Differentiate between transient and fatal errors to decide retry behavior.
- Handle fatal errors gracefully with fallbacks or fail-fast logic.
Suddenly, we’re not talking about a one-off mistake. We’re rethinking the system design to make entire classes of errors less likely—or at least easier to detect and recover from.
Designed for Empathy
It’s easy to ask: Shouldn’t someone have validated the key format before deploying? Shouldn’t someone have caught the configuration error? Maybe—but be careful. Even well-intentioned expectations can turn into assumptions—and assumptions are dangerous because people may be dealing with things we can’t see—illness, stress, or personal crisis.
If something can be automated in theory, it’s worth exploring in practice. Even if the complete solution isn’t possible today, closing that gap is progress. Every step toward automation is a step away from human-dependent workflows. More importantly, engaging in that process makes you a better engineer—it pushes you to think creatively, design defensively, and invent novel solutions.
Instead of asking, “Why did this happen?”, ask: “How can we build the system to catch this at different layers?” Automation allows people to focus on the challenging aspects rather than the error-prone ones. It’s a way to invest empathy into infrastructure.