post-mortemretrospectiveincidents

How to Write a Post-Mortem / Retrospective

A practical guide to writing blameless post-mortems and retrospectives — timeline construction, root cause analysis, action items that actually get done, and a Markdown template.

By mdkit Team··10 min read

A post-mortem is a structured review of something that didn't go as planned — an outage, a failed launch, a missed deadline, a shipped bug that reached production. A retrospective is the same structure applied to planned reviews — end of sprint, end of quarter, end of project.

Both exist for one reason: so the organization learns faster than its people leave. Without them, every new engineer rediscovers every old failure mode.

This guide covers how to run a post-mortem that produces action items that actually get done, and how to write it up so that people six months from now (including you) can learn from it.

The cultural prerequisite: blameless

Before anyone writes a template, the team has to agree on one thing: post-mortems are blameless. That means:

  • The focus is on systems, processes, and decisions — not individuals.
  • We assume people did the best they could with the information they had at the time.
  • Hindsight is not evidence. "You should have known" is not a valid criticism.
  • Nobody gets fired, demoted, or put on a PIP because of a post-mortem finding.

Blameless doesn't mean unaccountable. If someone repeatedly ignores process or acts negligently, that's a management conversation — it happens outside the post-mortem, not in it. The post-mortem stays focused on "what can we change in the system to prevent this class of incident."

If your organization can't commit to blamelessness, post-mortems will be theater. People will cover themselves, withhold information, and assign vague action items. Nothing will improve. Fix the culture first.

When to write one

Write a post-mortem for:

  • Any customer-facing incident (outage, data loss, major bug in production)
  • Any security incident (breach, exposure, unauthorized access)
  • Any missed launch or significantly missed deadline
  • Any "near miss" — something that almost went wrong and shouldn't happen again
  • Any project, at its conclusion (this is the retrospective flavor)

You do not need one for:

  • Individual bugs caught in code review or QA
  • Minor delays inside a sprint
  • Expected failures in a controlled environment (load testing, chaos engineering)

A good heuristic: if you'd want someone joining the team next year to know about it, write one.

The structure

A post-mortem has six sections:

  1. Summary — what happened, at a glance
  2. Timeline — what happened, with times
  3. Root causes — why it happened
  4. Impact — what it cost
  5. What went well, what didn't — the reflection
  6. Action items — what we'll do

The Project Post-Mortem template is a filled-in starting point. Here's a walkthrough of each section.

1. Summary

3–5 sentences at the top. Someone reading only the summary should know:

  • What happened (the user-visible symptom)
  • When and for how long
  • Who was affected
  • What the root cause was
  • What we're doing about it

Example:

Summary: On 2026-03-15 from 14:22 to 15:47 UTC (1h 25m), the checkout API returned 500 errors for approximately 8% of requests. Root cause was an expired TLS certificate on our payment provider's staging environment that was inadvertently used in production due to a feature flag mis-configuration. All affected customers were retried successfully. We are adding certificate expiration monitoring and tightening feature flag environment checks.

2. Timeline

A chronological list with timestamps. Times in UTC for shared context.

## Timeline All times UTC. - **14:22** — Payment processor starts returning TLS errors. Not yet detected. - **14:24** — First checkout 500 errors appear in logs. - **14:31** — PagerDuty alert fires (checkout error rate > 5%). - **14:33** — On-call engineer @alice acknowledges; opens incident channel. - **14:41** — @alice identifies payment provider as upstream cause. - **14:49** — @bob joins; notices staging endpoint in config. - **14:52** — Feature flag rollback initiated. - **15:14** — Rollback completes; error rate drops. - **15:47** — Error rate back to baseline; incident closed.

Rules for the timeline:

  • Fact, not judgment. "Error rate dropped" — not "we quickly fixed the issue."
  • Times in UTC. If the team is distributed, local times are confusing.
  • Attribute responders by initial or handle, not full names. Keeps the focus on roles.
  • Include detection gaps. The time between the problem starting (14:22) and the alert firing (14:31) is data. Nine minutes of undetected errors is an action item candidate.

3. Root causes

Root causes are why the incident happened, separated from the symptoms. Use the "5 Whys" technique to dig past the first layer.

Surface cause: "Feature flag pointed to staging." Why? Config file was copied from a test environment. Why? No CI check validates environment-specific values. Why? The config system doesn't have environment awareness.

A good root cause section distinguishes between:

  • Trigger — the specific change or event that started the incident
  • Contributing factors — conditions that made the incident possible or worse
  • Detection gaps — why it wasn't caught sooner
  • Response gaps — why it wasn't fixed faster

Incidents almost always have multiple root causes. "The cert expired" is a cause. "Our monitoring didn't catch cert expiration" is also a cause. Both get listed.

4. Impact

Quantified, specific:

  • Customer impact: "~2,400 checkout attempts failed. ~1,900 retried successfully. ~500 abandoned and have been contacted by support."
  • Revenue impact: "Estimated $18,000 in lost or delayed transactions."
  • Internal impact: "On-call team paged for 1h 25m. Engineering time spent: ~12 person-hours across 4 engineers."
  • Data integrity: "No data loss. 3 partial orders are being manually reconciled."

Impact is not "it was bad." Impact is numbers, identities, and time.

5. What went well, what didn't

Usually two bullet lists. This is the retrospective heart of the document.

What went well:

  • Alert fired within 9 minutes of the first error.
  • On-call engineer identified the upstream cause quickly.
  • Rollback mechanism worked as designed.
  • Support was looped in within 15 minutes.

What didn't:

  • Cert expiration wasn't monitored — a preventable root cause.
  • Config file didn't fail a CI check despite pointing to staging.
  • Status page was updated 20 minutes after incident started.
  • No runbook for payment provider incidents; engineer worked from first principles.

Be specific. Every bullet should point to a concrete moment or artifact, not a general feeling.

6. Action items

Where most post-mortems fail. Action items are commitments, not wishes.

Each action item gets:

  • An owner (one person)
  • A priority (P0 / P1 / P2)
  • A due date
  • A ticket number (not a bullet in a Google doc — a tracked item in Jira, Linear, GitHub Issues, whatever you use)

Template:

## Action items | Owner | Priority | Due | Item | Ticket | | ------- | :------: | --------- | ----------------------------------------------------- | -------- | | @alice | P0 | 2026-03-22| Add cert expiration monitoring for all prod endpoints | INFRA-312| | @bob | P1 | 2026-04-01| CI check for environment-specific config values | PLAT-204 | | @carol | P1 | 2026-03-25| Write payment provider incident runbook | OPS-189 | | @dave | P2 | 2026-04-15| Reduce detection time — error rate alert threshold | OBS-77 |

Categories of action items:

  • Prevent the same failure (primary goal)
  • Detect sooner (reduce time-to-detection)
  • Mitigate faster (reduce time-to-resolution)
  • Communicate better (internal and external)
  • Document (runbook, onboarding, decision record)

Track every action item in a dashboard. Review open items at every subsequent retrospective. An action item that ages out without being completed is a red flag — either it wasn't important, or the ownership is broken.

Facilitation

The document structure matters, but so does the meeting.

Before the meeting:

  • The facilitator (often a service owner or team lead) drafts the timeline and a first-pass root cause analysis.
  • Share the draft 24–48 hours before the meeting so people arrive prepared.

During the meeting (60–90 minutes):

  1. Review the timeline (15 min). Anyone who was there fills gaps, corrects errors.
  2. Root cause discussion (20 min). Apply the 5 Whys. Challenge surface causes.
  3. What went well / didn't (15 min). Round-robin so everyone speaks.
  4. Action items (15 min). For each candidate, assign owner, priority, and due date in the meeting — don't defer.
  5. Wrap (5 min). Agree who publishes the final doc and by when.

Rules of engagement:

  • No laptops except the scribe. Eye contact > multitasking.
  • "I" statements about your own role. "We" statements about systems.
  • Facilitator gently interrupts blame-shaped language: "Let's focus on what the system allowed, not who pushed the button."
  • If emotions run hot, pause. Reconvene the next day if needed.

Publishing

Internal post-mortems go in a shared folder, linked from your incident management tool. Searchable, discoverable, part of onboarding.

Customer-facing post-mortems are a subset of the internal one. Include:

  • What happened (in customer terms, not internal jargon)
  • When and for how long
  • What you're doing about it
  • An apology, if warranted

Don't include: internal team names, individual handles, raw logs, speculation about specific engineers' actions. Customers want to know they're in good hands, not rubberneck your process.

Templates vs. reality

No template fits every incident. A network partition post-mortem and a missed marketing launch post-mortem have the same structure but very different content. Use the template as a skeleton, not a cage. If a section doesn't apply, omit it. If an incident needs a section the template doesn't have (e.g., "regulatory disclosure timeline" for a security incident), add it.

The template serves the goal. The goal is learning.

Running retrospectives (not incident-driven)

Retrospectives follow the same structure with lighter weighting:

  • What we did (the project or sprint, briefly)
  • What went well
  • What didn't
  • What we'd do differently
  • Action items

Common formats:

  • Start, Stop, Continue — what to start doing, stop doing, keep doing
  • Mad, Sad, Glad — what frustrated people, disappointed them, delighted them
  • Sailboat — what's moving us forward, what's holding us back

For ongoing teams, do retros every 2–4 weeks. Skipping them is tempting when things are going well — don't. The point is to notice problems early, not clean up after disasters.

Anti-patterns

Watch for these failure modes:

  • The "always on time next time" action item. Vague promises without mechanisms don't change anything. "Add CI check for X" changes things.
  • The meeting that produces no document. If it's not written down, it didn't happen.
  • The document nobody reads. Indexed, searchable, linked from onboarding — or it's wasted effort.
  • Private finger-pointing after the meeting. Symptom of insufficient blamelessness in the room.
  • The "we'll fix it later" that never happens. Every action item without an owner and due date.

Closing

The test of a good post-mortem isn't the meeting or the document. It's whether the same class of incident happens again in six months. If it does, your action items didn't stick — either they were wrong, they weren't owned, or they weren't executed.

Treat every post-mortem as an investment in the team's future speed. Incidents are going to happen. What separates mature teams from immature ones is whether they compound the lessons.

Frequently Asked Questions

What's the difference between a post-mortem and a retrospective?+
Post-mortems are typically triggered by incidents (an outage, a failed launch, a data bug) and focus on what went wrong and why. Retrospectives are routine (end of sprint, end of project) and cover what went well, what didn't, and what to change. Structurally they overlap; use 'post-mortem' for incidents, 'retrospective' for planned reviews.
What does 'blameless' actually mean?+
Blameless means the post-mortem focuses on systems, processes, and decisions — not individuals. It assumes people made the best choices they could with the information they had at the time. Blameless doesn't mean unaccountable. It means the question is 'what in our system allowed this to happen' rather than 'who screwed up.'
Who should attend a post-mortem?+
Everyone who was directly involved in the incident (responders, decision-makers, affected service owners), plus representatives from downstream teams that felt the impact. For scheduled retrospectives, the full project team plus any key stakeholders. Keep it small enough that everyone speaks — 5–10 people is ideal.
How long after the incident should we hold the post-mortem?+
Within 3–5 business days while memory is fresh, but not on the same day or the day after — people need time to cool off, sleep on it, and gather logs. If an incident affected customers, publish the summary within 2 weeks; internal deep-dives can take longer.
Why do most action items never get done?+
Because they're not assigned, not prioritized, and not tracked. Every action item needs an owner (one person, not a team), a due date, and a tracking mechanism (ticket in your issue tracker, not a bullet in a doc). Review open action items in every subsequent retrospective until they close.

Keep reading