AI post-incident platform

Overview

At a large enterprise, incident response already had a defined rhythm: PagerDuty page, Slack channel, Google Meet, triage team. The weak spot was everything after recovery. My team and I built an internal AI platform that turned the artifacts responders already produced into structured reports and searchable incident history.

Technical write-up

Read the implementation notes on ingestion, report generation, retrieval, and agent observability.

/posts/2026/06/ai-incident-memory-platform

The Problem

The incident team had process, but not a durable system of record. Resolution details lived across Slack threads, call notes, and spreadsheet rows that captured only enough to close the loop.

That made repeat incidents harder than they needed to be. Engineers lost time reconstructing what had happened before. Leaders had no reliable way to review patterns without asking people to rebuild context by hand.

The bet
If incident memory could be captured from the existing workflow, responders would not need a new ritual to produce better post-incident records.

What I Built

Automatic ingestion from PagerDuty-created Slack incident channels and Google Meet transcripts
Structured post-incident reports with root cause, timeline, actions taken, follow-ups, and linked systems
A searchable incident history for engineers and leadership
A retrieval layer that let people ask questions across past incidents instead of scanning old channels
Operational views for grouping related incidents and summarizing high-pressure windows

Incident volume

500+ incidents

Processed in production within the first month of launch

Workflow change

No new ritual

Ingestion joined PagerDuty, Slack, and Meet paths responders already used

Usage

API requests

Engineering teams immediately requested API access for automation and remediation flows

The Tradeoff

The fastest path to adoption was not the deepest technical design. We shipped the first version as a proof of concept and kept the rollout close to the workflow teams already trusted. That was the right product call, but it deferred some work around scalability hardening, agent observability, and evals.

The useful lesson was that post-incident AI only works when it respects the incident culture around it. A technically stronger system that asks responders to change behavior too early is likely to fail before anyone sees the value.

Where It Went

The platform held up through the organization's busiest operational period and kept processing incidents after launch. Leadership used it for reporting and follow-up. Engineering teams started asking for programmatic access to incident data for automation and remediation flows.

That was the point where the project stopped looking like a single internal tool and started looking like platform infrastructure for incident memory.

technical notes
I moved the deeper architecture notes into a paired post so this page can stay focused on the product shape, adoption constraint, and outcome.

AI
Incident management
Internal tools
Developer tools