AI Post-incident Platform
Overview
At a large enterprise, incident response already had a defined rhythm: PagerDuty page, Slack channel, Google Meet, triage team. The weak spot was everything after recovery. My team and I built an internal AI platform that turned the artifacts responders already produced into structured reports and searchable incident history.
Technical write-up
Read the implementation notes on ingestion, report generation, retrieval, and agent observability.
The Problem
The incident team had process, but not a durable system of record. Resolution details lived across Slack threads, call notes, and spreadsheet rows that captured only enough to close the loop.
That made repeat incidents harder than they needed to be. Engineers lost time reconstructing what had happened before. Leaders had no reliable way to review patterns without asking people to rebuild context by hand.
The betIf incident memory could be captured from the existing workflow, responders would not need a new ritual to produce better post-incident records.
What I Built
- Automatic ingestion from PagerDuty-created Slack incident channels and Google Meet transcripts
- Structured post-incident reports with root cause, timeline, actions taken, follow-ups, and linked systems
- A searchable incident history for engineers and leadership
- A retrieval layer that let people ask questions across past incidents instead of scanning old channels
- Operational views for grouping related incidents and summarizing high-pressure windows
Incident volume
500+ incidents
Processed in production within the first month of launch
Workflow change
No new ritual
Ingestion joined PagerDuty, Slack, and Meet paths responders already used
Usage
API requests
Engineering teams immediately requested API access for automation and remediation flows
The Tradeoff
The fastest path to adoption was not the deepest technical design. We shipped the first version as a proof of concept and kept the rollout close to the workflow teams already trusted. That was the right product call, but it deferred some work around scalability hardening, agent observability, and evals.
The useful lesson was that post-incident AI only works when it respects the incident culture around it. A technically stronger system that asks responders to change behavior too early is likely to fail before anyone sees the value.
Where It Went
The platform held up through the organization's busiest operational period and kept processing incidents after launch. Leadership used it for reporting and follow-up. Engineering teams started asking for programmatic access to incident data for automation and remediation flows.
That was the point where the project stopped looking like a single internal tool and started looking like platform infrastructure for incident memory.
technical notesI moved the deeper architecture notes into a paired post so this page can stay focused on the product shape, adoption constraint, and outcome.
- AI
- Incident management
- Internal tools
- Developer tools