Back to portfolio

AI Post-incident Platform

An AI post-incident platform that turned Slack threads and Meet transcripts into structured reports and a searchable incident memory.

At a large enterprise, incident response already had a defined rhythm: PagerDuty page, Slack channel, Google Meet, triage team. The weak spot was everything after recovery. Incidents were tracked in a spreadsheet, reports were inconsistent, and the organization had no usable memory once the incident closed. My team and I built an AI post-incident platform that fit the existing workflow, generated structured reports from the artifacts teams already produced, and made past incidents discoverable when the next incident hit.

Problem

The team handling incidents had process, but not a system. Resolution details lived across Slack threads, call notes, and a spreadsheet row that usually captured only the minimum needed to close the loop. There were no consistent post-mortems, no structured record of root cause, and no easy way to see what had already been tried during a similar failure.

That gap mattered in two places. Engineers lost time re-learning the same lessons during new incidents. Leadership had no reliable way to review patterns across incidents without asking people to reconstruct context by hand. The organization had incident activity, but not incident memory.

Approach

The first design constraint was cultural, not technical. We could not start by asking incident responders to adopt a new workflow. The platform had to plug into the path they already used and prove value before it tried to change behavior.

That led to a zero-friction integration model. When PagerDuty opened an incident, it already created a Slack channel and pulled the right people in. The AI Post-Incident Platform joined that channel automatically, collected the full thread, and paired it with the Google Meet transcript captured through PagerDuty Scribe. From there, the platform generated a structured report: root cause analysis, timeline, actions taken, follow-ups, and links to the relevant repos, services, docs, and logs.

The second part of the product was retrieval. A report that exists but cannot be found is still operationally dead. We built a searchable interface for incident history and added a RAG-based chatbot on top of Turbopuffer embeddings so engineers and leaders could query prior incidents in natural language instead of scanning old channels by hand.

For high-stakes operational windows we added views that grouped related incidents and surfaced periodic AI summaries so teams could reason about system-wide impact instead of treating each page as an isolated event.

Hard Parts

Adoption came before process change

The product worked because it respected the incident culture that already existed. Trying to enforce a new ritual on day one would have failed. Meeting the current workflow first let the platform prove that a higher-quality incident record could appear without adding work for responders. That credibility made later process conversations possible.

Parallel AI processing created a new systems problem

Our first instinct was to treat report generation as a single end-to-end generation step. That path hit a latency wall quickly. Breaking the work into parallel pipelines over sections of the report improved speed, but it introduced a new problem: each sub-agent needed enough shared context to stay consistent without blowing up token usage or conflicting with the others. The architecture moved from "call a model" to "orchestrate partial views of the same incident."

Observability lagged behind the agent complexity

Once the AI pipeline split into multiple steps, debugging got harder. We looked at durable workflow and trace-first runtimes to make execution and failures legible, but infrastructure constraints inside the company made a clean adoption hard. The result worked, but understanding why a given report was good, slow, or wrong still took too much manual digging.

Outcome

The clearest validation came during the organization's busiest operational period and the months after. We launched the platform as a proof of concept, before doing real scalability hardening, and it still stayed reliable under that load. It has since processed hundreds of incidents, which forced the ingestion, summarization, and retrieval flows through a real production workload instead of a staged demo.

Incident volume

Hundreds of incidents

Processed in production over time. The platform stayed reliable through peak load even though we launched it as a proof of concept without dedicated scalability hardening.

Executive adoption

Leadership usage

The leadership team used the platform for reporting, follow-up, and incident pattern review.

Platform pull

API demand

Engineering teams asked for programmatic access to incident data for automation and remediation flows.

The product also pulled demand from both ends of the organization. Leadership started using it for reporting and trend analysis. Engineering teams asked for deeper integrations, including an API they could use for automated incident investigation and post-incident code fixes. That was the point where the project stopped looking like a single internal tool and started looking like a platform.

What I would do differently

We shipped the first version as a proof of concept, which was the right call for adoption, but it also meant some technical debt was intentional. If I were doing it again, I would keep the zero-friction rollout and add a narrow round of scalability hardening earlier. Peak load proved the platform could hold up under real pressure, but that reliability came from a simpler architecture and some luck, not from explicit capacity work.

I would also separate the "make this useful fast" decisions from the "make this easy to operate" decisions more aggressively. The product earned trust quickly; the harder part was keeping agent behavior, latency, and debugging legible as the workflow grew.

What I would do the same

I would still design for zero behavior change at the start. The fastest way to lose adoption in incident response is to ask responders to learn a new ritual in the middle of a high-stress workflow.

I would still generate a structured report, not just a summary. Root cause, timeline, actions taken, and linked systems gave the platform enough shape to support both human review and downstream retrieval.

I would still treat discoverability as a first-class feature. Search and the incident agent mattered as much as report generation because they turned stored incident data into something teams could actually reuse.

What I would do next

I would architect for parallel, section-based report generation from day one instead of starting with a single monolithic generation step and discovering the latency ceiling later.

I would adopt first-class agent orchestration with retries and distributed tracing much earlier so execution was observable by design instead of bolted on after the pipeline got complicated.

I would replace prompt stuffing with dynamic context discovery. Letting the agent navigate incident data directly is a better long-term model than trying to preload every relevant detail into the prompt.

I would add evals at the start, not near the end. I had begun building AssertAI on top of the Braintrust open-source eval model, but the platform needed that feedback loop much earlier.

I would expose a public API for incident data so other teams could build automated investigation and remediation flows on top of the same incident memory.


  • AI
  • Incident management
  • Internal tools
  • Platform engineering
Made with ❤️ in 🇨🇦 · Copyright © 2026 Valentin Prugnaud
Foxy seeing you here!
Wondering if I'd fit your role?
Logo