Building incident memory from Slack and Meet transcripts

Most incident AI ideas start in the wrong place. They ask responders to use a new product during the most stressful part of the workflow.

That was the constraint we tried not to violate. The teams already had a working rhythm: PagerDuty opened the incident, Slack became the coordination channel, Google Meet carried the live discussion, and people recovered the system. The weak spot was everything after recovery. The organization had activity, but not memory.

The product we built was an internal AI post-incident platform. It joined the workflow responders already used, collected the Slack thread and Meet transcript, generated a structured report, and made prior incidents searchable.

The portfolio version is here: AI post-incident platform. This post covers the technical decisions that were too implementation-heavy for the case study.

Start with the workflow, not the model

The first design constraint was cultural. Incident responders were not going to adopt a new ritual during an outage just because an AI feature existed. The platform had to fit the path they already trusted.

PagerDuty gave us the trigger. When an incident opened, it already created or coordinated the Slack space where responders worked. Google Meet carried the live triage discussion. Those artifacts were imperfect, but they were real. They reflected what people said and did while the incident was active.

That led to a zero-friction ingestion model:

PagerDuty opened the incident.
The platform joined or listened to the incident Slack channel.
The platform collected the full thread and relevant metadata.
PagerDuty Scribe supplied the Google Meet transcript.
The platform generated a structured post-incident report.
The report became searchable and queryable after review.

PagerDuty incident

Slack incident channel

Google Meet transcript

AI post-incident platform

Structured incident report

Searchable incident UI

RAG incident agent

Cross-incident ops view

Engineers and leadership

The important product decision was not "use an LLM." The important decision was to capture the data where responders already created it.

Generate a report, not a summary

A summary is useful once. A structured report is useful later.

The report schema mattered because different readers needed different parts of the same incident:

Responders needed the timeline and actions taken.
Engineering managers needed root cause, follow-ups, owners, and due dates.
Leadership needed patterns across incidents.
Future responders needed links to services, repos, logs, dashboards, and prior decisions.

The platform generated sections such as root cause analysis, timeline, actions taken, unresolved questions, follow-ups, and linked systems. That structure turned unstructured conversation into something a person could review and something a retrieval system could index.

This is also where the first latency problem showed up.

Our first instinct was a single end-to-end generation step: send the incident context to a model and ask for the report. That was simple to reason about, but it hit a ceiling quickly. Long Slack threads and transcripts made generation slow, and a single large prompt made it harder to control section quality.

The better shape was section-based generation. Separate report sections could run in parallel, each with focused instructions and context. That improved latency, but it changed the system from "call a model" to "orchestrate partial views of the same incident."

Parallel generation creates consistency problems

Parallelism helped speed. It also created coordination work.

Each section generator needed enough shared context to stay consistent with the rest of the report. If every worker received the full incident payload, token usage grew quickly. If each worker received only a narrow slice, sections could contradict each other or miss important facts.

The useful middle ground was to separate global incident facts from section-specific context:

Global context: incident title, severity, service, timestamps, responders, known customer impact, and source links.
Timeline context: timestamped Slack messages, transcript segments, status changes, and notable decisions.
Root cause context: evidence-bearing messages, remediation discussion, linked logs, and explicit uncertainty.
Follow-up context: action items, owners, open questions, and unresolved mitigations.

That split did not make the problem disappear. It made the failure modes easier to see. A timeline section could still miss a spoken decision from the transcript. A root cause section could overstate confidence if the source material was ambiguous. Follow-ups could duplicate the same action with slightly different wording.

The engineering work became less about prompt polish and more about data boundaries, section ownership, retries, and merge behavior.

Retrieval is part of the product

An incident report that cannot be found is operationally dead.

We built a searchable interface for incident history and added a retrieval-augmented chatbot on top of embeddings. The goal was not to make a clever chat UI. The goal was to let engineers and leaders ask practical questions:

"Have we seen this service fail this way before?"
"Which incidents involved this dependency?"
"What follow-ups were created after the last peak traffic event?"
"Did we already try this mitigation?"

Natural-language retrieval was helpful because incident language is messy. Teams do not always use the same service names, failure modes, or acronyms. Search needed to tolerate that mess while still returning reports with enough structure to verify the answer.

For high-stakes operational windows, we also added views that grouped related incidents and surfaced periodic AI summaries. Treating every incident as isolated was the wrong model during peak load. Teams needed to see system-wide impact, related failures, and repeated symptoms.

Observability became the hard part

Once the pipeline split into ingestion, section generation, report assembly, embedding, retrieval, and chat, normal logs were not enough.

The hard questions were not just "did the job fail?" They were:

Which source artifact influenced this section?
Why did one section finish slowly?
Did a retry change the output?
Which model call produced this sentence?
Was the answer wrong because retrieval missed the document, or because generation misread it?

We looked at durable workflow and trace-first runtimes to make execution legible, but infrastructure constraints inside the company made clean adoption hard. The result worked, but debugging still required too much manual digging.

That is the part I would change first if I were building the system again. Agent orchestration needs first-class traces, retries, and step-level observability from day one. Adding that after the pipeline is already complicated is expensive.

Evals should not wait

The platform needed evals earlier than we added them.

Post-incident generation has predictable failure modes:

Missing an important timeline event.
Inventing certainty where the source material is unclear.
Dropping owners or due dates from follow-ups.
Treating speculation in Slack as confirmed root cause.
Returning a prior incident that looks related but has a different failure mode.

Human review can catch some of this, but review does not scale into a feedback system by itself. We needed a small eval set from real or sanitized incidents: source artifacts, expected report fields, retrieval questions, and examples of unacceptable answers.

I had started building AssertAI on top of the Braintrust open-source eval model, but the incident platform needed that loop much earlier. Without evals, every prompt change felt plausible until a person read enough outputs to build confidence by hand.

What I would keep

I would still start with zero behavior change. Incident response is not the place to lead with a new ritual.

I would still generate structured reports, not summaries. Root cause, timeline, actions taken, follow-ups, and linked systems gave the platform enough shape to support both human review and retrieval.

I would still treat discoverability as part of the core product. Search and the incident agent mattered as much as report generation because they turned stored reports into reusable operational memory.

What I would change

I would design section-based generation from the start. The single-call version was useful for proving the idea, but it hid the latency and consistency problems until the input size grew.

I would adopt traceable orchestration earlier. Every generation step should have inputs, outputs, retries, model metadata, timing, and source attribution attached.

I would replace prompt stuffing with dynamic context discovery. Letting an agent navigate incident data directly is a better long-term model than trying to preload every relevant detail into one prompt.

I would add evals before tuning prompts. Otherwise, the team ends up arguing from anecdotes.

The product lesson is simple: incident AI is not mainly an AI problem. It is a workflow, data quality, retrieval, and operations problem with a model in the middle.

AI
Incident Response
Developer tools
Internal Tools