What is the correct order for AI incident triage?

Classify first, layer second, model last. Three steps: (1) identify the incident type — provider/infra, quality drift, retrieval failure, tool failure, or cost/latency spike; (2) narrow to a layer — request / routing / context / execution / outcome; (3) only then dive into prompt or model. Starting from a single bad sample drowns you in noise.

What should the first-round triage check?

Four questions, broad to narrow: any recent config / prompt / model change (check the deploy timeline), are errors concentrated in one provider or region (check provider dashboards), is it global or scoped to a segment / tenant / feature flag, and is it a deterministic error or quality drift. Start with the broad view — drilling into individual cases first wastes time.

How do quality incidents differ from infra incidents in response?

Two different playbooks. Infra incident: visible errors, timeouts, failures — caught by system metrics, start with provider and worker. Quality incident: looks successful but content is noticeably worse — caught by sample review, start with prompt / retrieval / policy. Treating a quality incident like a 5xx is a common misdiagnosis — you keep tuning retries when retries are not the problem.

Which fields must each AI request trace to make debugging tractable?

At least eight fields: trace_id, model/provider, prompt or config version, retry count, retrieval source IDs, tool call summary, latency, and token usage. Not full sensitive content — just enough to reassemble one bad request across the system. Debugging without a trace_id is almost always painful.

What does a runbook need to actually be useful during an incident?

Six pieces, all required: symptoms, quick diagnostic entry points, immediate mitigation actions, root-cause path, rollback procedure, owner and escalation path. The three most common failure modes each need their own runbook: 401/403 — check keys and permissions first; 429 — drop concurrency and add backoff; hallucination surge — lower temperature, narrow scope, add source guards. A runbook that is only opened after the fact is dead weight.

Debugging & Incident Playbook

⏱️ 35 min

Debugging & Incident Playbook

When an LLM system breaks, the scariest part isn't the error itself -- it's the team not knowing where to look first. AI incidents are rarely single-point failures. They can involve the model, retrieval, prompt, tools, provider, queue -- or sometimes just a config change that quietly amplified bad behavior. Without a playbook, debugging feels like stumbling around in the dark.

So this page isn't about "check logs when things break." It's about how AI engineers should turn debugging and incident response into a reusable production workflow.

AI Incident Response Flow

Bottom line: classify first, then debug

The most common mistake during AI incidents is immediately staring at model output.

A more effective sequence:

Identify what type of incident this is
Narrow down to which layer
Only then do prompt / model-level investigation

If you don't classify first, debugging gets buried in noise.

5 common types of AI incidents

Type	Common symptoms
provider / infra	401, 403, 429, 5xx, timeout
quality drift	Suddenly more hallucinations, bad citations, broken formatting
retrieval failure	Can't find sources, citations are empty
tool failure	Tool timeout, schema mismatch, execution error
cost / latency spike	Token usage spikes, excessive fallbacks, P95 blows up

Classify the incident into one of these categories and you'll locate the problem much faster.

What to check in the first round of triage

A more practical triage order:

Question	Where to look first
Was there a recent config / prompt / model change?	deploy / config timeline
Is the error concentrated on one provider / region?	provider dashboard / route log
Is everything broken, or just a specific request type?	request segment / tenant / feature flag
Is it a deterministic error or quality drift?	logs + samples + metrics

The worst thing during an AI incident is fixating on a single bad sample. Look at the big picture first, then zoom in.

A more production-like debugging layer model

Layer	What to investigate
request layer	Which users, which feature, which tenant is affected
routing layer	Which provider / model / fallback was used
context layer	Is the prompt, history, or retrieval chunk abnormal?
execution layer	Are tools, queues, workers, timeout, or retries failing?
outcome layer	Are quality, cost, latency, or schema out of control?

With these layers, incident discussions become much more productive than "did the model get dumber?"

Quick actions for common incidents

Incident	Faster action
401 / 403	Check key, permissions, env changes first
429	Reduce concurrency, enable backoff, check traffic spike
5xx / timeout	Check provider health, switch to fallback if needed
schema fail	Add a repair path or revert to old prompt
hallucination surge	Lower temperature, tighten scope, strengthen source guard

In incident response, "stop the bleeding first" is usually more important than "find the root cause immediately."

Tracing and logging: what's the minimum?

Debugging AI without a trace ID is almost always painful later.

Each request should at minimum be linked to:

trace_id
model / provider
prompt or config version
retry count
retrieval source IDs
tool call summary
latency
token usage

This isn't about logging sensitive content verbatim -- it's about being able to trace a bad request end-to-end through the system.

Quality incidents and infra incidents aren't the same thing

This is where many teams get confused.

Infra incident	Quality incident
Obvious errors, timeouts, failures	Looks successful, but content is noticeably worse
Easier to detect with system metrics	Often only found through sample review
Usually look at provider / worker first	Usually look at prompt / retrieval / policy first

If you treat a quality incident like a regular 5xx outage, you'll miss the real problem.

Runbooks shouldn't just be archived docs

A usable runbook needs at minimum:

Symptoms
Quick diagnostic entry points
Temporary mitigation actions
Root cause investigation path
Rollback method
Responsible person and escalation path

Without these, runbooks quickly become "postmortem reading material" instead of a tool you actually use on-call.

The most valuable part of a postmortem

AI incident postmortems shouldn't just say "issue fixed."

What they should really cover:

Which monitor should have alerted earlier?
Which bad case should be added to the eval set?
Which rollback switch wasn't fast enough?
Which guardrail could have caught it proactively?

A truly good postmortem turns incidents into future system capabilities.

Practice

Take one of your live AI features and fill in these 4 things:

A one-page incident classification table
A one-page triage sequence
Runbooks for your 3 most common failure types
A set of fields that must be in every trace

Once these are in place, the team will be much more stable when things go wrong.

📚 相关资源

OpenAI API Docs

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

AI incident 排障的正确顺序是什么？

先分型，再排障，最后才看模型。3 步：(1) 判断是哪类事故（provider/infra、quality drift、retrieval failure、tool failure、cost/latency spike 5 类）；(2) 缩小到哪一层（request / routing / context / execution / outcome）；(3) 才做 prompt / model 级定位。一上来盯单条 bad sample 会被噪音拖死。

AI incident 第一轮 triage 应该看什么？

4 个问题，从面到点：最近有没有发过配置 / prompt / model 变更（看 deploy timeline）、错误是不是集中在某 provider / region（看 provider dashboard）、是全量都坏还是某一类请求坏（按 segment / tenant / feature flag 切）、是 deterministic error 还是质量漂移。先看面再看点，否则容易陷入个例。

Quality incident 和 infra incident 怎么区别处理？

完全两套打法。Infra incident：明显报错 / 超时 / 失败，靠系统指标发现，先看 provider 和 worker。Quality incident：看起来成功但内容明显变差，要靠 sample review 才发现，先看 prompt / retrieval / policy。把 quality incident 当 5xx 处理是常见误判 —— 你会盯 retry 但根本不在 retry 上。

AI 请求至少要 trace 哪些字段才好排障？

至少 8 项：trace_id、model / provider、prompt 或 config 版本、retry count、retrieval source IDs、tool call summary、latency、token usage。不是要你把敏感内容全记下来，是让你能把一次坏请求在系统里串起来 —— 没有 trace_id 的 AI 排障基本都很痛。

AI runbook 应该写什么才在现场真正能用？

6 件事缺一不可：症状描述、快速诊断入口、临时止损动作、根因排查路径、回滚方法、责任人和升级路径。常见 3 类故障必须各有 runbook：401/403 先查 key/permission，429 先降并发开 backoff，hallucination surge 先收温度收 scope 加 source guard。Runbook 不能只当事后存档。