Debugging & Incident Playbook
Debugging & Incident Playbook
When an LLM system breaks, the scariest part isn't the error itself -- it's the team not knowing where to look first. AI incidents are rarely single-point failures. They can involve the model, retrieval, prompt, tools, provider, queue -- or sometimes just a config change that quietly amplified bad behavior. Without a playbook, debugging feels like stumbling around in the dark.
So this page isn't about "check logs when things break." It's about how AI engineers should turn debugging and incident response into a reusable production workflow.
Bottom line: classify first, then debug
The most common mistake during AI incidents is immediately staring at model output.
A more effective sequence:
- Identify what type of incident this is
- Narrow down to which layer
- Only then do prompt / model-level investigation
If you don't classify first, debugging gets buried in noise.
5 common types of AI incidents
| Type | Common symptoms |
|---|---|
| provider / infra | 401, 403, 429, 5xx, timeout |
| quality drift | Suddenly more hallucinations, bad citations, broken formatting |
| retrieval failure | Can't find sources, citations are empty |
| tool failure | Tool timeout, schema mismatch, execution error |
| cost / latency spike | Token usage spikes, excessive fallbacks, P95 blows up |
Classify the incident into one of these categories and you'll locate the problem much faster.
What to check in the first round of triage
A more practical triage order:
| Question | Where to look first |
|---|---|
| Was there a recent config / prompt / model change? | deploy / config timeline |
| Is the error concentrated on one provider / region? | provider dashboard / route log |
| Is everything broken, or just a specific request type? | request segment / tenant / feature flag |
| Is it a deterministic error or quality drift? | logs + samples + metrics |
The worst thing during an AI incident is fixating on a single bad sample. Look at the big picture first, then zoom in.
A more production-like debugging layer model
| Layer | What to investigate |
|---|---|
| request layer | Which users, which feature, which tenant is affected |
| routing layer | Which provider / model / fallback was used |
| context layer | Is the prompt, history, or retrieval chunk abnormal? |
| execution layer | Are tools, queues, workers, timeout, or retries failing? |
| outcome layer | Are quality, cost, latency, or schema out of control? |
With these layers, incident discussions become much more productive than "did the model get dumber?"
Quick actions for common incidents
| Incident | Faster action |
|---|---|
| 401 / 403 | Check key, permissions, env changes first |
| 429 | Reduce concurrency, enable backoff, check traffic spike |
| 5xx / timeout | Check provider health, switch to fallback if needed |
| schema fail | Add a repair path or revert to old prompt |
| hallucination surge | Lower temperature, tighten scope, strengthen source guard |
In incident response, "stop the bleeding first" is usually more important than "find the root cause immediately."
Tracing and logging: what's the minimum?
Debugging AI without a trace ID is almost always painful later.
Each request should at minimum be linked to:
- trace_id
- model / provider
- prompt or config version
- retry count
- retrieval source IDs
- tool call summary
- latency
- token usage
This isn't about logging sensitive content verbatim -- it's about being able to trace a bad request end-to-end through the system.
Quality incidents and infra incidents aren't the same thing
This is where many teams get confused.
| Infra incident | Quality incident |
|---|---|
| Obvious errors, timeouts, failures | Looks successful, but content is noticeably worse |
| Easier to detect with system metrics | Often only found through sample review |
| Usually look at provider / worker first | Usually look at prompt / retrieval / policy first |
If you treat a quality incident like a regular 5xx outage, you'll miss the real problem.
Runbooks shouldn't just be archived docs
A usable runbook needs at minimum:
- Symptoms
- Quick diagnostic entry points
- Temporary mitigation actions
- Root cause investigation path
- Rollback method
- Responsible person and escalation path
Without these, runbooks quickly become "postmortem reading material" instead of a tool you actually use on-call.
The most valuable part of a postmortem
AI incident postmortems shouldn't just say "issue fixed."
What they should really cover:
- Which monitor should have alerted earlier?
- Which bad case should be added to the eval set?
- Which rollback switch wasn't fast enough?
- Which guardrail could have caught it proactively?
A truly good postmortem turns incidents into future system capabilities.
Practice
Take one of your live AI features and fill in these 4 things:
- A one-page incident classification table
- A one-page triage sequence
- Runbooks for your 3 most common failure types
- A set of fields that must be in every trace
Once these are in place, the team will be much more stable when things go wrong.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
AI incident 排障的正确顺序是什么?
先分型,再排障,最后才看模型。3 步:(1) 判断是哪类事故(provider/infra、quality drift、retrieval failure、tool failure、cost/latency spike 5 类);(2) 缩小到哪一层(request / routing / context / execution / outcome);(3) 才做 prompt / model 级定位。一上来盯单条 bad sample 会被噪音拖死。
AI incident 第一轮 triage 应该看什么?
4 个问题,从面到点:最近有没有发过配置 / prompt / model 变更(看 deploy timeline)、错误是不是集中在某 provider / region(看 provider dashboard)、是全量都坏还是某一类请求坏(按 segment / tenant / feature flag 切)、是 deterministic error 还是质量漂移。先看面再看点,否则容易陷入个例。
Quality incident 和 infra incident 怎么区别处理?
完全两套打法。Infra incident:明显报错 / 超时 / 失败,靠系统指标发现,先看 provider 和 worker。Quality incident:看起来成功但内容明显变差,要靠 sample review 才发现,先看 prompt / retrieval / policy。把 quality incident 当 5xx 处理是常见误判 —— 你会盯 retry 但根本不在 retry 上。
AI 请求至少要 trace 哪些字段才好排障?
至少 8 项:trace_id、model / provider、prompt 或 config 版本、retry count、retrieval source IDs、tool call summary、latency、token usage。不是要你把敏感内容全记下来,是让你能把一次坏请求在系统里串起来 —— 没有 trace_id 的 AI 排障基本都很痛。
AI runbook 应该写什么才在现场真正能用?
6 件事缺一不可:症状描述、快速诊断入口、临时止损动作、根因排查路径、回滚方法、责任人和升级路径。常见 3 类故障必须各有 runbook:401/403 先查 key/permission,429 先降并发开 backoff,hallucination surge 先收温度收 scope 加 source guard。Runbook 不能只当事后存档。