29

Debugging & Incident Playbook

⏱️ 35 min

Debugging & Incident Playbook

When an LLM system breaks, the scariest part isn't the error itself -- it's the team not knowing where to look first. AI incidents are rarely single-point failures. They can involve the model, retrieval, prompt, tools, provider, queue -- or sometimes just a config change that quietly amplified bad behavior. Without a playbook, debugging feels like stumbling around in the dark.

So this page isn't about "check logs when things break." It's about how AI engineers should turn debugging and incident response into a reusable production workflow.

AI Incident Response Flow


Bottom line: classify first, then debug

The most common mistake during AI incidents is immediately staring at model output.

A more effective sequence:

  1. Identify what type of incident this is
  2. Narrow down to which layer
  3. Only then do prompt / model-level investigation

If you don't classify first, debugging gets buried in noise.


5 common types of AI incidents

TypeCommon symptoms
provider / infra401, 403, 429, 5xx, timeout
quality driftSuddenly more hallucinations, bad citations, broken formatting
retrieval failureCan't find sources, citations are empty
tool failureTool timeout, schema mismatch, execution error
cost / latency spikeToken usage spikes, excessive fallbacks, P95 blows up

Classify the incident into one of these categories and you'll locate the problem much faster.


What to check in the first round of triage

A more practical triage order:

QuestionWhere to look first
Was there a recent config / prompt / model change?deploy / config timeline
Is the error concentrated on one provider / region?provider dashboard / route log
Is everything broken, or just a specific request type?request segment / tenant / feature flag
Is it a deterministic error or quality drift?logs + samples + metrics

The worst thing during an AI incident is fixating on a single bad sample. Look at the big picture first, then zoom in.


A more production-like debugging layer model

LayerWhat to investigate
request layerWhich users, which feature, which tenant is affected
routing layerWhich provider / model / fallback was used
context layerIs the prompt, history, or retrieval chunk abnormal?
execution layerAre tools, queues, workers, timeout, or retries failing?
outcome layerAre quality, cost, latency, or schema out of control?

With these layers, incident discussions become much more productive than "did the model get dumber?"


Quick actions for common incidents

IncidentFaster action
401 / 403Check key, permissions, env changes first
429Reduce concurrency, enable backoff, check traffic spike
5xx / timeoutCheck provider health, switch to fallback if needed
schema failAdd a repair path or revert to old prompt
hallucination surgeLower temperature, tighten scope, strengthen source guard

In incident response, "stop the bleeding first" is usually more important than "find the root cause immediately."


Tracing and logging: what's the minimum?

Debugging AI without a trace ID is almost always painful later.

Each request should at minimum be linked to:

  • trace_id
  • model / provider
  • prompt or config version
  • retry count
  • retrieval source IDs
  • tool call summary
  • latency
  • token usage

This isn't about logging sensitive content verbatim -- it's about being able to trace a bad request end-to-end through the system.


Quality incidents and infra incidents aren't the same thing

This is where many teams get confused.

Infra incidentQuality incident
Obvious errors, timeouts, failuresLooks successful, but content is noticeably worse
Easier to detect with system metricsOften only found through sample review
Usually look at provider / worker firstUsually look at prompt / retrieval / policy first

If you treat a quality incident like a regular 5xx outage, you'll miss the real problem.


Runbooks shouldn't just be archived docs

A usable runbook needs at minimum:

  1. Symptoms
  2. Quick diagnostic entry points
  3. Temporary mitigation actions
  4. Root cause investigation path
  5. Rollback method
  6. Responsible person and escalation path

Without these, runbooks quickly become "postmortem reading material" instead of a tool you actually use on-call.


The most valuable part of a postmortem

AI incident postmortems shouldn't just say "issue fixed."

What they should really cover:

  • Which monitor should have alerted earlier?
  • Which bad case should be added to the eval set?
  • Which rollback switch wasn't fast enough?
  • Which guardrail could have caught it proactively?

A truly good postmortem turns incidents into future system capabilities.


Practice

Take one of your live AI features and fill in these 4 things:

  1. A one-page incident classification table
  2. A one-page triage sequence
  3. Runbooks for your 3 most common failure types
  4. A set of fields that must be in every trace

Once these are in place, the team will be much more stable when things go wrong.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

AI incident 排障的正确顺序是什么?

先分型,再排障,最后才看模型。3 步:(1) 判断是哪类事故(provider/infra、quality drift、retrieval failure、tool failure、cost/latency spike 5 类);(2) 缩小到哪一层(request / routing / context / execution / outcome);(3) 才做 prompt / model 级定位。一上来盯单条 bad sample 会被噪音拖死。

AI incident 第一轮 triage 应该看什么?

4 个问题,从面到点:最近有没有发过配置 / prompt / model 变更(看 deploy timeline)、错误是不是集中在某 provider / region(看 provider dashboard)、是全量都坏还是某一类请求坏(按 segment / tenant / feature flag 切)、是 deterministic error 还是质量漂移。先看面再看点,否则容易陷入个例。

Quality incident 和 infra incident 怎么区别处理?

完全两套打法。Infra incident:明显报错 / 超时 / 失败,靠系统指标发现,先看 provider 和 worker。Quality incident:看起来成功但内容明显变差,要靠 sample review 才发现,先看 prompt / retrieval / policy。把 quality incident 当 5xx 处理是常见误判 —— 你会盯 retry 但根本不在 retry 上。

AI 请求至少要 trace 哪些字段才好排障?

至少 8 项:trace_id、model / provider、prompt 或 config 版本、retry count、retrieval source IDs、tool call summary、latency、token usage。不是要你把敏感内容全记下来,是让你能把一次坏请求在系统里串起来 —— 没有 trace_id 的 AI 排障基本都很痛。

AI runbook 应该写什么才在现场真正能用?

6 件事缺一不可:症状描述、快速诊断入口、临时止损动作、根因排查路径、回滚方法、责任人和升级路径。常见 3 类故障必须各有 runbook:401/403 先查 key/permission,429 先降并发开 backoff,hallucination surge 先收温度收 scope 加 source guard。Runbook 不能只当事后存档。