logo
29

Debugging & Incident Playbook

⏱️ 35 min

Debugging & Incident Playbook

When an LLM system breaks, the scariest part isn't the error itself -- it's the team not knowing where to look first. AI incidents are rarely single-point failures. They can involve the model, retrieval, prompt, tools, provider, queue -- or sometimes just a config change that quietly amplified bad behavior. Without a playbook, debugging feels like stumbling around in the dark.

So this page isn't about "check logs when things break." It's about how AI engineers should turn debugging and incident response into a reusable production workflow.

AI Incident Response Flow


Bottom line: classify first, then debug

The most common mistake during AI incidents is immediately staring at model output.

A more effective sequence:

  1. Identify what type of incident this is
  2. Narrow down to which layer
  3. Only then do prompt / model-level investigation

If you don't classify first, debugging gets buried in noise.


5 common types of AI incidents

TypeCommon symptoms
provider / infra401, 403, 429, 5xx, timeout
quality driftSuddenly more hallucinations, bad citations, broken formatting
retrieval failureCan't find sources, citations are empty
tool failureTool timeout, schema mismatch, execution error
cost / latency spikeToken usage spikes, excessive fallbacks, P95 blows up

Classify the incident into one of these categories and you'll locate the problem much faster.


What to check in the first round of triage

A more practical triage order:

QuestionWhere to look first
Was there a recent config / prompt / model change?deploy / config timeline
Is the error concentrated on one provider / region?provider dashboard / route log
Is everything broken, or just a specific request type?request segment / tenant / feature flag
Is it a deterministic error or quality drift?logs + samples + metrics

The worst thing during an AI incident is fixating on a single bad sample. Look at the big picture first, then zoom in.


A more production-like debugging layer model

LayerWhat to investigate
request layerWhich users, which feature, which tenant is affected
routing layerWhich provider / model / fallback was used
context layerIs the prompt, history, or retrieval chunk abnormal?
execution layerAre tools, queues, workers, timeout, or retries failing?
outcome layerAre quality, cost, latency, or schema out of control?

With these layers, incident discussions become much more productive than "did the model get dumber?"


Quick actions for common incidents

IncidentFaster action
401 / 403Check key, permissions, env changes first
429Reduce concurrency, enable backoff, check traffic spike
5xx / timeoutCheck provider health, switch to fallback if needed
schema failAdd a repair path or revert to old prompt
hallucination surgeLower temperature, tighten scope, strengthen source guard

In incident response, "stop the bleeding first" is usually more important than "find the root cause immediately."


Tracing and logging: what's the minimum?

Debugging AI without a trace ID is almost always painful later.

Each request should at minimum be linked to:

  • trace_id
  • model / provider
  • prompt or config version
  • retry count
  • retrieval source IDs
  • tool call summary
  • latency
  • token usage

This isn't about logging sensitive content verbatim -- it's about being able to trace a bad request end-to-end through the system.


Quality incidents and infra incidents aren't the same thing

This is where many teams get confused.

Infra incidentQuality incident
Obvious errors, timeouts, failuresLooks successful, but content is noticeably worse
Easier to detect with system metricsOften only found through sample review
Usually look at provider / worker firstUsually look at prompt / retrieval / policy first

If you treat a quality incident like a regular 5xx outage, you'll miss the real problem.


Runbooks shouldn't just be archived docs

A usable runbook needs at minimum:

  1. Symptoms
  2. Quick diagnostic entry points
  3. Temporary mitigation actions
  4. Root cause investigation path
  5. Rollback method
  6. Responsible person and escalation path

Without these, runbooks quickly become "postmortem reading material" instead of a tool you actually use on-call.


The most valuable part of a postmortem

AI incident postmortems shouldn't just say "issue fixed."

What they should really cover:

  • Which monitor should have alerted earlier?
  • Which bad case should be added to the eval set?
  • Which rollback switch wasn't fast enough?
  • Which guardrail could have caught it proactively?

A truly good postmortem turns incidents into future system capabilities.


Practice

Take one of your live AI features and fill in these 4 things:

  1. A one-page incident classification table
  2. A one-page triage sequence
  3. Runbooks for your 3 most common failure types
  4. A set of fields that must be in every trace

Once these are in place, the team will be much more stable when things go wrong.

📚 相关资源