Debugging & Incident Playbook
Debugging & Incident Playbook
When an LLM system breaks, the scariest part isn't the error itself -- it's the team not knowing where to look first. AI incidents are rarely single-point failures. They can involve the model, retrieval, prompt, tools, provider, queue -- or sometimes just a config change that quietly amplified bad behavior. Without a playbook, debugging feels like stumbling around in the dark.
So this page isn't about "check logs when things break." It's about how AI engineers should turn debugging and incident response into a reusable production workflow.
Bottom line: classify first, then debug
The most common mistake during AI incidents is immediately staring at model output.
A more effective sequence:
- Identify what type of incident this is
- Narrow down to which layer
- Only then do prompt / model-level investigation
If you don't classify first, debugging gets buried in noise.
5 common types of AI incidents
| Type | Common symptoms |
|---|---|
| provider / infra | 401, 403, 429, 5xx, timeout |
| quality drift | Suddenly more hallucinations, bad citations, broken formatting |
| retrieval failure | Can't find sources, citations are empty |
| tool failure | Tool timeout, schema mismatch, execution error |
| cost / latency spike | Token usage spikes, excessive fallbacks, P95 blows up |
Classify the incident into one of these categories and you'll locate the problem much faster.
What to check in the first round of triage
A more practical triage order:
| Question | Where to look first |
|---|---|
| Was there a recent config / prompt / model change? | deploy / config timeline |
| Is the error concentrated on one provider / region? | provider dashboard / route log |
| Is everything broken, or just a specific request type? | request segment / tenant / feature flag |
| Is it a deterministic error or quality drift? | logs + samples + metrics |
The worst thing during an AI incident is fixating on a single bad sample. Look at the big picture first, then zoom in.
A more production-like debugging layer model
| Layer | What to investigate |
|---|---|
| request layer | Which users, which feature, which tenant is affected |
| routing layer | Which provider / model / fallback was used |
| context layer | Is the prompt, history, or retrieval chunk abnormal? |
| execution layer | Are tools, queues, workers, timeout, or retries failing? |
| outcome layer | Are quality, cost, latency, or schema out of control? |
With these layers, incident discussions become much more productive than "did the model get dumber?"
Quick actions for common incidents
| Incident | Faster action |
|---|---|
| 401 / 403 | Check key, permissions, env changes first |
| 429 | Reduce concurrency, enable backoff, check traffic spike |
| 5xx / timeout | Check provider health, switch to fallback if needed |
| schema fail | Add a repair path or revert to old prompt |
| hallucination surge | Lower temperature, tighten scope, strengthen source guard |
In incident response, "stop the bleeding first" is usually more important than "find the root cause immediately."
Tracing and logging: what's the minimum?
Debugging AI without a trace ID is almost always painful later.
Each request should at minimum be linked to:
- trace_id
- model / provider
- prompt or config version
- retry count
- retrieval source IDs
- tool call summary
- latency
- token usage
This isn't about logging sensitive content verbatim -- it's about being able to trace a bad request end-to-end through the system.
Quality incidents and infra incidents aren't the same thing
This is where many teams get confused.
| Infra incident | Quality incident |
|---|---|
| Obvious errors, timeouts, failures | Looks successful, but content is noticeably worse |
| Easier to detect with system metrics | Often only found through sample review |
| Usually look at provider / worker first | Usually look at prompt / retrieval / policy first |
If you treat a quality incident like a regular 5xx outage, you'll miss the real problem.
Runbooks shouldn't just be archived docs
A usable runbook needs at minimum:
- Symptoms
- Quick diagnostic entry points
- Temporary mitigation actions
- Root cause investigation path
- Rollback method
- Responsible person and escalation path
Without these, runbooks quickly become "postmortem reading material" instead of a tool you actually use on-call.
The most valuable part of a postmortem
AI incident postmortems shouldn't just say "issue fixed."
What they should really cover:
- Which monitor should have alerted earlier?
- Which bad case should be added to the eval set?
- Which rollback switch wasn't fast enough?
- Which guardrail could have caught it proactively?
A truly good postmortem turns incidents into future system capabilities.
Practice
Take one of your live AI features and fill in these 4 things:
- A one-page incident classification table
- A one-page triage sequence
- Runbooks for your 3 most common failure types
- A set of fields that must be in every trace
Once these are in place, the team will be much more stable when things go wrong.