29
Debugging & Incident Playbook
Debugging and incident response keep LLM features stable when things break in prod.
1) Fast Triage
- Capture request/response (scrub PII) with trace IDs, model, config version, retries.
- Classify errors: 4xx (config/auth), 429 (rate), 5xx (provider), format failures.
- Check recent deploy/config change; roll back if spiking errors.
2) Common Failures & Fixes
- Schema/format fail: tighten prompts; add format-repair retry path.
- Hallucination/citation miss: enforce “answer only from context”; reduce temperature; add rerank.
- Latency spike: provider health? queue lag? switch to backup model/region.
- Cost spike: large prompts? history not trimmed? enforce caps; enable cache.
3) Playbooks
- 401/403: key/permission expired; rotate key; verify env vars.
- 429: reduce concurrency; add jittered backoff; activate rate limiter.
- 5xx: fail-open vs fail-closed policy; fallback model; partial responses.
- Tool failures: sandbox limits, timeout; circuit-break noisy tools.
4) Observability Essentials
- Traces: per-call timing (provider, embedding, retrieval, rerank, LLM).
- Metrics: success rate, P95 latency, tokens, retries, fallback rate.
- Logs: compact, structured; avoid sensitive payloads; link to trace IDs.
5) Runbooks & Ownership
- Per-feature runbook: known errors, dashboards, rollback steps, alert thresholds.
- On-call: rotation, paging rules; severity definitions; comms template.
- Postmortems: root cause, fixes, tests added, regression checks; track action items.
6) Resilience Patterns
- Fallback hierarchy: primary/secondary models; cached answer when all fail.
- Graceful degradation: shorter answers, smaller model, or non-streaming.
- Idempotency + retries with limits; DLQ for repeated failures.
7) Readiness Tests
- Chaos drills: kill primary provider; inject latency; verify fallback/alerts.
- Load tests: check 429 behavior and queueing.
- Contract tests: prompts/models/tool schemas unchanged across deploys.
8) Minimal Checklist
- Trace IDs in every log; dashboards for errors/latency/cost.
- Playbooks for 401/403, 429, 5xx, schema fail, hallucination.
- Fallback + kill switch ready; alerting hooked to on-call.