logo
29

Debugging & Incident Playbook

⏱️ 35分钟

Debugging and incident response keep LLM features stable when things break in prod.

1) Fast Triage

  • Capture request/response (scrub PII) with trace IDs, model, config version, retries.
  • Classify errors: 4xx (config/auth), 429 (rate), 5xx (provider), format failures.
  • Check recent deploy/config change; roll back if spiking errors.

2) Common Failures & Fixes

  • Schema/format fail: tighten prompts; add format-repair retry path.
  • Hallucination/citation miss: enforce “answer only from context”; reduce temperature; add rerank.
  • Latency spike: provider health? queue lag? switch to backup model/region.
  • Cost spike: large prompts? history not trimmed? enforce caps; enable cache.

3) Playbooks

  • 401/403: key/permission expired; rotate key; verify env vars.
  • 429: reduce concurrency; add jittered backoff; activate rate limiter.
  • 5xx: fail-open vs fail-closed policy; fallback model; partial responses.
  • Tool failures: sandbox limits, timeout; circuit-break noisy tools.

4) Observability Essentials

  • Traces: per-call timing (provider, embedding, retrieval, rerank, LLM).
  • Metrics: success rate, P95 latency, tokens, retries, fallback rate.
  • Logs: compact, structured; avoid sensitive payloads; link to trace IDs.

5) Runbooks & Ownership

  • Per-feature runbook: known errors, dashboards, rollback steps, alert thresholds.
  • On-call: rotation, paging rules; severity definitions; comms template.
  • Postmortems: root cause, fixes, tests added, regression checks; track action items.

6) Resilience Patterns

  • Fallback hierarchy: primary/secondary models; cached answer when all fail.
  • Graceful degradation: shorter answers, smaller model, or non-streaming.
  • Idempotency + retries with limits; DLQ for repeated failures.

7) Readiness Tests

  • Chaos drills: kill primary provider; inject latency; verify fallback/alerts.
  • Load tests: check 429 behavior and queueing.
  • Contract tests: prompts/models/tool schemas unchanged across deploys.

8) Minimal Checklist

  • Trace IDs in every log; dashboards for errors/latency/cost.
  • Playbooks for 401/403, 429, 5xx, schema fail, hallucination.
  • Fallback + kill switch ready; alerting hooked to on-call.

📚 相关资源