20
Evaluation & Quality Monitoring
Evaluation and quality monitoring keep your LLM features reliable. This chapter gives a practical playbook for building evals, catching regressions, and closing the feedback loop.
1) What to Measure
- Relevance: Does the answer match the user intent?
- Faithfulness: Is the answer grounded in sources (no hallucinations)?
- Format: Does it follow the required schema/JSON/table?
- Safety: No toxic/PII leakage/policy violations.
- Latency & Cost: P95 response time, tokens per call.
2) Offline Eval Sets (Golden Tests)
- Build a small, versioned dataset (e.g., 50-200 samples) with:
- Input: query + context (if RAG) + expected format.
- Labels: ideal answer, citations, expected refusal cases.
- Negative cases: empty context, conflicting docs, forbidden requests.
- Keep per-direction/task evals (FAQ, RAG Q&A, extraction, summarization).
3) Automatic Checks (Cheap Baseline)
- Format validation: JSON schema / regex / length limits.
- Citation enforcement: answer must contain provided citation IDs.
- Refusal checks: when no context, expect a refusal phrase.
- Guardrails: block unsafe categories (violence/PII) via content filters.
4) LLM-as-Judge (Quality Scoring)
- Use a judge prompt to score relevance/faithfulness/clarity on a 1-5 scale.
- Ask judge to mark hallucinations and missing citations.
- Run with a smaller, cheaper model for CI, a stronger model for nightly evals.
Sample Judge Prompt
You are an evaluator. Given:
- question
- retrieved context (with IDs)
- model answer
Rate 1-5 for: relevance, faithfulness, format compliance.
List any hallucinations or missing citations. Reply in JSON:
{
"relevance": 1-5,
"faithfulness": 1-5,
"format": 1-5,
"issues": ["..."]
}
5) Online Metrics & Instrumentation
- Log per request: model, temperature, latency, tokens, response status, retry count.
- For RAG: log top-k docs, scores, chosen citations.
- Add user feedback hooks (thumbs up/down with optional comment).
- Build dashboards: success rate, P95 latency, cost per 1k calls, top errors/429s.
6) Regression Gates
- CI gate: run offline eval set; fail if score drops > threshold or format errors increase.
- Schema gate: reject responses failing JSON schema; retry with “fix-format” prompt.
- Safety gate: block outputs violating policy; return safe fallback.
7) Hallucination & Safety Mitigations
- Prompt constraints: “answer only from provided context; if unknown, say UNKNOWN”.
- Citations mandatory for factual claims.
- Refusal logic: no context → refuse; unsafe request → safe decline.
- Rerank/verify: cross-check with a secondary model or rule-based validators.
8) A/B and Canary
- Route small traffic to new model/prompt; compare CTR/feedback/latency/cost.
- Kill switch: instant rollback to previous config if metrics degrade.
- Version configs: prompt/version ID baked into logs for traceability.
9) Red Teaming
- Adversarial prompts: prompt injection, jailbreaks, policy bypass attempts.
- Data exfiltration tests: attempts to leak secrets/PII.
- Toxicity/abuse: profanity, harassment; ensure safe replies or refusal.
10) Cadence & Ownership
- Daily/weekly: review dashboards, top failing prompts, cost anomalies.
- Monthly: refresh eval sets with fresh data; rotate judge model if needed.
- Ownership: assign an “LLM quality owner” per domain to triage regressions.
11) Minimal Toolkit
- Eval runner: script to load golden set, call model, run checks, produce JSON report.
- Schema validator: JSON schema for each task type.
- Judge prompts: stored and versioned; tag with model version.
- Dashboard: Grafana/Looker or vendor dashboard for latency/cost/success.
免费证书
立即查看 →最新免费认证与项目
快速补充简历亮点,提升竞争力。
12) Practice Tasks
- Build a 50-sample golden set for your FAQ/RAG use case with citations.
- Implement a judge prompt and JSON schema check; fail build if format score < 4 or faithfulness < 4.
- Add user feedback logging (thumbs + comment) and surface top 10 failure queries in a dashboard.