Evaluation & Quality Monitoring
Evaluation and quality monitoring keep your LLM features reliable. This chapter gives a practical playbook for building evals, catching regressions, and closing the feedback loop.
1) What to Measure
- Relevance: Does the answer match the user intent?
- Faithfulness: Is the answer grounded in sources (no hallucinations)?
- Format: Does it follow the required schema/JSON/table?
- Safety: No toxic/PII leakage/policy violations.
- Latency & Cost: P95 response time, tokens per call.
2) Offline Eval Sets (Golden Tests)
- Build a small, versioned dataset (e.g., 50-200 samples) with:
- Input: query + context (if RAG) + expected format.
- Labels: ideal answer, citations, expected refusal cases.
- Negative cases: empty context, conflicting docs, forbidden requests.
- Keep per-direction/task evals (FAQ, RAG Q&A, extraction, summarization).
3) Automatic Checks (Cheap Baseline)
- Format validation: JSON schema / regex / length limits.
- Citation enforcement: answer must contain provided citation IDs.
- Refusal checks: when no context, expect a refusal phrase.
- Guardrails: block unsafe categories (violence/PII) via content filters.
4) LLM-as-Judge (Quality Scoring)
- Use a judge prompt to score relevance/faithfulness/clarity on a 1-5 scale.
- Ask judge to mark hallucinations and missing citations.
- Run with a smaller, cheaper model for CI, a stronger model for nightly evals.
Sample Judge Prompt
You are an evaluator. Given:
- question
- retrieved context (with IDs)
- model answer
Rate 1-5 for: relevance, faithfulness, format compliance.
List any hallucinations or missing citations. Reply in JSON:
{
"relevance": 1-5,
"faithfulness": 1-5,
"format": 1-5,
"issues": ["..."]
}
5) Online Metrics & Instrumentation
- Log per request: model, temperature, latency, tokens, response status, retry count.
- For RAG: log top-k docs, scores, chosen citations.
- Add user feedback hooks (thumbs up/down with optional comment).
- Build dashboards: success rate, P95 latency, cost per 1k calls, top errors/429s.
6) Regression Gates
- CI gate: run offline eval set; fail if score drops > threshold or format errors increase.
- Schema gate: reject responses failing JSON schema; retry with "fix-format" prompt.
- Safety gate: block outputs violating policy; return safe fallback.
7) Hallucination & Safety Mitigations
- Prompt constraints: "answer only from provided context; if unknown, say UNKNOWN".
- Citations mandatory for factual claims.
- Refusal logic: no context -> refuse; unsafe request -> safe decline.
- Rerank/verify: cross-check with a secondary model or rule-based validators.
8) A/B and Canary
- Route small traffic to new model/prompt; compare CTR/feedback/latency/cost.
- Kill switch: instant rollback to previous config if metrics degrade.
- Version configs: prompt/version ID baked into logs for traceability.
9) Red Teaming
- Adversarial prompts: prompt injection, jailbreaks, policy bypass attempts.
- Data exfiltration tests: attempts to leak secrets/PII.
- Toxicity/abuse: profanity, harassment; ensure safe replies or refusal.
10) Cadence & Ownership
- Daily/weekly: review dashboards, top failing prompts, cost anomalies.
- Monthly: refresh eval sets with fresh data; rotate judge model if needed.
- Ownership: assign an "LLM quality owner" per domain to triage regressions.
11) Minimal Toolkit
- Eval runner: script to load golden set, call model, run checks, produce JSON report.
- Schema validator: JSON schema for each task type.
- Judge prompts: stored and versioned; tag with model version.
- Dashboard: Grafana/Looker or vendor dashboard for latency/cost/success.
Latest free certifications and projects
Quickly add resume highlights and boost competitiveness.
12) Practice Tasks
- Build a 50-sample golden set for your FAQ/RAG use case with citations.
- Implement a judge prompt and JSON schema check; fail build if format score < 4 or faithfulness < 4.
- Add user feedback logging (thumbs + comment) and surface top 10 failure queries in a dashboard.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
LLM 应用的 offline eval set 该怎么搭?多少样本算够?
本章给的标准:一个 small versioned dataset,50-200 samples,按 task 分(FAQ、RAG Q&A、extraction、summarization 各一份)。每个样本要有:input(query + context if RAG + expected format)、labels(ideal answer、citations、expected refusal cases)、negative cases(empty context、conflicting docs、forbidden requests)。50 个样本起步够日常 CI 用,但要覆盖 happy path + 边界 + 拒答场景,不是全 happy path。每月 refresh 一次加新数据 + 旋转 judge model。
LLM-as-Judge 在 CI 跑应该用便宜模型还是贵模型?
本章建议双层策略:CI 用 smaller cheaper model(如 GPT-4o-mini)跑 fast 反馈,nightly evals 换 stronger model(如 GPT-5 / Claude Sonnet 4.5)做高质量打分。CI 每次 PR 跑一遍判断 "有没有明显倒退",nightly 跑深度评估 "质量绝对值多少"。Judge prompt 让模型按 1-5 评 relevance / faithfulness / format compliance,并标 hallucinations 和 missing citations,输出 JSON 格式 —— 强制结构化输出方便 dashboard 抓数据。
regression gate 在 CI 里怎么设阈值?
本章给的最小标准:CI gate 跑 offline eval set,failure 条件 "score drops > threshold or format errors increase"。具体阈值参考练习题:format score < 4 或 faithfulness < 4 就 fail build。三层 gate 配合:(1) Schema gate —— 拒绝违反 JSON schema 的 response,retry with fix-format prompt;(2) Safety gate —— 违反 policy 的 output 直接 block,返回 safe fallback;(3) Quality gate —— 评分跌破阈值 fail。阈值要根据基线慢慢调高,新功能起步可以宽松,跑稳了再收紧。
online metrics 至少要 log 什么?dashboard 该看什么?
每次请求至少记:model、temperature、latency、tokens、response status、retry count;RAG 场景额外记 top-k docs / scores / chosen citations;加用户反馈钩子(thumbs up/down + 可选评论)。Dashboard 至少四块:success rate、P95 latency、cost per 1k calls、top errors / 429s。然后做 top failing queries 排行榜(前 10 名是什么)+ cost anomaly 检测。每天 / 每周看一次,月度做 eval set refresh 和 judge model rotation。owner 制 —— 每个 domain 指定一个 "LLM quality owner" 处理 regression。
A/B test 一个新 prompt / 新模型该怎么做?
Canary 路线:把小流量(如 5%)路由到新 model / prompt,对比 CTR、用户反馈、latency、cost;指标退化立刻 kill switch 回滚到上一版 config。关键操作:prompt / version ID 必须烙进 logs(traceability),不然你回滚后查 logs 都不知道哪条是哪版本生成的。A/B 期间也要跑 offline eval set 双保险 —— online 指标不一定立刻反映质量退化。Red Teaming 也是 A/B 的兄弟实践:上线前用 adversarial prompts(prompt injection、jailbreaks、data exfiltration、toxicity)打一遍。