20

Evaluation & Quality Monitoring

⏱️ 45 min

Evaluation and quality monitoring keep your LLM features reliable. This chapter gives a practical playbook for building evals, catching regressions, and closing the feedback loop.

1) What to Measure

  • Relevance: Does the answer match the user intent?
  • Faithfulness: Is the answer grounded in sources (no hallucinations)?
  • Format: Does it follow the required schema/JSON/table?
  • Safety: No toxic/PII leakage/policy violations.
  • Latency & Cost: P95 response time, tokens per call.

2) Offline Eval Sets (Golden Tests)

  • Build a small, versioned dataset (e.g., 50-200 samples) with:
    • Input: query + context (if RAG) + expected format.
    • Labels: ideal answer, citations, expected refusal cases.
    • Negative cases: empty context, conflicting docs, forbidden requests.
  • Keep per-direction/task evals (FAQ, RAG Q&A, extraction, summarization).

3) Automatic Checks (Cheap Baseline)

  • Format validation: JSON schema / regex / length limits.
  • Citation enforcement: answer must contain provided citation IDs.
  • Refusal checks: when no context, expect a refusal phrase.
  • Guardrails: block unsafe categories (violence/PII) via content filters.

4) LLM-as-Judge (Quality Scoring)

  • Use a judge prompt to score relevance/faithfulness/clarity on a 1-5 scale.
  • Ask judge to mark hallucinations and missing citations.
  • Run with a smaller, cheaper model for CI, a stronger model for nightly evals.

Sample Judge Prompt

You are an evaluator. Given:
- question
- retrieved context (with IDs)
- model answer
Rate 1-5 for: relevance, faithfulness, format compliance.
List any hallucinations or missing citations. Reply in JSON:
{
  "relevance": 1-5,
  "faithfulness": 1-5,
  "format": 1-5,
  "issues": ["..."]
}

5) Online Metrics & Instrumentation

  • Log per request: model, temperature, latency, tokens, response status, retry count.
  • For RAG: log top-k docs, scores, chosen citations.
  • Add user feedback hooks (thumbs up/down with optional comment).
  • Build dashboards: success rate, P95 latency, cost per 1k calls, top errors/429s.

6) Regression Gates

  • CI gate: run offline eval set; fail if score drops > threshold or format errors increase.
  • Schema gate: reject responses failing JSON schema; retry with "fix-format" prompt.
  • Safety gate: block outputs violating policy; return safe fallback.

7) Hallucination & Safety Mitigations

  • Prompt constraints: "answer only from provided context; if unknown, say UNKNOWN".
  • Citations mandatory for factual claims.
  • Refusal logic: no context -> refuse; unsafe request -> safe decline.
  • Rerank/verify: cross-check with a secondary model or rule-based validators.

8) A/B and Canary

  • Route small traffic to new model/prompt; compare CTR/feedback/latency/cost.
  • Kill switch: instant rollback to previous config if metrics degrade.
  • Version configs: prompt/version ID baked into logs for traceability.

9) Red Teaming

  • Adversarial prompts: prompt injection, jailbreaks, policy bypass attempts.
  • Data exfiltration tests: attempts to leak secrets/PII.
  • Toxicity/abuse: profanity, harassment; ensure safe replies or refusal.

10) Cadence & Ownership

  • Daily/weekly: review dashboards, top failing prompts, cost anomalies.
  • Monthly: refresh eval sets with fresh data; rotate judge model if needed.
  • Ownership: assign an "LLM quality owner" per domain to triage regressions.

11) Minimal Toolkit

  • Eval runner: script to load golden set, call model, run checks, produce JSON report.
  • Schema validator: JSON schema for each task type.
  • Judge prompts: stored and versioned; tag with model version.
  • Dashboard: Grafana/Looker or vendor dashboard for latency/cost/success.
Free Certificates

Latest free certifications and projects

Quickly add resume highlights and boost competitiveness.

View Now

12) Practice Tasks

  1. Build a 50-sample golden set for your FAQ/RAG use case with citations.
  2. Implement a judge prompt and JSON schema check; fail build if format score < 4 or faithfulness < 4.
  3. Add user feedback logging (thumbs + comment) and surface top 10 failure queries in a dashboard.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

LLM 应用的 offline eval set 该怎么搭?多少样本算够?

本章给的标准:一个 small versioned dataset,50-200 samples,按 task 分(FAQ、RAG Q&A、extraction、summarization 各一份)。每个样本要有:input(query + context if RAG + expected format)、labels(ideal answer、citations、expected refusal cases)、negative cases(empty context、conflicting docs、forbidden requests)。50 个样本起步够日常 CI 用,但要覆盖 happy path + 边界 + 拒答场景,不是全 happy path。每月 refresh 一次加新数据 + 旋转 judge model。

LLM-as-Judge 在 CI 跑应该用便宜模型还是贵模型?

本章建议双层策略:CI 用 smaller cheaper model(如 GPT-4o-mini)跑 fast 反馈,nightly evals 换 stronger model(如 GPT-5 / Claude Sonnet 4.5)做高质量打分。CI 每次 PR 跑一遍判断 "有没有明显倒退",nightly 跑深度评估 "质量绝对值多少"。Judge prompt 让模型按 1-5 评 relevance / faithfulness / format compliance,并标 hallucinations 和 missing citations,输出 JSON 格式 —— 强制结构化输出方便 dashboard 抓数据。

regression gate 在 CI 里怎么设阈值?

本章给的最小标准:CI gate 跑 offline eval set,failure 条件 "score drops > threshold or format errors increase"。具体阈值参考练习题:format score < 4 或 faithfulness < 4 就 fail build。三层 gate 配合:(1) Schema gate —— 拒绝违反 JSON schema 的 response,retry with fix-format prompt;(2) Safety gate —— 违反 policy 的 output 直接 block,返回 safe fallback;(3) Quality gate —— 评分跌破阈值 fail。阈值要根据基线慢慢调高,新功能起步可以宽松,跑稳了再收紧。

online metrics 至少要 log 什么?dashboard 该看什么?

每次请求至少记:model、temperature、latency、tokens、response status、retry count;RAG 场景额外记 top-k docs / scores / chosen citations;加用户反馈钩子(thumbs up/down + 可选评论)。Dashboard 至少四块:success rate、P95 latency、cost per 1k calls、top errors / 429s。然后做 top failing queries 排行榜(前 10 名是什么)+ cost anomaly 检测。每天 / 每周看一次,月度做 eval set refresh 和 judge model rotation。owner 制 —— 每个 domain 指定一个 "LLM quality owner" 处理 regression。

A/B test 一个新 prompt / 新模型该怎么做?

Canary 路线:把小流量(如 5%)路由到新 model / prompt,对比 CTR、用户反馈、latency、cost;指标退化立刻 kill switch 回滚到上一版 config。关键操作:prompt / version ID 必须烙进 logs(traceability),不然你回滚后查 logs 都不知道哪条是哪版本生成的。A/B 期间也要跑 offline eval set 双保险 —— online 指标不一定立刻反映质量退化。Red Teaming 也是 A/B 的兄弟实践:上线前用 adversarial prompts(prompt injection、jailbreaks、data exfiltration、toxicity)打一遍。