Should LLM-as-Judge in CI use a cheap or expensive model?

The chapter recommends a two-tier strategy: CI uses a smaller cheaper model (e.g., GPT-4o-mini) for fast feedback; nightly evals swap in a stronger one (e.g., GPT-5 / Claude Sonnet 4.5) for high-quality scoring. CI catches "did anything obviously regress" per PR; nightly answers "what's the absolute quality." The judge prompt asks for 1-5 scores on relevance/faithfulness/format compliance, plus hallucination flags and missing citations, in JSON — structured output keeps dashboards consumable.

How do I set thresholds for regression gates in CI?

Chapter minimum: the CI gate runs the offline eval set; failure if "score drops > threshold or format errors increase." Concrete thresholds from the practice task: format score < 4 or faithfulness < 4 fails the build. Three layered gates: (1) Schema gate — reject responses violating JSON schema, retry with a fix-format prompt; (2) Safety gate — block outputs violating policy, return a safe fallback; (3) Quality gate — fail when scores drop below threshold. Tune thresholds gradually against baseline; start lenient on new features and tighten after stability.

What online metrics must I log, and what should the dashboard show?

Per request, log at minimum: model, temperature, latency, tokens, response status, retry count; for RAG also record top-k docs/scores/chosen citations; add user feedback hooks (thumbs + optional comment). Dashboard needs at least four panels: success rate, P95 latency, cost per 1k calls, top errors/429s. Then a top-failing-queries leaderboard (top 10) + cost anomaly detection. Review daily/weekly; monthly refresh the eval set and rotate the judge model. Ownership: every domain gets an "LLM quality owner" who triages regressions.

How do I A/B test a new prompt or a new model?

Canary route: send a small slice of traffic (e.g., 5%) to the new model/prompt, compare CTR, user feedback, latency, cost; metric regressions trigger an instant kill switch to roll back to previous config. Crucial: bake prompt/version ID into the logs (traceability) — otherwise post-rollback you can't tell which log came from which version. Run the offline eval set in parallel during A/B as a second net — online metrics don't always show quality regressions immediately. Red Teaming is the A/B sibling practice: hit pre-launch with adversarial prompts (injection, jailbreaks, data exfiltration, toxicity).

Evaluation & Quality Monitoring

Q: How do I build an offline eval set for an LLM app, and how many samples are enough?

The chapter standard: a small versioned dataset, 50-200 samples, split by task (FAQ, RAG Q&A, extraction, summarization — one set each). Each sample needs: input (query + context if RAG + expected format), labels (ideal answer, citations, expected refusal cases), negative cases (empty context, conflicting docs, forbidden requests). Fifty starts the door for daily CI, but must cover happy path + edge cases + refusals — not all happy path. Refresh monthly with fresh data + rotate the judge model.

Q: What online metrics must I log, and what should the dashboard show?

Per request, log at minimum: model, temperature, latency, tokens, response status, retry count; for RAG also record top-k docs/scores/chosen citations; add user feedback hooks (thumbs + optional comment). Dashboard needs at least four panels: success rate, P95 latency, cost per 1k calls, top errors/429s. Then a top-failing-queries leaderboard (top 10) + cost anomaly detection. Review daily/weekly; monthly refresh the eval set and rotate the judge model. Ownership: every domain gets an "LLM quality owner" who triages regressions.

Q: How do I A/B test a new prompt or a new model?

Canary route: send a small slice of traffic (e.g., 5%) to the new model/prompt, compare CTR, user feedback, latency, cost; metric regressions trigger an instant kill switch to roll back to previous config. Crucial: bake prompt/version ID into the logs (traceability) — otherwise post-rollback you can't tell which log came from which version. Run the offline eval set in parallel during A/B as a second net — online metrics don't always show quality regressions immediately. Red Teaming is the A/B sibling practice: hit pre-launch with adversarial prompts (injection, jailbreaks, data exfiltration, toxicity).

⏱️ 45 min

Evaluation and quality monitoring keep your LLM features reliable. This chapter gives a practical playbook for building evals, catching regressions, and closing the feedback loop.

1) What to Measure

Relevance: Does the answer match the user intent?
Faithfulness: Is the answer grounded in sources (no hallucinations)?
Format: Does it follow the required schema/JSON/table?
Safety: No toxic/PII leakage/policy violations.
Latency & Cost: P95 response time, tokens per call.

2) Offline Eval Sets (Golden Tests)

Build a small, versioned dataset (e.g., 50-200 samples) with:
- Input: query + context (if RAG) + expected format.
- Labels: ideal answer, citations, expected refusal cases.
- Negative cases: empty context, conflicting docs, forbidden requests.
Keep per-direction/task evals (FAQ, RAG Q&A, extraction, summarization).

3) Automatic Checks (Cheap Baseline)

Format validation: JSON schema / regex / length limits.
Citation enforcement: answer must contain provided citation IDs.
Refusal checks: when no context, expect a refusal phrase.
Guardrails: block unsafe categories (violence/PII) via content filters.

4) LLM-as-Judge (Quality Scoring)

Use a judge prompt to score relevance/faithfulness/clarity on a 1-5 scale.
Ask judge to mark hallucinations and missing citations.
Run with a smaller, cheaper model for CI, a stronger model for nightly evals.

Sample Judge Prompt

You are an evaluator. Given:
- question
- retrieved context (with IDs)
- model answer
Rate 1-5 for: relevance, faithfulness, format compliance.
List any hallucinations or missing citations. Reply in JSON:
{
  "relevance": 1-5,
  "faithfulness": 1-5,
  "format": 1-5,
  "issues": ["..."]
}

5) Online Metrics & Instrumentation

Log per request: model, temperature, latency, tokens, response status, retry count.
For RAG: log top-k docs, scores, chosen citations.
Add user feedback hooks (thumbs up/down with optional comment).
Build dashboards: success rate, P95 latency, cost per 1k calls, top errors/429s.

6) Regression Gates

CI gate: run offline eval set; fail if score drops > threshold or format errors increase.
Schema gate: reject responses failing JSON schema; retry with "fix-format" prompt.
Safety gate: block outputs violating policy; return safe fallback.

7) Hallucination & Safety Mitigations

Prompt constraints: "answer only from provided context; if unknown, say UNKNOWN".
Citations mandatory for factual claims.
Refusal logic: no context -> refuse; unsafe request -> safe decline.
Rerank/verify: cross-check with a secondary model or rule-based validators.

8) A/B and Canary

Route small traffic to new model/prompt; compare CTR/feedback/latency/cost.
Kill switch: instant rollback to previous config if metrics degrade.
Version configs: prompt/version ID baked into logs for traceability.

9) Red Teaming

Adversarial prompts: prompt injection, jailbreaks, policy bypass attempts.
Data exfiltration tests: attempts to leak secrets/PII.
Toxicity/abuse: profanity, harassment; ensure safe replies or refusal.

10) Cadence & Ownership

Daily/weekly: review dashboards, top failing prompts, cost anomalies.
Monthly: refresh eval sets with fresh data; rotate judge model if needed.
Ownership: assign an "LLM quality owner" per domain to triage regressions.

11) Minimal Toolkit

Eval runner: script to load golden set, call model, run checks, produce JSON report.
Schema validator: JSON schema for each task type.
Judge prompts: stored and versioned; tag with model version.
Dashboard: Grafana/Looker or vendor dashboard for latency/cost/success.

Free Certificates

Latest free certifications and projects

Quickly add resume highlights and boost competitiveness.

View Now

12) Practice Tasks

Build a 50-sample golden set for your FAQ/RAG use case with citations.
Implement a judge prompt and JSON schema check; fail build if format score < 4 or faithfulness < 4.
Add user feedback logging (thumbs + comment) and surface top 10 failure queries in a dashboard.

📚 相关资源

OpenAI API Docs

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

LLM 应用的 offline eval set 该怎么搭？多少样本算够？

本章给的标准：一个 small versioned dataset，50-200 samples，按 task 分（FAQ、RAG Q&A、extraction、summarization 各一份）。每个样本要有：input（query + context if RAG + expected format）、labels（ideal answer、citations、expected refusal cases）、negative cases（empty context、conflicting docs、forbidden requests）。50 个样本起步够日常 CI 用，但要覆盖 happy path + 边界 + 拒答场景，不是全 happy path。每月 refresh 一次加新数据 + 旋转 judge model。

LLM-as-Judge 在 CI 跑应该用便宜模型还是贵模型？

本章建议双层策略：CI 用 smaller cheaper model（如 GPT-4o-mini）跑 fast 反馈，nightly evals 换 stronger model（如 GPT-5 / Claude Sonnet 4.5）做高质量打分。CI 每次 PR 跑一遍判断 "有没有明显倒退"，nightly 跑深度评估 "质量绝对值多少"。Judge prompt 让模型按 1-5 评 relevance / faithfulness / format compliance，并标 hallucinations 和 missing citations，输出 JSON 格式 —— 强制结构化输出方便 dashboard 抓数据。

regression gate 在 CI 里怎么设阈值？

本章给的最小标准：CI gate 跑 offline eval set，failure 条件 "score drops > threshold or format errors increase"。具体阈值参考练习题：format score < 4 或 faithfulness < 4 就 fail build。三层 gate 配合：(1) Schema gate —— 拒绝违反 JSON schema 的 response，retry with fix-format prompt；(2) Safety gate —— 违反 policy 的 output 直接 block，返回 safe fallback；(3) Quality gate —— 评分跌破阈值 fail。阈值要根据基线慢慢调高，新功能起步可以宽松，跑稳了再收紧。

online metrics 至少要 log 什么？dashboard 该看什么？

每次请求至少记：model、temperature、latency、tokens、response status、retry count；RAG 场景额外记 top-k docs / scores / chosen citations；加用户反馈钩子（thumbs up/down + 可选评论）。Dashboard 至少四块：success rate、P95 latency、cost per 1k calls、top errors / 429s。然后做 top failing queries 排行榜（前 10 名是什么）+ cost anomaly 检测。每天 / 每周看一次，月度做 eval set refresh 和 judge model rotation。owner 制 —— 每个 domain 指定一个 "LLM quality owner" 处理 regression。

A/B test 一个新 prompt / 新模型该怎么做？

Canary 路线：把小流量（如 5%）路由到新 model / prompt，对比 CTR、用户反馈、latency、cost；指标退化立刻 kill switch 回滚到上一版 config。关键操作：prompt / version ID 必须烙进 logs（traceability），不然你回滚后查 logs 都不知道哪条是哪版本生成的。A/B 期间也要跑 offline eval set 双保险 —— online 指标不一定立刻反映质量退化。Red Teaming 也是 A/B 的兄弟实践：上线前用 adversarial prompts（prompt injection、jailbreaks、data exfiltration、toxicity）打一遍。