21

LLM-as-a-Judge 评测

⏱️ 35分钟

Advanced Evaluation

本章讲的是用 LLM 做评测(LLM-as-a-Judge)的完整方法体系,涵盖 direct scoring、pairwise comparison、rubric generation 与 bias mitigation。它不是单一技巧,而是一组可组合的评测策略。

  • Direct scoring 适合 objective criteria。
  • Pairwise comparison 适合偏好型评价。
  • 必须 swap positions 来减小 position bias。
  • Rubrics 可以大幅降低评测方差。
  • Confidence calibration 与分数同样重要。

你将学到什么

  • 什么时候用 direct scoring,什么时候用 pairwise
  • 如何构建可复用的 rubric
  • 如何处理 position/length/authority 等 bias

When to Activate

Activate this skill when:

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories:

Direct Scoring: 用单一 LLM 给 response 打分。

  • Best for: factual accuracy、instruction following、toxicity
  • Failure mode: scale drift, inconsistent interpretation

Pairwise Comparison: 比较两条 response 并选优。

  • Best for: tone、style、persuasiveness
  • Failure mode: position bias, length bias

研究(MT-Bench, Zheng et al., 2023)表明:偏好型任务中 pairwise 比 direct scoring 更可靠。

The Bias Landscape

LLM judges 的常见 bias:

Position Bias: 位置偏好。Mitigation: swap positions。

Length Bias: 越长越高分。Mitigation: 明确提示忽略长度。

Self-Enhancement Bias: 自评更高。Mitigation: 生成与评测使用不同模型。

Verbosity Bias: 细节越多越高。Mitigation: rubric 明确“无关细节扣分”。

Authority Bias: 语气越自信越高。Mitigation: 要求 evidence 或引用。

Metric Selection Framework

Task TypePrimary MetricsSecondary Metrics
Binary classificationRecall, Precision, F1Cohen's κ
Ordinal scale (1-5)Spearman's ρ, Kendall's τCohen's κ (weighted)
Pairwise preferenceAgreement rate, Position consistencyConfidence calibration
Multi-labelMacro-F1, Micro-F1Per-label precision/recall

关键点:比起绝对一致率,更重要的是系统性偏差。

Evaluation Approaches

Direct Scoring Implementation

Direct scoring 需要:clear criteria、calibrated scale、structured output。

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

  • 1-3: 低认知负担
  • 1-5: 最平衡
  • 1-10: 需要明确 rubric

Prompt Structure:

You are an expert evaluator assessing response quality.
...

要求 justification-before-score,可提升一致性。

Pairwise Comparison Implementation

Position Bias Mitigation Protocol:

  1. A first / B second
  2. B first / A second
  3. 不一致 → TIE + lower confidence

Rubric Generation

Rubrics 降低评测方差 40-60%。

Rubric Components:

  1. Level descriptions
  2. Characteristics
  3. Examples (optional)
  4. Edge cases
  5. Scoring guidelines

Practical Guidance

Evaluation Pipeline Design

Criteria Loader → Primary Scorer → Bias Mitigation → Confidence Scoring

Common Anti-Patterns

  • Scoring without justification
  • Single-pass pairwise
  • Overloaded criteria
  • Missing edge cases
  • Ignoring confidence calibration

Decision Framework: Direct vs. Pairwise

Objective? → Direct scoring
Preference? → Pairwise comparison
Reference available? → Reference-based evaluation

Scaling Evaluation

  • PoLL (Panel of LLMs)
  • Hierarchical evaluation
  • Human-in-the-loop

Examples

(示例部分保持原文,可直接使用)

Example 1: Direct Scoring for Accuracy

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

{ "winner": "A", "confidence": 0.6 }

Final Result:

{
	"winner": "B",
	"confidence": 0.7,
	"positionConsistency": {
		"consistent": true,
		"firstPassWinner": "B",
		"secondPassWinner": "B"
	}
}

Minimal Rubric Template

Criterion: <name>
Description: <what good looks like>
Scale: 1-5
Levels:

-   1: <poor>
-   3: <adequate>
-   5: <excellent>
    Edge cases: <how to score ambiguous cases>

Guidelines

  1. Always require justification before scores
  2. Always swap positions in pairwise comparison
  3. Match scale granularity to rubric specificity
  4. Separate objective and subjective criteria
  5. Include confidence scores
  6. Define edge cases explicitly
  7. Use domain-specific rubrics
  8. Validate against human judgments
  9. Monitor systematic bias
  10. Design for iteration

Practice Task

  • 写一个 1-5 rubric(例如“回答准确性”)
  • 用它评估 2 个模型输出,记录差异

Integration

  • context-fundamentals
  • tool-design
  • context-optimization
  • evaluation

References

  • Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators
  • Judging LLM-as-a-Judge (Zheng et al., 2023)
  • G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)
  • Large Language Models are not Fair Evaluators (Wang et al., 2023)

Skill Metadata

Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

什么时候用 direct scoring,什么时候用 pairwise comparison?

Direct scoring 适合 objective criteria —— 事实准确性、instruction following、toxicity 这种有明确对错的任务。Pairwise 适合偏好型评价 —— tone、style、persuasiveness 这类没有绝对正确答案的任务。MT-Bench 论文(Zheng et al., 2023)证明:偏好型任务上 pairwise 比 direct scoring 更可靠,因为人类自己打分都漂移,但 "A 和 B 哪个更好" 一致性高得多。Direct scoring 的 failure mode 是 scale drift;pairwise 的 failure mode 是 position bias 和 length bias。

LLM judge 的常见 bias 有哪些?怎么各个击破?

本章列了五个:(1) Position bias —— 偏向某个位置,缓解:swap positions(A first / B first 各跑一次,不一致就 TIE + lower confidence);(2) Length bias —— 越长越高分,缓解:明确提示忽略长度;(3) Self-enhancement bias —— 自己评自己分高,缓解:生成模型和评测模型用不同的;(4) Verbosity bias —— 细节越多越高分,缓解:rubric 明确 "无关细节扣分";(5) Authority bias —— 语气越自信分越高,缓解:要求 evidence 或引用。每个 bias 都有对应工程化做法,不能只靠 "换更大模型"。

Rubric 真能让评分一致性提升 40-60% 吗?怎么写一个好的 rubric?

本章数据:rubric 可降低评测方差 40-60%。一个完整 rubric 五个组件:(1) Level descriptions(1-5 各档什么样);(2) Characteristics(关键特征);(3) Examples(可选,但很有帮助);(4) Edge cases(模糊场景怎么打);(5) Scoring guidelines。Scale 选择:1-3 低认知负担、1-5 最平衡(大多数场景的默认)、1-10 需要更详细 rubric 才不漂移。最低模板:Criterion / Description / Scale 1-5 / Levels(1 poor、3 adequate、5 excellent)/ Edge cases。

为什么要要求 LLM judge "先 justification 再 score"?

Justification-before-score 能显著提升一致性。原因是 LLM 直接给数字时是 "凭感觉",先写理由强制它过一遍证据链,再打分时数字就不是凭空冒出来的。本章 Example 1 就是这个结构:先列 evidence("Correctly identifies axial tilt as primary cause" 等三条),再给 justification 段落,最后给 score 5 + improvement 建议。配合 confidence 字段(pairwise 不一致时降置信度)能进一步发现哪些样本 judge 自己也没把握。这是 G-Eval 论文里推荐的标准做法。

Pairwise 评测的 swap protocol 具体怎么做?

本章 Example 2 是完整流程:(1) First pass,A first / B second 给 prompt + criteria,judge 输出 winner B、confidence 0.8;(2) Second pass,B first / A second 跑同样的 prompt,输出 winner A、confidence 0.6;(3) 一致 → 用一致结果 + 平均 confidence;不一致 → TIE 或 lower-confidence 结果。Final 输出包含 winner、confidence、positionConsistency 字段。再加 PoLL(Panel of LLMs,多个 judge 投票)和 Hierarchical evaluation(不同层级关心不同维度),能进一步压低 bias。Human-in-the-loop 兜底处理 judge 拿不准的样本。