LLM-as-a-Judge 评测
Advanced Evaluation
本章讲的是用 LLM 做评测(LLM-as-a-Judge)的完整方法体系,涵盖 direct scoring、pairwise comparison、rubric generation 与 bias mitigation。它不是单一技巧,而是一组可组合的评测策略。
- Direct scoring 适合 objective criteria。
- Pairwise comparison 适合偏好型评价。
- 必须 swap positions 来减小 position bias。
- Rubrics 可以大幅降低评测方差。
- Confidence calibration 与分数同样重要。
你将学到什么
- 什么时候用 direct scoring,什么时候用 pairwise
- 如何构建可复用的 rubric
- 如何处理 position/length/authority 等 bias
When to Activate
Activate this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
Core Concepts
The Evaluation Taxonomy
Evaluation approaches fall into two primary categories:
Direct Scoring: 用单一 LLM 给 response 打分。
- Best for: factual accuracy、instruction following、toxicity
- Failure mode: scale drift, inconsistent interpretation
Pairwise Comparison: 比较两条 response 并选优。
- Best for: tone、style、persuasiveness
- Failure mode: position bias, length bias
研究(MT-Bench, Zheng et al., 2023)表明:偏好型任务中 pairwise 比 direct scoring 更可靠。
The Bias Landscape
LLM judges 的常见 bias:
Position Bias: 位置偏好。Mitigation: swap positions。
Length Bias: 越长越高分。Mitigation: 明确提示忽略长度。
Self-Enhancement Bias: 自评更高。Mitigation: 生成与评测使用不同模型。
Verbosity Bias: 细节越多越高。Mitigation: rubric 明确“无关细节扣分”。
Authority Bias: 语气越自信越高。Mitigation: 要求 evidence 或引用。
Metric Selection Framework
| Task Type | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary classification | Recall, Precision, F1 | Cohen's κ |
| Ordinal scale (1-5) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
关键点:比起绝对一致率,更重要的是系统性偏差。
Evaluation Approaches
Direct Scoring Implementation
Direct scoring 需要:clear criteria、calibrated scale、structured output。
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Scale Calibration:
- 1-3: 低认知负担
- 1-5: 最平衡
- 1-10: 需要明确 rubric
Prompt Structure:
You are an expert evaluator assessing response quality.
...
要求 justification-before-score,可提升一致性。
Pairwise Comparison Implementation
Position Bias Mitigation Protocol:
- A first / B second
- B first / A second
- 不一致 → TIE + lower confidence
Rubric Generation
Rubrics 降低评测方差 40-60%。
Rubric Components:
- Level descriptions
- Characteristics
- Examples (optional)
- Edge cases
- Scoring guidelines
Practical Guidance
Evaluation Pipeline Design
Criteria Loader → Primary Scorer → Bias Mitigation → Confidence Scoring
Common Anti-Patterns
- Scoring without justification
- Single-pass pairwise
- Overloaded criteria
- Missing edge cases
- Ignoring confidence calibration
Decision Framework: Direct vs. Pairwise
Objective? → Direct scoring
Preference? → Pairwise comparison
Reference available? → Reference-based evaluation
Scaling Evaluation
- PoLL (Panel of LLMs)
- Hierarchical evaluation
- Human-in-the-loop
Examples
(示例部分保持原文,可直接使用)
Example 1: Direct Scoring for Accuracy
Input:
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
Output:
{
"criterion": "Factual Accuracy",
"score": 5,
"evidence": [
"Correctly identifies axial tilt as primary cause",
"Correctly explains differential sunlight by hemisphere",
"No factual errors present"
],
"justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
"improvement": "Could add the specific tilt angle (23.5°) for completeness."
}
Example 2: Pairwise Comparison with Position Swap
Input:
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
First Pass (A first):
{ "winner": "B", "confidence": 0.8 }
Second Pass (B first):
{ "winner": "A", "confidence": 0.6 }
Final Result:
{
"winner": "B",
"confidence": 0.7,
"positionConsistency": {
"consistent": true,
"firstPassWinner": "B",
"secondPassWinner": "B"
}
}
Minimal Rubric Template
Criterion: <name>
Description: <what good looks like>
Scale: 1-5
Levels:
- 1: <poor>
- 3: <adequate>
- 5: <excellent>
Edge cases: <how to score ambiguous cases>
Guidelines
- Always require justification before scores
- Always swap positions in pairwise comparison
- Match scale granularity to rubric specificity
- Separate objective and subjective criteria
- Include confidence scores
- Define edge cases explicitly
- Use domain-specific rubrics
- Validate against human judgments
- Monitor systematic bias
- Design for iteration
Practice Task
- 写一个 1-5 rubric(例如“回答准确性”)
- 用它评估 2 个模型输出,记录差异
Related Pages
Integration
- context-fundamentals
- tool-design
- context-optimization
- evaluation
References
- Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators
- Judging LLM-as-a-Judge (Zheng et al., 2023)
- G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)
- Large Language Models are not Fair Evaluators (Wang et al., 2023)
Skill Metadata
Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
什么时候用 direct scoring,什么时候用 pairwise comparison?
Direct scoring 适合 objective criteria —— 事实准确性、instruction following、toxicity 这种有明确对错的任务。Pairwise 适合偏好型评价 —— tone、style、persuasiveness 这类没有绝对正确答案的任务。MT-Bench 论文(Zheng et al., 2023)证明:偏好型任务上 pairwise 比 direct scoring 更可靠,因为人类自己打分都漂移,但 "A 和 B 哪个更好" 一致性高得多。Direct scoring 的 failure mode 是 scale drift;pairwise 的 failure mode 是 position bias 和 length bias。
LLM judge 的常见 bias 有哪些?怎么各个击破?
本章列了五个:(1) Position bias —— 偏向某个位置,缓解:swap positions(A first / B first 各跑一次,不一致就 TIE + lower confidence);(2) Length bias —— 越长越高分,缓解:明确提示忽略长度;(3) Self-enhancement bias —— 自己评自己分高,缓解:生成模型和评测模型用不同的;(4) Verbosity bias —— 细节越多越高分,缓解:rubric 明确 "无关细节扣分";(5) Authority bias —— 语气越自信分越高,缓解:要求 evidence 或引用。每个 bias 都有对应工程化做法,不能只靠 "换更大模型"。
Rubric 真能让评分一致性提升 40-60% 吗?怎么写一个好的 rubric?
本章数据:rubric 可降低评测方差 40-60%。一个完整 rubric 五个组件:(1) Level descriptions(1-5 各档什么样);(2) Characteristics(关键特征);(3) Examples(可选,但很有帮助);(4) Edge cases(模糊场景怎么打);(5) Scoring guidelines。Scale 选择:1-3 低认知负担、1-5 最平衡(大多数场景的默认)、1-10 需要更详细 rubric 才不漂移。最低模板:Criterion / Description / Scale 1-5 / Levels(1 poor、3 adequate、5 excellent)/ Edge cases。
为什么要要求 LLM judge "先 justification 再 score"?
Justification-before-score 能显著提升一致性。原因是 LLM 直接给数字时是 "凭感觉",先写理由强制它过一遍证据链,再打分时数字就不是凭空冒出来的。本章 Example 1 就是这个结构:先列 evidence("Correctly identifies axial tilt as primary cause" 等三条),再给 justification 段落,最后给 score 5 + improvement 建议。配合 confidence 字段(pairwise 不一致时降置信度)能进一步发现哪些样本 judge 自己也没把握。这是 G-Eval 论文里推荐的标准做法。
Pairwise 评测的 swap protocol 具体怎么做?
本章 Example 2 是完整流程:(1) First pass,A first / B second 给 prompt + criteria,judge 输出 winner B、confidence 0.8;(2) Second pass,B first / A second 跑同样的 prompt,输出 winner A、confidence 0.6;(3) 一致 → 用一致结果 + 平均 confidence;不一致 → TIE 或 lower-confidence 结果。Final 输出包含 winner、confidence、positionConsistency 字段。再加 PoLL(Panel of LLMs,多个 judge 投票)和 Hierarchical evaluation(不同层级关心不同维度),能进一步压低 bias。Human-in-the-loop 兜底处理 judge 拿不准的样本。