LLM-as-a-Judge 评测
Advanced Evaluation
本章讲的是用 LLM 做评测(LLM-as-a-Judge)的完整方法体系,涵盖 direct scoring、pairwise comparison、rubric generation 与 bias mitigation。它不是单一技巧,而是一组可组合的评测策略。
- Direct scoring 适合 objective criteria。
- Pairwise comparison 适合偏好型评价。
- 必须 swap positions 来减小 position bias。
- Rubrics 可以大幅降低评测方差。
- Confidence calibration 与分数同样重要。
你将学到什么
- 什么时候用 direct scoring,什么时候用 pairwise
- 如何构建可复用的 rubric
- 如何处理 position/length/authority 等 bias
When to Activate
Activate this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
Core Concepts
The Evaluation Taxonomy
Evaluation approaches fall into two primary categories:
Direct Scoring: 用单一 LLM 给 response 打分。
- Best for: factual accuracy、instruction following、toxicity
- Failure mode: scale drift, inconsistent interpretation
Pairwise Comparison: 比较两条 response 并选优。
- Best for: tone、style、persuasiveness
- Failure mode: position bias, length bias
研究(MT-Bench, Zheng et al., 2023)表明:偏好型任务中 pairwise 比 direct scoring 更可靠。
The Bias Landscape
LLM judges 的常见 bias:
Position Bias: 位置偏好。Mitigation: swap positions。
Length Bias: 越长越高分。Mitigation: 明确提示忽略长度。
Self-Enhancement Bias: 自评更高。Mitigation: 生成与评测使用不同模型。
Verbosity Bias: 细节越多越高。Mitigation: rubric 明确“无关细节扣分”。
Authority Bias: 语气越自信越高。Mitigation: 要求 evidence 或引用。
Metric Selection Framework
| Task Type | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary classification | Recall, Precision, F1 | Cohen's κ |
| Ordinal scale (1-5) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
关键点:比起绝对一致率,更重要的是系统性偏差。
Evaluation Approaches
Direct Scoring Implementation
Direct scoring 需要:clear criteria、calibrated scale、structured output。
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Scale Calibration:
- 1-3: 低认知负担
- 1-5: 最平衡
- 1-10: 需要明确 rubric
Prompt Structure:
You are an expert evaluator assessing response quality.
...
要求 justification-before-score,可提升一致性。
Pairwise Comparison Implementation
Position Bias Mitigation Protocol:
- A first / B second
- B first / A second
- 不一致 → TIE + lower confidence
Rubric Generation
Rubrics 降低评测方差 40-60%。
Rubric Components:
- Level descriptions
- Characteristics
- Examples (optional)
- Edge cases
- Scoring guidelines
Practical Guidance
Evaluation Pipeline Design
Criteria Loader → Primary Scorer → Bias Mitigation → Confidence Scoring
Common Anti-Patterns
- Scoring without justification
- Single-pass pairwise
- Overloaded criteria
- Missing edge cases
- Ignoring confidence calibration
Decision Framework: Direct vs. Pairwise
Objective? → Direct scoring
Preference? → Pairwise comparison
Reference available? → Reference-based evaluation
Scaling Evaluation
- PoLL (Panel of LLMs)
- Hierarchical evaluation
- Human-in-the-loop
Examples
(示例部分保持原文,可直接使用)
Example 1: Direct Scoring for Accuracy
Input:
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
Output:
{
"criterion": "Factual Accuracy",
"score": 5,
"evidence": [
"Correctly identifies axial tilt as primary cause",
"Correctly explains differential sunlight by hemisphere",
"No factual errors present"
],
"justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
"improvement": "Could add the specific tilt angle (23.5°) for completeness."
}
Example 2: Pairwise Comparison with Position Swap
Input:
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
First Pass (A first):
{ "winner": "B", "confidence": 0.8 }
Second Pass (B first):
{ "winner": "A", "confidence": 0.6 }
Final Result:
{
"winner": "B",
"confidence": 0.7,
"positionConsistency": {
"consistent": true,
"firstPassWinner": "B",
"secondPassWinner": "B"
}
}
Minimal Rubric Template
Criterion: <name>
Description: <what good looks like>
Scale: 1-5
Levels:
- 1: <poor>
- 3: <adequate>
- 5: <excellent>
Edge cases: <how to score ambiguous cases>
Guidelines
- Always require justification before scores
- Always swap positions in pairwise comparison
- Match scale granularity to rubric specificity
- Separate objective and subjective criteria
- Include confidence scores
- Define edge cases explicitly
- Use domain-specific rubrics
- Validate against human judgments
- Monitor systematic bias
- Design for iteration
Practice Task
- 写一个 1-5 rubric(例如“回答准确性”)
- 用它评估 2 个模型输出,记录差异
Related Pages
Integration
- context-fundamentals
- tool-design
- context-optimization
- evaluation
References
- LLM-as-Judge Implementation Patterns
- Bias Mitigation Techniques
- Metric Selection Guide
- Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators
- Judging LLM-as-a-Judge (Zheng et al., 2023)
- G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)
- Large Language Models are not Fair Evaluators (Wang et al., 2023)
Skill Metadata
Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0