logo
21

LLM-as-a-Judge 评测

⏱️ 35分钟

Advanced Evaluation

本章讲的是用 LLM 做评测(LLM-as-a-Judge)的完整方法体系,涵盖 direct scoring、pairwise comparison、rubric generation 与 bias mitigation。它不是单一技巧,而是一组可组合的评测策略。

  • Direct scoring 适合 objective criteria。
  • Pairwise comparison 适合偏好型评价。
  • 必须 swap positions 来减小 position bias。
  • Rubrics 可以大幅降低评测方差。
  • Confidence calibration 与分数同样重要。

你将学到什么

  • 什么时候用 direct scoring,什么时候用 pairwise
  • 如何构建可复用的 rubric
  • 如何处理 position/length/authority 等 bias

When to Activate

Activate this skill when:

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories:

Direct Scoring: 用单一 LLM 给 response 打分。

  • Best for: factual accuracy、instruction following、toxicity
  • Failure mode: scale drift, inconsistent interpretation

Pairwise Comparison: 比较两条 response 并选优。

  • Best for: tone、style、persuasiveness
  • Failure mode: position bias, length bias

研究(MT-Bench, Zheng et al., 2023)表明:偏好型任务中 pairwise 比 direct scoring 更可靠。

The Bias Landscape

LLM judges 的常见 bias:

Position Bias: 位置偏好。Mitigation: swap positions。

Length Bias: 越长越高分。Mitigation: 明确提示忽略长度。

Self-Enhancement Bias: 自评更高。Mitigation: 生成与评测使用不同模型。

Verbosity Bias: 细节越多越高。Mitigation: rubric 明确“无关细节扣分”。

Authority Bias: 语气越自信越高。Mitigation: 要求 evidence 或引用。

Metric Selection Framework

Task TypePrimary MetricsSecondary Metrics
Binary classificationRecall, Precision, F1Cohen's κ
Ordinal scale (1-5)Spearman's ρ, Kendall's τCohen's κ (weighted)
Pairwise preferenceAgreement rate, Position consistencyConfidence calibration
Multi-labelMacro-F1, Micro-F1Per-label precision/recall

关键点:比起绝对一致率,更重要的是系统性偏差。

Evaluation Approaches

Direct Scoring Implementation

Direct scoring 需要:clear criteria、calibrated scale、structured output。

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

  • 1-3: 低认知负担
  • 1-5: 最平衡
  • 1-10: 需要明确 rubric

Prompt Structure:

You are an expert evaluator assessing response quality.
...

要求 justification-before-score,可提升一致性。

Pairwise Comparison Implementation

Position Bias Mitigation Protocol:

  1. A first / B second
  2. B first / A second
  3. 不一致 → TIE + lower confidence

Rubric Generation

Rubrics 降低评测方差 40-60%。

Rubric Components:

  1. Level descriptions
  2. Characteristics
  3. Examples (optional)
  4. Edge cases
  5. Scoring guidelines

Practical Guidance

Evaluation Pipeline Design

Criteria Loader → Primary Scorer → Bias Mitigation → Confidence Scoring

Common Anti-Patterns

  • Scoring without justification
  • Single-pass pairwise
  • Overloaded criteria
  • Missing edge cases
  • Ignoring confidence calibration

Decision Framework: Direct vs. Pairwise

Objective? → Direct scoring
Preference? → Pairwise comparison
Reference available? → Reference-based evaluation

Scaling Evaluation

  • PoLL (Panel of LLMs)
  • Hierarchical evaluation
  • Human-in-the-loop

Examples

(示例部分保持原文,可直接使用)

Example 1: Direct Scoring for Accuracy

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

{ "winner": "A", "confidence": 0.6 }

Final Result:

{
	"winner": "B",
	"confidence": 0.7,
	"positionConsistency": {
		"consistent": true,
		"firstPassWinner": "B",
		"secondPassWinner": "B"
	}
}

Minimal Rubric Template

Criterion: <name>
Description: <what good looks like>
Scale: 1-5
Levels:

-   1: <poor>
-   3: <adequate>
-   5: <excellent>
    Edge cases: <how to score ambiguous cases>

Guidelines

  1. Always require justification before scores
  2. Always swap positions in pairwise comparison
  3. Match scale granularity to rubric specificity
  4. Separate objective and subjective criteria
  5. Include confidence scores
  6. Define edge cases explicitly
  7. Use domain-specific rubrics
  8. Validate against human judgments
  9. Monitor systematic bias
  10. Design for iteration

Practice Task

  • 写一个 1-5 rubric(例如“回答准确性”)
  • 用它评估 2 个模型输出,记录差异

Integration

  • context-fundamentals
  • tool-design
  • context-optimization
  • evaluation

References


Skill Metadata

Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0