21

LLM-as-a-Judge Evaluation

⏱️ 35 min

Advanced Evaluation

This chapter covers the complete methodology for using LLMs as evaluators (LLM-as-a-Judge), including direct scoring, pairwise comparison, rubric generation, and bias mitigation. It's not a single trick — it's a composable set of evaluation strategies.

  • Direct scoring works best for objective criteria.
  • Pairwise comparison works best for preference-based evaluation.
  • You must swap positions to reduce position bias.
  • Rubrics can dramatically reduce evaluation variance.
  • Confidence calibration matters as much as the scores themselves.

What You'll Learn

  • When to use direct scoring vs. pairwise comparison
  • How to build reusable rubrics
  • How to handle position/length/authority and other biases

When to Activate

Activate this skill when:

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories:

Direct Scoring: A single LLM scores the response.

  • Best for: factual accuracy, instruction following, toxicity
  • Failure mode: scale drift, inconsistent interpretation

Pairwise Comparison: Compare two responses and pick the better one.

  • Best for: tone, style, persuasiveness
  • Failure mode: position bias, length bias

Research (MT-Bench, Zheng et al., 2023) shows that for preference-based tasks, pairwise comparison is more reliable than direct scoring.

The Bias Landscape

Common biases in LLM judges:

Position Bias: Preference for whichever response appears first. Mitigation: swap positions.

Length Bias: Longer responses get higher scores. Mitigation: explicitly instruct to ignore length.

Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: use different models for generation and evaluation.

Verbosity Bias: More detail = higher score. Mitigation: rubric should explicitly state "deduct for irrelevant detail."

Authority Bias: More confident tone = higher score. Mitigation: require evidence or citations.

Metric Selection Framework

Task TypePrimary MetricsSecondary Metrics
Binary classificationRecall, Precision, F1Cohen's k
Ordinal scale (1-5)Spearman's p, Kendall's tCohen's k (weighted)
Pairwise preferenceAgreement rate, Position consistencyConfidence calibration
Multi-labelMacro-F1, Micro-F1Per-label precision/recall

One gotcha: systematic bias matters more than raw agreement rate.

Evaluation Approaches

Direct Scoring Implementation

Direct scoring requires: clear criteria, a calibrated scale, and structured output.

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

  • 1-3: low cognitive load
  • 1-5: best balance
  • 1-10: needs detailed rubric

Prompt Structure:

You are an expert evaluator assessing response quality.
...

Requiring justification-before-score improves consistency.

Pairwise Comparison Implementation

Position Bias Mitigation Protocol:

  1. A first / B second
  2. B first / A second
  3. Inconsistent results -> TIE + lower confidence

Rubric Generation

Rubrics reduce evaluation variance by 40-60%.

Rubric Components:

  1. Level descriptions
  2. Characteristics
  3. Examples (optional)
  4. Edge cases
  5. Scoring guidelines

Practical Guidance

Evaluation Pipeline Design

Criteria Loader -> Primary Scorer -> Bias Mitigation -> Confidence Scoring

Common Anti-Patterns

  • Scoring without justification
  • Single-pass pairwise
  • Overloaded criteria
  • Missing edge cases
  • Ignoring confidence calibration

Decision Framework: Direct vs. Pairwise

Objective? -> Direct scoring
Preference? -> Pairwise comparison
Reference available? -> Reference-based evaluation

Scaling Evaluation

  • PoLL (Panel of LLMs)
  • Hierarchical evaluation
  • Human-in-the-loop

Examples

Example 1: Direct Scoring for Accuracy

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

{ "winner": "A", "confidence": 0.6 }

Final Result:

{
	"winner": "B",
	"confidence": 0.7,
	"positionConsistency": {
		"consistent": true,
		"firstPassWinner": "B",
		"secondPassWinner": "B"
	}
}

Minimal Rubric Template

Criterion: <name>
Description: <what good looks like>
Scale: 1-5
Levels:

-   1: <poor>
-   3: <adequate>
-   5: <excellent>
    Edge cases: <how to score ambiguous cases>

Guidelines

  1. Always require justification before scores
  2. Always swap positions in pairwise comparison
  3. Match scale granularity to rubric specificity
  4. Separate objective and subjective criteria
  5. Include confidence scores
  6. Define edge cases explicitly
  7. Use domain-specific rubrics
  8. Validate against human judgments
  9. Monitor systematic bias
  10. Design for iteration

Practice Task

  • Write a 1-5 rubric (e.g., "answer accuracy")
  • Use it to evaluate 2 model outputs and record the differences

Integration

  • context-fundamentals
  • tool-design
  • context-optimization
  • evaluation

References

  • Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators
  • Judging LLM-as-a-Judge (Zheng et al., 2023)
  • G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)
  • Large Language Models are not Fair Evaluators (Wang et al., 2023)

Skill Metadata

Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

什么时候用 direct scoring,什么时候用 pairwise comparison?

Direct scoring 适合 objective criteria —— 事实准确性、instruction following、toxicity 这种有明确对错的任务。Pairwise 适合偏好型评价 —— tone、style、persuasiveness 这类没有绝对正确答案的任务。MT-Bench 论文(Zheng et al., 2023)证明:偏好型任务上 pairwise 比 direct scoring 更可靠,因为人类自己打分都漂移,但 "A 和 B 哪个更好" 一致性高得多。Direct scoring 的 failure mode 是 scale drift;pairwise 的 failure mode 是 position bias 和 length bias。

LLM judge 的常见 bias 有哪些?怎么各个击破?

本章列了五个:(1) Position bias —— 偏向某个位置,缓解:swap positions(A first / B first 各跑一次,不一致就 TIE + lower confidence);(2) Length bias —— 越长越高分,缓解:明确提示忽略长度;(3) Self-enhancement bias —— 自己评自己分高,缓解:生成模型和评测模型用不同的;(4) Verbosity bias —— 细节越多越高分,缓解:rubric 明确 "无关细节扣分";(5) Authority bias —— 语气越自信分越高,缓解:要求 evidence 或引用。每个 bias 都有对应工程化做法,不能只靠 "换更大模型"。

Rubric 真能让评分一致性提升 40-60% 吗?怎么写一个好的 rubric?

本章数据:rubric 可降低评测方差 40-60%。一个完整 rubric 五个组件:(1) Level descriptions(1-5 各档什么样);(2) Characteristics(关键特征);(3) Examples(可选,但很有帮助);(4) Edge cases(模糊场景怎么打);(5) Scoring guidelines。Scale 选择:1-3 低认知负担、1-5 最平衡(大多数场景的默认)、1-10 需要更详细 rubric 才不漂移。最低模板:Criterion / Description / Scale 1-5 / Levels(1 poor、3 adequate、5 excellent)/ Edge cases。

为什么要要求 LLM judge "先 justification 再 score"?

Justification-before-score 能显著提升一致性。原因是 LLM 直接给数字时是 "凭感觉",先写理由强制它过一遍证据链,再打分时数字就不是凭空冒出来的。本章 Example 1 就是这个结构:先列 evidence("Correctly identifies axial tilt as primary cause" 等三条),再给 justification 段落,最后给 score 5 + improvement 建议。配合 confidence 字段(pairwise 不一致时降置信度)能进一步发现哪些样本 judge 自己也没把握。这是 G-Eval 论文里推荐的标准做法。

Pairwise 评测的 swap protocol 具体怎么做?

本章 Example 2 是完整流程:(1) First pass,A first / B second 给 prompt + criteria,judge 输出 winner B、confidence 0.8;(2) Second pass,B first / A second 跑同样的 prompt,输出 winner A、confidence 0.6;(3) 一致 → 用一致结果 + 平均 confidence;不一致 → TIE 或 lower-confidence 结果。Final 输出包含 winner、confidence、positionConsistency 字段。再加 PoLL(Panel of LLMs,多个 judge 投票)和 Hierarchical evaluation(不同层级关心不同维度),能进一步压低 bias。Human-in-the-loop 兜底处理 judge 拿不准的样本。