When should I use direct scoring vs pairwise comparison?

Direct scoring fits objective criteria — factual accuracy, instruction following, toxicity, anything with clear right/wrong. Pairwise fits preference judgments — tone, style, persuasiveness, anything without an absolute correct answer. MT-Bench (Zheng et al., 2023) showed: on preference tasks, pairwise outperforms direct scoring — humans drift on scores, but "is A or B better?" stays much more consistent. Direct scoring's failure mode is scale drift; pairwise's failure modes are position bias and length bias.

What are the common biases in LLM judges, and how do I mitigate each?

The chapter lists five: (1) Position bias — favoring one slot, mitigation: swap positions (A first / B first, inconsistency → TIE + lower confidence); (2) Length bias — longer scores higher, mitigation: explicitly tell it to ignore length; (3) Self-enhancement bias — judges its own outputs higher, mitigation: use a different model for generation vs evaluation; (4) Verbosity bias — more detail wins, mitigation: rubric explicitly penalizes irrelevant detail; (5) Authority bias — confident tone wins, mitigation: require evidence or citations. Each bias has an engineered fix — "just use a bigger model" doesn't cover it.

Can rubrics really cut scoring variance 40-60%? How do I write a good one?

Chapter data: rubrics cut evaluation variance by 40-60%. A complete rubric has five parts: (1) Level descriptions (what 1-5 looks like); (2) Characteristics (key traits); (3) Examples (optional but high-value); (4) Edge cases (how to score the gray zone); (5) Scoring guidelines. Scale choice: 1-3 has lowest cognitive load, 1-5 is the most balanced (default for most cases), 1-10 needs a more detailed rubric or it drifts. Minimum template: Criterion / Description / Scale 1-5 / Levels (1 poor, 3 adequate, 5 excellent) / Edge cases.

Why force the LLM judge to write justification before the score?

Justification-before-score significantly improves consistency. Reason: LLMs giving a raw number rely on "vibes"; forcing a written rationale walks them through the evidence chain before the number, so the score isn't pulled out of thin air. The chapter's Example 1 follows this structure: list evidence first (three points like "Correctly identifies axial tilt as primary cause"), then a justification paragraph, then score 5 + improvement suggestion. Pair with a confidence field (lower it when pairwise disagrees) to surface which samples the judge itself is uncertain on. This is the standard recipe from the G-Eval paper.

What does the swap protocol look like in practice for pairwise evaluation?

The chapter's Example 2 spells it out: (1) First pass, A first / B second, judge sees prompt + criteria and outputs winner B, confidence 0.8; (2) Second pass, B first / A second on the same prompt, outputs winner A, confidence 0.6; (3) consistent → use the consistent result + averaged confidence; inconsistent → TIE or take the lower-confidence side. Final output carries winner, confidence, positionConsistency fields. Add PoLL (Panel of LLMs voting) and Hierarchical evaluation (different layers care about different dimensions) to push bias lower. Human-in-the-loop catches the cases the judge can't call.

LLM-as-a-Judge Evaluation

⏱️ 35 min

Advanced Evaluation

This chapter covers the complete methodology for using LLMs as evaluators (LLM-as-a-Judge), including direct scoring, pairwise comparison, rubric generation, and bias mitigation. It's not a single trick — it's a composable set of evaluation strategies.

Direct scoring works best for objective criteria.
Pairwise comparison works best for preference-based evaluation.
You must swap positions to reduce position bias.
Rubrics can dramatically reduce evaluation variance.
Confidence calibration matters as much as the scores themselves.

What You'll Learn

When to use direct scoring vs. pairwise comparison
How to build reusable rubrics
How to handle position/length/authority and other biases

When to Activate

Activate this skill when:

Building automated evaluation pipelines for LLM outputs
Comparing multiple model responses to select the best one
Establishing consistent quality standards across evaluation teams
Debugging evaluation systems that show inconsistent results
Designing A/B tests for prompt or model changes
Creating rubrics for human or automated evaluation
Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories:

Direct Scoring: A single LLM scores the response.

Best for: factual accuracy, instruction following, toxicity
Failure mode: scale drift, inconsistent interpretation

Pairwise Comparison: Compare two responses and pick the better one.

Best for: tone, style, persuasiveness
Failure mode: position bias, length bias

Research (MT-Bench, Zheng et al., 2023) shows that for preference-based tasks, pairwise comparison is more reliable than direct scoring.

The Bias Landscape

Common biases in LLM judges:

Position Bias: Preference for whichever response appears first. Mitigation: swap positions.

Length Bias: Longer responses get higher scores. Mitigation: explicitly instruct to ignore length.

Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: use different models for generation and evaluation.

Verbosity Bias: More detail = higher score. Mitigation: rubric should explicitly state "deduct for irrelevant detail."

Authority Bias: More confident tone = higher score. Mitigation: require evidence or citations.

Metric Selection Framework

Task Type	Primary Metrics	Secondary Metrics
Binary classification	Recall, Precision, F1	Cohen's k
Ordinal scale (1-5)	Spearman's p, Kendall's t	Cohen's k (weighted)
Pairwise preference	Agreement rate, Position consistency	Confidence calibration
Multi-label	Macro-F1, Micro-F1	Per-label precision/recall

One gotcha: systematic bias matters more than raw agreement rate.

Evaluation Approaches

Direct Scoring Implementation

Direct scoring requires: clear criteria, a calibrated scale, and structured output.

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

1-3: low cognitive load
1-5: best balance
1-10: needs detailed rubric

Prompt Structure:

You are an expert evaluator assessing response quality.
...

Requiring justification-before-score improves consistency.

Pairwise Comparison Implementation

Position Bias Mitigation Protocol:

A first / B second
B first / A second
Inconsistent results -> TIE + lower confidence

Rubric Generation

Rubrics reduce evaluation variance by 40-60%.

Rubric Components:

Level descriptions
Characteristics
Examples (optional)
Edge cases
Scoring guidelines

Practical Guidance

Evaluation Pipeline Design

Criteria Loader -> Primary Scorer -> Bias Mitigation -> Confidence Scoring

Common Anti-Patterns

Scoring without justification
Single-pass pairwise
Overloaded criteria
Missing edge cases
Ignoring confidence calibration

Decision Framework: Direct vs. Pairwise

Objective? -> Direct scoring
Preference? -> Pairwise comparison
Reference available? -> Reference-based evaluation

Scaling Evaluation

PoLL (Panel of LLMs)
Hierarchical evaluation
Human-in-the-loop

Examples

Example 1: Direct Scoring for Accuracy

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

{ "winner": "A", "confidence": 0.6 }

Final Result:

{
	"winner": "B",
	"confidence": 0.7,
	"positionConsistency": {
		"consistent": true,
		"firstPassWinner": "B",
		"secondPassWinner": "B"
	}
}

Minimal Rubric Template

Criterion: <name>
Description: <what good looks like>
Scale: 1-5
Levels:

-   1: <poor>
-   3: <adequate>
-   5: <excellent>
    Edge cases: <how to score ambiguous cases>

Guidelines

Always require justification before scores
Always swap positions in pairwise comparison
Match scale granularity to rubric specificity
Separate objective and subjective criteria
Include confidence scores
Define edge cases explicitly
Use domain-specific rubrics
Validate against human judgments
Monitor systematic bias
Design for iteration

Practice Task

Write a 1-5 rubric (e.g., "answer accuracy")
Use it to evaluate 2 model outputs and record the differences

Integration

context-fundamentals
tool-design
context-optimization
evaluation

References

Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators
Judging LLM-as-a-Judge (Zheng et al., 2023)
G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)
Large Language Models are not Fair Evaluators (Wang et al., 2023)

Skill Metadata

Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

什么时候用 direct scoring，什么时候用 pairwise comparison？

Direct scoring 适合 objective criteria —— 事实准确性、instruction following、toxicity 这种有明确对错的任务。Pairwise 适合偏好型评价 —— tone、style、persuasiveness 这类没有绝对正确答案的任务。MT-Bench 论文（Zheng et al., 2023）证明：偏好型任务上 pairwise 比 direct scoring 更可靠，因为人类自己打分都漂移，但 "A 和 B 哪个更好" 一致性高得多。Direct scoring 的 failure mode 是 scale drift；pairwise 的 failure mode 是 position bias 和 length bias。

LLM judge 的常见 bias 有哪些？怎么各个击破？

本章列了五个：(1) Position bias —— 偏向某个位置，缓解：swap positions（A first / B first 各跑一次，不一致就 TIE + lower confidence）；(2) Length bias —— 越长越高分，缓解：明确提示忽略长度；(3) Self-enhancement bias —— 自己评自己分高，缓解：生成模型和评测模型用不同的；(4) Verbosity bias —— 细节越多越高分，缓解：rubric 明确 "无关细节扣分"；(5) Authority bias —— 语气越自信分越高，缓解：要求 evidence 或引用。每个 bias 都有对应工程化做法，不能只靠 "换更大模型"。

Rubric 真能让评分一致性提升 40-60% 吗？怎么写一个好的 rubric？

本章数据：rubric 可降低评测方差 40-60%。一个完整 rubric 五个组件：(1) Level descriptions（1-5 各档什么样）；(2) Characteristics（关键特征）；(3) Examples（可选，但很有帮助）；(4) Edge cases（模糊场景怎么打）；(5) Scoring guidelines。Scale 选择：1-3 低认知负担、1-5 最平衡（大多数场景的默认）、1-10 需要更详细 rubric 才不漂移。最低模板：Criterion / Description / Scale 1-5 / Levels（1 poor、3 adequate、5 excellent）/ Edge cases。

为什么要要求 LLM judge "先 justification 再 score"？

Justification-before-score 能显著提升一致性。原因是 LLM 直接给数字时是 "凭感觉"，先写理由强制它过一遍证据链，再打分时数字就不是凭空冒出来的。本章 Example 1 就是这个结构：先列 evidence（"Correctly identifies axial tilt as primary cause" 等三条），再给 justification 段落，最后给 score 5 + improvement 建议。配合 confidence 字段（pairwise 不一致时降置信度）能进一步发现哪些样本 judge 自己也没把握。这是 G-Eval 论文里推荐的标准做法。

Pairwise 评测的 swap protocol 具体怎么做？

本章 Example 2 是完整流程：(1) First pass，A first / B second 给 prompt + criteria，judge 输出 winner B、confidence 0.8；(2) Second pass，B first / A second 跑同样的 prompt，输出 winner A、confidence 0.6；(3) 一致 → 用一致结果 + 平均 confidence；不一致 → TIE 或 lower-confidence 结果。Final 输出包含 winner、confidence、positionConsistency 字段。再加 PoLL（Panel of LLMs，多个 judge 投票）和 Hierarchical evaluation（不同层级关心不同维度），能进一步压低 bias。Human-in-the-loop 兜底处理 judge 拿不准的样本。