logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

LLM Evaluation

Evaluation prompts (overview)

The core of evaluation: write judging criteria clearly enough that an LLM acting as judge can give explainable comparisons or scores. You're not looking for "the perfect answer" -- you're building a stable, reusable, auditable evaluation workflow.


Learning Path (suggested order)

  1. Beginner: Fix scoring dimensions and output format
  2. Intermediate: Introduce rubrics and weights
  3. Advanced: Use evaluation results to drive iteration

What Is an Evaluation Prompt?

An Evaluation Prompt has the model play judge/reviewer -- comparing output quality, scoring, and explaining its reasoning.

┌─────────────────────────────────────────────────────────────┐
│                    Evaluation Prompt Flow                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Candidate outputs → Evaluation criteria → Score/rank → Explanation & suggestions │
│   (A/B/multiple)       (Rubric)             (scores/ranking) (improvement direction) │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Why Evaluation Matters

Use CaseSpecific ApplicationBusiness Value
Prompt iterationPick the better versionLower trial-and-error cost
Content productionCopy/summary quality reviewBetter consistency
Model comparisonCompare outputs across modelsInform model selection
Standardized outputAuto-scoring and filteringBetter efficiency

Business Output (PM Perspective)

With Evaluation Prompts you can deliver:

  • Quantifiable comparison results (A/B output rankings)
  • Evaluation templates (reusable rubrics)
  • Improvement suggestions (for prompt iteration)

Completion criteria (suggested):

  • Read this page + complete 1 exercise + self-check once

Core Prompt Structure

Goal: Evaluate candidate outputs
Criteria: Scoring dimensions and weights
Format: Output structure (scores/rationale/conclusion)
Input: Candidate answers

General Template

You are a strict evaluator. Compare the outputs using the criteria below.

Scoring criteria (1-5 per dimension):
1) Accuracy
2) Clarity
3) Completeness

Candidate outputs:
A: {output_a}
B: {output_b}

Output format:
- Scores: A=?, B=?
- Winner:
- Rationale (1-3 points):

Quick Start: A/B Comparison

Compare the two answers below. Score on "accuracy, clarity, completeness" (1-5 each).

A: Answer 1
B: Answer 2

Example 1: Writing Quality Evaluation

Evaluate these two product descriptions. Criteria: conciseness, persuasiveness, information completeness.

A: Lightweight and durable, great for travel.
B: Ultra-light design, 30L capacity, works for both urban and travel use.

Example 2: Summary Quality Evaluation

Evaluate the two summaries. Criteria: covers key points, clearly expressed, doesn't introduce new information.

Example 3: Structured Scoring (Rubric)

Scoring dimensions:
1) Accuracy (40%)
2) Readability (30%)
3) Structure (30%)

Output:
- Total score (0-100)
- Per-dimension scores
- Winner

Migration Template (swap variables to reuse)

Criteria: {criteria}
Candidates: {outputs}
Output: Scores + Winner + Rationale

Self-check Checklist (review before submitting)

  • Are scoring dimensions clear and actionable?
  • Does it prevent the model from introducing new info?
  • Is the output structure fixed?
  • Can the output be parsed programmatically?

Advanced Tips

  1. Weighted scoring: Assign different weights to different metrics.
  2. Score first, explain after: Prevents rationale from retroactively influencing scores.
  3. Three-round evaluation: Run multiple times and average to reduce bias.
  4. Align with goals: Scoring criteria should match business objectives.
  5. Output improvement suggestions: Makes it easy to iterate directly.

Common Problems & Solutions

ProblemCauseSolution
Inconsistent scoresVague criteriaClarify dimension descriptions
Verbose outputNo format limitsFix output fields
Introduces new infoNot restrictedAdd "based on input only"
Too subjectiveNo rubricDesign a scoring table

Hands-on Exercises

Exercise 1: A/B Evaluation

Evaluate two course descriptions. Criteria: clarity, appeal, information completeness.

Exercise 2: Multi-candidate Ranking

Rank 3 answers and provide rationale.

Exercise Scoring Rubric (self-assessment)

DimensionPassing Criteria
Clear criteriaScoring dimensions are actionable
Stable outputScores and rationale are structurally consistent
ReusableRubric is swappable
ParseableOutput can be processed programmatically

Index


Takeaways

  1. The key to Evaluation Prompts is actionable scoring criteria.
  2. Fixed output structure makes comparison and automation easier.
  3. Rubrics significantly reduce subjective bias.
  4. Output suggestions can feed directly into prompt iteration.
  5. Templates improve reuse efficiency.