LLM Evaluation
Evaluation prompts (overview)
The core of evaluation: write judging criteria clearly enough that an LLM acting as judge can give explainable comparisons or scores. You're not looking for "the perfect answer" -- you're building a stable, reusable, auditable evaluation workflow.
Learning Path (suggested order)
- Beginner: Fix scoring dimensions and output format
- Intermediate: Introduce rubrics and weights
- Advanced: Use evaluation results to drive iteration
What Is an Evaluation Prompt?
An Evaluation Prompt has the model play judge/reviewer -- comparing output quality, scoring, and explaining its reasoning.
┌─────────────────────────────────────────────────────────────┐
│ Evaluation Prompt Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Candidate outputs → Evaluation criteria → Score/rank → Explanation & suggestions │
│ (A/B/multiple) (Rubric) (scores/ranking) (improvement direction) │
│ │
└─────────────────────────────────────────────────────────────┘
Why Evaluation Matters
| Use Case | Specific Application | Business Value |
|---|---|---|
| Prompt iteration | Pick the better version | Lower trial-and-error cost |
| Content production | Copy/summary quality review | Better consistency |
| Model comparison | Compare outputs across models | Inform model selection |
| Standardized output | Auto-scoring and filtering | Better efficiency |
Business Output (PM Perspective)
With Evaluation Prompts you can deliver:
- Quantifiable comparison results (A/B output rankings)
- Evaluation templates (reusable rubrics)
- Improvement suggestions (for prompt iteration)
Completion criteria (suggested):
- Read this page + complete 1 exercise + self-check once
Core Prompt Structure
Goal: Evaluate candidate outputs
Criteria: Scoring dimensions and weights
Format: Output structure (scores/rationale/conclusion)
Input: Candidate answers
General Template
You are a strict evaluator. Compare the outputs using the criteria below.
Scoring criteria (1-5 per dimension):
1) Accuracy
2) Clarity
3) Completeness
Candidate outputs:
A: {output_a}
B: {output_b}
Output format:
- Scores: A=?, B=?
- Winner:
- Rationale (1-3 points):
Quick Start: A/B Comparison
Compare the two answers below. Score on "accuracy, clarity, completeness" (1-5 each).
A: Answer 1
B: Answer 2
Example 1: Writing Quality Evaluation
Evaluate these two product descriptions. Criteria: conciseness, persuasiveness, information completeness.
A: Lightweight and durable, great for travel.
B: Ultra-light design, 30L capacity, works for both urban and travel use.
Example 2: Summary Quality Evaluation
Evaluate the two summaries. Criteria: covers key points, clearly expressed, doesn't introduce new information.
Example 3: Structured Scoring (Rubric)
Scoring dimensions:
1) Accuracy (40%)
2) Readability (30%)
3) Structure (30%)
Output:
- Total score (0-100)
- Per-dimension scores
- Winner
Migration Template (swap variables to reuse)
Criteria: {criteria}
Candidates: {outputs}
Output: Scores + Winner + Rationale
Self-check Checklist (review before submitting)
- Are scoring dimensions clear and actionable?
- Does it prevent the model from introducing new info?
- Is the output structure fixed?
- Can the output be parsed programmatically?
Advanced Tips
- Weighted scoring: Assign different weights to different metrics.
- Score first, explain after: Prevents rationale from retroactively influencing scores.
- Three-round evaluation: Run multiple times and average to reduce bias.
- Align with goals: Scoring criteria should match business objectives.
- Output improvement suggestions: Makes it easy to iterate directly.
Common Problems & Solutions
| Problem | Cause | Solution |
|---|---|---|
| Inconsistent scores | Vague criteria | Clarify dimension descriptions |
| Verbose output | No format limits | Fix output fields |
| Introduces new info | Not restricted | Add "based on input only" |
| Too subjective | No rubric | Design a scoring table |
Hands-on Exercises
Exercise 1: A/B Evaluation
Evaluate two course descriptions. Criteria: clarity, appeal, information completeness.
Exercise 2: Multi-candidate Ranking
Rank 3 answers and provide rationale.
Exercise Scoring Rubric (self-assessment)
| Dimension | Passing Criteria |
|---|---|
| Clear criteria | Scoring dimensions are actionable |
| Stable output | Scores and rationale are structurally consistent |
| Reusable | Rubric is swappable |
| Parseable | Output can be processed programmatically |
Index
Takeaways
- The key to Evaluation Prompts is actionable scoring criteria.
- Fixed output structure makes comparison and automation easier.
- Rubrics significantly reduce subjective bias.
- Output suggestions can feed directly into prompt iteration.
- Templates improve reuse efficiency.