logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

Evaluate Outputs (Teacher)

Use an LLM to compare and evaluate outputs

TL;DR

  • This is a classic LLM-as-a-Judge use case: have a judge model compare two outputs and give feedback like a teacher grading papers.
  • Great for evaluation: A/B testing prompts, comparing different models, or comparing different settings of the same model.
  • The risk is judge bias and instability. Fix the rubric, require evidence citations (quote from the outputs), and do multi-round / multi-judge cross-validation.

Background

This prompt tests an LLM's ability to evaluate and compare outputs from two different models (or two different prompts), as if it was a teacher.

One workflow:

  1. Ask two models to write the dialogue with the same prompt
  2. Ask a judge model to compare the two outputs

Example generation prompt (for the two models):

Plato's Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models?

How to Apply

You can break this evaluation workflow into three steps:

  1. Generate: Use the same generation prompt to produce two outputs (different models or different prompt versions)
  2. Judge: Use a judge prompt to compare and evaluate
  3. Decide: Pick the better version based on the rubric, or feed the feedback into the next iteration round

In production, make the rubric more specific, for example:

  • coherence (logic and structure)
  • faithfulness (does it stay on topic)
  • style adherence (does it match Plato dialogue style)
  • clarity (readability and expression)

How to Iterate

  1. Have the judge output structured results: Winner + Scores + Evidence + Actionable feedback
  2. Fix comparison dimensions and scoring ranges (e.g., 1-5) to reduce arbitrariness
  3. Add a "tie / unsure" option to avoid forced picks
  4. Use multiple judge prompts or multiple judge models for consistency checks (majority vote)

Self-check Rubric

  • Does the judge cite specific passages from the outputs (evidence)?
  • Do the scores correspond to rubric dimensions, not vague generalizations?
  • Can it provide actionable improvement suggestions (how to change the prompt next round)?
  • Are results stable across multiple runs (temperature control + multi-round consistency)?

Practice

Exercise: do an A/B prompt test on a writing task you commonly use:

  • prompt A: shorter, more open-ended
  • prompt B: longer, with constraints and structured output

Then use a judge prompt to compare them and produce an "improvement checklist" for the next prompt iteration round.

Prompt (evaluation)

Can you compare the two outputs below as if you were a teacher?

Output from ChatGPT: {output 1}

Output from GPT-4: {output 2}

Code / API

OpenAI (Python)

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "user",
            "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    temperature=1,
    max_tokens=1500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

Fireworks (Python)

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

completion = fireworks.client.ChatCompletion.create(
    model="accounts/fireworks/models/mixtral-8x7b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
    stream=True,
    n=1,
    top_p=1,
    top_k=40,
    presence_penalty=0,
    frequency_penalty=0,
    prompt_truncate_len=1024,
    context_length_exceeded_behavior="truncate",
    temperature=0.9,
    max_tokens=4000,
)

Reference