Evaluate Outputs (Teacher)
Use an LLM to compare and evaluate outputs
TL;DR
- This is a classic
LLM-as-a-Judgeuse case: have a judge model compare two outputs and give feedback like a teacher grading papers. - Great for
evaluation: A/B testing prompts, comparing different models, or comparing different settings of the same model. - The risk is judge bias and instability. Fix the rubric, require evidence citations (quote from the outputs), and do multi-round / multi-judge cross-validation.
Background
This prompt tests an LLM's ability to evaluate and compare outputs from two different models (or two different prompts), as if it was a teacher.
One workflow:
- Ask two models to write the dialogue with the same prompt
- Ask a judge model to compare the two outputs
Example generation prompt (for the two models):
Plato's Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models?
How to Apply
You can break this evaluation workflow into three steps:
- Generate: Use the same generation prompt to produce two outputs (different models or different prompt versions)
- Judge: Use a judge prompt to compare and evaluate
- Decide: Pick the better version based on the rubric, or feed the feedback into the next iteration round
In production, make the rubric more specific, for example:
- coherence (logic and structure)
- faithfulness (does it stay on topic)
- style adherence (does it match Plato dialogue style)
- clarity (readability and expression)
How to Iterate
- Have the judge output structured results:
Winner+Scores+Evidence+Actionable feedback - Fix comparison dimensions and scoring ranges (e.g., 1-5) to reduce arbitrariness
- Add a "tie / unsure" option to avoid forced picks
- Use multiple judge prompts or multiple judge models for consistency checks (majority vote)
Self-check Rubric
- Does the judge cite specific passages from the outputs (evidence)?
- Do the scores correspond to rubric dimensions, not vague generalizations?
- Can it provide actionable improvement suggestions (how to change the prompt next round)?
- Are results stable across multiple runs (temperature control + multi-round consistency)?
Practice
Exercise: do an A/B prompt test on a writing task you commonly use:
- prompt A: shorter, more open-ended
- prompt B: longer, with constraints and structured output
Then use a judge prompt to compare them and produce an "improvement checklist" for the next prompt iteration round.
Prompt (evaluation)
Can you compare the two outputs below as if you were a teacher?
Output from ChatGPT: {output 1}
Output from GPT-4: {output 2}
Code / API
OpenAI (Python)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
}
],
temperature=1,
max_tokens=1500,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
Fireworks (Python)
import fireworks.client
fireworks.client.api_key = "<FIREWORKS_API_KEY>"
completion = fireworks.client.ChatCompletion.create(
model="accounts/fireworks/models/mixtral-8x7b-instruct",
messages=[
{
"role": "user",
"content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
}
],
stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
stream=True,
n=1,
top_p=1,
top_k=40,
presence_penalty=0,
frequency_penalty=0,
prompt_truncate_len=1024,
context_length_exceeded_behavior="truncate",
temperature=0.9,
max_tokens=4000,
)