Evaluate outputs (teacher)
use an LLM to compare outputs
TL;DR(中文)
- 这是一个典型的
LLM-as-a-Judge用法:让 judge model 比较两个 outputs,并按“老师批改”给反馈。 - 适合做
evaluation:A/B 测试 prompts、比较不同 models、或比较同一模型的不同设置。 - 风险在于 judge 的 bias 与不稳定性;建议固定 rubric、要求引用 evidence(从输出中摘句),并做多轮/多 judge 交叉验证。
Background
This prompt tests an LLM's ability to evaluate and compare outputs from two different models (or two different prompts), as if it was a teacher.
One workflow:
- Ask two models to write the dialogue with the same prompt
- Ask a judge model to compare the two outputs
Example generation prompt (for the two models):
Plato’s Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models?
How to Apply(中文)
你可以把这个评估 workflow 拆成三步:
- Generate:用同一个 generation prompt 生成两个 outputs(不同 model 或不同 prompt 版本)
- Judge:用 judge prompt 对比评估
- Decide:按 rubric 选更好的版本,或把反馈用于下一轮迭代
在生产中,建议把 rubric 写得更具体,比如:
- coherence(逻辑与结构)
- faithfulness(是否偏离题意)
- style adherence(是否符合 Plato dialogue 风格)
- clarity(可读性与表达)
How to Iterate(中文)
- 让 judge 输出结构化结果:
Winner+Scores+Evidence+Actionable feedback - 固定比较维度与打分范围(例如 1-5),减少随意性
- 加 “tie / unsure” 选项,避免强行选边
- 用多个 judge prompt 或多个 judge models 做一致性检查(majority vote)
Self-check rubric(中文)
- judge 是否引用了 outputs 中的具体片段(evidence)?
- 评分是否对应 rubric 维度,而不是泛泛而谈?
- 是否能给出可执行的改进建议(下一轮怎么改 prompt)?
- 多次运行结果是否稳定(temperature 控制 + 多轮一致性)?
Practice(中文)
练习:对你常用的一个写作任务做 A/B prompt:
- prompt A:更短、更开放
- prompt B:更长、带 constraints 与结构化输出
然后用 judge prompt 做对比,产出一个“改进清单”用于下一轮 prompt 迭代。
Prompt (evaluation)
Can you compare the two outputs below as if you were a teacher?
Output from ChatGPT: {output 1}
Output from GPT-4: {output 2}
Code / API
OpenAI (Python)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
}
],
temperature=1,
max_tokens=1500,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
Fireworks (Python)
import fireworks.client
fireworks.client.api_key = "<FIREWORKS_API_KEY>"
completion = fireworks.client.ChatCompletion.create(
model="accounts/fireworks/models/mixtral-8x7b-instruct",
messages=[
{
"role": "user",
"content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
}
],
stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
stream=True,
n=1,
top_p=1,
top_k=40,
presence_penalty=0,
frequency_penalty=0,
prompt_truncate_len=1024,
context_length_exceeded_behavior="truncate",
temperature=0.9,
max_tokens=4000,
)