Evaluate outputs (teacher)
use an LLM to compare outputs
#TL;DR(中文)
- 这是一个典型的 用法:让 judge model 比较两个 outputs,并按“老师批改”给反馈。code
LLM-as-a-Judge - 适合做 :A/B 测试 prompts、比较不同 models、或比较同一模型的不同设置。code
evaluation - 风险在于 judge 的 bias 与不稳定性;建议固定 rubric、要求引用 evidence(从输出中摘句),并做多轮/多 judge 交叉验证。
#Background
This prompt tests an LLM's ability to evaluate and compare outputs from two different models (or two different prompts), as if it was a teacher.
One workflow:
- Ask two models to write the dialogue with the same prompt
- Ask a judge model to compare the two outputs
Example generation prompt (for the two models):
textPlato’s Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models?
#How to Apply(中文)
你可以把这个评估 workflow 拆成三步:
- Generate:用同一个 generation prompt 生成两个 outputs(不同 model 或不同 prompt 版本)
- Judge:用 judge prompt 对比评估
- Decide:按 rubric 选更好的版本,或把反馈用于下一轮迭代
在生产中,建议把 rubric 写得更具体,比如:
- coherence(逻辑与结构)
- faithfulness(是否偏离题意)
- style adherence(是否符合 Plato dialogue 风格)
- clarity(可读性与表达)
#How to Iterate(中文)
- 让 judge 输出结构化结果:+code
Winner+codeScores+codeEvidencecodeActionable feedback - 固定比较维度与打分范围(例如 1-5),减少随意性
- 加 “tie / unsure” 选项,避免强行选边
- 用多个 judge prompt 或多个 judge models 做一致性检查(majority vote)
#Self-check rubric(中文)
- judge 是否引用了 outputs 中的具体片段(evidence)?
- 评分是否对应 rubric 维度,而不是泛泛而谈?
- 是否能给出可执行的改进建议(下一轮怎么改 prompt)?
- 多次运行结果是否稳定(temperature 控制 + 多轮一致性)?
#Practice(中文)
练习:对你常用的一个写作任务做 A/B prompt:
- prompt A:更短、更开放
- prompt B:更长、带 constraints 与结构化输出
然后用 judge prompt 做对比,产出一个“改进清单”用于下一轮 prompt 迭代。
#Prompt (evaluation)
textCan you compare the two outputs below as if you were a teacher? Output from ChatGPT: {output 1} Output from GPT-4: {output 2}
#Code / API
#OpenAI (Python)
pythonfrom openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[ { "role": "user", "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}", } ], temperature=1, max_tokens=1500, top_p=1, frequency_penalty=0, presence_penalty=0, )
#Fireworks (Python)
pythonimport fireworks.client fireworks.client.api_key = "<FIREWORKS_API_KEY>" completion = fireworks.client.ChatCompletion.create( model="accounts/fireworks/models/mixtral-8x7b-instruct", messages=[ { "role": "user", "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}", } ], stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"], stream=True, n=1, top_p=1, top_k=40, presence_penalty=0, frequency_penalty=0, prompt_truncate_len=1024, context_length_exceeded_behavior="truncate", temperature=0.9, max_tokens=4000, )