Evaluate outputs (teacher)

use an LLM to compare outputs

TL;DR（中文）

这是一个典型的 LLM-as-a-Judge 用法：让 judge model 比较两个 outputs，并按“老师批改”给反馈。
适合做 evaluation：A/B 测试 prompts、比较不同 models、或比较同一模型的不同设置。
风险在于 judge 的 bias 与不稳定性；建议固定 rubric、要求引用 evidence（从输出中摘句），并做多轮/多 judge 交叉验证。

Background

This prompt tests an LLM's ability to evaluate and compare outputs from two different models (or two different prompts), as if it was a teacher.

One workflow:

Ask two models to write the dialogue with the same prompt
Ask a judge model to compare the two outputs

Example generation prompt (for the two models):

Plato’s Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models?

How to Apply（中文）

你可以把这个评估 workflow 拆成三步：

Generate：用同一个 generation prompt 生成两个 outputs（不同 model 或不同 prompt 版本）
Judge：用 judge prompt 对比评估
Decide：按 rubric 选更好的版本，或把反馈用于下一轮迭代

在生产中，建议把 rubric 写得更具体，比如：

coherence（逻辑与结构）
faithfulness（是否偏离题意）
style adherence（是否符合 Plato dialogue 风格）
clarity（可读性与表达）

How to Iterate（中文）

让 judge 输出结构化结果：Winner + Scores + Evidence + Actionable feedback
固定比较维度与打分范围（例如 1-5），减少随意性
加 “tie / unsure” 选项，避免强行选边
用多个 judge prompt 或多个 judge models 做一致性检查（majority vote）

Self-check rubric（中文）

judge 是否引用了 outputs 中的具体片段（evidence）？
评分是否对应 rubric 维度，而不是泛泛而谈？
是否能给出可执行的改进建议（下一轮怎么改 prompt）？
多次运行结果是否稳定（temperature 控制 + 多轮一致性）？

Practice（中文）

练习：对你常用的一个写作任务做 A/B prompt：

prompt A：更短、更开放
prompt B：更长、带 constraints 与结构化输出

然后用 judge prompt 做对比，产出一个“改进清单”用于下一轮 prompt 迭代。

Prompt (evaluation)

Can you compare the two outputs below as if you were a teacher?

Output from ChatGPT: {output 1}

Output from GPT-4: {output 2}

Code / API

OpenAI (Python)

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "user",
            "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    temperature=1,
    max_tokens=1500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

Fireworks (Python)

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

completion = fireworks.client.ChatCompletion.create(
    model="accounts/fireworks/models/mixtral-8x7b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
    stream=True,
    n=1,
    top_p=1,
    top_k=40,
    presence_penalty=0,
    frequency_penalty=0,
    prompt_truncate_len=1024,
    context_length_exceeded_behavior="truncate",
    temperature=0.9,
    max_tokens=4000,
)

Reference

Sparks of Artificial General Intelligence: Early experiments with GPT-4 (13 April 2023)