Self-Consistency
Sample multiple reasoning paths and aggregate to stabilize CoT outputs
Self-consistency is one of the more advanced prompting techniques out there. Proposed by Wang et al. (2022), it aims to "replace the naive greedy decoding used in chain-of-thought prompting." The idea: sample multiple different reasoning paths through few-shot CoT, then pick the most consistent answer from the generated results. This boosts CoT performance on arithmetic and commonsense reasoning tasks.
Let's try this arithmetic reasoning example:
Prompt:
When I was 6 years old, my sister was half my age. Now I am 70 years old. How old is my sister?
Output:
35
Wrong. So how do we fix this with self-consistency? Let's use the few-shot exemplars from Wang et al. 2022 (Table 17):
Prompt:
Q: There are 15 trees in the grove. The forester will plant trees today. After they are done, there will be 21 trees. How many trees did the forester plant today?
A: We start with 15 trees. We end with 21 trees. The difference is the number of trees planted. So they planted 21 - 15 = 6 trees. The answer is 6.
Q: There are 3 cars in the parking lot and 2 more cars arrive. How many cars are in the parking lot?
A: There are already 3 cars. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah has 32 chocolates and her sister has 42. If they eat 35, how many do they have left in total?
A: Leah has 32 and her sister has 42, so initially they have 32 + 42 = 74 chocolates. They ate 35. So they have 74 - 35 = 39 chocolates left. The answer is 39.
Q: Jason has 20 lollipops. He gives some to Denny. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason had 20 lollipops. Now he has 12, so he gave away 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has 5 toys. For Christmas, he got 2 toys from his parents. How many toys does he have now?
A: He has 5 toys. He got 2 more, so he has 5 + 2 = 7 toys. Then he got 2 more, so in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There are 9 computers in the server room. From Monday to Thursday, 5 computers are installed each day. How many computers are in the server room now?
A: From Monday to Thursday is 4 days. Each day 5 computers are added, so 4 * 5 = 20 computers were added. Starting with 9 computers, now there are 9 + 20 = 29 computers. The answer is 29.
Q: Michael has 58 golf balls. On Tuesday he loses 23. On Wednesday he loses 2 more. How many golf balls does he have left at the end of Wednesday?
A: Michael started with 58. After losing 23 on Tuesday he has 58 - 23 = 35. After losing 2 more on Wednesday he has 35 - 2 = 33. The answer is 33.
Q: Olivia has 23 dollars. She buys five bagels at 3 dollars each. How much money does she have left?
A: Five bagels at 3 dollars each cost 15 dollars. She has 23 - 15 = 8 dollars left. The answer is 8.
Q: When I was 6 years old, my sister was half my age. Now I am 70 years old. How old is my sister?
A:
Output 1:
When I was 6, my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.
Output 2:
When the narrator was 6, their sister was half their age, so she was 3. Now the narrator is 70, so the sister is 70 - 3 = 67. The answer is 67.
Output 3:
When I was 6, my sister was half my age, so she was 3. Now I am 70, so she is 70 / 2 = 35. The answer is 35.
Computing the final answer involves a few steps (see the paper for details), but the short version: the majority answer already emerged, so that becomes the final answer.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
Self-Consistency 是什么?跟 CoT 什么关系?
Wang 等人 2022 提出,专门替换 CoT 默认的「贪婪解码」。做法:用 few-shot CoT 采样多条不同推理路径(temperature > 0),对每条路径得到的最终答案投票,多数答案即结果。它不是新的推理模型,而是 CoT 的解码升级,专治算术 / 常识推理这类有唯一正确答案的任务。
Self-Consistency 怎么实现?sample 几条最划算?
把 CoT prompt 跑 N 次(temperature 0.5-0.7),对每次输出抽取最终答案,做多数投票。论文里 N = 40 提升最大,实操中 N = 5-10 已经能拿到大部分收益,再往上边际递减。代价是 N 倍 token,所以适合「错一次代价高 + 答案离散可投票」的场景,不适合开放生成。
为什么 Self-Consistency 能纠正 CoT 的错答?
因为正确推理路径通常不止一条——多条不同路径都能到正确答案,错误路径则各错各的、互相不一致。论文示例里「我 6 岁时妹妹是我一半年龄,现在我 70 岁,妹妹多大?」CoT 单跑会给 35(错),sample 3 次有 2 次给 67(正确),多数投票就纠回来了。
Self-Consistency 适合什么任务?什么任务不适合?
适合:算术(GSM8K)、常识推理、单选题、短答案抽取——这类任务答案离散,能投票。不适合:代码生成(多版本无法投票)、长文写作、开放对话、JSON 抽取(结构稍异就视为不一致)。这类场景应该用 ToT、Reflexion 或多模型融合,不是 majority vote。
Self-Consistency 和 ensemble 是同一回事吗?
类似但不一样。Ensemble 通常指多个不同模型投票(GPT-4 + Claude + Gemini),Self-Consistency 是同一个模型同一个 prompt 多次采样投票,靠 temperature 制造 diversity。前者吃多家 API 成本高,后者只需要一家 API 跑 N 次,工程上更简单——但模型本身的系统性偏差无法纠正。