What is Self-Consistency and how does it relate to CoT?

Proposed by Wang et al. 2022 to replace CoT's default greedy decoding. The recipe: sample multiple reasoning paths with few-shot CoT (temperature > 0), then majority-vote on the final answers across paths. It's not a new reasoning model — just a smarter decoding layer over CoT, and it shines on arithmetic and commonsense tasks where there's one right answer.

How do you implement Self-Consistency, and how many samples are cost-effective?

Run the same CoT prompt N times with temperature 0.5-0.7, extract the final answer from each, then majority-vote. The paper hit peak gains around N = 40, but in practice N = 5-10 already captures most of the lift with diminishing returns after. Cost is N× tokens, so it fits high-stakes tasks with discrete, votable answers — not open-ended generation.

Why does Self-Consistency correct CoT's wrong answers?

Because correct reasoning usually has multiple valid routes — different paths converge on the right answer, while wrong paths disagree with each other. In the paper's example ("I'm 70 now, my sister was half my age when I was 6 — how old is she?") a single CoT call returns 35 (wrong), but sampling three runs gets 67 twice (correct), and majority voting drags it back.

Which tasks suit Self-Consistency, and which don't?

Good fit: arithmetic (GSM8K), commonsense reasoning, multiple choice, short extraction — answers are discrete and votable. Bad fit: code generation (you can't vote on competing programs), long-form writing, open-ended dialogue, JSON extraction (tiny structural differences look like disagreement). For those, reach for ToT, Reflexion or multi-model ensembles instead of majority vote.

Is Self-Consistency the same thing as ensembling?

Related but not identical. Ensembling usually means voting across different models — GPT-4 + Claude + Gemini. Self-Consistency uses one model, one prompt, sampled N times with temperature for diversity. The former eats multiple vendors' API budget; the latter only hits one API repeatedly and is operationally simpler — but it can't correct the model's systematic biases since they all come from the same model.

Self-Consistency

Sample multiple reasoning paths and aggregate to stabilize CoT outputs

Self-consistency is one of the more advanced prompting techniques out there. Proposed by Wang et al. (2022), it aims to "replace the naive greedy decoding used in chain-of-thought prompting." The idea: sample multiple different reasoning paths through few-shot CoT, then pick the most consistent answer from the generated results. This boosts CoT performance on arithmetic and commonsense reasoning tasks.

Let's try this arithmetic reasoning example:

Prompt:

When I was 6 years old, my sister was half my age. Now I am 70 years old. How old is my sister?

Output:

Wrong. So how do we fix this with self-consistency? Let's use the few-shot exemplars from Wang et al. 2022 (Table 17):

Prompt:

Q: There are 15 trees in the grove. The forester will plant trees today. After they are done, there will be 21 trees. How many trees did the forester plant today?
A: We start with 15 trees. We end with 21 trees. The difference is the number of trees planted. So they planted 21 - 15 = 6 trees. The answer is 6.

Q: There are 3 cars in the parking lot and 2 more cars arrive. How many cars are in the parking lot?
A: There are already 3 cars. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.

Q: Leah has 32 chocolates and her sister has 42. If they eat 35, how many do they have left in total?
A: Leah has 32 and her sister has 42, so initially they have 32 + 42 = 74 chocolates. They ate 35. So they have 74 - 35 = 39 chocolates left. The answer is 39.

Q: Jason has 20 lollipops. He gives some to Denny. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason had 20 lollipops. Now he has 12, so he gave away 20 - 12 = 8 lollipops. The answer is 8.

Q: Shawn has 5 toys. For Christmas, he got 2 toys from his parents. How many toys does he have now?
A: He has 5 toys. He got 2 more, so he has 5 + 2 = 7 toys. Then he got 2 more, so in total he has 7 + 2 = 9 toys. The answer is 9.

Q: There are 9 computers in the server room. From Monday to Thursday, 5 computers are installed each day. How many computers are in the server room now?
A: From Monday to Thursday is 4 days. Each day 5 computers are added, so 4 * 5 = 20 computers were added. Starting with 9 computers, now there are 9 + 20 = 29 computers. The answer is 29.

Q: Michael has 58 golf balls. On Tuesday he loses 23. On Wednesday he loses 2 more. How many golf balls does he have left at the end of Wednesday?
A: Michael started with 58. After losing 23 on Tuesday he has 58 - 23 = 35. After losing 2 more on Wednesday he has 35 - 2 = 33. The answer is 33.

Q: Olivia has 23 dollars. She buys five bagels at 3 dollars each. How much money does she have left?
A: Five bagels at 3 dollars each cost 15 dollars. She has 23 - 15 = 8 dollars left. The answer is 8.

Q: When I was 6 years old, my sister was half my age. Now I am 70 years old. How old is my sister?
A:

Output 1:

When I was 6, my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.

Output 2:

When the narrator was 6, their sister was half their age, so she was 3. Now the narrator is 70, so the sister is 70 - 3 = 67. The answer is 67.

Output 3:

When I was 6, my sister was half my age, so she was 3. Now I am 70, so she is 70 / 2 = 35. The answer is 35.

Computing the final answer involves a few steps (see the paper for details), but the short version: the majority answer already emerged, so that becomes the final answer.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

Self-Consistency 是什么？跟 CoT 什么关系？

Wang 等人 2022 提出，专门替换 CoT 默认的「贪婪解码」。做法：用 few-shot CoT 采样多条不同推理路径（temperature > 0），对每条路径得到的最终答案投票，多数答案即结果。它不是新的推理模型，而是 CoT 的解码升级，专治算术 / 常识推理这类有唯一正确答案的任务。

Self-Consistency 怎么实现？sample 几条最划算？

把 CoT prompt 跑 N 次（temperature 0.5-0.7），对每次输出抽取最终答案，做多数投票。论文里 N = 40 提升最大，实操中 N = 5-10 已经能拿到大部分收益，再往上边际递减。代价是 N 倍 token，所以适合「错一次代价高 + 答案离散可投票」的场景，不适合开放生成。

为什么 Self-Consistency 能纠正 CoT 的错答？

因为正确推理路径通常不止一条——多条不同路径都能到正确答案，错误路径则各错各的、互相不一致。论文示例里「我 6 岁时妹妹是我一半年龄，现在我 70 岁，妹妹多大？」CoT 单跑会给 35（错），sample 3 次有 2 次给 67（正确），多数投票就纠回来了。

Self-Consistency 适合什么任务？什么任务不适合？

适合：算术（GSM8K）、常识推理、单选题、短答案抽取——这类任务答案离散，能投票。不适合：代码生成（多版本无法投票）、长文写作、开放对话、JSON 抽取（结构稍异就视为不一致）。这类场景应该用 ToT、Reflexion 或多模型融合，不是 majority vote。

Self-Consistency 和 ensemble 是同一回事吗？

类似但不一样。Ensemble 通常指多个不同模型投票（GPT-4 + Claude + Gemini），Self-Consistency 是同一个模型同一个 prompt 多次采样投票，靠 temperature 制造 diversity。前者吃多家 API 成本高，后者只需要一家 API 跑 N 次，工程上更简单——但模型本身的系统性偏差无法纠正。

Prompt 大师

📚 相关资源

❓ 常见问题