Few-shot Prompting
Enable in-context learning through demonstrations and examples
Large language models have impressive zero-shot abilities, but they still fall short on more complex tasks when you don't give them examples. Few-shot prompting is the fix -- you provide demonstrations in the prompt to steer the model toward better performance. These demos act as conditioning for the examples that follow.
According to Touvron et al. 2023, few-shot properties start to emerge once models are large enough (Kaplan et al., 2020).
Let's walk through an example from Brown et al. 2020. Here the task is using a made-up word correctly in a sentence.
Prompt:
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
Output:
When we won the game, we all started farduddling to celebrate.
The model learned the task from just one example (1-shot). For harder tasks, you can bump that up -- 3-shot, 5-shot, 10-shot, whatever it takes.
Turn this chapter's knowledge into practical skills
Enter the interactive lab and practice Prompt with real tasks. Get started in 10 minutes.
Based on findings from Min et al. (2022), here are some extra tips about demonstrations for few-shot learning:
- "The label space and the distribution of the input text specified by the demonstrations are both important (regardless of whether the labels are correct for individual inputs)"
- The format you use matters a lot for performance. Even random labels work way better than no labels at all.
- Selecting random labels from the true label distribution (rather than uniform) also helps.
Let's try something interesting. Here's an example with randomized labels (meaning Negative and Positive are randomly assigned to inputs):
Prompt:
This is awesome! // Negative
This is terrible! // Positive
Wow, that movie was great! // Positive
What a horrible show! //
Output:
Negative
We still got the right answer even with randomized labels. And the consistent format helped too. Here's the thing -- newer GPT models are even more robust to messy formatting:
Prompt:
Positive This is awesome!
This is bad! Negative
Wow that movie was rad!
Positive
What a horrible show! --
Output:
Negative
Inconsistent format, but the model still nailed it. More thorough testing would be needed to confirm this holds across different and more complex tasks.
Limitations of Few-Shot Prompting
Standard few-shot works well for many tasks, but it's not perfect -- especially for complex reasoning. Let's show why. Remember this task from earlier:
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:
The model outputs:
Yes, the odd numbers in this group add up to 107, which is an even number.
Wrong. 107 isn't even. This highlights the limitations and the need for more advanced prompting.
Let's see if adding few-shot examples helps.
Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Answer is False.
The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.
A: Answer is True.
The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24.
A: Answer is True.
The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.
A: Answer is False.
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:
Output:
Answer is True.
Still didn't work. Few-shot alone isn't enough for this kind of reasoning problem. The examples above give the model basic task info, but the task itself requires multi-step reasoning. In other words, breaking the problem into steps and showing the model how to reason through them would help. That's exactly what Chain-of-Thought (CoT) prompting was designed for.
Bottom line: examples help for many tasks. But when zero-shot and few-shot both fail, the model probably needs more than pattern matching -- it needs to reason. From here, consider fine-tuning or more advanced prompting techniques. Next up, we'll cover chain-of-thought prompting, which has been getting a lot of attention.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
Few-shot 给几个 example 最合适?
看任务难度。简单任务 1-shot 就够(教程的 whatpu/farduddle 一例就让模型学会用造词)。难一点的可以 3-shot、5-shot、10-shot 往上加。Touvron et al. 2023 + Kaplan et al. 2020 指出:模型规模够大时 few-shot 才出现「emergent ability」,小模型多堆例子也无效。
Few-shot 里标签写错了,模型还能预测对吗?
能——这是 Min et al. 2022 的反直觉结论。把 `This is awesome! // Negative` 这种随机错标签喂进去,模型仍能在新输入上输出 `Negative`。原因是模型主要学的是「标签空间」和「输入分布」,标签是否对单条正确不那么关键;但格式必须保持,光去掉标签效果就崩了。
Few-shot 的 demonstration 格式有多讲究?
格式比标签准确性更重要。Min et al. 2022:「使用的格式对性能起着关键作用,即使只是使用随机标签,这也比没有标签好得多。」实战中保持每条示范的分隔符、字段顺序、大小写一致——比如都用 `// Positive` 而不是混着 `: Positive` 和 `// positive`,否则模型解析容易飘。
Few-shot 在哪类任务上会失败?
多步推理任务。教程演示 `15, 32, 5, 13, 82, 7, 1` 这组数找奇数和,给四个 demonstration 后模型仍输出 `Answer is True.`(错)。原因是 few-shot 只展示输入-输出,没展示推理步骤——这种任务必须升级到 Chain-of-Thought(Wei et al. 2022),demo 里把推理链写出来。
Zero-shot 和 Few-shot 都不灵,下一步该考虑什么?
教程原话:「这可能意味着模型学到的东西不足以在任务上表现良好。从这里开始,建议开始考虑微调您的模型或尝试更高级的提示技术。」具体两个方向:(1) Chain-of-Thought 把推理步骤显式写进 demo;(2) 模型 fine-tuning 或换更大的模型。Prompt 不能解决所有问题。