How many examples is the right amount for few-shot?

It depends on task difficulty. Simple tasks need just 1-shot (the chapter's whatpu/farduddle example teaches the model a made-up word from a single demonstration). Harder tasks can ramp to 3, 5, or 10 shots. Touvron et al. 2023 plus Kaplan et al. 2020 noted few-shot only becomes an 'emergent ability' once models are big enough—piling examples on a small model doesn't help.

If the few-shot labels are wrong, can the model still predict correctly?

Yes—this is the counterintuitive finding from Min et al. 2022. Feed in `This is awesome! // Negative` with randomized labels and the model still outputs `Negative` on the new input. The model mainly learns the label space and input distribution; per-example label correctness matters less. But format must be preserved—dropping the labels entirely breaks performance.

How much does few-shot demonstration formatting matter?

Format matters more than label accuracy. Min et al. 2022: 'the format you use matters a lot for performance—even random labels work way better than no labels at all.' In practice keep separators, field order, and case consistent across demos—use `// Positive` everywhere, not a mix of `: Positive` and `// positive`. Inconsistent format makes the model's parsing wobble.

What kinds of tasks does few-shot fail on?

Multi-step reasoning tasks. The chapter shows `15, 32, 5, 13, 82, 7, 1`—asking whether the odd numbers sum to even—with four demonstrations still produces `Answer is True.` (wrong). Few-shot demos show input-to-output but not the reasoning steps. These tasks need Chain-of-Thought (Wei et al. 2022) where the reasoning chain is shown explicitly in each demo.

If both zero-shot and few-shot fail, what should I try next?

From the chapter directly: 'this may mean the model has not learned enough to perform on the task—from here, consider fine-tuning the model or trying more advanced prompting techniques.' Two concrete paths: (1) Chain-of-Thought, writing the reasoning steps into each demo; (2) fine-tuning or moving to a larger model. Prompting doesn't solve everything.

Few-shot Prompting

Enable in-context learning through demonstrations and examples

Large language models have impressive zero-shot abilities, but they still fall short on more complex tasks when you don't give them examples. Few-shot prompting is the fix -- you provide demonstrations in the prompt to steer the model toward better performance. These demos act as conditioning for the examples that follow.

According to Touvron et al. 2023, few-shot properties start to emerge once models are large enough (Kaplan et al., 2020).

Let's walk through an example from Brown et al. 2020. Here the task is using a made-up word correctly in a sentence.

Prompt:

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.

To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:

Output:

When we won the game, we all started farduddling to celebrate.

The model learned the task from just one example (1-shot). For harder tasks, you can bump that up -- 3-shot, 5-shot, 10-shot, whatever it takes.

Prompt Lab

Turn this chapter's knowledge into practical skills

Enter the interactive lab and practice Prompt with real tasks. Get started in 10 minutes.

View Now

Based on findings from Min et al. (2022), here are some extra tips about demonstrations for few-shot learning:

"The label space and the distribution of the input text specified by the demonstrations are both important (regardless of whether the labels are correct for individual inputs)"
The format you use matters a lot for performance. Even random labels work way better than no labels at all.
Selecting random labels from the true label distribution (rather than uniform) also helps.

Let's try something interesting. Here's an example with randomized labels (meaning Negative and Positive are randomly assigned to inputs):

Prompt:

This is awesome! // Negative
This is terrible! // Positive
Wow, that movie was great! // Positive
What a horrible show! //

Output:

Negative

We still got the right answer even with randomized labels. And the consistent format helped too. Here's the thing -- newer GPT models are even more robust to messy formatting:

Prompt:

Positive This is awesome!
This is bad! Negative
Wow that movie was rad!
Positive
What a horrible show! --

Output:

Negative

Inconsistent format, but the model still nailed it. More thorough testing would be needed to confirm this holds across different and more complex tasks.

Limitations of Few-Shot Prompting

Standard few-shot works well for many tasks, but it's not perfect -- especially for complex reasoning. Let's show why. Remember this task from earlier:

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.

A:

The model outputs:

Yes, the odd numbers in this group add up to 107, which is an even number.

Wrong. 107 isn't even. This highlights the limitations and the need for more advanced prompting.

Let's see if adding few-shot examples helps.

Prompt:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Answer is False.

The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.
A: Answer is True.

The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24.
A: Answer is True.

The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.
A: Answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:

Output:

Answer is True.

Still didn't work. Few-shot alone isn't enough for this kind of reasoning problem. The examples above give the model basic task info, but the task itself requires multi-step reasoning. In other words, breaking the problem into steps and showing the model how to reason through them would help. That's exactly what Chain-of-Thought (CoT) prompting was designed for.

Bottom line: examples help for many tasks. But when zero-shot and few-shot both fail, the model probably needs more than pattern matching -- it needs to reason. From here, consider fine-tuning or more advanced prompting techniques. Next up, we'll cover chain-of-thought prompting, which has been getting a lot of attention.

📚 相关资源

Few-shot Learning — Teach AI with Examples

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

Few-shot 给几个 example 最合适？

看任务难度。简单任务 1-shot 就够（教程的 whatpu/farduddle 一例就让模型学会用造词）。难一点的可以 3-shot、5-shot、10-shot 往上加。Touvron et al. 2023 + Kaplan et al. 2020 指出：模型规模够大时 few-shot 才出现「emergent ability」，小模型多堆例子也无效。

Few-shot 里标签写错了，模型还能预测对吗？

能——这是 Min et al. 2022 的反直觉结论。把 `This is awesome! // Negative` 这种随机错标签喂进去，模型仍能在新输入上输出 `Negative`。原因是模型主要学的是「标签空间」和「输入分布」，标签是否对单条正确不那么关键；但格式必须保持，光去掉标签效果就崩了。

Few-shot 的 demonstration 格式有多讲究？

格式比标签准确性更重要。Min et al. 2022：「使用的格式对性能起着关键作用，即使只是使用随机标签，这也比没有标签好得多。」实战中保持每条示范的分隔符、字段顺序、大小写一致——比如都用 `// Positive` 而不是混着 `: Positive` 和 `// positive`，否则模型解析容易飘。

Few-shot 在哪类任务上会失败？

多步推理任务。教程演示 `15, 32, 5, 13, 82, 7, 1` 这组数找奇数和，给四个 demonstration 后模型仍输出 `Answer is True.`（错）。原因是 few-shot 只展示输入-输出，没展示推理步骤——这种任务必须升级到 Chain-of-Thought（Wei et al. 2022），demo 里把推理链写出来。

Zero-shot 和 Few-shot 都不灵，下一步该考虑什么？

教程原话：「这可能意味着模型学到的东西不足以在任务上表现良好。从这里开始，建议开始考虑微调您的模型或尝试更高级的提示技术。」具体两个方向：(1) Chain-of-Thought 把推理步骤显式写进 demo；(2) 模型 fine-tuning 或换更大的模型。Prompt 不能解决所有问题。

Prompt 大师

Turn this chapter's knowledge into practical skills

Limitations of Few-Shot Prompting

📚 相关资源

❓ 常见问题