What is DSP and what problem does it solve?

Li et al. 2023 introduce Directional Stimulus Prompting for the scenario where the LLM is a frozen black box, hand-tuning prompts is brittle, and you need fine-grained control. The recipe: train a small policy LM (e.g. Flan-T5) to generate a stimulus — keywords or a strategy snippet — for each input, then feed that stimulus together with the original input into the frozen large model.

How does DSP differ from fine-tuning the big model?

DSP never touches the big model — you train a few-hundred-million-parameter policy model that learns how to phrase prompts to GPT-4. So you don't need weight access (works on closed models like GPT-4 / Claude), and training cost is a fraction of fine-tuning a 175B model. Tradeoff: every inference now adds one extra small-model call.

How is the DSP policy model trained?

Three stages: (1) supervised pretraining on a small set of human-labeled (input, stimulus) pairs to initialise; (2) RL fine-tuning where the big model's output quality is the reward and a policy-gradient method updates the policy LM; (3) iterate — generate stimulus → score the downstream output → update policy. You're RL-optimising a prompt generator, not the prompt itself.

Which tasks does DSP actually improve?

Two task families in the paper: CNN/DailyMail summarisation — keyword stimuli push ROUGE-1 from 43.7 to 45.1, ROUGE-2 from 20.5 to 21.8; and MultiWOZ dialogue — BLEU 14.2→16.8, Inform 68.3%→72.6%, Success 58.1%→63.4%. Common shape: information-dense generation where the output must cover specific key points.

Can I simulate DSP without training a policy model?

Yes, simulate it with a two-step prompt chain: step one — call the LLM to extract 5 key points (it stands in for the policy LM); step two — splice those points into the real prompt as "must-cover key information". Works for summarisation, dialogue and report generation, captures most of the information-retention gain DSP reports — at the cost of two API calls instead of one.

DSP

Directional Stimulus Prompting: use a policy LM to generate stimulus hints

Li et al. (2023) proposed Directional Stimulus Prompting (DSP), a new technique for better guiding LLM outputs.

The core idea: train a small, tunable policy model (Policy LM) that generates a "stimulus/hint" for each input, then sends that stimulus along with the original input to the black-box LLM to guide it toward higher-quality outputs. The policy model is optimized using reinforcement learning (RL).

The diagram below compares directional stimulus prompting with standard prompting. The policy LM can be small (like Flan-T5) and is optimized to generate hints that guide the frozen black-box LLM.

DSP

Image source: Li et al. (2023)

Why DSP?

Directly prompting large LLMs has several problems:

LLMs are black boxes: You can't modify GPT-4's parameters
Prompt optimization is hard: Hand-writing prompts is time-consuming and inconsistent
Lack of fine-grained control: It's tough to make the LLM focus on specific key information in the input

DSP's solution: don't tune the LLM -- train a small model to generate optimal "guide cues" for the LLM. Think of it this way: you can't change how an expert thinks, but you can learn to ask better questions.

How It Works

Overall Architecture

Input text ──→ Policy LM (small model) ──→ Generate stimulus (keywords/hints)
                                                │
                                                ▼
Input text + stimulus ──→ Black-box LLM (large model) ──→ Final output

Step 1: Policy Model Generates Stimulus

The policy LM analyzes the input and generates a short but targeted "stimulus." The form varies by task:

Summarization task -- stimulus is keywords:

Input article: A recent study shows that walking 30 minutes daily can significantly
               reduce cardiovascular disease risk. The research team tracked 5,000
               participants for 10 years and found regular walkers had a 35% lower
               heart disease rate. Lead researcher Prof. Zhang stated...

Policy LM stimulus: walking, cardiovascular, 35%, 10-year study

Dialog task -- stimulus is a conversation strategy:

Input dialog context: User is unhappy with the product price

Policy LM stimulus: express understanding → emphasize value → offer alternatives

Step 2: Inject Stimulus into LLM Prompt

Please generate a summary for the following article.

Key points to cover: walking, cardiovascular, 35%, 10-year study

Article: A recent study shows that walking 30 minutes daily can significantly reduce cardiovascular disease risk...

Summary:

Step 3: Reinforcement Learning Optimization

The policy LM training process:

Supervised pre-training: Initial training with a small set of human-annotated (input, stimulus) pairs
RL fine-tuning: Use the LLM's output quality as the reward signal, optimize the policy LM with policy gradient methods
Iterative optimization: Repeatedly generate stimulus -> evaluate output -> update policy

Experimental Results

The paper validated DSP across multiple tasks:

Summarization (CNN/DailyMail)

Method	ROUGE-1	ROUGE-2	ROUGE-L
Standard Prompt	43.7	20.5	40.2
DSP (keyword stimulus)	45.1	21.8	41.5

Dialog Response (MultiWOZ)

Method	BLEU	Inform	Success
Standard Prompt	14.2	68.3%	58.1%
DSP	16.8	72.6%	63.4%

Key findings:

Even a policy LM with just a few hundred million parameters (way smaller than the target LLM) can effectively guide the large model
DSP's advantage is most obvious in information retention -- keyword stimuli help the LLM avoid missing key details
RL optimization beats pure supervised learning because it directly optimizes final output quality

Practical Application Guide

While full DSP implementation requires training a policy model, the idea works in everyday prompt engineering:

1. Manual DSP: Key Info Extraction + Guidance

You can play the role of the "policy LM" yourself -- extract key info first, then guide the model:

Step one (you do this): Read the article, extract 3-5 keywords/key points
Step two (send to LLM):

Please generate a summary based on the following article.

⚡ Must cover these points: [your extracted keywords]

Article content: ...

2. Two-Step Prompt Chain

Use one LLM call to simulate the "policy LM," another to produce the final output:

Prompt 1 (extract stimulus):
Read the following text and extract the 5 most important key points, each in 3-5 words.
Only output the list of points, nothing else.

[text]

---

Prompt 2 (guided generation):
Generate a professional summary for the following text.

⚡ Key points (must cover all):
[Prompt 1 output]

Text: [original text]

3. Strategy Guidance for Dialog Scenarios

Prompt 1 (generate strategy):
Here's a customer service dialog context. Analyze the user's emotion and core need,
then provide a 3-step response strategy (each step under 10 words).

Dialog: ...

---

Prompt 2 (execute strategy):
Please respond to the user based on the following strategy.

Response strategy: [Prompt 1 output]

Dialog context: ...

Comparison with Other Methods

Method	Mechanism	Training Required?	Best For
Few-shot	Provide examples	No	Format guidance
CoT	Guide reasoning process	No	Reasoning tasks
Self-Consistency	Sample multiple, vote	No	Improving accuracy
DSP	Small model generates cues	Yes (policy LM)	Info-dense generation
Prompt Tuning	Continuous vector prefix	Yes (needs model access)	Needs model weights

DSP's unique advantage: it's the only method that optimizes prompts in a learnable way without accessing the LLM's internal parameters.

Self-Check Checklist

Does the task involve focusing on specific key information during generation? (DSP is ideal for this)
Can you extract key points first, then guide the model?
Is the stimulus granularity appropriate? (Too coarse = no guidance, too fine = over-constraining)
Did you verify that the guided output actually covers the key information?

References

Guiding Large Language Models via Directional Stimulus Prompting (Li et al., 2023)
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (Fernando et al., 2023)
Large Language Models Are Human-Level Prompt Engineers (Zhou et al., 2022 -- APE)

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

DSP 是什么？要解决什么问题？

Li 等人 2023 提出 Directional Stimulus Prompting，针对「LLM 是黑盒不能改权重 + prompt 难手工优化 + 缺精细控制」的场景。做法：训练一个小型可调节策略模型（Policy LM，如 Flan-T5）为每个输入生成「刺激/提示」（关键词或策略），把刺激和原始输入一起发给冻结的大模型。

DSP 跟 fine-tuning 大模型有什么不同？

DSP 完全不动大模型——你训练的是一个几亿参数的小策略模型，它学会怎么「向 GPT-4 提问」。这意味着不需要模型权重访问权（GPT-4 / Claude 等闭源模型也能用），训练成本远低于 fine-tune 175B 大模型。代价：每次推理多一次小模型调用。

DSP 的策略模型怎么训练？

三步：1) 监督学习预训练——用少量人工标注的（输入, 刺激）对做初始化；2) 强化学习微调——把大模型的输出质量作为奖励信号，用策略梯度优化策略 LM；3) 迭代——反复生成刺激 → 评估输出 → 更新策略。本质是 RL 优化 prompt 生成器，而不是优化 prompt 本身。

DSP 在哪些任务上有效？

论文实测两类：CNN/DailyMail 摘要——DSP 关键词刺激把 ROUGE-1 从 43.7 拉到 45.1，ROUGE-2 从 20.5 拉到 21.8；MultiWOZ 对话——BLEU 14.2→16.8，Inform 68.3%→72.6%，Success 58.1%→63.4%。共同特征：信息密集、必须覆盖关键要点的生成任务。

没训练策略模型的资源，能模拟 DSP 吗？

能，用两步 prompt chain 模拟：第一步用 LLM 提取 5 个关键要点（扮演策略 LM 的角色），第二步把要点作为「必须覆盖的关键信息」拼到正式 prompt 里。摘要、对话、报告生成都能用——拿到 DSP 论文里大部分「信息保留」的好处，代价是两次 API 调用而不是一次。