DSP
Directional Stimulus Prompting: use a policy LM to generate stimulus hints
Li et al. (2023) proposed Directional Stimulus Prompting (DSP), a new technique for better guiding LLM outputs.
The core idea: train a small, tunable policy model (Policy LM) that generates a "stimulus/hint" for each input, then sends that stimulus along with the original input to the black-box LLM to guide it toward higher-quality outputs. The policy model is optimized using reinforcement learning (RL).
The diagram below compares directional stimulus prompting with standard prompting. The policy LM can be small (like Flan-T5) and is optimized to generate hints that guide the frozen black-box LLM.

Image source: Li et al. (2023)
Why DSP?
Directly prompting large LLMs has several problems:
- LLMs are black boxes: You can't modify GPT-4's parameters
- Prompt optimization is hard: Hand-writing prompts is time-consuming and inconsistent
- Lack of fine-grained control: It's tough to make the LLM focus on specific key information in the input
DSP's solution: don't tune the LLM -- train a small model to generate optimal "guide cues" for the LLM. Think of it this way: you can't change how an expert thinks, but you can learn to ask better questions.
How It Works
Overall Architecture
Input text ──→ Policy LM (small model) ──→ Generate stimulus (keywords/hints)
│
▼
Input text + stimulus ──→ Black-box LLM (large model) ──→ Final output
Step 1: Policy Model Generates Stimulus
The policy LM analyzes the input and generates a short but targeted "stimulus." The form varies by task:
Summarization task -- stimulus is keywords:
Input article: A recent study shows that walking 30 minutes daily can significantly
reduce cardiovascular disease risk. The research team tracked 5,000
participants for 10 years and found regular walkers had a 35% lower
heart disease rate. Lead researcher Prof. Zhang stated...
Policy LM stimulus: walking, cardiovascular, 35%, 10-year study
Dialog task -- stimulus is a conversation strategy:
Input dialog context: User is unhappy with the product price
Policy LM stimulus: express understanding → emphasize value → offer alternatives
Step 2: Inject Stimulus into LLM Prompt
Please generate a summary for the following article.
Key points to cover: walking, cardiovascular, 35%, 10-year study
Article: A recent study shows that walking 30 minutes daily can significantly reduce cardiovascular disease risk...
Summary:
Step 3: Reinforcement Learning Optimization
The policy LM training process:
- Supervised pre-training: Initial training with a small set of human-annotated (input, stimulus) pairs
- RL fine-tuning: Use the LLM's output quality as the reward signal, optimize the policy LM with policy gradient methods
- Iterative optimization: Repeatedly generate stimulus -> evaluate output -> update policy
Experimental Results
The paper validated DSP across multiple tasks:
Summarization (CNN/DailyMail)
| Method | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Standard Prompt | 43.7 | 20.5 | 40.2 |
| DSP (keyword stimulus) | 45.1 | 21.8 | 41.5 |
Dialog Response (MultiWOZ)
| Method | BLEU | Inform | Success |
|---|---|---|---|
| Standard Prompt | 14.2 | 68.3% | 58.1% |
| DSP | 16.8 | 72.6% | 63.4% |
Key findings:
- Even a policy LM with just a few hundred million parameters (way smaller than the target LLM) can effectively guide the large model
- DSP's advantage is most obvious in information retention -- keyword stimuli help the LLM avoid missing key details
- RL optimization beats pure supervised learning because it directly optimizes final output quality
Practical Application Guide
While full DSP implementation requires training a policy model, the idea works in everyday prompt engineering:
1. Manual DSP: Key Info Extraction + Guidance
You can play the role of the "policy LM" yourself -- extract key info first, then guide the model:
Step one (you do this): Read the article, extract 3-5 keywords/key points
Step two (send to LLM):
Please generate a summary based on the following article.
⚡ Must cover these points: [your extracted keywords]
Article content: ...
2. Two-Step Prompt Chain
Use one LLM call to simulate the "policy LM," another to produce the final output:
Prompt 1 (extract stimulus):
Read the following text and extract the 5 most important key points, each in 3-5 words.
Only output the list of points, nothing else.
[text]
---
Prompt 2 (guided generation):
Generate a professional summary for the following text.
⚡ Key points (must cover all):
[Prompt 1 output]
Text: [original text]
3. Strategy Guidance for Dialog Scenarios
Prompt 1 (generate strategy):
Here's a customer service dialog context. Analyze the user's emotion and core need,
then provide a 3-step response strategy (each step under 10 words).
Dialog: ...
---
Prompt 2 (execute strategy):
Please respond to the user based on the following strategy.
Response strategy: [Prompt 1 output]
Dialog context: ...
Comparison with Other Methods
| Method | Mechanism | Training Required? | Best For |
|---|---|---|---|
| Few-shot | Provide examples | No | Format guidance |
| CoT | Guide reasoning process | No | Reasoning tasks |
| Self-Consistency | Sample multiple, vote | No | Improving accuracy |
| DSP | Small model generates cues | Yes (policy LM) | Info-dense generation |
| Prompt Tuning | Continuous vector prefix | Yes (needs model access) | Needs model weights |
DSP's unique advantage: it's the only method that optimizes prompts in a learnable way without accessing the LLM's internal parameters.
Self-Check Checklist
- Does the task involve focusing on specific key information during generation? (DSP is ideal for this)
- Can you extract key points first, then guide the model?
- Is the stimulus granularity appropriate? (Too coarse = no guidance, too fine = over-constraining)
- Did you verify that the guided output actually covers the key information?
References
- Guiding Large Language Models via Directional Stimulus Prompting (Li et al., 2023)
- Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (Fernando et al., 2023)
- Large Language Models Are Human-Level Prompt Engineers (Zhou et al., 2022 -- APE)
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
DSP 是什么?要解决什么问题?
Li 等人 2023 提出 Directional Stimulus Prompting,针对「LLM 是黑盒不能改权重 + prompt 难手工优化 + 缺精细控制」的场景。做法:训练一个小型可调节策略模型(Policy LM,如 Flan-T5)为每个输入生成「刺激/提示」(关键词或策略),把刺激和原始输入一起发给冻结的大模型。
DSP 跟 fine-tuning 大模型有什么不同?
DSP 完全不动大模型——你训练的是一个几亿参数的小策略模型,它学会怎么「向 GPT-4 提问」。这意味着不需要模型权重访问权(GPT-4 / Claude 等闭源模型也能用),训练成本远低于 fine-tune 175B 大模型。代价:每次推理多一次小模型调用。
DSP 的策略模型怎么训练?
三步:1) 监督学习预训练——用少量人工标注的(输入, 刺激)对做初始化;2) 强化学习微调——把大模型的输出质量作为奖励信号,用策略梯度优化策略 LM;3) 迭代——反复生成刺激 → 评估输出 → 更新策略。本质是 RL 优化 prompt 生成器,而不是优化 prompt 本身。
DSP 在哪些任务上有效?
论文实测两类:CNN/DailyMail 摘要——DSP 关键词刺激把 ROUGE-1 从 43.7 拉到 45.1,ROUGE-2 从 20.5 拉到 21.8;MultiWOZ 对话——BLEU 14.2→16.8,Inform 68.3%→72.6%,Success 58.1%→63.4%。共同特征:信息密集、必须覆盖关键要点的生成任务。
没训练策略模型的资源,能模拟 DSP 吗?
能,用两步 prompt chain 模拟:第一步用 LLM 提取 5 个关键要点(扮演策略 LM 的角色),第二步把要点作为「必须覆盖的关键信息」拼到正式 prompt 里。摘要、对话、报告生成都能用——拿到 DSP 论文里大部分「信息保留」的好处,代价是两次 API 调用而不是一次。