P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

DSP

Directional Stimulus Prompting: use a policy LM to generate stimulus hints

Li et al. (2023) proposed Directional Stimulus Prompting (DSP), a new technique for better guiding LLM outputs.

The core idea: train a small, tunable policy model (Policy LM) that generates a "stimulus/hint" for each input, then sends that stimulus along with the original input to the black-box LLM to guide it toward higher-quality outputs. The policy model is optimized using reinforcement learning (RL).

The diagram below compares directional stimulus prompting with standard prompting. The policy LM can be small (like Flan-T5) and is optimized to generate hints that guide the frozen black-box LLM.

DSP

Image source: Li et al. (2023)

Why DSP?

Directly prompting large LLMs has several problems:

  1. LLMs are black boxes: You can't modify GPT-4's parameters
  2. Prompt optimization is hard: Hand-writing prompts is time-consuming and inconsistent
  3. Lack of fine-grained control: It's tough to make the LLM focus on specific key information in the input

DSP's solution: don't tune the LLM -- train a small model to generate optimal "guide cues" for the LLM. Think of it this way: you can't change how an expert thinks, but you can learn to ask better questions.

How It Works

Overall Architecture

Input text ──→ Policy LM (small model) ──→ Generate stimulus (keywords/hints)
                                                │
                                                ▼
Input text + stimulus ──→ Black-box LLM (large model) ──→ Final output

Step 1: Policy Model Generates Stimulus

The policy LM analyzes the input and generates a short but targeted "stimulus." The form varies by task:

Summarization task -- stimulus is keywords:

Input article: A recent study shows that walking 30 minutes daily can significantly
               reduce cardiovascular disease risk. The research team tracked 5,000
               participants for 10 years and found regular walkers had a 35% lower
               heart disease rate. Lead researcher Prof. Zhang stated...

Policy LM stimulus: walking, cardiovascular, 35%, 10-year study

Dialog task -- stimulus is a conversation strategy:

Input dialog context: User is unhappy with the product price

Policy LM stimulus: express understanding → emphasize value → offer alternatives

Step 2: Inject Stimulus into LLM Prompt

Please generate a summary for the following article.

Key points to cover: walking, cardiovascular, 35%, 10-year study

Article: A recent study shows that walking 30 minutes daily can significantly reduce cardiovascular disease risk...

Summary:

Step 3: Reinforcement Learning Optimization

The policy LM training process:

  1. Supervised pre-training: Initial training with a small set of human-annotated (input, stimulus) pairs
  2. RL fine-tuning: Use the LLM's output quality as the reward signal, optimize the policy LM with policy gradient methods
  3. Iterative optimization: Repeatedly generate stimulus -> evaluate output -> update policy

Experimental Results

The paper validated DSP across multiple tasks:

Summarization (CNN/DailyMail)

MethodROUGE-1ROUGE-2ROUGE-L
Standard Prompt43.720.540.2
DSP (keyword stimulus)45.121.841.5

Dialog Response (MultiWOZ)

MethodBLEUInformSuccess
Standard Prompt14.268.3%58.1%
DSP16.872.6%63.4%

Key findings:

  • Even a policy LM with just a few hundred million parameters (way smaller than the target LLM) can effectively guide the large model
  • DSP's advantage is most obvious in information retention -- keyword stimuli help the LLM avoid missing key details
  • RL optimization beats pure supervised learning because it directly optimizes final output quality

Practical Application Guide

While full DSP implementation requires training a policy model, the idea works in everyday prompt engineering:

1. Manual DSP: Key Info Extraction + Guidance

You can play the role of the "policy LM" yourself -- extract key info first, then guide the model:

Step one (you do this): Read the article, extract 3-5 keywords/key points
Step two (send to LLM):

Please generate a summary based on the following article.

⚡ Must cover these points: [your extracted keywords]

Article content: ...

2. Two-Step Prompt Chain

Use one LLM call to simulate the "policy LM," another to produce the final output:

Prompt 1 (extract stimulus):
Read the following text and extract the 5 most important key points, each in 3-5 words.
Only output the list of points, nothing else.

[text]

---

Prompt 2 (guided generation):
Generate a professional summary for the following text.

⚡ Key points (must cover all):
[Prompt 1 output]

Text: [original text]

3. Strategy Guidance for Dialog Scenarios

Prompt 1 (generate strategy):
Here's a customer service dialog context. Analyze the user's emotion and core need,
then provide a 3-step response strategy (each step under 10 words).

Dialog: ...

---

Prompt 2 (execute strategy):
Please respond to the user based on the following strategy.

Response strategy: [Prompt 1 output]

Dialog context: ...

Comparison with Other Methods

MethodMechanismTraining Required?Best For
Few-shotProvide examplesNoFormat guidance
CoTGuide reasoning processNoReasoning tasks
Self-ConsistencySample multiple, voteNoImproving accuracy
DSPSmall model generates cuesYes (policy LM)Info-dense generation
Prompt TuningContinuous vector prefixYes (needs model access)Needs model weights

DSP's unique advantage: it's the only method that optimizes prompts in a learnable way without accessing the LLM's internal parameters.

Self-Check Checklist

  • Does the task involve focusing on specific key information during generation? (DSP is ideal for this)
  • Can you extract key points first, then guide the model?
  • Is the stimulus granularity appropriate? (Too coarse = no guidance, too fine = over-constraining)
  • Did you verify that the guided output actually covers the key information?

References

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

DSP 是什么?要解决什么问题?

Li 等人 2023 提出 Directional Stimulus Prompting,针对「LLM 是黑盒不能改权重 + prompt 难手工优化 + 缺精细控制」的场景。做法:训练一个小型可调节策略模型(Policy LM,如 Flan-T5)为每个输入生成「刺激/提示」(关键词或策略),把刺激和原始输入一起发给冻结的大模型。

DSP 跟 fine-tuning 大模型有什么不同?

DSP 完全不动大模型——你训练的是一个几亿参数的小策略模型,它学会怎么「向 GPT-4 提问」。这意味着不需要模型权重访问权(GPT-4 / Claude 等闭源模型也能用),训练成本远低于 fine-tune 175B 大模型。代价:每次推理多一次小模型调用。

DSP 的策略模型怎么训练?

三步:1) 监督学习预训练——用少量人工标注的(输入, 刺激)对做初始化;2) 强化学习微调——把大模型的输出质量作为奖励信号,用策略梯度优化策略 LM;3) 迭代——反复生成刺激 → 评估输出 → 更新策略。本质是 RL 优化 prompt 生成器,而不是优化 prompt 本身。

DSP 在哪些任务上有效?

论文实测两类:CNN/DailyMail 摘要——DSP 关键词刺激把 ROUGE-1 从 43.7 拉到 45.1,ROUGE-2 从 20.5 拉到 21.8;MultiWOZ 对话——BLEU 14.2→16.8,Inform 68.3%→72.6%,Success 58.1%→63.4%。共同特征:信息密集、必须覆盖关键要点的生成任务。

没训练策略模型的资源,能模拟 DSP 吗?

能,用两步 prompt chain 模拟:第一步用 LLM 提取 5 个关键要点(扮演策略 LM 的角色),第二步把要点作为「必须覆盖的关键信息」拼到正式 prompt 里。摘要、对话、报告生成都能用——拿到 DSP 论文里大部分「信息保留」的好处,代价是两次 API 调用而不是一次。