DSP
Directional Stimulus Prompting: use a policy LM to generate stimulus hints
Li et al. (2023) proposed Directional Stimulus Prompting (DSP), a new technique for better guiding LLM outputs.
The core idea: train a small, tunable policy model (Policy LM) that generates a "stimulus/hint" for each input, then sends that stimulus along with the original input to the black-box LLM to guide it toward higher-quality outputs. The policy model is optimized using reinforcement learning (RL).
The diagram below compares directional stimulus prompting with standard prompting. The policy LM can be small (like Flan-T5) and is optimized to generate hints that guide the frozen black-box LLM.

Image source: Li et al. (2023)
Why DSP?
Directly prompting large LLMs has several problems:
- LLMs are black boxes: You can't modify GPT-4's parameters
- Prompt optimization is hard: Hand-writing prompts is time-consuming and inconsistent
- Lack of fine-grained control: It's tough to make the LLM focus on specific key information in the input
DSP's solution: don't tune the LLM -- train a small model to generate optimal "guide cues" for the LLM. Think of it this way: you can't change how an expert thinks, but you can learn to ask better questions.
How It Works
Overall Architecture
Input text ──→ Policy LM (small model) ──→ Generate stimulus (keywords/hints)
│
▼
Input text + stimulus ──→ Black-box LLM (large model) ──→ Final output
Step 1: Policy Model Generates Stimulus
The policy LM analyzes the input and generates a short but targeted "stimulus." The form varies by task:
Summarization task -- stimulus is keywords:
Input article: A recent study shows that walking 30 minutes daily can significantly
reduce cardiovascular disease risk. The research team tracked 5,000
participants for 10 years and found regular walkers had a 35% lower
heart disease rate. Lead researcher Prof. Zhang stated...
Policy LM stimulus: walking, cardiovascular, 35%, 10-year study
Dialog task -- stimulus is a conversation strategy:
Input dialog context: User is unhappy with the product price
Policy LM stimulus: express understanding → emphasize value → offer alternatives
Step 2: Inject Stimulus into LLM Prompt
Please generate a summary for the following article.
Key points to cover: walking, cardiovascular, 35%, 10-year study
Article: A recent study shows that walking 30 minutes daily can significantly reduce cardiovascular disease risk...
Summary:
Step 3: Reinforcement Learning Optimization
The policy LM training process:
- Supervised pre-training: Initial training with a small set of human-annotated (input, stimulus) pairs
- RL fine-tuning: Use the LLM's output quality as the reward signal, optimize the policy LM with policy gradient methods
- Iterative optimization: Repeatedly generate stimulus -> evaluate output -> update policy
Experimental Results
The paper validated DSP across multiple tasks:
Summarization (CNN/DailyMail)
| Method | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Standard Prompt | 43.7 | 20.5 | 40.2 |
| DSP (keyword stimulus) | 45.1 | 21.8 | 41.5 |
Dialog Response (MultiWOZ)
| Method | BLEU | Inform | Success |
|---|---|---|---|
| Standard Prompt | 14.2 | 68.3% | 58.1% |
| DSP | 16.8 | 72.6% | 63.4% |
Key findings:
- Even a policy LM with just a few hundred million parameters (way smaller than the target LLM) can effectively guide the large model
- DSP's advantage is most obvious in information retention -- keyword stimuli help the LLM avoid missing key details
- RL optimization beats pure supervised learning because it directly optimizes final output quality
Practical Application Guide
While full DSP implementation requires training a policy model, the idea works in everyday prompt engineering:
1. Manual DSP: Key Info Extraction + Guidance
You can play the role of the "policy LM" yourself -- extract key info first, then guide the model:
Step one (you do this): Read the article, extract 3-5 keywords/key points
Step two (send to LLM):
Please generate a summary based on the following article.
⚡ Must cover these points: [your extracted keywords]
Article content: ...
2. Two-Step Prompt Chain
Use one LLM call to simulate the "policy LM," another to produce the final output:
Prompt 1 (extract stimulus):
Read the following text and extract the 5 most important key points, each in 3-5 words.
Only output the list of points, nothing else.
[text]
---
Prompt 2 (guided generation):
Generate a professional summary for the following text.
⚡ Key points (must cover all):
[Prompt 1 output]
Text: [original text]
3. Strategy Guidance for Dialog Scenarios
Prompt 1 (generate strategy):
Here's a customer service dialog context. Analyze the user's emotion and core need,
then provide a 3-step response strategy (each step under 10 words).
Dialog: ...
---
Prompt 2 (execute strategy):
Please respond to the user based on the following strategy.
Response strategy: [Prompt 1 output]
Dialog context: ...
Comparison with Other Methods
| Method | Mechanism | Training Required? | Best For |
|---|---|---|---|
| Few-shot | Provide examples | No | Format guidance |
| CoT | Guide reasoning process | No | Reasoning tasks |
| Self-Consistency | Sample multiple, vote | No | Improving accuracy |
| DSP | Small model generates cues | Yes (policy LM) | Info-dense generation |
| Prompt Tuning | Continuous vector prefix | Yes (needs model access) | Needs model weights |
DSP's unique advantage: it's the only method that optimizes prompts in a learnable way without accessing the LLM's internal parameters.
Self-Check Checklist
- Does the task involve focusing on specific key information during generation? (DSP is ideal for this)
- Can you extract key points first, then guide the model?
- Is the stimulus granularity appropriate? (Too coarse = no guidance, too fine = over-constraining)
- Did you verify that the guided output actually covers the key information?
References
- Guiding Large Language Models via Directional Stimulus Prompting (Li et al., 2023)
- Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (Fernando et al., 2023)
- Large Language Models Are Human-Level Prompt Engineers (Zhou et al., 2022 -- APE)