logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

DSP

Directional Stimulus Prompting: use a policy LM to generate stimulus hints

Li et al. (2023) proposed Directional Stimulus Prompting (DSP), a new technique for better guiding LLM outputs.

The core idea: train a small, tunable policy model (Policy LM) that generates a "stimulus/hint" for each input, then sends that stimulus along with the original input to the black-box LLM to guide it toward higher-quality outputs. The policy model is optimized using reinforcement learning (RL).

The diagram below compares directional stimulus prompting with standard prompting. The policy LM can be small (like Flan-T5) and is optimized to generate hints that guide the frozen black-box LLM.

DSP

Image source: Li et al. (2023)

Why DSP?

Directly prompting large LLMs has several problems:

  1. LLMs are black boxes: You can't modify GPT-4's parameters
  2. Prompt optimization is hard: Hand-writing prompts is time-consuming and inconsistent
  3. Lack of fine-grained control: It's tough to make the LLM focus on specific key information in the input

DSP's solution: don't tune the LLM -- train a small model to generate optimal "guide cues" for the LLM. Think of it this way: you can't change how an expert thinks, but you can learn to ask better questions.

How It Works

Overall Architecture

Input text ──→ Policy LM (small model) ──→ Generate stimulus (keywords/hints)
                                                │
                                                ▼
Input text + stimulus ──→ Black-box LLM (large model) ──→ Final output

Step 1: Policy Model Generates Stimulus

The policy LM analyzes the input and generates a short but targeted "stimulus." The form varies by task:

Summarization task -- stimulus is keywords:

Input article: A recent study shows that walking 30 minutes daily can significantly
               reduce cardiovascular disease risk. The research team tracked 5,000
               participants for 10 years and found regular walkers had a 35% lower
               heart disease rate. Lead researcher Prof. Zhang stated...

Policy LM stimulus: walking, cardiovascular, 35%, 10-year study

Dialog task -- stimulus is a conversation strategy:

Input dialog context: User is unhappy with the product price

Policy LM stimulus: express understanding → emphasize value → offer alternatives

Step 2: Inject Stimulus into LLM Prompt

Please generate a summary for the following article.

Key points to cover: walking, cardiovascular, 35%, 10-year study

Article: A recent study shows that walking 30 minutes daily can significantly reduce cardiovascular disease risk...

Summary:

Step 3: Reinforcement Learning Optimization

The policy LM training process:

  1. Supervised pre-training: Initial training with a small set of human-annotated (input, stimulus) pairs
  2. RL fine-tuning: Use the LLM's output quality as the reward signal, optimize the policy LM with policy gradient methods
  3. Iterative optimization: Repeatedly generate stimulus -> evaluate output -> update policy

Experimental Results

The paper validated DSP across multiple tasks:

Summarization (CNN/DailyMail)

MethodROUGE-1ROUGE-2ROUGE-L
Standard Prompt43.720.540.2
DSP (keyword stimulus)45.121.841.5

Dialog Response (MultiWOZ)

MethodBLEUInformSuccess
Standard Prompt14.268.3%58.1%
DSP16.872.6%63.4%

Key findings:

  • Even a policy LM with just a few hundred million parameters (way smaller than the target LLM) can effectively guide the large model
  • DSP's advantage is most obvious in information retention -- keyword stimuli help the LLM avoid missing key details
  • RL optimization beats pure supervised learning because it directly optimizes final output quality

Practical Application Guide

While full DSP implementation requires training a policy model, the idea works in everyday prompt engineering:

1. Manual DSP: Key Info Extraction + Guidance

You can play the role of the "policy LM" yourself -- extract key info first, then guide the model:

Step one (you do this): Read the article, extract 3-5 keywords/key points
Step two (send to LLM):

Please generate a summary based on the following article.

⚡ Must cover these points: [your extracted keywords]

Article content: ...

2. Two-Step Prompt Chain

Use one LLM call to simulate the "policy LM," another to produce the final output:

Prompt 1 (extract stimulus):
Read the following text and extract the 5 most important key points, each in 3-5 words.
Only output the list of points, nothing else.

[text]

---

Prompt 2 (guided generation):
Generate a professional summary for the following text.

⚡ Key points (must cover all):
[Prompt 1 output]

Text: [original text]

3. Strategy Guidance for Dialog Scenarios

Prompt 1 (generate strategy):
Here's a customer service dialog context. Analyze the user's emotion and core need,
then provide a 3-step response strategy (each step under 10 words).

Dialog: ...

---

Prompt 2 (execute strategy):
Please respond to the user based on the following strategy.

Response strategy: [Prompt 1 output]

Dialog context: ...

Comparison with Other Methods

MethodMechanismTraining Required?Best For
Few-shotProvide examplesNoFormat guidance
CoTGuide reasoning processNoReasoning tasks
Self-ConsistencySample multiple, voteNoImproving accuracy
DSPSmall model generates cuesYes (policy LM)Info-dense generation
Prompt TuningContinuous vector prefixYes (needs model access)Needs model weights

DSP's unique advantage: it's the only method that optimizes prompts in a learnable way without accessing the LLM's internal parameters.

Self-Check Checklist

  • Does the task involve focusing on specific key information during generation? (DSP is ideal for this)
  • Can you extract key points first, then guide the model?
  • Is the stimulus granularity appropriate? (Too coarse = no guidance, too fine = over-constraining)
  • Did you verify that the guided output actually covers the key information?

References