logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

Reflexion

Self-reflection: reinforce agents with verbal feedback

Reflexion is a framework that reinforces language-based agents through linguistic feedback. According to Shinn et al. (2023), "Reflexion is a new paradigm for 'verbal' reinforcement that parameterizes a policy as an agent's memory encoding paired with an LLM's parameter choices."

At a high level, Reflexion converts environment feedback (free-form language or scalar rewards) into linguistic feedback -- called self-reflection -- that provides context for the LLM agent in the next round. This helps the agent learn quickly and effectively from past mistakes, improving performance on many advanced tasks.

"Reflexion Framework"

As shown above, Reflexion consists of three distinct models:

  • Actor: Generates text and actions based on state observations. The actor takes actions in the environment and receives observations, forming trajectories. Chain-of-Thought (CoT) and ReAct serve as actor models. A memory component is also added to provide additional context.
  • Evaluator: Scores the actor's output. Specifically, it takes the generated trajectory (also called short-term memory) as input and outputs a reward score. Different reward functions are used for different tasks (LLM-based and rule-based heuristic rewards for decision tasks).
  • Self-Reflection: Generates verbal reinforcement cues to help the actor improve. This role is played by an LLM, producing valuable feedback for future trials. The self-reflection model uses the reward signal, current trajectory, and persistent memory to generate specific and relevant feedback, which is stored in the memory component. The agent leverages these experiences (stored in long-term memory) to rapidly improve its decisions.

Overall, the key steps in Reflexion are: a) define the task, b) generate a trajectory, c) evaluate, d) perform self-reflection, e) generate the next trajectory. The figure below shows examples of a Reflexion agent learning to iteratively improve its behavior for decision-making, coding, and reasoning tasks. Reflexion extends the ReAct framework by introducing self-evaluation, self-reflection, and memory components.

"Reflexion Examples"

Results

Experiments show that Reflexion significantly improves performance on decision-making tasks in AlfWorld, question reasoning in HotPotQA, and Python programming tasks on HumanEval.

On sequential decision-making (AlfWorld) tasks, ReAct + Reflexion with heuristic and GPT self-evaluation for binary classification completed 130/134 tasks, significantly outperforming ReAct alone.

"Reflexion ALFWorld Results"

In just a few learning steps, Reflexion significantly outperformed all baseline methods. For reasoning alone and when adding episodic memory consisting of recent trajectories, Reflexion + CoT outperformed CoT-only and CoT with episodic memory respectively.

"Reflexion HotpotQA Results"

As shown in the table below, Reflexion generally outperforms previous SOTA methods when writing Python and Rust code on MBPP, HumanEval, and Leetcode Hard.

"Reflexion Programming Results"

When to Use Reflexion?

Reflexion works best when:

  1. The agent needs to learn from trial and error: Reflexion is designed to help agents improve by reflecting on past mistakes and incorporating that knowledge into future decisions. It's a natural fit for tasks requiring iterative learning, such as decision-making, reasoning, and programming.

  2. Traditional RL methods aren't practical: Conventional reinforcement learning typically needs massive training data and expensive model fine-tuning. Reflexion offers a lightweight alternative that doesn't require fine-tuning the underlying language model, making it more efficient in terms of data and compute.

  3. Nuanced feedback is needed: Reflexion uses linguistic feedback, which is more nuanced and specific than scalar rewards in traditional RL. This lets the agent better understand its mistakes and make more targeted improvements in subsequent trials.

  4. Interpretability and explicit memory matter: Compared to traditional RL, Reflexion provides a more interpretable and explicit form of episodic memory. The agent's self-reflections are stored in its memory component, making it easier to analyze and understand its learning process.

Reflexion has proven effective on these tasks:

  • Sequential decision-making: Improved agent performance on AlfWorld tasks involving navigation and multi-step goals across various environments.
  • Reasoning: Improved performance on HotPotQA, a QA dataset requiring multi-document reasoning.
  • Programming: Reflexion agents wrote better code on benchmarks like HumanEval and MBPP, achieving SOTA results in some cases.

Some limitations to keep in mind:

  • Depends on self-evaluation ability: Reflexion relies on the agent accurately assessing its performance and generating useful reflections. This can be challenging for complex tasks, though it should improve as models get better.
  • Long-term memory constraints: Reflexion uses a sliding window with a maximum capacity. For more complex tasks, advanced structures like vector embeddings or SQL databases might work better.
  • Code generation limitations: Test-driven development has limitations in specifying accurate input-output mappings (e.g., non-deterministic generator functions and hardware-dependent function outputs).

Image source: Reflexion: Language Agents with Verbal Reinforcement Learning

References