logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

Multimodal CoT

A two-stage CoT framework combining vision and text

Zhang et al. (2023) proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality only. Multimodal CoT incorporates both text and vision into a two-stage framework. The first stage involves rationale generation based on multimodal information. The second stage is answer inference, which leverages the generated rationale to arrive at the final answer.

The multimodal CoT model (1B parameters) outperformed GPT-3.5 on the ScienceQA benchmark.

MCOT

Image source: Zhang et al. (2023)

Why Multimodal CoT?

Text-only CoT hits a wall when questions involve visual information. For example:

  • Science questions: Experiment diagrams, circuit diagrams, chemical structures
  • Math geometry: Reading angles, lengths, and spatial relationships from figures
  • Data analysis: Interpreting charts, bar graphs, line graphs
  • Everyday reasoning: Observing details in photos to answer questions

In these scenarios, text descriptions alone don't provide enough information. The model needs to "see" the image to reason correctly.

Two-Stage Framework

Stage 1: Rationale Generation

The model receives both text and image as input, then generates a reasoning process (rationale). The goal: get the model to "verbalize" its thinking, including what it observes in the image and how it combines visual and textual information.

Input:
- Text: An object slides down an inclined plane at the angle shown.
        The object's mass is 2kg. Find the gravitational component
        along the plane.
- Image: [A diagram showing a 30-degree incline]

Rationale output:
From the diagram, the incline angle is 30 degrees. Object mass m=2kg,
gravity G = mg = 2 × 9.8 = 19.6N.
Gravitational component along the plane = G × sin(30°) = 19.6 × 0.5 = 9.8N.

Stage 2: Answer Inference

The rationale from stage 1 is concatenated with the original input (text + image) and fed to the model to generate the final answer. The two-stage approach works because the rationale acts as a "bridge," helping the model better leverage multimodal information.

Input:
- Original text + image
- Rationale (from stage 1)

Output:
Answer: The gravitational component along the plane is 9.8N

Key Technical Details

Visual Feature Fusion

Multimodal CoT uses a Vision Transformer (ViT) to extract image features, fusing them with language features through:

  1. Feature concatenation: Concatenate visual feature vectors with text embeddings
  2. Cross-attention: Let text tokens "attend to" image patches in Transformer layers
  3. Gating mechanism: Dynamically control the proportion of visual information injected

Hallucination Problem and Solution

The paper found something interesting: when the model uses only text input, the generated rationales frequently contain hallucinations -- fabricating information that doesn't exist in the image. Adding the visual modality significantly reduced the hallucination rate.

❌ Text-only CoT (hallucination):
"From the diagram we can see this is an equilateral triangle..."
(The actual diagram shows a right triangle)

✅ Multimodal CoT (correct):
"The diagram shows a right triangle, with one angle at 90 degrees and another labeled as 30 degrees..."

Experimental Results

Performance on ScienceQA benchmark:

ModelParametersAccuracy
GPT-3.5 (CoT)175B75.17%
GPT-4 (CoT)-83.99%
Multimodal CoT (small model)1B84.91%

Key findings:

  • A 1B-parameter multimodal CoT model beat GPT-3.5 (175B) and even slightly outperformed GPT-4
  • The two-stage approach is 16% more accurate than single-stage (directly generating answers)
  • Visual features are critical for reducing hallucinations -- without visual input, the hallucination rate in rationales hit 65%

Practical Applications

1. Education and Exam Assistance

Multimodal CoT is perfect for exam questions with charts and diagrams:

Prompt:
Look at the chart below and answer the question. First describe what
you see in the chart, then reason step by step to reach your answer.

[Attached image: Company revenue bar chart 2020-2025]

Question: Which year had the highest revenue growth rate?

2. Medical Imaging Analysis

Prompt:
Please analyze the following X-ray. First describe the key features
you observe in the image, then give your preliminary assessment
and reasoning process based on those observations.

[Attached image: Chest X-ray]

3. Code + Screenshot Debugging

Prompt:
Here's some code and its runtime screenshot. First describe the error
shown in the screenshot, then analyze what in the code might cause it,
and finally suggest a fix.

Code: [code block]
Screenshot: [error screenshot]

Applying This with Modern Multimodal Models

While the original paper fine-tuned a small model, the multimodal CoT concept works equally well with modern large models (GPT-4o, Claude, Gemini). You can guide the model through similar two-stage reasoning via prompting:

Please answer the question following these steps:

Step 1 (Observe and Reason):
- Carefully observe all relevant information in the image
- List key visual elements (numbers, labels, relationships)
- Describe your reasoning process

Step 2 (Arrive at Answer):
- Based on your reasoning, give the final answer
- State your confidence level and any uncertainties

Self-Check Checklist

  • Does the question involve information that requires visual understanding? (If text alone suffices, use standard CoT)
  • Did you explicitly ask the model to "describe observations" before "reasoning toward an answer" (two-stage thinking)?
  • Is the image resolution high enough for the model to identify key details?
  • Did you provide a structured output format requirement?

References