Multimodal CoT
A two-stage CoT framework combining vision and text
Zhang et al. (2023) proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality only. Multimodal CoT incorporates both text and vision into a two-stage framework. The first stage involves rationale generation based on multimodal information. The second stage is answer inference, which leverages the generated rationale to arrive at the final answer.
The multimodal CoT model (1B parameters) outperformed GPT-3.5 on the ScienceQA benchmark.

Image source: Zhang et al. (2023)
Why Multimodal CoT?
Text-only CoT hits a wall when questions involve visual information. For example:
- Science questions: Experiment diagrams, circuit diagrams, chemical structures
- Math geometry: Reading angles, lengths, and spatial relationships from figures
- Data analysis: Interpreting charts, bar graphs, line graphs
- Everyday reasoning: Observing details in photos to answer questions
In these scenarios, text descriptions alone don't provide enough information. The model needs to "see" the image to reason correctly.
Two-Stage Framework
Stage 1: Rationale Generation
The model receives both text and image as input, then generates a reasoning process (rationale). The goal: get the model to "verbalize" its thinking, including what it observes in the image and how it combines visual and textual information.
Input:
- Text: An object slides down an inclined plane at the angle shown.
The object's mass is 2kg. Find the gravitational component
along the plane.
- Image: [A diagram showing a 30-degree incline]
Rationale output:
From the diagram, the incline angle is 30 degrees. Object mass m=2kg,
gravity G = mg = 2 × 9.8 = 19.6N.
Gravitational component along the plane = G × sin(30°) = 19.6 × 0.5 = 9.8N.
Stage 2: Answer Inference
The rationale from stage 1 is concatenated with the original input (text + image) and fed to the model to generate the final answer. The two-stage approach works because the rationale acts as a "bridge," helping the model better leverage multimodal information.
Input:
- Original text + image
- Rationale (from stage 1)
Output:
Answer: The gravitational component along the plane is 9.8N
Key Technical Details
Visual Feature Fusion
Multimodal CoT uses a Vision Transformer (ViT) to extract image features, fusing them with language features through:
- Feature concatenation: Concatenate visual feature vectors with text embeddings
- Cross-attention: Let text tokens "attend to" image patches in Transformer layers
- Gating mechanism: Dynamically control the proportion of visual information injected
Hallucination Problem and Solution
The paper found something interesting: when the model uses only text input, the generated rationales frequently contain hallucinations -- fabricating information that doesn't exist in the image. Adding the visual modality significantly reduced the hallucination rate.
❌ Text-only CoT (hallucination):
"From the diagram we can see this is an equilateral triangle..."
(The actual diagram shows a right triangle)
✅ Multimodal CoT (correct):
"The diagram shows a right triangle, with one angle at 90 degrees and another labeled as 30 degrees..."
Experimental Results
Performance on ScienceQA benchmark:
| Model | Parameters | Accuracy |
|---|---|---|
| GPT-3.5 (CoT) | 175B | 75.17% |
| GPT-4 (CoT) | - | 83.99% |
| Multimodal CoT (small model) | 1B | 84.91% |
Key findings:
- A 1B-parameter multimodal CoT model beat GPT-3.5 (175B) and even slightly outperformed GPT-4
- The two-stage approach is 16% more accurate than single-stage (directly generating answers)
- Visual features are critical for reducing hallucinations -- without visual input, the hallucination rate in rationales hit 65%
Practical Applications
1. Education and Exam Assistance
Multimodal CoT is perfect for exam questions with charts and diagrams:
Prompt:
Look at the chart below and answer the question. First describe what
you see in the chart, then reason step by step to reach your answer.
[Attached image: Company revenue bar chart 2020-2025]
Question: Which year had the highest revenue growth rate?
2. Medical Imaging Analysis
Prompt:
Please analyze the following X-ray. First describe the key features
you observe in the image, then give your preliminary assessment
and reasoning process based on those observations.
[Attached image: Chest X-ray]
3. Code + Screenshot Debugging
Prompt:
Here's some code and its runtime screenshot. First describe the error
shown in the screenshot, then analyze what in the code might cause it,
and finally suggest a fix.
Code: [code block]
Screenshot: [error screenshot]
Applying This with Modern Multimodal Models
While the original paper fine-tuned a small model, the multimodal CoT concept works equally well with modern large models (GPT-4o, Claude, Gemini). You can guide the model through similar two-stage reasoning via prompting:
Please answer the question following these steps:
Step 1 (Observe and Reason):
- Carefully observe all relevant information in the image
- List key visual elements (numbers, labels, relationships)
- Describe your reasoning process
Step 2 (Arrive at Answer):
- Based on your reasoning, give the final answer
- State your confidence level and any uncertainties
Self-Check Checklist
- Does the question involve information that requires visual understanding? (If text alone suffices, use standard CoT)
- Did you explicitly ask the model to "describe observations" before "reasoning toward an answer" (two-stage thinking)?
- Is the image resolution high enough for the model to identify key details?
- Did you provide a structured output format requirement?
References
- Multimodal Chain-of-Thought Reasoning in Language Models (Zhang et al., 2023)
- Language Is Not All You Need: Aligning Perception with Language Models (2023)
- Visual Instruction Tuning (Liu et al., 2023 - LLaVA)
- GPT-4V(ision) System Card