12

Multimodal & Tool Chains

⏱️ 40 min

Multimodal Tooling

Multimodal AI is easy to turn into a "just pass everything" feature. But in production, things break fast: image understanding is flaky, OCR costs add up, video is slow, audio pipelines are a pain, and tool schemas blow up the moment they get complex. The real job of an AI engineer isn't adding more modalities -- it's turning each modality into a controllable product capability.

So this page isn't a model catalog. It's about how to pick, integrate, and control multimodal tooling.

Multimodal Tooling Matrix


Bottom line: break down modalities first, then pick tools

A lot of teams see a model that supports vision / audio / video and immediately want to throw everything at it.

A more reliable sequence:

  1. Identify which input types you actually need to handle
  2. Decide whether to use native multimodal understanding or convert to text first
  3. Then choose between a general model and specialized tools

Not every multimodal task should go straight to one big model.


Different modalities, completely different engineering problems

ModalityCommon tasksEngineering focus
imageOCR, caption, UI understanding, chart Q&Aresolution, OCR quality, region reference
audioSTT, meeting notes, TTSnoise, speaker, chunk, timestamp
videotranscript, scene summary, event extractionframe sampling, long duration, cost
filesPDF, slides, spreadsheet, repoparsing, metadata, page mapping
toolsweb search, DB query, code execschema, permission, latency, safety

Treat these as one category and you'll end up with something expensive and unstable.


A more realistic tool selection logic

ScenarioMore stable approach
Simple image understandingvision-capable LLM
High-quality OCROCR first, then hand off to a text LLM
Meeting transcriptSTT first, then summarize / extract action items
Long video analysistranscript + scene summary, not frame-by-frame
Structured chart/table understandingspecialized parser + LLM explanation

Many multimodal systems that actually work well in production do so because they preprocess first, rather than feeding raw input directly to the model.


Metadata is the lifeline of multimodal systems

The moment you're dealing with images, audio, video, or PDFs, metadata becomes critical.

At minimum, keep:

  • filename / source id
  • page / frame / timestamp
  • chunk order
  • extraction method
  • confidence if available

Without these, citation, review, and debugging all become extremely painful.


Multimodal RAG isn't just dumping files into a vector store

A more reliable approach typically looks like:

extract text / OCR / transcript
  -> add metadata
  -> embed/index
  -> retrieve by source-aware chunks
  -> generate with citations

Especially for images and video, what's actually retrievable is usually not raw pixels but:

  • OCR text
  • caption
  • scene summary
  • timestamped transcript

That's why multimodal RAG is more like a data pipeline than a model feature.


Tool integration: schema and permissions are the core

Whenever you're connecting external tools, don't just ask "can I call it?"

Focus on:

QuestionWhy it matters
Is the tool schema clear?Unclear schemas lead to frequent misuse
Is the domain / DB scope restricted?Prevents unauthorized or dirty queries
How are timeout and retry configured?Tool calls easily slow down the chain
How is output fed back to the model?Avoids context pollution and format mess

Once a multimodal system starts calling tools, stability gets noticeably harder.


UX: don't forget to show users what the system actually saw

This one's important.

Many users upload a file and have no idea what the system actually extracted.

Better multimodal UX typically shows:

  • upload status
  • file / duration / page limits
  • extracted summary preview
  • citation to page / timestamp
  • error on unsupported input

If users can't tell what the system "saw," they won't trust it.


Testing can't rely on text samples alone

Multimodal eval needs to be split by modality at minimum.

Test itemWhy test it separately
noisy image / OCRreal images won't always be clean
noisy audiomeeting environments are often terrible
long videocost and latency spike unexpectedly
chart / slide understandingstructural misreads are very common
tool-augmented outputonce tools are involved, failure modes multiply

Testing a multimodal feature only with ideal samples is basically not testing at all.


The most underestimated costs

Multimodal costs aren't just about model calls.

They also include:

  • OCR / STT preprocessing costs
  • Large file storage and transfer costs
  • More complex logging and debugging costs
  • Higher eval and human review costs

If you only budget for LLM API calls, you'll significantly underestimate the total.


Practice

Take a multimodal feature you're building and answer these 4 questions first:

  1. What modality is the real input?
  2. Would converting to text first be more stable?
  3. Which metadata fields need to be preserved?
  4. How will you handle citation and error UX?

Get these 4 answers nailed down before you start wiring up models and tools.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

多模态系统应该把所有输入都直接喂给一个大模型吗?

不该。更稳的顺序是先拆 modality、再决定路径。简单图片用 vision LLM;高质量 OCR 应该先 OCR 再交给 text LLM;会议录音先 STT 再做 summary;长视频用 transcript + scene summary,而不是逐帧硬喂。多模态系统真正跑得稳,是因为先做 preprocessing —— 把原始像素 / audio 变成带 metadata 的文本,而不是把 raw 数据扔给模型让它自己整。

做 multimodal RAG 至少要保留哪些 metadata?

最少五件事:filename / source id、page / frame / timestamp、chunk order、extraction method、confidence(如有)。没有这些后面 citation、review、debugging 全部炸 —— 用户看到一段答案问 "这是 PDF 哪一页?",你查不出来就丢失信任。图片 / 视频真正可检索的常常不是原始像素,而是 OCR text、caption、scene summary、timestamped transcript,所以 metadata 是命根子。

多模态系统的真实成本里,最容易被低估的是哪一块?

预处理 + 长尾运维。LLM 调用价只是表面 —— OCR / STT 预处理、大文件存储和传输、更复杂的日志和排障、更高的 eval 和人工 review 成本,加起来通常比模型调用更贵。一个长视频 transcript pipeline 跑一次几毛钱,但每天跑一万个 + 失败重试 + 标注审查,月度成本完全是另一个量级。只预算 LLM token 的团队基本都会超支。

多模态 feature 的 UX 上最容易遗漏的是什么?

告诉用户 "系统看到了什么"。用户上传 PDF / 图 / 录音后,根本不知道你识别出了哪些内容 —— 这种黑盒会直接摧毁信任。最少要给:upload status、file / duration / page limits、extracted summary preview、citation 到 page / timestamp、unsupported input 的明确报错。让用户能验证你看到的和他想的是不是同一件事,是 multimodal UX 的底线。

multimodal feature 的测试为什么不能只用干净样本?

真实世界没有干净样本。eval 必须按 modality 分别测:noisy image / OCR、noisy audio(会议环境通常很差)、long video(成本和延迟会突然爆)、chart / slide understanding(结构最容易看错)、tool-augmented output(一旦接 tool 错误类型立刻翻倍)。只用理想样本测过的多模态系统,上线第一周就会被用户的真实输入暴打 —— 模糊扫描件、地铁里的语音、200 页 PDF 是常态。