logo
12

多模态与工具链

⏱️ 40分钟

Multimodal and tool integrations let your AI apps handle text, images, audio, video, and external actions.

1) Modalities & Use Cases

  • Vision: image captioning, OCR, UI understanding, chart/slide Q&A.
  • Audio: speech-to-text, text-to-speech, meeting notes.
  • Video: scene detection, transcript + summarization.
  • Files: PDFs, spreadsheets, code repos.
  • Tools: web search, DB queries, code execution, automation APIs.

2) Input Handling

  • Normalize: extract text (OCR), downsample images, trim audio.
  • Chunking: for long transcripts, chunk + timestamp; for images, limit size/frames.
  • Metadata: keep filename, page/frame/time for citations.

3) Model Selection

  • Vision-capable LLMs (e.g., GPT-4o, Gemini) for light image understanding.
  • Dedicated models for OCR/speech if quality needed, then feed text to LLM.
  • Latency vs quality: choose “flash”/“mini” for fast feedback, “pro” for depth.

4) Multimodal RAG

  • Store text + vector embeddings; for images, store captions/OCR + embeddings.
  • For video, derive transcript + scene summaries; index both.
  • Retrieval returns text snippets with source IDs/timecodes; include in prompt.

5) Tool Integration

  • Define tool schemas clearly; restrict domains/DBs; sanitize inputs.
  • Streaming tools: stream partial results (e.g., search hits) back to the model.
  • Safety: sandbox code exec; rate-limit external APIs; redact secrets.

6) Output & UX

  • For audio: return both text and TTS audio if needed.
  • For images: return bounding boxes/refs; include source references.
  • For video: link to timestamps; provide key moments list.

7) Testing & Eval

  • Golden sets per modality: image Q&A, chart extraction, noisy audio.
  • Check hallucinations: require citations to source pages/frames.
  • Measure latency and size: large files can blow up cost—enforce limits.

8) Checklist

  • File limits: size, type, duration; user-facing errors when exceeded.
  • Preprocessing pipeline with retries and fallbacks (OCR/STT providers).
  • Logs include modality, size, processing steps, latency, cost.

📚 相关资源