12

多模态与工具链

⏱️ 40分钟

Multimodal and tool integrations let your AI apps handle text, images, audio, video, and external actions.

1) Modalities & Use Cases

Vision: image captioning, OCR, UI understanding, chart/slide Q&A.
Audio: speech-to-text, text-to-speech, meeting notes.
Video: scene detection, transcript + summarization.
Files: PDFs, spreadsheets, code repos.
Tools: web search, DB queries, code execution, automation APIs.

2) Input Handling

Normalize: extract text (OCR), downsample images, trim audio.
Chunking: for long transcripts, chunk + timestamp; for images, limit size/frames.
Metadata: keep filename, page/frame/time for citations.

3) Model Selection

Vision-capable LLMs (e.g., GPT-4o, Gemini) for light image understanding.
Dedicated models for OCR/speech if quality needed, then feed text to LLM.
Latency vs quality: choose “flash”/“mini” for fast feedback, “pro” for depth.

4) Multimodal RAG

Store text + vector embeddings; for images, store captions/OCR + embeddings.
For video, derive transcript + scene summaries; index both.
Retrieval returns text snippets with source IDs/timecodes; include in prompt.

5) Tool Integration

Define tool schemas clearly; restrict domains/DBs; sanitize inputs.
Streaming tools: stream partial results (e.g., search hits) back to the model.
Safety: sandbox code exec; rate-limit external APIs; redact secrets.

6) Output & UX

For audio: return both text and TTS audio if needed.
For images: return bounding boxes/refs; include source references.
For video: link to timestamps; provide key moments list.

7) Testing & Eval

Golden sets per modality: image Q&A, chart extraction, noisy audio.
Check hallucinations: require citations to source pages/frames.
Measure latency and size: large files can blow up cost—enforce limits.

8) Checklist

File limits: size, type, duration; user-facing errors when exceeded.
Preprocessing pipeline with retries and fallbacks (OCR/STT providers).
Logs include modality, size, processing steps, latency, cost.

📚 相关资源

Gemini API 快速上手