How is RAG different from just asking the model directly, and when should I use it?

RAG makes the model retrieve first, answer second — solves knowledge freshness and private-data gaps. Use it for enterprise KB Q&A, doc assistants, support FAQs, compliance search, codebase Q&A. Skip it when the task needs heavy reasoning but no source material exists. Pipeline: user question → embedding → retrieval → context assembly → LLM generation → citations / confidence.

How big should each chunk be, and how much overlap?

Use RecursiveCharacterTextSplitter with chunk_size 200-400 tokens and 10-20% overlap (~60 tokens) as a baseline. Cut tables and code by semantic block, not by hard length. Every chunk must carry metadata — document name, section, page, timestamp, source type — for later filtering and citation. Too small fragments meaning, too large blunts retrieval precision.

Should I pick Pinecone, Chroma or Weaviate for the vector DB?

Managed: Pinecone or Weaviate Cloud — low ops, production-friendly. Self-host lightweight: Chroma. Scale + rich filters: Weaviate or Milvus. Four selection criteria: latency, scalability, metadata-filter support, multi-tenant isolation. For embeddings, OpenAI text-embedding-3-small covers most cases — lock the whole project to one embedding model so vector spaces stay consistent, and dedupe by content hash before writing.

Is rerank worth adding when pure vector search isn't precise enough?

Yes. Baseline is similarity search with k=3-6. To lift precision, layer a reranker (Cohere Rerank or bge-reranker): fetch top 20-50 candidates, rerank down to top 5. Hybrid BM25 + vector retrieval is also common. On top of that, filter by metadata (doc type, recency, tenant) to cut false positives, and dedupe same-source chunks to avoid redundancy.

How long should the assembled context be when I build the prompt?

Cap assembled context at 1/2 - 2/3 of the model's window so generation has room. Tag retrieved chunks with IDs like [1][2][3] and their source, and instruct the model: "answer only from the supplied chunks; if not found, reply 'not found in supplied chunks'". Require citation IDs in the answer for traceability. Surface retrieval scores as confidence — flag low scores as "low confidence".

RAG Systems Introduction

⏱️ 60 min

RAG (Retrieval-Augmented Generation) lets models answer questions using your private data. It's a critical piece of most AI applications.

1) Concept & Use Cases

Solves knowledge freshness and private data problems: the model "retrieves first, then answers."
Good fit: enterprise knowledge base Q&A, document assistants, customer service FAQs, compliance search, codebase Q&A.
Not a fit: tasks requiring heavy reasoning without supporting data; real-time multimodal reasoning may need additional capabilities.

2) End-to-End Pipeline

User question → Vectorize → Retrieve → Build context → LLM generates answer → (optional) Return citations/confidence

Core components: chunking, embedding, vector store, retrieval, reranking, prompt construction, generation, feedback loop.

3) Data Preparation & Chunking

Cleaning: strip headers/footers, table of contents/watermarks, footnotes; preserve paragraph semantics.
Chunking: recursive character splitting (200–400 tokens per chunk, 10–20% overlap); tables/code can use semantic blocks.
Metadata: document name, section, page number, timestamp, source type — used for filtering and citation.

Python Example (LangChain)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=60
)
chunks = splitter.split_text(text)

4) Embedding (Vectorization)

Model selection: OpenAI text-embedding-3-*; balance cost with Chinese/English performance.
Dimensions: defaults are usually fine; stick with the same model to avoid space mismatches.
Deduplication: hash content or use IDs to prevent duplicate writes.

Python Example

from openai import OpenAI
client = OpenAI()
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
).data

5) Vector Database Selection

Managed: Pinecone, Weaviate Cloud — no ops overhead, good for production.
Self-hosted: Chroma (lightweight), Weaviate, Milvus; watch out for persistence and backups.
Selection criteria: latency, scalability, filter support (metadata filters), multi-tenant isolation.

6) Retrieval Strategies

Similarity search (k=3–6) is the baseline.
Filtering: filter by document type/time/tenant to reduce false positives.
Hybrid retrieval: BM25 + vector; or rerank to boost relevance (e.g., Cohere Rerank / bge-reranker).
Dedup & rerank: merge same-source passages to avoid redundancy.

7) Context Construction & Prompting

Format: list retrieved passages with numbered IDs and sources; instruct "answer only based on the provided context; if unknown, say unknown."
Length control: keep context tokens to 1/2–2/3 of the model's limit; leave room for output.
Citations: require answers to include passage numbers for traceability.

Example Prompt Fragment

You are a knowledge base assistant. Answer ONLY based on the
"provided passages" below. Do not make things up.

Provided passages:
[1] Passage content...
[2] Passage content...

If the answer is not found, reply "Not found in the provided passages."
Include citation numbers in your answer, e.g., [1][2].

8) Generation & Anti-Hallucination

System constraint: only use the provided context, don't invent facts.
JSON/structured output: makes it easier for frontend or downstream consumption.
Confidence: return retrieval scores and sources; flag low scores as "low confidence."
Refusal: when unknown, return "Not found" — don't speculate.

9) Incremental Updates

Periodic/real-time sync: watch for file changes, incrementally update the vector store.
Versioning: tag documents with version numbers; include versions in retrieval results for traceability.
Expiration strategy: retire old versions, clean up stale vectors.

10) Performance & Cost

Control chunk size and retrieval k to reduce context length.
Pre-summarize: for very long documents, summarize first, then retrieve against summaries.
Caching: cache popular Q&A pairs; pre-generate answers for common queries.
Infrastructure: place the vector store near the model's region to reduce latency.

11) Evaluation & Alignment

Offline eval set: prepare Q&A pairs, check relevance, accuracy, and citation correctness.
Online feedback: collect thumbs up/down on misses; feed back to improve retrieval or chunking.
Metrics: hit rate, no-answer rate, average citation count, P95 latency, cost.

12) Security & Multi-Tenancy

Tenant isolation: metadata includes tenant_id; enforce filtering at retrieval time.
Access control: filter visible documents by user/role.
Data sensitivity: redact sensitive fields; don't store raw text in logs.

13) Minimal Working Example

# 1) Chunk + Embed + Store (simplified)
chunks = splitter.split_text(text)
vectors = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
).data
db.add(texts=chunks, embeddings=[v.embedding for v in vectors])

# 2) Query
query = "What's the refund process?"
q_emb = client.embeddings.create(model="text-embedding-3-small", input=query).data[0].embedding
docs = db.similarity_search_by_vector(q_emb, k=4)

# 3) Build prompt
context = "\\n".join([f"[{i+1}] {d.page_content}" for i, d in enumerate(docs)])
prompt = f"""You are a customer service assistant. Answer ONLY based on the
provided passages. Do not fabricate.

Provided passages:
{context}

Question: {query}
If the answer is not found, reply "Not found in the provided passages." Include citation numbers."""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
print(resp.choices[0].message.content)

14) Exercises

Build a local vector store (Chroma) with 3 articles, implement retrieval Q&A with citation numbers in the output.
Add reranking (e.g., bge-reranker or Cohere Rerank) and compare answer quality.
Add "no answer" detection logic to avoid false answers, and log retrieval scores and latency.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

RAG 和直接问大模型有什么区别？什么时候该用 RAG？

RAG 让模型「先检索再回答」，解决知识时效和私有数据问题。适合企业知识库问答、文档助手、客服 FAQ、合规检索、代码库问答。不适合需要强推理但无资料支撑的任务。流程是：用户问题 → 向量化 → 检索 → 构建上下文 → LLM 生成 → 返引用/信心分。

chunk 应该切多大？overlap 设多少？

用 RecursiveCharacterTextSplitter，chunk_size 200-400 token、overlap 10-20%（约 60 token）是基础配置。表格/代码按语义块切，不要硬切。每个 chunk 必须带 metadata：文档名、章节、页码、时间戳、来源类型，用于后续过滤和引用。chunk 太小语义碎，太大检索精度降。

向量数据库选 Pinecone、Chroma 还是 Weaviate？

托管走 Pinecone / Weaviate Cloud（省运维，适合生产）；自建轻量用 Chroma；要规模和过滤选 Weaviate / Milvus。选型四要点：延迟、扩展性、metadata filter 支持、多租户隔离。embedding 模型用 OpenAI text-embedding-3-small，整个项目锁定同一模型避免空间不一致；写入前用 hash 去重。

只用向量检索精度不够，rerank 有用吗？

有用。基础是相似度检索 k=3-6；想提升相关性叠 rerank（Cohere Rerank 或 bge-reranker），先召回 top-20-50 再 rerank 到 top-5。多路检索 BM25 + 向量混合也常见。同时按 metadata（文档类型、时间、租户）过滤减少误检，合并同源段落避免冗余。

构建 prompt 时 context 长度该控制在哪？

上下文 tokens 控制在模型上限的 1/2 - 2/3，留出输出空间。检索段落要带编号 [1][2][3] 和来源，prompt 里写「仅根据提供的段落回答，未找到回复'未在提供的段落中找到'」。回答必须附引用编号便于追溯。返检索得分作为信心分，低分提示「低置信」。