05
RAG Systems Introduction
RAG (Retrieval-Augmented Generation) lets models answer questions using your private data. It's a critical piece of most AI applications.
1) Concept & Use Cases
- Solves knowledge freshness and private data problems: the model "retrieves first, then answers."
- Good fit: enterprise knowledge base Q&A, document assistants, customer service FAQs, compliance search, codebase Q&A.
- Not a fit: tasks requiring heavy reasoning without supporting data; real-time multimodal reasoning may need additional capabilities.
2) End-to-End Pipeline
User question → Vectorize → Retrieve → Build context → LLM generates answer → (optional) Return citations/confidence
Core components: chunking, embedding, vector store, retrieval, reranking, prompt construction, generation, feedback loop.
3) Data Preparation & Chunking
- Cleaning: strip headers/footers, table of contents/watermarks, footnotes; preserve paragraph semantics.
- Chunking: recursive character splitting (200–400 tokens per chunk, 10–20% overlap); tables/code can use semantic blocks.
- Metadata: document name, section, page number, timestamp, source type — used for filtering and citation.
Python Example (LangChain)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=60
)
chunks = splitter.split_text(text)
4) Embedding (Vectorization)
- Model selection: OpenAI text-embedding-3-*; balance cost with Chinese/English performance.
- Dimensions: defaults are usually fine; stick with the same model to avoid space mismatches.
- Deduplication: hash content or use IDs to prevent duplicate writes.
Python Example
from openai import OpenAI
client = OpenAI()
emb = client.embeddings.create(
model="text-embedding-3-small",
input=chunks
).data
5) Vector Database Selection
- Managed: Pinecone, Weaviate Cloud — no ops overhead, good for production.
- Self-hosted: Chroma (lightweight), Weaviate, Milvus; watch out for persistence and backups.
- Selection criteria: latency, scalability, filter support (metadata filters), multi-tenant isolation.
6) Retrieval Strategies
- Similarity search (k=3–6) is the baseline.
- Filtering: filter by document type/time/tenant to reduce false positives.
- Hybrid retrieval: BM25 + vector; or rerank to boost relevance (e.g., Cohere Rerank / bge-reranker).
- Dedup & rerank: merge same-source passages to avoid redundancy.
7) Context Construction & Prompting
- Format: list retrieved passages with numbered IDs and sources; instruct "answer only based on the provided context; if unknown, say unknown."
- Length control: keep context tokens to 1/2–2/3 of the model's limit; leave room for output.
- Citations: require answers to include passage numbers for traceability.
Example Prompt Fragment
You are a knowledge base assistant. Answer ONLY based on the
"provided passages" below. Do not make things up.
Provided passages:
[1] Passage content...
[2] Passage content...
If the answer is not found, reply "Not found in the provided passages."
Include citation numbers in your answer, e.g., [1][2].
8) Generation & Anti-Hallucination
- System constraint: only use the provided context, don't invent facts.
- JSON/structured output: makes it easier for frontend or downstream consumption.
- Confidence: return retrieval scores and sources; flag low scores as "low confidence."
- Refusal: when unknown, return "Not found" — don't speculate.
9) Incremental Updates
- Periodic/real-time sync: watch for file changes, incrementally update the vector store.
- Versioning: tag documents with version numbers; include versions in retrieval results for traceability.
- Expiration strategy: retire old versions, clean up stale vectors.
10) Performance & Cost
- Control chunk size and retrieval k to reduce context length.
- Pre-summarize: for very long documents, summarize first, then retrieve against summaries.
- Caching: cache popular Q&A pairs; pre-generate answers for common queries.
- Infrastructure: place the vector store near the model's region to reduce latency.
11) Evaluation & Alignment
- Offline eval set: prepare Q&A pairs, check relevance, accuracy, and citation correctness.
- Online feedback: collect thumbs up/down on misses; feed back to improve retrieval or chunking.
- Metrics: hit rate, no-answer rate, average citation count, P95 latency, cost.
12) Security & Multi-Tenancy
- Tenant isolation: metadata includes tenant_id; enforce filtering at retrieval time.
- Access control: filter visible documents by user/role.
- Data sensitivity: redact sensitive fields; don't store raw text in logs.
13) Minimal Working Example
# 1) Chunk + Embed + Store (simplified)
chunks = splitter.split_text(text)
vectors = client.embeddings.create(
model="text-embedding-3-small",
input=chunks
).data
db.add(texts=chunks, embeddings=[v.embedding for v in vectors])
# 2) Query
query = "What's the refund process?"
q_emb = client.embeddings.create(model="text-embedding-3-small", input=query).data[0].embedding
docs = db.similarity_search_by_vector(q_emb, k=4)
# 3) Build prompt
context = "\\n".join([f"[{i+1}] {d.page_content}" for i, d in enumerate(docs)])
prompt = f"""You are a customer service assistant. Answer ONLY based on the
provided passages. Do not fabricate.
Provided passages:
{context}
Question: {query}
If the answer is not found, reply "Not found in the provided passages." Include citation numbers."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
print(resp.choices[0].message.content)
14) Exercises
- Build a local vector store (Chroma) with 3 articles, implement retrieval Q&A with citation numbers in the output.
- Add reranking (e.g., bge-reranker or Cohere Rerank) and compare answer quality.
- Add "no answer" detection logic to avoid false answers, and log retrieval scores and latency.