logo
05

RAG Systems Introduction

⏱️ 60 min

RAG (Retrieval-Augmented Generation) lets models answer questions using your private data. It's a critical piece of most AI applications.

1) Concept & Use Cases

  • Solves knowledge freshness and private data problems: the model "retrieves first, then answers."
  • Good fit: enterprise knowledge base Q&A, document assistants, customer service FAQs, compliance search, codebase Q&A.
  • Not a fit: tasks requiring heavy reasoning without supporting data; real-time multimodal reasoning may need additional capabilities.

2) End-to-End Pipeline

User question → Vectorize → Retrieve → Build context → LLM generates answer → (optional) Return citations/confidence

Core components: chunking, embedding, vector store, retrieval, reranking, prompt construction, generation, feedback loop.

3) Data Preparation & Chunking

  • Cleaning: strip headers/footers, table of contents/watermarks, footnotes; preserve paragraph semantics.
  • Chunking: recursive character splitting (200–400 tokens per chunk, 10–20% overlap); tables/code can use semantic blocks.
  • Metadata: document name, section, page number, timestamp, source type — used for filtering and citation.

Python Example (LangChain)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=60
)
chunks = splitter.split_text(text)

4) Embedding (Vectorization)

  • Model selection: OpenAI text-embedding-3-*; balance cost with Chinese/English performance.
  • Dimensions: defaults are usually fine; stick with the same model to avoid space mismatches.
  • Deduplication: hash content or use IDs to prevent duplicate writes.

Python Example

from openai import OpenAI
client = OpenAI()
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
).data

5) Vector Database Selection

  • Managed: Pinecone, Weaviate Cloud — no ops overhead, good for production.
  • Self-hosted: Chroma (lightweight), Weaviate, Milvus; watch out for persistence and backups.
  • Selection criteria: latency, scalability, filter support (metadata filters), multi-tenant isolation.

6) Retrieval Strategies

  • Similarity search (k=3–6) is the baseline.
  • Filtering: filter by document type/time/tenant to reduce false positives.
  • Hybrid retrieval: BM25 + vector; or rerank to boost relevance (e.g., Cohere Rerank / bge-reranker).
  • Dedup & rerank: merge same-source passages to avoid redundancy.

7) Context Construction & Prompting

  • Format: list retrieved passages with numbered IDs and sources; instruct "answer only based on the provided context; if unknown, say unknown."
  • Length control: keep context tokens to 1/2–2/3 of the model's limit; leave room for output.
  • Citations: require answers to include passage numbers for traceability.

Example Prompt Fragment

You are a knowledge base assistant. Answer ONLY based on the
"provided passages" below. Do not make things up.

Provided passages:
[1] Passage content...
[2] Passage content...

If the answer is not found, reply "Not found in the provided passages."
Include citation numbers in your answer, e.g., [1][2].

8) Generation & Anti-Hallucination

  • System constraint: only use the provided context, don't invent facts.
  • JSON/structured output: makes it easier for frontend or downstream consumption.
  • Confidence: return retrieval scores and sources; flag low scores as "low confidence."
  • Refusal: when unknown, return "Not found" — don't speculate.

9) Incremental Updates

  • Periodic/real-time sync: watch for file changes, incrementally update the vector store.
  • Versioning: tag documents with version numbers; include versions in retrieval results for traceability.
  • Expiration strategy: retire old versions, clean up stale vectors.

10) Performance & Cost

  • Control chunk size and retrieval k to reduce context length.
  • Pre-summarize: for very long documents, summarize first, then retrieve against summaries.
  • Caching: cache popular Q&A pairs; pre-generate answers for common queries.
  • Infrastructure: place the vector store near the model's region to reduce latency.

11) Evaluation & Alignment

  • Offline eval set: prepare Q&A pairs, check relevance, accuracy, and citation correctness.
  • Online feedback: collect thumbs up/down on misses; feed back to improve retrieval or chunking.
  • Metrics: hit rate, no-answer rate, average citation count, P95 latency, cost.

12) Security & Multi-Tenancy

  • Tenant isolation: metadata includes tenant_id; enforce filtering at retrieval time.
  • Access control: filter visible documents by user/role.
  • Data sensitivity: redact sensitive fields; don't store raw text in logs.

13) Minimal Working Example

# 1) Chunk + Embed + Store (simplified)
chunks = splitter.split_text(text)
vectors = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
).data
db.add(texts=chunks, embeddings=[v.embedding for v in vectors])

# 2) Query
query = "What's the refund process?"
q_emb = client.embeddings.create(model="text-embedding-3-small", input=query).data[0].embedding
docs = db.similarity_search_by_vector(q_emb, k=4)

# 3) Build prompt
context = "\\n".join([f"[{i+1}] {d.page_content}" for i, d in enumerate(docs)])
prompt = f"""You are a customer service assistant. Answer ONLY based on the
provided passages. Do not fabricate.

Provided passages:
{context}

Question: {query}
If the answer is not found, reply "Not found in the provided passages." Include citation numbers."""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
print(resp.choices[0].message.content)

14) Exercises

  1. Build a local vector store (Chroma) with 3 articles, implement retrieval Q&A with citation numbers in the output.
  2. Add reranking (e.g., bge-reranker or Cohere Rerank) and compare answer quality.
  3. Add "no answer" detection logic to avoid false answers, and log retrieval scores and latency.

📚 相关资源