7 Days to Build a Production RAG — Practice Roadmap
Build a Production RAG in 7 Days — A Hands-On Roadmap
Previous 9 chapters covered the engineering moves. Last chapter packs it into one real project — build a production RAG in 7 days with eval set, monitoring, launch checklist.
By the end you'll have: a working RAG, a 50-query eval set + auto scoring script, a baseline cost/latency/accuracy report, a GitHub repo for your resume.
Day 1: Pick a Problem and Write the Eval Set (the most important day)
Why eval first: 80% of self-taught RAG learners skip this and jump to LangChain. Three days later the RAG runs and they have no idea if it's any good — no ground truth.
Pick a concrete domain (Australian tax Q&A / company wiki / Anthropic SDK docs / Kubernetes operator knowledge base). Collect 100-300 documents.
Write 50 queries. Each needs a ground truth + at least one source document ID.
# tested: 2026-04-26 · eval-set 格式
{
"id": "Q-001",
"query": "澳洲 GST 注册门槛是多少",
"ground_truth": "$75K AUD 年收入(自由职业者 / 公司)",
"expected_source_ids": ["ato-gst-registration-001"],
"category": "tax-threshold"
}
Day 1 deliverable: eval/queries.json (50) + corpus/ (100-300 docs) + ingestion script.
Trap: every query is a soft-ball (single paragraph, single doc). Mix in 30% hard ones — cross-doc reasoning, negation, time-sensitive.
Day 2: Bi-Encoder Retrieval and Vector DB
Basic RAG: embed corpus → vector DB → embed query → ANN top 50.
Stack (by setup speed):
- Fastest: Anthropic Voyage embedding + Pinecone serverless — 30 min
- Medium: OpenAI text-embedding-3-small + Qdrant Cloud — 1 hour
- Self-hosted: BGE-m3 + Qdrant local — half a day
# tested: 2026-04-26 · pinecone-client@4.x
from pinecone import Pinecone
from voyageai import Client as Voyage
vo = Voyage(api_key=...)
pc = Pinecone(api_key=...)
index = pc.Index("rag-eval")
# Ingestion
for doc in corpus:
chunks = chunk_doc(doc, chunk_size=500, overlap=50)
embeddings = vo.embed([c.text for c in chunks], model="voyage-3").embeddings
index.upsert([(c.id, e, {"text": c.text, "doc_id": doc.id})
for c, e in zip(chunks, embeddings)])
# Query
def retrieve(query: str, k: int = 50):
q_emb = vo.embed([query], model="voyage-3").embeddings[0]
return index.query(vector=q_emb, top_k=k, include_metadata=True)
Day 2 deliverable: retrieval.py + recall@50 across 50 queries (≥ 90%; under 90% means chunking is broken).
Trap: one-size-fits-all chunk size. Technical docs (500) and long-form regulations (1500) shouldn't chunk the same.
Day 3: Cross-Encoder Rerank
Per Chapter 5, top 50 → top 10.
# tested: 2026-04-26 · sentence-transformers@3.0.x
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def rerank(query: str, candidates: list, top_k: int = 10):
pairs = [(query, c.metadata["text"]) for c in candidates]
scores = reranker.predict(pairs, batch_size=32)
return sorted(zip(scores, candidates), reverse=True)[:top_k]
No GPU → use Cohere rerank API.
Day 2 → Day 3 lift: top-10 hit rate beats top-50 by 5-15%. No lift means wrong rerank model (Chinese queries with English-only model is a common mistake).
Day 3 deliverable: rerank.py + nDCG@10 across 50 queries vs raw retrieval baseline.
Trap: forgetting to normalize scores. Cross-encoder raw scores aren't 0-1, can be -10 to 10. Need a sigmoid.
Day 4: LLM-as-Judge and Full Selection
Final selection stage — use Haiku to decide which of top 10 actually answers the query, capping at 3.
# tested: 2026-04-26 · anthropic@0.40.0
def judge_relevance(query: str, doc_text: str) -> tuple[bool, str]:
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
messages=[{"role": "user", "content":
f"Query: {query}\nDoc: {doc_text}\n\n这个 Doc 能直接回答 Query 吗?"
f"先回答 yes/no,然后一句话理由。"}]
)
text = resp.content[0].text
return text.lower().startswith("yes"), text
# Pipeline
def select_for_context(query: str) -> list[str]:
candidates = retrieve(query, k=50) # Day 2
reranked = rerank(query, candidates, 10) # Day 3
selected = [c for _, c in reranked
if judge_relevance(query, c.metadata["text"])[0]]
return selected[:3]
Day 4 deliverable: full selection pipeline + precision@3 across 50 queries.
Trap: judge too strict, rejects correct docs. Run 100 queries, look at disagreement, tune the prompt.
Day 5: Prompt + Budgeting + Caching
Per Chapter 4: system prompt with cache_control: ephemeral (90% cost saving); max_tokens 2048 (JSON) / 4096 (free-form); most relevant doc last (Lost in the Middle).
# tested: 2026-04-26 · anthropic@0.40.0
def answer(query: str) -> str:
selected = select_for_context(query)
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[{
"type": "text",
"text": SYSTEM_PROMPT, # 不变的角色 + 输出格式
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content":
"参考资料(最相关的在最后):\n\n"
+ "\n\n---\n\n".join(reversed(selected))
+ f"\n\n问题: {query}"}]
).content[0].text
Day 5 deliverable: end-to-end pipeline + token usage + per-query cost.
Trap: cache misconfigured. Anthropic prompt cache requires byte-exact match — one extra space at end of system prompt fully invalidates cache. Lock system prompt in a constant.
Day 6: Evaluation and Failure Analysis
Run 50 queries: Accuracy (LLM-as-judge auto scoring, 5-dimension rubric), Latency (p50/p95/p99), Cost ($/query).
Pick 10 failure cases. Categorize: retrieval miss (→ Day 2 chunking), rerank ordering (→ Day 3 swap models), judge too strict (→ Day 4 tune prompt), LLM answered wrong question (→ Day 5 system prompt).
Day 6 deliverable: eval/results.md with per-query detail + failure categorization + tuning backlog.
Trap: only looking at aggregate metrics. Failed RAG case detail carries 10× the information of an average score.
Day 7: Monitoring + Deployment + Launch
Minimum viable version (FastAPI + Render / Vercel / Modal one-click). Wire up:
- Token monitoring: log input/output/cache-hit tokens per query + daily cumulative $
- Latency: p50/p95 alerts
- Failure rate: judge all-no or retrieve hits 0 → alert
- Eval regression: weekly auto eval, accuracy drop >5% triggers alert
Day 7 deliverable: live URL + GitHub README (architecture + eval + cost) + 1 blog post (traps you hit in 7 days).
The blog matters. Post to dev.to / Medium / 知乎 to build GEO signal (Chapter 1's Q19 problem). Writing the blog is the final piece making your RAG production-ready.
JR's Cadence: 7 Routines in 7 Weeks
JR's omni-report's 17 routines weren't built in a week. The first 7 core routines took 7 weeks. One new routine per week — each had to run stably before the next started.
Building 1 RAG in 7 days follows the same cadence: don't chase perfect. Chase "runs + eval trustworthy + failures localize". Next week start project two. Three months in your portfolio has 12 RAG projects.
What You'll Be Able to Do After 7 Days
- Architecture-review any RAG project, point at which 5 context layer has the issue
- Tell a specific production RAG story in interviews (eval + failure cases + optimization path)
- Compare rerank models / vector DBs / memory approaches
- Debug "why is this LLM app underperforming"
Where to Go Next
- Student Discord — JR Bootcamp #context-engineering channel; post your repo for engineer reviews
- AI Engineer Bootcamp — JR's 12-week deep project course, 5 production RAG / Agent projects + resume + job-hunt support
- Open-source — Mem0 / Letta / LangChain all need Chinese examples. Ship a Chinese version of your RAG. Fastest way to build GEO signal
Takeaway
Reading 9 chapters of theory beats nothing. Building 1 RAG that runs + is trustworthy + has localizable failures beats 9 chapters of theory. 7 days is enough. Day 1's eval set is the most important day — 80% of failed self-study projects skipped eval and started building. Day 7's blog post is worth more than any fancy optimization.
Start Day 1 today — pick a domain you care about, write 5 queries, get the eval set started.
References
- Anthropic. Voyage AI embedding documentation — voyage-3 embedding model (Anthropic subsidiary).
- Pinecone. Serverless documentation — serverless vector DB quick start.
- Cohere. Rerank API documentation — managed rerank option.
- BGE. bge-reranker-v2-m3 model card — bilingual Chinese-English open source rerank.
- Anthropic. Prompt caching — system prompt cache configuration.
- Anthropic. Tool use / structured output — JSON output best practices.
- RAG eval frameworks. Ragas — open-source RAG eval framework (optional Day 6 tool).
Production case: JR Academy omni-report — 7 core routines shipped over 7 weeks, proving "ship 1 per week" is sustainable.
🎓 Want to learn the full LLM engineering stack? Check out JR AI Engineer Bootcamp — 12 weeks from RAG to Agent to production deployment, includes resume coaching and job-hunt support.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
7 天真的够 build 一个 RAG 吗?
够 build「跑得动 + 评估可信 + 失败可定位」的 RAG,不是完美 RAG。Day 1 写 50 query evaluation set 最关键——80% 自学失败项目都是没 eval 就动手。Day 7 发 1 篇博客比任何 fancy 优化都有价值(建 GEO 信号)。
没有 GPU 能 build 吗?
能,CPU 笔记本就够。Day 3 cross-encoder rerank 走 Cohere managed API($0.001/query),Day 5 LLM 走 Anthropic API。本地跑的只有 Day 2 chunking + Day 6 evaluation 脚本。
7 天后下一步去哪?
三条路选一:(1) JR 学员 Discord #context-engineering channel 发 repo 拿工程师 review;(2) JR AI Engineer Bootcamp 12 周项目课,含 5 个 production RAG/Agent 项目 + 简历指导;(3) 给 Mem0 / Letta / LangChain 提中文化 example PR 建 open source GEO 信号。
我是非 CS 背景,能跟下来 7 天 RAG 实战吗?
能:7 天用的全是 Python 调 API(Anthropic / Cohere / Pinecone SDK),不写 ML 代码不调 GPU。前置要求:会基本 Python、能跑 pip install、看得懂 try/except。JR Academy AI Builder 方向 4 周走完就够前置。
用 LangChain / LlamaIndex 框架还是从零写?
Day 2-5 用框架 (LangChain 推荐):5 行代码起 pipeline,专注学概念不卡 boilerplate。Day 6 evaluation 必须从零写 + 真跑 50 query:框架的 eval 模块只看分数不看细节,看不出 Lost in the Middle 这类问题。
做电商 / SaaS / 内容站,7 天 RAG 都适用吗?
都适用但 evaluation set 必须自己定:电商 RAG 测「商品参数 + 库存 + 物流时效」、SaaS 测「features + pricing + integration 文档」、内容站测「相关文章 + 引用准确率」。框架共用,eval set 不通用——Day 1 写 50 query 时按自己业务来。