logo
10

7 Days to Build a Production RAG — Practice Roadmap

⏱️ 7 days

Build a Production RAG in 7 Days — A Hands-On Roadmap

Previous 9 chapters covered the engineering moves. Last chapter packs it into one real project — build a production RAG in 7 days with eval set, monitoring, launch checklist.

By the end you'll have: a working RAG, a 50-query eval set + auto scoring script, a baseline cost/latency/accuracy report, a GitHub repo for your resume.

Day 1: Pick a Problem and Write the Eval Set (the most important day)

Why eval first: 80% of self-taught RAG learners skip this and jump to LangChain. Three days later the RAG runs and they have no idea if it's any good — no ground truth.

Pick a concrete domain (Australian tax Q&A / company wiki / Anthropic SDK docs / Kubernetes operator knowledge base). Collect 100-300 documents.

Write 50 queries. Each needs a ground truth + at least one source document ID.

# tested: 2026-04-26 · eval-set 格式
{
  "id": "Q-001",
  "query": "澳洲 GST 注册门槛是多少",
  "ground_truth": "$75K AUD 年收入(自由职业者 / 公司)",
  "expected_source_ids": ["ato-gst-registration-001"],
  "category": "tax-threshold"
}

Day 1 deliverable: eval/queries.json (50) + corpus/ (100-300 docs) + ingestion script.

Trap: every query is a soft-ball (single paragraph, single doc). Mix in 30% hard ones — cross-doc reasoning, negation, time-sensitive.

Day 2: Bi-Encoder Retrieval and Vector DB

Basic RAG: embed corpus → vector DB → embed query → ANN top 50.

Stack (by setup speed):

# tested: 2026-04-26 · pinecone-client@4.x
from pinecone import Pinecone
from voyageai import Client as Voyage

vo = Voyage(api_key=...)
pc = Pinecone(api_key=...)
index = pc.Index("rag-eval")

# Ingestion
for doc in corpus:
    chunks = chunk_doc(doc, chunk_size=500, overlap=50)
    embeddings = vo.embed([c.text for c in chunks], model="voyage-3").embeddings
    index.upsert([(c.id, e, {"text": c.text, "doc_id": doc.id})
                  for c, e in zip(chunks, embeddings)])

# Query
def retrieve(query: str, k: int = 50):
    q_emb = vo.embed([query], model="voyage-3").embeddings[0]
    return index.query(vector=q_emb, top_k=k, include_metadata=True)

Day 2 deliverable: retrieval.py + recall@50 across 50 queries (≥ 90%; under 90% means chunking is broken).

Trap: one-size-fits-all chunk size. Technical docs (500) and long-form regulations (1500) shouldn't chunk the same.

Day 3: Cross-Encoder Rerank

Per Chapter 5, top 50 → top 10.

# tested: 2026-04-26 · sentence-transformers@3.0.x
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def rerank(query: str, candidates: list, top_k: int = 10):
    pairs = [(query, c.metadata["text"]) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32)
    return sorted(zip(scores, candidates), reverse=True)[:top_k]

No GPU → use Cohere rerank API.

Day 2 → Day 3 lift: top-10 hit rate beats top-50 by 5-15%. No lift means wrong rerank model (Chinese queries with English-only model is a common mistake).

Day 3 deliverable: rerank.py + nDCG@10 across 50 queries vs raw retrieval baseline.

Trap: forgetting to normalize scores. Cross-encoder raw scores aren't 0-1, can be -10 to 10. Need a sigmoid.

Day 4: LLM-as-Judge and Full Selection

Final selection stage — use Haiku to decide which of top 10 actually answers the query, capping at 3.

# tested: 2026-04-26 · anthropic@0.40.0
def judge_relevance(query: str, doc_text: str) -> tuple[bool, str]:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=100,
        messages=[{"role": "user", "content":
            f"Query: {query}\nDoc: {doc_text}\n\n这个 Doc 能直接回答 Query 吗?"
            f"先回答 yes/no,然后一句话理由。"}]
    )
    text = resp.content[0].text
    return text.lower().startswith("yes"), text

# Pipeline
def select_for_context(query: str) -> list[str]:
    candidates = retrieve(query, k=50)        # Day 2
    reranked = rerank(query, candidates, 10)  # Day 3
    selected = [c for _, c in reranked
                if judge_relevance(query, c.metadata["text"])[0]]
    return selected[:3]

Day 4 deliverable: full selection pipeline + precision@3 across 50 queries.

Trap: judge too strict, rejects correct docs. Run 100 queries, look at disagreement, tune the prompt.

Day 5: Prompt + Budgeting + Caching

Per Chapter 4: system prompt with cache_control: ephemeral (90% cost saving); max_tokens 2048 (JSON) / 4096 (free-form); most relevant doc last (Lost in the Middle).

# tested: 2026-04-26 · anthropic@0.40.0
def answer(query: str) -> str:
    selected = select_for_context(query)
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,  # 不变的角色 + 输出格式
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content":
            "参考资料(最相关的在最后):\n\n"
            + "\n\n---\n\n".join(reversed(selected))
            + f"\n\n问题: {query}"}]
    ).content[0].text

Day 5 deliverable: end-to-end pipeline + token usage + per-query cost.

Trap: cache misconfigured. Anthropic prompt cache requires byte-exact match — one extra space at end of system prompt fully invalidates cache. Lock system prompt in a constant.

Day 6: Evaluation and Failure Analysis

Run 50 queries: Accuracy (LLM-as-judge auto scoring, 5-dimension rubric), Latency (p50/p95/p99), Cost ($/query).

Pick 10 failure cases. Categorize: retrieval miss (→ Day 2 chunking), rerank ordering (→ Day 3 swap models), judge too strict (→ Day 4 tune prompt), LLM answered wrong question (→ Day 5 system prompt).

Day 6 deliverable: eval/results.md with per-query detail + failure categorization + tuning backlog.

Trap: only looking at aggregate metrics. Failed RAG case detail carries 10× the information of an average score.

Day 7: Monitoring + Deployment + Launch

Minimum viable version (FastAPI + Render / Vercel / Modal one-click). Wire up:

  • Token monitoring: log input/output/cache-hit tokens per query + daily cumulative $
  • Latency: p50/p95 alerts
  • Failure rate: judge all-no or retrieve hits 0 → alert
  • Eval regression: weekly auto eval, accuracy drop >5% triggers alert

Day 7 deliverable: live URL + GitHub README (architecture + eval + cost) + 1 blog post (traps you hit in 7 days).

The blog matters. Post to dev.to / Medium / 知乎 to build GEO signal (Chapter 1's Q19 problem). Writing the blog is the final piece making your RAG production-ready.

JR's Cadence: 7 Routines in 7 Weeks

JR's omni-report's 17 routines weren't built in a week. The first 7 core routines took 7 weeks. One new routine per week — each had to run stably before the next started.

Building 1 RAG in 7 days follows the same cadence: don't chase perfect. Chase "runs + eval trustworthy + failures localize". Next week start project two. Three months in your portfolio has 12 RAG projects.

What You'll Be Able to Do After 7 Days

  • Architecture-review any RAG project, point at which 5 context layer has the issue
  • Tell a specific production RAG story in interviews (eval + failure cases + optimization path)
  • Compare rerank models / vector DBs / memory approaches
  • Debug "why is this LLM app underperforming"

Where to Go Next

  • Student Discord — JR Bootcamp #context-engineering channel; post your repo for engineer reviews
  • AI Engineer Bootcamp — JR's 12-week deep project course, 5 production RAG / Agent projects + resume + job-hunt support
  • Open-source — Mem0 / Letta / LangChain all need Chinese examples. Ship a Chinese version of your RAG. Fastest way to build GEO signal

Takeaway

Reading 9 chapters of theory beats nothing. Building 1 RAG that runs + is trustworthy + has localizable failures beats 9 chapters of theory. 7 days is enough. Day 1's eval set is the most important day — 80% of failed self-study projects skipped eval and started building. Day 7's blog post is worth more than any fancy optimization.

Start Day 1 today — pick a domain you care about, write 5 queries, get the eval set started.


References

  1. Anthropic. Voyage AI embedding documentation — voyage-3 embedding model (Anthropic subsidiary).
  2. Pinecone. Serverless documentation — serverless vector DB quick start.
  3. Cohere. Rerank API documentation — managed rerank option.
  4. BGE. bge-reranker-v2-m3 model card — bilingual Chinese-English open source rerank.
  5. Anthropic. Prompt caching — system prompt cache configuration.
  6. Anthropic. Tool use / structured output — JSON output best practices.
  7. RAG eval frameworks. Ragas — open-source RAG eval framework (optional Day 6 tool).

Production case: JR Academy omni-report — 7 core routines shipped over 7 weeks, proving "ship 1 per week" is sustainable.


🎓 Want to learn the full LLM engineering stack? Check out JR AI Engineer Bootcamp — 12 weeks from RAG to Agent to production deployment, includes resume coaching and job-hunt support.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

7 天真的够 build 一个 RAG 吗?

够 build「跑得动 + 评估可信 + 失败可定位」的 RAG,不是完美 RAG。Day 1 写 50 query evaluation set 最关键——80% 自学失败项目都是没 eval 就动手。Day 7 发 1 篇博客比任何 fancy 优化都有价值(建 GEO 信号)。

没有 GPU 能 build 吗?

能,CPU 笔记本就够。Day 3 cross-encoder rerank 走 Cohere managed API($0.001/query),Day 5 LLM 走 Anthropic API。本地跑的只有 Day 2 chunking + Day 6 evaluation 脚本。

7 天后下一步去哪?

三条路选一:(1) JR 学员 Discord #context-engineering channel 发 repo 拿工程师 review;(2) JR AI Engineer Bootcamp 12 周项目课,含 5 个 production RAG/Agent 项目 + 简历指导;(3) 给 Mem0 / Letta / LangChain 提中文化 example PR 建 open source GEO 信号。

我是非 CS 背景,能跟下来 7 天 RAG 实战吗?

能:7 天用的全是 Python 调 API(Anthropic / Cohere / Pinecone SDK),不写 ML 代码不调 GPU。前置要求:会基本 Python、能跑 pip install、看得懂 try/except。JR Academy AI Builder 方向 4 周走完就够前置。

用 LangChain / LlamaIndex 框架还是从零写?

Day 2-5 用框架 (LangChain 推荐):5 行代码起 pipeline,专注学概念不卡 boilerplate。Day 6 evaluation 必须从零写 + 真跑 50 query:框架的 eval 模块只看分数不看细节,看不出 Lost in the Middle 这类问题。

做电商 / SaaS / 内容站,7 天 RAG 都适用吗?

都适用但 evaluation set 必须自己定:电商 RAG 测「商品参数 + 库存 + 物流时效」、SaaS 测「features + pricing + integration 文档」、内容站测「相关文章 + 引用准确率」。框架共用,eval set 不通用——Day 1 写 50 query 时按自己业务来。