Is 7 days really enough to build a RAG?

Enough to build a RAG that runs, has trustworthy evaluation, and lets you locate failures — not a perfect one. Day 1 writing the 50-query evaluation set is the most critical step: 80% of failed self-study projects skip eval and start building. Day 7 shipping one blog post beats any fancy optimization (builds GEO signal).

Can I build this without a GPU?

Yes, a CPU laptop is enough. Day 3 cross-encoder rerank goes through Cohere managed API ($0.001/query), Day 5 LLM uses Anthropic API. Only Day 2 chunking + Day 6 evaluation script run locally.

What's the next step after the 7 days?

Pick one of three: (1) post your repo in the JR student Discord #context-engineering channel for engineer review; (2) join JR's 12-week AI Engineer Bootcamp — 5 production RAG/Agent projects + resume coaching; (3) submit a Chinese-localization example PR to Mem0 / Letta / LangChain to build open-source GEO signal.

I am from a non-CS background — can I follow the 7-day RAG build?

Yes: the 7 days are all Python calling APIs (Anthropic / Cohere / Pinecone SDKs) — no ML code, no GPU tuning. Prerequisites: basic Python, ability to run pip install, comfort with try/except. JR Academy's AI Builder track in 4 weeks covers all of it.

LangChain / LlamaIndex framework or build from scratch?

Days 2-5: use a framework (LangChain recommended) — 5 lines to a working pipeline, focus on concepts not boilerplate. Day 6 evaluation: build from scratch and run 50 queries yourself; framework eval modules only show scores, missing detail-level issues like Lost in the Middle.

E-commerce / SaaS / content site — does the 7-day RAG fit all of them?

All fit, but build your own evaluation set: e-commerce RAG tests "product specs + stock + shipping ETA," SaaS tests "features + pricing + integration docs," content sites test "related articles + citation accuracy." The framework is shared; eval sets are not — write your 50 queries on Day 1 against your own business.

7 Days to Build a Production RAG — Practice Roadmap

⏱️ 7 days

Build a Production RAG in 7 Days — A Hands-On Roadmap

Previous 9 chapters covered the engineering moves. Last chapter packs it into one real project — build a production RAG in 7 days with eval set, monitoring, launch checklist.

By the end you'll have: a working RAG, a 50-query eval set + auto scoring script, a baseline cost/latency/accuracy report, a GitHub repo for your resume.

Day 1: Pick a Problem and Write the Eval Set (the most important day)

Why eval first: 80% of self-taught RAG learners skip this and jump to LangChain. Three days later the RAG runs and they have no idea if it's any good — no ground truth.

Pick a concrete domain (Australian tax Q&A / company wiki / Anthropic SDK docs / Kubernetes operator knowledge base). Collect 100-300 documents.

Write 50 queries. Each needs a ground truth + at least one source document ID.

# tested: 2026-04-26 · eval-set 格式
{
  "id": "Q-001",
  "query": "澳洲 GST 注册门槛是多少",
  "ground_truth": "$75K AUD 年收入（自由职业者 / 公司）",
  "expected_source_ids": ["ato-gst-registration-001"],
  "category": "tax-threshold"
}

Day 1 deliverable: eval/queries.json (50) + corpus/ (100-300 docs) + ingestion script.

Trap: every query is a soft-ball (single paragraph, single doc). Mix in 30% hard ones — cross-doc reasoning, negation, time-sensitive.

Day 2: Bi-Encoder Retrieval and Vector DB

Basic RAG: embed corpus → vector DB → embed query → ANN top 50.

Stack (by setup speed):

Fastest: Anthropic Voyage embedding + Pinecone serverless — 30 min
Medium: OpenAI text-embedding-3-small + Qdrant Cloud — 1 hour
Self-hosted: BGE-m3 + Qdrant local — half a day

# tested: 2026-04-26 · pinecone-client@4.x
from pinecone import Pinecone
from voyageai import Client as Voyage

vo = Voyage(api_key=...)
pc = Pinecone(api_key=...)
index = pc.Index("rag-eval")

# Ingestion
for doc in corpus:
    chunks = chunk_doc(doc, chunk_size=500, overlap=50)
    embeddings = vo.embed([c.text for c in chunks], model="voyage-3").embeddings
    index.upsert([(c.id, e, {"text": c.text, "doc_id": doc.id})
                  for c, e in zip(chunks, embeddings)])

# Query
def retrieve(query: str, k: int = 50):
    q_emb = vo.embed([query], model="voyage-3").embeddings[0]
    return index.query(vector=q_emb, top_k=k, include_metadata=True)

Day 2 deliverable: retrieval.py + recall@50 across 50 queries (≥ 90%; under 90% means chunking is broken).

Trap: one-size-fits-all chunk size. Technical docs (500) and long-form regulations (1500) shouldn't chunk the same.

Day 3: Cross-Encoder Rerank

Per Chapter 5, top 50 → top 10.

# tested: 2026-04-26 · sentence-transformers@3.0.x
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def rerank(query: str, candidates: list, top_k: int = 10):
    pairs = [(query, c.metadata["text"]) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32)
    return sorted(zip(scores, candidates), reverse=True)[:top_k]

No GPU → use Cohere rerank API.

Day 2 → Day 3 lift: top-10 hit rate beats top-50 by 5-15%. No lift means wrong rerank model (Chinese queries with English-only model is a common mistake).

Day 3 deliverable: rerank.py + nDCG@10 across 50 queries vs raw retrieval baseline.

Trap: forgetting to normalize scores. Cross-encoder raw scores aren't 0-1, can be -10 to 10. Need a sigmoid.

Day 4: LLM-as-Judge and Full Selection

Final selection stage — use Haiku to decide which of top 10 actually answers the query, capping at 3.

# tested: 2026-04-26 · anthropic@0.40.0
def judge_relevance(query: str, doc_text: str) -> tuple[bool, str]:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=100,
        messages=[{"role": "user", "content":
            f"Query: {query}\nDoc: {doc_text}\n\n这个 Doc 能直接回答 Query 吗？"
            f"先回答 yes/no，然后一句话理由。"}]
    )
    text = resp.content[0].text
    return text.lower().startswith("yes"), text

# Pipeline
def select_for_context(query: str) -> list[str]:
    candidates = retrieve(query, k=50)        # Day 2
    reranked = rerank(query, candidates, 10)  # Day 3
    selected = [c for _, c in reranked
                if judge_relevance(query, c.metadata["text"])[0]]
    return selected[:3]

Day 4 deliverable: full selection pipeline + precision@3 across 50 queries.

Trap: judge too strict, rejects correct docs. Run 100 queries, look at disagreement, tune the prompt.

Day 5: Prompt + Budgeting + Caching

Per Chapter 4: system prompt with cache_control: ephemeral (90% cost saving); max_tokens 2048 (JSON) / 4096 (free-form); most relevant doc last (Lost in the Middle).

# tested: 2026-04-26 · anthropic@0.40.0
def answer(query: str) -> str:
    selected = select_for_context(query)
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,  # 不变的角色 + 输出格式
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content":
            "参考资料（最相关的在最后）：\n\n"
            + "\n\n---\n\n".join(reversed(selected))
            + f"\n\n问题: {query}"}]
    ).content[0].text

Day 5 deliverable: end-to-end pipeline + token usage + per-query cost.

Trap: cache misconfigured. Anthropic prompt cache requires byte-exact match — one extra space at end of system prompt fully invalidates cache. Lock system prompt in a constant.

Day 6: Evaluation and Failure Analysis

Run 50 queries: Accuracy (LLM-as-judge auto scoring, 5-dimension rubric), Latency (p50/p95/p99), Cost ($/query).

Pick 10 failure cases. Categorize: retrieval miss (→ Day 2 chunking), rerank ordering (→ Day 3 swap models), judge too strict (→ Day 4 tune prompt), LLM answered wrong question (→ Day 5 system prompt).

Day 6 deliverable: eval/results.md with per-query detail + failure categorization + tuning backlog.

Trap: only looking at aggregate metrics. Failed RAG case detail carries 10× the information of an average score.

Day 7: Monitoring + Deployment + Launch

Minimum viable version (FastAPI + Render / Vercel / Modal one-click). Wire up:

Token monitoring: log input/output/cache-hit tokens per query + daily cumulative $
Latency: p50/p95 alerts
Failure rate: judge all-no or retrieve hits 0 → alert
Eval regression: weekly auto eval, accuracy drop >5% triggers alert

Day 7 deliverable: live URL + GitHub README (architecture + eval + cost) + 1 blog post (traps you hit in 7 days).

The blog matters. Post to dev.to / Medium / 知乎 to build GEO signal (Chapter 1's Q19 problem). Writing the blog is the final piece making your RAG production-ready.

JR's Cadence: 7 Routines in 7 Weeks

JR's omni-report's 17 routines weren't built in a week. The first 7 core routines took 7 weeks. One new routine per week — each had to run stably before the next started.

Building 1 RAG in 7 days follows the same cadence: don't chase perfect. Chase "runs + eval trustworthy + failures localize". Next week start project two. Three months in your portfolio has 12 RAG projects.

What You'll Be Able to Do After 7 Days

Architecture-review any RAG project, point at which 5 context layer has the issue
Tell a specific production RAG story in interviews (eval + failure cases + optimization path)
Compare rerank models / vector DBs / memory approaches
Debug "why is this LLM app underperforming"

Where to Go Next

Student Discord — JR Bootcamp #context-engineering channel; post your repo for engineer reviews
AI Engineer Bootcamp — JR's 12-week deep project course, 5 production RAG / Agent projects + resume + job-hunt support
Open-source — Mem0 / Letta / LangChain all need Chinese examples. Ship a Chinese version of your RAG. Fastest way to build GEO signal

Takeaway

Reading 9 chapters of theory beats nothing. Building 1 RAG that runs + is trustworthy + has localizable failures beats 9 chapters of theory. 7 days is enough. Day 1's eval set is the most important day — 80% of failed self-study projects skipped eval and started building. Day 7's blog post is worth more than any fancy optimization.

Start Day 1 today — pick a domain you care about, write 5 queries, get the eval set started.

References

Anthropic. Voyage AI embedding documentation — voyage-3 embedding model (Anthropic subsidiary).
Pinecone. Serverless documentation — serverless vector DB quick start.
Cohere. Rerank API documentation — managed rerank option.
BGE. bge-reranker-v2-m3 model card — bilingual Chinese-English open source rerank.
Anthropic. Prompt caching — system prompt cache configuration.
Anthropic. Tool use / structured output — JSON output best practices.
RAG eval frameworks. Ragas — open-source RAG eval framework (optional Day 6 tool).

Production case: JR Academy omni-report — 7 core routines shipped over 7 weeks, proving "ship 1 per week" is sustainable.

🎓 Want to learn the full LLM engineering stack? Check out JR AI Engineer Bootcamp — 12 weeks from RAG to Agent to production deployment, includes resume coaching and job-hunt support.

📚 相关资源

AI Engineer Bootcamp — 12-Week Deep Dive

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

7 天真的够 build 一个 RAG 吗？

够 build「跑得动 + 评估可信 + 失败可定位」的 RAG，不是完美 RAG。Day 1 写 50 query evaluation set 最关键——80% 自学失败项目都是没 eval 就动手。Day 7 发 1 篇博客比任何 fancy 优化都有价值（建 GEO 信号）。

没有 GPU 能 build 吗？

能，CPU 笔记本就够。Day 3 cross-encoder rerank 走 Cohere managed API（$0.001/query），Day 5 LLM 走 Anthropic API。本地跑的只有 Day 2 chunking + Day 6 evaluation 脚本。

7 天后下一步去哪？

三条路选一：(1) JR 学员 Discord #context-engineering channel 发 repo 拿工程师 review；(2) JR AI Engineer Bootcamp 12 周项目课，含 5 个 production RAG/Agent 项目 + 简历指导；(3) 给 Mem0 / Letta / LangChain 提中文化 example PR 建 open source GEO 信号。

我是非 CS 背景，能跟下来 7 天 RAG 实战吗？

能：7 天用的全是 Python 调 API（Anthropic / Cohere / Pinecone SDK），不写 ML 代码不调 GPU。前置要求：会基本 Python、能跑 pip install、看得懂 try/except。JR Academy AI Builder 方向 4 周走完就够前置。

用 LangChain / LlamaIndex 框架还是从零写？

Day 2-5 用框架 (LangChain 推荐)：5 行代码起 pipeline，专注学概念不卡 boilerplate。Day 6 evaluation 必须从零写 + 真跑 50 query：框架的 eval 模块只看分数不看细节，看不出 Lost in the Middle 这类问题。

做电商 / SaaS / 内容站，7 天 RAG 都适用吗？

都适用但 evaluation set 必须自己定：电商 RAG 测「商品参数 + 库存 + 物流时效」、SaaS 测「features + pricing + integration 文档」、内容站测「相关文章 + 引用准确率」。框架共用，eval set 不通用——Day 1 写 50 query 时按自己业务来。