logo
03

Context Selection — Why RAG Recall Does Not Equal Accuracy

⏱️ 25 min

Context Selection — Why RAG Recall ≠ Correct Answer

Anyone who's shipped RAG has hit this number: 90% recall (top-k docs include the correct answer 90% of the time), but end-to-end accuracy is only 60%. Where did the 30% go?

Retrieval isn't the problem. Selection never happened.

Lost in the Middle — The 30% Empirical Gap

July 2023, Stanford's Liu and team published Lost in the Middle (arXiv:2307.03172), exposing a hidden flaw in every long-context LLM with one experiment:

Experiment design:

  1. Multi-document QA task (20 docs + 1 question)
  2. Only 1 doc contains the answer; 19 are distractors (topic-relevant, no answer)
  3. Place answer-doc at different positions (1st / 5th / 10th / 15th / 20th)
  4. Measure accuracy at each position

Result: every model tested (GPT-3.5, Claude, LLaMA) showed a U-shaped curve — peaks when answer sits at start or end, drops 30%+ when buried in the middle.

[Lost in the Middle 论文图] — see figure on page 3 of the original paper.

Of the 10 retrieved docs, content from doc 4 to doc 7 the model basically can't see. Even at 100% recall, the model can't use the part in the wrong position.

Not a bug. Joint product of transformer attention + training data distribution — start/end positions matter more in training data (paper abstracts, article conclusions). Anthropic reproduced in 200K models, Long context tips doc explicitly recommends putting key info at the end.

Three Root Causes — Not Just Position

Position bias is most visible, but RAG recall ≠ correct answer has three causes:

1. Position bias — Middle blindness

10 retrieved chunks, 5 relevant, sorted by relevance — relevance rank #1 isn't attention rank #1. Actively shove the most critical chunk to the end.

2. Distractor pollution

2024 Found in the Middle (He et al.) proves: more distractors in context (docs topic-similar but answer-empty), higher hallucination — the model "assembles" relevant nouns from distractors into fake answers.

Australian tax RAG retrieves 10 chunks — 3 GST, 7 income tax. User asks GST. The 7 income-tax chunks are distractors. Model says "GST is ..., note income tax has ..." — packing income tax details as if they were GST facts. Recalled right also recalled similar-but-wrong = pollution.

3. Instruction-following degradation

Anthropic's 200K context eval doc: longer context, lower adherence to system prompt hard rules (output format, prohibitions). At 50K with JSON output requirement, model occasionally drops fields. At 200K, field-dropping rate jumps noticeably.

Hidden cost of recalling more — you think "more retrieval is safer", really system instruction gets diluted.

Selection ≠ Retrieval

retrieval finds candidates, selection picks which actually enter context.

StageJobTargetTools
RetrievalFind top-N from corpusRecallbi-encoder / BM25 / hybrid
SelectionPick top-K from top-NPrecision + positionrerank / filter / LLM-judge
CompositionOrder top-K, stuff into promptPosition, token budgetmanual + templates

Intro RAG tutorials only teach retrieval — top-5 straight into prompt. Root cause of "90% recall, 60% accuracy".

Selection's Three-Stage Pipeline

Chapter 5 details each toolchain. Overall shape:

Stage 1 — Filter (coarse pass)

Drop by hard rules:

  • Stale (2023 policy doc talking 2026 event)
  • Source blacklist (recycled CSDN articles)
  • Length anomalies (< 50 or > 5000 chars)

Cost dirt cheap (O(n)), drops 30-50% candidates.

Stage 2 — Rerank (fine pass)

Re-score with cross-encoder:

  • bi-encoder (retrieval's) sees query or doc one at a time — fast, imprecise
  • cross-encoder sees query + doc together, 30-50% more precise, 50-100x cost — only run on top-50

Mainstream: Cohere rerank-3 / BGE-reranker-v2 / Anthropic contextual retrieval.

Stage 3 — LLM-as-judge (final pass)

Cheap model (Haiku or 4o-mini) as judge:

  • Input: query + single doc
  • Output: does this answer the query? yes/no + reason

5-10x rerank cost, catches "topic-related but off-target" passages rerank lets through.

Three combined: retrieval 100 → filter 50 → rerank 10 → judge 3 → 3 enter context. Recall intact, distractor pollution to zero.

JR Real Case: daily-jobs Selection Pipeline

JR Academy runs 3 daily-jobs routines per day — one each for ai-essentials, ai-engineer, ai-engineer-rag bootcamp tracks, recommending 3 jobs per cohort. Textbook selection:

# tested: 2026-04-26 · routine: Daily Jobs - AI Engineer Bootcamp

Stage 0: 抓取(retrieval)
  WebFetch au.linkedin.com/jobs/junior-machine-learning-jobs
  WebFetch au.linkedin.com/jobs/ai-engineer-jobs
  → 抓回约 30 个 job

Stage 1: Filter(硬规则)
  剔除 recruiter spam(YO IT Consulting 这类同 job 多 location 群发)
  剔除 SEEK 来源(403 默认无效)
  剔除发布 > 7 天
  → 剩约 18 个 job

Stage 2: Rerank(按 tier 标签)
  打 3 个 tier 标签:
    aspirational(理想档 — Mid-Senior + T0/T1 brand)
    actionable(够得着档 — Junior/Graduate + AU Big 4)
    special(特殊机会 — Intern / Graduate Program / 2026-27 Start)
  → 每 tier 留 top 3-5 候选

Stage 3: LLM-as-judge(人称代入)
  以「JR Bootcamp 学员视角」给每个候选写 whyForLearners
  ≥ 30 字、必须具体、不允许模板填空
  → 每 tier 选 1 个最佳 → 共 3 个 final

最终 context 进 prompt:3 个 job 全文 + 3 个 whyForLearners

Not fancy algorithms — three filter stages plus one judge model. The 3 jobs that come out passed distractor screening + position ordering (tier is the position signal).

Students see "3 hand-picked jobs that matter to you", not "30 fetched, 3 chosen". Product value of selection.

High Recall + Weak Selection vs Low Recall + Strong Selection — Trade-off

DimensionHigh recall + weak selectionLow recall + strong selection
RecallHigh (top 50 ≥ 95%)Medium (top 10 ≈ 75-85%)
Selection engThree-stage + LLM judgeOne rerank pass
Token costHigh (50 chunks through rerank)Low (10 straight to prompt)
Latency+2-5 sec+200ms
Distractor riskHigh, must filter hardLow, miss-rate high
FitsKnowledge base QA (90%+ accuracy)Real-time chat (latency-sensitive)
Doesn't fitReal-time chatStrict compliance (missed recall unacceptable)

JR experience: external production RAG goes high-recall + strong-selection. Internal tools go low-recall + weak-selection. Different latency, cost, quality bars.

Takeaway

Recall is retrieval's job. Accuracy is selection's job. Stuffing top-10 straight into prompt is a toy demo. Production RAG must split into retrieval / selection / composition. The middle stage (selection) is where the engineering work between 90% recall and 90% accuracy lives.


References

  1. Liu et al. (2023-07-06). Lost in the Middle. arXiv:2307.03172.
  2. He et al. (2024-03-08). Found in the Middle. arXiv:2403.04797.
  3. Anthropic. Long context tips.
  4. Anthropic. (2023-05-11). 100K context windows.
  5. Liu. GitHub: lost-in-the-middle.

Production case: JR Academy omni-report Daily Jobs — retrieval / filter / rerank / LLM-judge three-stage pipeline.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

RAG 召回率 90% 但答对率只有 60%,问题在哪?

90% recall = top-k 里有正确答案但 LLM 没用上,三个原因:(1) Lost in the Middle 注意力衰减、(2) distractor 段落污染、(3) 长 context 下 instruction-following 下降。解法是加 selection 层,不是继续调 retrieval。

Selection 和 Rerank 是同一回事吗?

Rerank 是 selection 的一段,不是全部。完整 selection 三段:filter(硬规则去时效/来源/长度异常)→ rerank(cross-encoder 精筛)→ LLM-as-judge(Haiku/4o-mini 终筛)。100 候选 → 3 个真进 prompt。

什么时候不需要做 selection?

三个条件同时满足才能跳:实时聊天助手 + 召回数 ≤ 5 + 延迟敏感。否则 production RAG 必须做 selection。JR 内部规则:对外学员侧多召强筛,对内仪表盘少召弱筛。

Selection 一套上完整 pipeline 一个月成本多少?

10K query/天规模:Cohere rerank $30/月 + LLM-as-judge (Haiku) $300/月 + 向量库 Pinecone Standard $70/月 = ~$400/月。换 BGE 自托管 + Haiku 判断 = ~$120/月(不含 GPU 摊销)。比不做 selection 省下的 LLM input token 成本通常 5-10× 回本。

做企业内部知识库该不该上 selection?

该上。企业 KB 文档来源杂(Confluence / Slack / Google Doc 老版本),distractor 比公网 RAG 更严重。第一步先加 LLM-as-judge 一层,单 query $0.0005 即可把召回准确率从 60% 拉到 85%+。Filter + rerank 是后续优化。

Selection 最常见的失败模式是什么?

把 selection 做成「再排一次序」而不是「真砍掉」。100 候选 rerank 完仍塞 top-20 进 prompt = 没做 selection。正确做法:rerank 完按 score 阈值(< 0.3 直接扔)+ 硬上限 top-3 至 top-5。砍不下去就是没做。