Context Selection — Why RAG Recall Does Not Equal Accuracy
Context Selection — Why RAG Recall ≠ Correct Answer
Anyone who's shipped RAG has hit this number: 90% recall (top-k docs include the correct answer 90% of the time), but end-to-end accuracy is only 60%. Where did the 30% go?
Retrieval isn't the problem. Selection never happened.
Lost in the Middle — The 30% Empirical Gap
July 2023, Stanford's Liu and team published Lost in the Middle (arXiv:2307.03172), exposing a hidden flaw in every long-context LLM with one experiment:
Experiment design:
- Multi-document QA task (20 docs + 1 question)
- Only 1 doc contains the answer; 19 are distractors (topic-relevant, no answer)
- Place answer-doc at different positions (1st / 5th / 10th / 15th / 20th)
- Measure accuracy at each position
Result: every model tested (GPT-3.5, Claude, LLaMA) showed a U-shaped curve — peaks when answer sits at start or end, drops 30%+ when buried in the middle.
[] — see figure on page 3 of the original paper.
Of the 10 retrieved docs, content from doc 4 to doc 7 the model basically can't see. Even at 100% recall, the model can't use the part in the wrong position.
Not a bug. Joint product of transformer attention + training data distribution — start/end positions matter more in training data (paper abstracts, article conclusions). Anthropic reproduced in 200K models, Long context tips doc explicitly recommends putting key info at the end.
Three Root Causes — Not Just Position
Position bias is most visible, but RAG recall ≠ correct answer has three causes:
1. Position bias — Middle blindness
10 retrieved chunks, 5 relevant, sorted by relevance — relevance rank #1 isn't attention rank #1. Actively shove the most critical chunk to the end.
2. Distractor pollution
2024 Found in the Middle (He et al.) proves: more distractors in context (docs topic-similar but answer-empty), higher hallucination — the model "assembles" relevant nouns from distractors into fake answers.
Australian tax RAG retrieves 10 chunks — 3 GST, 7 income tax. User asks GST. The 7 income-tax chunks are distractors. Model says "GST is ..., note income tax has ..." — packing income tax details as if they were GST facts. Recalled right also recalled similar-but-wrong = pollution.
3. Instruction-following degradation
Anthropic's 200K context eval doc: longer context, lower adherence to system prompt hard rules (output format, prohibitions). At 50K with JSON output requirement, model occasionally drops fields. At 200K, field-dropping rate jumps noticeably.
Hidden cost of recalling more — you think "more retrieval is safer", really system instruction gets diluted.
Selection ≠ Retrieval
retrieval finds candidates, selection picks which actually enter context.
| Stage | Job | Target | Tools |
|---|---|---|---|
| Retrieval | Find top-N from corpus | Recall | bi-encoder / BM25 / hybrid |
| Selection | Pick top-K from top-N | Precision + position | rerank / filter / LLM-judge |
| Composition | Order top-K, stuff into prompt | Position, token budget | manual + templates |
Intro RAG tutorials only teach retrieval — top-5 straight into prompt. Root cause of "90% recall, 60% accuracy".
Selection's Three-Stage Pipeline
Chapter 5 details each toolchain. Overall shape:
Stage 1 — Filter (coarse pass)
Drop by hard rules:
- Stale (2023 policy doc talking 2026 event)
- Source blacklist (recycled CSDN articles)
- Length anomalies (< 50 or > 5000 chars)
Cost dirt cheap (O(n)), drops 30-50% candidates.
Stage 2 — Rerank (fine pass)
Re-score with cross-encoder:
- bi-encoder (retrieval's) sees query or doc one at a time — fast, imprecise
- cross-encoder sees query + doc together, 30-50% more precise, 50-100x cost — only run on top-50
Mainstream: Cohere rerank-3 / BGE-reranker-v2 / Anthropic contextual retrieval.
Stage 3 — LLM-as-judge (final pass)
Cheap model (Haiku or 4o-mini) as judge:
- Input: query + single doc
- Output: does this answer the query? yes/no + reason
5-10x rerank cost, catches "topic-related but off-target" passages rerank lets through.
Three combined: retrieval 100 → filter 50 → rerank 10 → judge 3 → 3 enter context. Recall intact, distractor pollution to zero.
JR Real Case: daily-jobs Selection Pipeline
JR Academy runs 3 daily-jobs routines per day — one each for ai-essentials, ai-engineer, ai-engineer-rag bootcamp tracks, recommending 3 jobs per cohort. Textbook selection:
# tested: 2026-04-26 · routine: Daily Jobs - AI Engineer Bootcamp
Stage 0: 抓取(retrieval)
WebFetch au.linkedin.com/jobs/junior-machine-learning-jobs
WebFetch au.linkedin.com/jobs/ai-engineer-jobs
→ 抓回约 30 个 job
Stage 1: Filter(硬规则)
剔除 recruiter spam(YO IT Consulting 这类同 job 多 location 群发)
剔除 SEEK 来源(403 默认无效)
剔除发布 > 7 天
→ 剩约 18 个 job
Stage 2: Rerank(按 tier 标签)
打 3 个 tier 标签:
aspirational(理想档 — Mid-Senior + T0/T1 brand)
actionable(够得着档 — Junior/Graduate + AU Big 4)
special(特殊机会 — Intern / Graduate Program / 2026-27 Start)
→ 每 tier 留 top 3-5 候选
Stage 3: LLM-as-judge(人称代入)
以「JR Bootcamp 学员视角」给每个候选写 whyForLearners
≥ 30 字、必须具体、不允许模板填空
→ 每 tier 选 1 个最佳 → 共 3 个 final
最终 context 进 prompt:3 个 job 全文 + 3 个 whyForLearners
Not fancy algorithms — three filter stages plus one judge model. The 3 jobs that come out passed distractor screening + position ordering (tier is the position signal).
Students see "3 hand-picked jobs that matter to you", not "30 fetched, 3 chosen". Product value of selection.
High Recall + Weak Selection vs Low Recall + Strong Selection — Trade-off
| Dimension | High recall + weak selection | Low recall + strong selection |
|---|---|---|
| Recall | High (top 50 ≥ 95%) | Medium (top 10 ≈ 75-85%) |
| Selection eng | Three-stage + LLM judge | One rerank pass |
| Token cost | High (50 chunks through rerank) | Low (10 straight to prompt) |
| Latency | +2-5 sec | +200ms |
| Distractor risk | High, must filter hard | Low, miss-rate high |
| Fits | Knowledge base QA (90%+ accuracy) | Real-time chat (latency-sensitive) |
| Doesn't fit | Real-time chat | Strict compliance (missed recall unacceptable) |
JR experience: external production RAG goes high-recall + strong-selection. Internal tools go low-recall + weak-selection. Different latency, cost, quality bars.
Takeaway
Recall is retrieval's job. Accuracy is selection's job. Stuffing top-10 straight into prompt is a toy demo. Production RAG must split into retrieval / selection / composition. The middle stage (selection) is where the engineering work between 90% recall and 90% accuracy lives.
References
- Liu et al. (2023-07-06). Lost in the Middle. arXiv:2307.03172.
- He et al. (2024-03-08). Found in the Middle. arXiv:2403.04797.
- Anthropic. Long context tips.
- Anthropic. (2023-05-11). 100K context windows.
- Liu. GitHub: lost-in-the-middle.
Production case: JR Academy omni-report Daily Jobs — retrieval / filter / rerank / LLM-judge three-stage pipeline.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
RAG 召回率 90% 但答对率只有 60%,问题在哪?
90% recall = top-k 里有正确答案但 LLM 没用上,三个原因:(1) Lost in the Middle 注意力衰减、(2) distractor 段落污染、(3) 长 context 下 instruction-following 下降。解法是加 selection 层,不是继续调 retrieval。
Selection 和 Rerank 是同一回事吗?
Rerank 是 selection 的一段,不是全部。完整 selection 三段:filter(硬规则去时效/来源/长度异常)→ rerank(cross-encoder 精筛)→ LLM-as-judge(Haiku/4o-mini 终筛)。100 候选 → 3 个真进 prompt。
什么时候不需要做 selection?
三个条件同时满足才能跳:实时聊天助手 + 召回数 ≤ 5 + 延迟敏感。否则 production RAG 必须做 selection。JR 内部规则:对外学员侧多召强筛,对内仪表盘少召弱筛。
Selection 一套上完整 pipeline 一个月成本多少?
10K query/天规模:Cohere rerank $30/月 + LLM-as-judge (Haiku) $300/月 + 向量库 Pinecone Standard $70/月 = ~$400/月。换 BGE 自托管 + Haiku 判断 = ~$120/月(不含 GPU 摊销)。比不做 selection 省下的 LLM input token 成本通常 5-10× 回本。
做企业内部知识库该不该上 selection?
该上。企业 KB 文档来源杂(Confluence / Slack / Google Doc 老版本),distractor 比公网 RAG 更严重。第一步先加 LLM-as-judge 一层,单 query $0.0005 即可把召回准确率从 60% 拉到 85%+。Filter + rerank 是后续优化。
Selection 最常见的失败模式是什么?
把 selection 做成「再排一次序」而不是「真砍掉」。100 候选 rerank 完仍塞 top-20 进 prompt = 没做 selection。正确做法:rerank 完按 score 阈值(< 0.3 直接扔)+ 硬上限 top-3 至 top-5。砍不下去就是没做。