Rerank — Turning Recall into Selection
Rerank — Turning Recall into Selection
Chapter 3 sketched the three-stage shape of selection (filter / rerank / LLM-judge). This chapter pulls the middle stage apart.
Why retrieval uses bi-encoder and rerank uses cross-encoder
Both encoders answer "how relevant are query and doc," differently —
Bi-encoder: query and doc go through the encoder separately, producing two independent embeddings, then cosine similarity.
Query ──→ Encoder ──→ q_vec ┐
├─→ cosine(q_vec, d_vec) → score
Doc ──→ Encoder ──→ d_vec ┘
Win: doc embeddings are computed offline and stored in a vector DB. Querying a 10M-doc corpus returns in milliseconds — only the query vector runs at runtime.
Cost: no cross-attention, so the model can't see specific match points. Query "Is Australia's GST 10%?" vs doc "Australia's consumption tax (GST) is currently 10%." — bi-encoder knows they're similar but can't tell whether the similarity is on "tax rate" or "country."
Cross-encoder: query and doc are concatenated through the encoder together, outputting a 0-1 score.
[Query] [SEP] [Doc] ──→ Encoder ──→ score
Win: cross-attention sees token-to-token relationships. Precision lands 30-50% above bi-encoder (Sentence-Transformers comparison).
Cost: every query × doc requires a forward pass. 1 × 10M = 10M forwards. Can't be the first-stage filter — only the rerank.
The standard two-stage selection pipeline
Stage A (Retrieval, bi-encoder)
10M docs → 提前 embed → 向量库
Query → embed → ANN top-50
~50ms, 召回率 90%+
Stage B (Rerank, cross-encoder)
Top-50 + Query → cross-encoder 逐对打分 → top-10
~200-500ms(batch + GPU)
Stage C (LLM-as-judge, optional)
Top-10 → 便宜 LLM(Haiku/4o-mini)yes/no → top-3
~1-2s, $0.001/query
bi-encoder top-N caps the rerank ceiling. N=50 is the common compromise.
Mainstream rerank options (2026-04)
| Tool | Type | NDCG@10 | Cost | When to pick |
|---|---|---|---|---|
| Cohere rerank-3 | Managed | ≈ 0.75 | $2/1K query | No ML engineer |
| BGE-reranker-v2-m3 | self-host | ≈ 0.72 | ~10ms/query | GPUs + zh/en bilingual |
| Anthropic Contextual Retrieval | RAG kit | ≈ 0.78 | Anthropic API | All-in Anthropic |
| Voyage rerank-2 | Managed | ≈ 0.74 | $0.05/1M token | Small volume, token billing |
| Jina rerank | Managed + OSS | ≈ 0.71 | Free 1M token/mo | POC |
NDCG@10 from vendor announcements + BEIR benchmark third-party comparisons. Corpora vary ±10% — run your own eval.
- < 10K queries/mo → Cohere/Voyage
-
100K queries/mo → BGE self-host (cost crushes managed)
- All-in Anthropic → Contextual Retrieval
- Chinese-heavy → BGE / Cohere
Rerank code (cross-encoder, BGE)
# tested: 2026-04-26 · sentence-transformers@3.0.1
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") # 首次 ~2GB
def rerank_top_k(query: str, docs: list[str], top_k: int = 10):
"""50 candidates → top_k. GPU batch_size=32 单次 ~200ms."""
pairs = [(query, doc) for doc in docs]
scores = reranker.predict(pairs, batch_size=32)
ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
return ranked[:top_k]
candidates = vector_db.search(query_embedding, top_k=50)
top10 = rerank_top_k(query, [c.text for c in candidates], top_k=10)
Without GPU batching, 50 pairs takes 5 seconds — production needs GPU or managed API.
LLM-as-judge — final filter on top of rerank
Rerank misses one class: topically relevant but doesn't actually answer. Query "Australia GST registration threshold" — rerank ranks "History of Australia's GST" #3. Both about GST, both score high, but doc never answers the threshold question.
Cheap model decides directly:
# tested: 2026-04-26 · anthropic@0.40.0
def judge(query: str, doc: str) -> bool:
"""Haiku 判断 doc 能否回答 query. ~$0.0001/call."""
prompt = f"""你是检索质量评估员。
Query: {query}
Doc: {doc}
这个 Doc 能直接回答 Query 吗?只回答 yes 或 no。"""
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return resp.content[0].text.strip().lower().startswith("yes")
Cost: 10 candidates × $0.0001 = $0.001/query. 10K/day costs $10.
Skip when: realtime chat (users feel +1-2s latency), or retrieval is already stable (eval doesn't show judge gains).
Real JR case: implicit rerank inside daily-jobs
daily-jobs "Stage 2: Rerank" isn't a cross-encoder model — it's structured tag + business rule lightweight rerank:
# tested: 2026-04-26 · routine: Daily Jobs - AI Engineer Bootcamp
# 18 候选 job → 3 final pick
Step 1: 打 tier 标签
- aspirational → Mid-Senior + brand in {Apple, Google, Canva, Atlassian, ...}
- actionable → Junior/Graduate + brand in {AU Big 4 banks, top tech}
- special → title 含 {"Intern", "Graduate Program", "2026-27 Start"}
Step 2: tier 内 rerank
- 优先 posted < 24h
- 优先 title 含 {"AI Engineer", "ML Engineer", "LLM Engineer"}
- 优先 location 在 Sydney/Melbourne
Step 3: 每 tier 选 1 → 3 final picks
"Business rule as rerank" — with clean ranking signals (tier, freshness, keywords), rules beat cross-encoder on explainability. But serious semantic search ("FastAPI vs Django for AI backend?") defeats rules — needs cross-encoder.
Managed API vs Self-host — Trade-off
| Dimension | Cohere/Voyage Managed | BGE Self-host |
|---|---|---|
| Integration | < 1 day | 1-2 weeks |
| Cost per query | $0.001-0.002 | $0.0001 |
| Latency | 50-200ms | 10-50ms |
| Data leaves company | Yes | No |
| Custom fine-tune | Not supported | Yes |
| Ops burden | 0 | 1 ML eng part-time |
| Throughput ceiling | rate limit | GPU |
| Best for | < 50K queries/mo | > 100K queries/mo |
JR experience: first RAG project went on Cohere; once shipped, monthly bill + compliance decided BGE migration. Validate on managed first, optimize cost on self-host second.
Takeaway
Bi-encoder solves "find the candidates," cross-encoder solves "rank them precisely." The two stages stack to 90% recall + 80%+ precision. LLM-as-judge is the optional third stage, catching "relevant but doesn't answer." Production RAG pick: < 50K/month go managed (Cohere), > 100K/month go BGE self-host.
References
- Reimers, N. & Gurevych, I. Sentence-Transformers: Cross-Encoders documentation — bi vs cross encoder 工程对比.
- BGE. BAAI bge-reranker-v2-m3 model card — 中英双语 rerank 模型.
- Cohere. Rerank API documentation — managed rerank API.
- Anthropic. (2024-09-19). Introducing Contextual Retrieval — Anthropic 自家 RAG 改进方案.
- Thakur, N. et al. BEIR benchmark — retrieval / rerank 第三方评测基准.
- Voyage AI. Rerank-2 API docs — token 计费 rerank.
Production case: JR Academy omni-report Daily Jobs routines — 用业务规则(tier + freshness + keyword)做 rerank 的轻量模式.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
为什么不直接用 cross-encoder 做 retrieval?
Cross-encoder 慢 bi-encoder 6 个数量级:10M 文档库 = 10M 次 forward。Bi-encoder 把 doc embedding 离线算好存向量库,query 时只算 query embedding + ANN 查找,毫秒返回。组合用:bi-encoder 召回 50 → cross-encoder rerank 10。
Cohere rerank 和 BGE self-host 怎么选?
按 query 量分:< 50K/月用 Cohere(managed,单 query $0.001-0.002),> 100K/月用 BGE-reranker-v2-m3(自托管,GPU 摊销后单 query ~$0.0001),全 Anthropic 栈用 Contextual Retrieval。中文为主选 BGE 或 Cohere。
LLM-as-judge 什么时候加?
Rerank 后漏「话题相关但答非所问」时加,比如 query 问「澳洲 GST 注册门槛」rerank 把「澳洲 GST 历史沿革」排进 top 3。LLM-as-judge 用 Haiku/4o-mini 直接 yes/no 判断 doc 能否回答 query,10K query/天 $10。实时聊天 + 已稳定 RAG 不加。
Rerank 加上后整体延迟会涨多少?
Cohere rerank API 50-150ms(top-50 候选)、BGE 自托管 GPU 上 30-80ms、CPU 跑 BGE-base 200-500ms。加 LLM-as-judge 再加 300-800ms(Haiku)。production 总目标:retrieval 50ms + rerank 100ms + judge 500ms + LLM 主调用 2-3s ≈ 3-4 秒,对话场景可接受。
我已经用 LangChain 了,rerank 怎么接?
LangChain 内建 ContextualCompressionRetriever + 多个 reranker 包装:CohereRerank、JinaRerank、FlashrankRerank(CPU 友好开源)、CrossEncoderReranker(HuggingFace BGE 包装)。原 retriever 不动,套一层包装即可,5-10 行代码上线。
Rerank 上线后效果不如预期,最常见原因?
召回 candidate 数太少。Bi-encoder 只召 top-10 喂 rerank,rerank 没东西可挑。正确做法:bi-encoder 召 50-100,rerank 砍到 top-3 至 top-5。召得不够多 rerank 就是装饰,不创造价值。