Why not use cross-encoder directly for retrieval?

Cross-encoder is 6 orders of magnitude slower than bi-encoder: 10M doc corpus = 10M forward passes. Bi-encoder precomputes doc embeddings into a vector store; query time only embeds the query + does ANN lookup, milliseconds. Combine them: bi-encoder recalls 50 → cross-encoder reranks 10.

Cohere rerank vs BGE self-host — how do I choose?

Pick by query volume: 100K/month use BGE-reranker-v2-m3 (self-hosted, ~$0.0001 per query after GPU amortization), full Anthropic stack use Contextual Retrieval. Chinese-heavy workloads pick BGE or Cohere.

When should I add LLM-as-judge?

Add it when rerank still passes "topic-relevant but off-question" hits — e.g. query "Australian GST registration threshold" gets "Australian GST historical evolution" in top 3. LLM-as-judge uses Haiku/4o-mini for a yes/no on whether the doc answers the query, 10K queries/day = $10. Skip for real-time chat and already-stable RAG.

How much does adding rerank push total latency?

Cohere rerank API: 50-150ms (top-50 candidates); self-hosted BGE on GPU: 30-80ms; BGE-base on CPU: 200-500ms. Add LLM-as-judge for another 300-800ms (Haiku). Production target: retrieval 50ms + rerank 100ms + judge 500ms + main LLM call 2-3s ≈ 3-4s total, acceptable for chat.

I am on LangChain — how do I plug in rerank?

LangChain ships ContextualCompressionRetriever + several reranker wrappers: CohereRerank, JinaRerank, FlashrankRerank (CPU-friendly OSS), CrossEncoderReranker (HuggingFace BGE wrapper). Keep the original retriever, wrap it once, ship in 5-10 lines.

Rerank shipped but underperforms — most common cause?

Too few recall candidates. If bi-encoder hands rerank only top-10, rerank has nothing to choose from. Correct: bi-encoder recalls 50-100, rerank cuts to top-3 to top-5. Without wide enough recall, rerank is decoration, not value.

Rerank — Turning Recall into Selection

⏱️ 25 min

Rerank — Turning Recall into Selection

Chapter 3 sketched the three-stage shape of selection (filter / rerank / LLM-judge). This chapter pulls the middle stage apart.

Why retrieval uses bi-encoder and rerank uses cross-encoder

Both encoders answer "how relevant are query and doc," differently —

Bi-encoder: query and doc go through the encoder separately, producing two independent embeddings, then cosine similarity.

Query   ──→ Encoder ──→ q_vec  ┐
                                 ├─→ cosine(q_vec, d_vec) → score
Doc     ──→ Encoder ──→ d_vec  ┘

Win: doc embeddings are computed offline and stored in a vector DB. Querying a 10M-doc corpus returns in milliseconds — only the query vector runs at runtime.

Cost: no cross-attention, so the model can't see specific match points. Query "Is Australia's GST 10%?" vs doc "Australia's consumption tax (GST) is currently 10%." — bi-encoder knows they're similar but can't tell whether the similarity is on "tax rate" or "country."

Cross-encoder: query and doc are concatenated through the encoder together, outputting a 0-1 score.

[Query] [SEP] [Doc] ──→ Encoder ──→ score

Win: cross-attention sees token-to-token relationships. Precision lands 30-50% above bi-encoder (Sentence-Transformers comparison).

Cost: every query × doc requires a forward pass. 1 × 10M = 10M forwards. Can't be the first-stage filter — only the rerank.

The standard two-stage selection pipeline

Stage A (Retrieval, bi-encoder)
  10M docs → 提前 embed → 向量库
  Query → embed → ANN top-50
  ~50ms, 召回率 90%+

Stage B (Rerank, cross-encoder)
  Top-50 + Query → cross-encoder 逐对打分 → top-10
  ~200-500ms（batch + GPU）

Stage C (LLM-as-judge, optional)
  Top-10 → 便宜 LLM（Haiku/4o-mini）yes/no → top-3
  ~1-2s, $0.001/query

bi-encoder top-N caps the rerank ceiling. N=50 is the common compromise.

Mainstream rerank options (2026-04)

Tool	Type	NDCG@10	Cost	When to pick
Cohere rerank-3	Managed	≈ 0.75	$2/1K query	No ML engineer
BGE-reranker-v2-m3	self-host	≈ 0.72	~10ms/query	GPUs + zh/en bilingual
Anthropic Contextual Retrieval	RAG kit	≈ 0.78	Anthropic API	All-in Anthropic
Voyage rerank-2	Managed	≈ 0.74	$0.05/1M token	Small volume, token billing
Jina rerank	Managed + OSS	≈ 0.71	Free 1M token/mo	POC

NDCG@10 from vendor announcements + BEIR benchmark third-party comparisons. Corpora vary ±10% — run your own eval.

< 10K queries/mo → Cohere/Voyage
100K queries/mo → BGE self-host (cost crushes managed)
All-in Anthropic → Contextual Retrieval
Chinese-heavy → BGE / Cohere

Rerank code (cross-encoder, BGE)

# tested: 2026-04-26 · sentence-transformers@3.0.1
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")  # 首次 ~2GB

def rerank_top_k(query: str, docs: list[str], top_k: int = 10):
    """50 candidates → top_k. GPU batch_size=32 单次 ~200ms."""
    pairs = [(query, doc) for doc in docs]
    scores = reranker.predict(pairs, batch_size=32)
    ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
    return ranked[:top_k]

candidates = vector_db.search(query_embedding, top_k=50)
top10 = rerank_top_k(query, [c.text for c in candidates], top_k=10)

Without GPU batching, 50 pairs takes 5 seconds — production needs GPU or managed API.

LLM-as-judge — final filter on top of rerank

Rerank misses one class: topically relevant but doesn't actually answer. Query "Australia GST registration threshold" — rerank ranks "History of Australia's GST" #3. Both about GST, both score high, but doc never answers the threshold question.

Cheap model decides directly:

# tested: 2026-04-26 · anthropic@0.40.0
def judge(query: str, doc: str) -> bool:
    """Haiku 判断 doc 能否回答 query. ~$0.0001/call."""
    prompt = f"""你是检索质量评估员。
Query: {query}
Doc: {doc}

这个 Doc 能直接回答 Query 吗？只回答 yes 或 no。"""
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text.strip().lower().startswith("yes")

Cost: 10 candidates × $0.0001 = $0.001/query. 10K/day costs $10.

Skip when: realtime chat (users feel +1-2s latency), or retrieval is already stable (eval doesn't show judge gains).

Real JR case: implicit rerank inside daily-jobs

daily-jobs "Stage 2: Rerank" isn't a cross-encoder model — it's structured tag + business rule lightweight rerank:

# tested: 2026-04-26 · routine: Daily Jobs - AI Engineer Bootcamp
# 18 候选 job → 3 final pick

Step 1: 打 tier 标签
  - aspirational  → Mid-Senior + brand in {Apple, Google, Canva, Atlassian, ...}
  - actionable    → Junior/Graduate + brand in {AU Big 4 banks, top tech}
  - special       → title 含 {"Intern", "Graduate Program", "2026-27 Start"}

Step 2: tier 内 rerank
  - 优先 posted < 24h
  - 优先 title 含 {"AI Engineer", "ML Engineer", "LLM Engineer"}
  - 优先 location 在 Sydney/Melbourne

Step 3: 每 tier 选 1 → 3 final picks

"Business rule as rerank" — with clean ranking signals (tier, freshness, keywords), rules beat cross-encoder on explainability. But serious semantic search ("FastAPI vs Django for AI backend?") defeats rules — needs cross-encoder.

Managed API vs Self-host — Trade-off

Dimension	Cohere/Voyage Managed	BGE Self-host
Integration	< 1 day	1-2 weeks
Cost per query	$0.001-0.002	$0.0001
Latency	50-200ms	10-50ms
Data leaves company	Yes	No
Custom fine-tune	Not supported	Yes
Ops burden	0	1 ML eng part-time
Throughput ceiling	rate limit	GPU
Best for	< 50K queries/mo	> 100K queries/mo

JR experience: first RAG project went on Cohere; once shipped, monthly bill + compliance decided BGE migration. Validate on managed first, optimize cost on self-host second.

Takeaway

Bi-encoder solves "find the candidates," cross-encoder solves "rank them precisely." The two stages stack to 90% recall + 80%+ precision. LLM-as-judge is the optional third stage, catching "relevant but doesn't answer." Production RAG pick: < 50K/month go managed (Cohere), > 100K/month go BGE self-host.

References

Reimers, N. & Gurevych, I. Sentence-Transformers: Cross-Encoders documentation — bi vs cross encoder 工程对比.
BGE. BAAI bge-reranker-v2-m3 model card — 中英双语 rerank 模型.
Cohere. Rerank API documentation — managed rerank API.
Anthropic. (2024-09-19). Introducing Contextual Retrieval — Anthropic 自家 RAG 改进方案.
Thakur, N. et al. BEIR benchmark — retrieval / rerank 第三方评测基准.
Voyage AI. Rerank-2 API docs — token 计费 rerank.

Production case: JR Academy omni-report Daily Jobs routines — 用业务规则（tier + freshness + keyword）做 rerank 的轻量模式.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

为什么不直接用 cross-encoder 做 retrieval？

Cross-encoder 慢 bi-encoder 6 个数量级：10M 文档库 = 10M 次 forward。Bi-encoder 把 doc embedding 离线算好存向量库，query 时只算 query embedding + ANN 查找，毫秒返回。组合用：bi-encoder 召回 50 → cross-encoder rerank 10。

Cohere rerank 和 BGE self-host 怎么选？

按 query 量分：< 50K/月用 Cohere（managed，单 query $0.001-0.002），> 100K/月用 BGE-reranker-v2-m3（自托管，GPU 摊销后单 query ~$0.0001），全 Anthropic 栈用 Contextual Retrieval。中文为主选 BGE 或 Cohere。

LLM-as-judge 什么时候加？

Rerank 后漏「话题相关但答非所问」时加，比如 query 问「澳洲 GST 注册门槛」rerank 把「澳洲 GST 历史沿革」排进 top 3。LLM-as-judge 用 Haiku/4o-mini 直接 yes/no 判断 doc 能否回答 query，10K query/天 $10。实时聊天 + 已稳定 RAG 不加。

Rerank 加上后整体延迟会涨多少？

Cohere rerank API 50-150ms（top-50 候选）、BGE 自托管 GPU 上 30-80ms、CPU 跑 BGE-base 200-500ms。加 LLM-as-judge 再加 300-800ms（Haiku）。production 总目标：retrieval 50ms + rerank 100ms + judge 500ms + LLM 主调用 2-3s ≈ 3-4 秒，对话场景可接受。

我已经用 LangChain 了，rerank 怎么接？

LangChain 内建 ContextualCompressionRetriever + 多个 reranker 包装：CohereRerank、JinaRerank、FlashrankRerank（CPU 友好开源）、CrossEncoderReranker（HuggingFace BGE 包装）。原 retriever 不动，套一层包装即可，5-10 行代码上线。

Rerank 上线后效果不如预期，最常见原因？

召回 candidate 数太少。Bi-encoder 只召 top-10 喂 rerank，rerank 没东西可挑。正确做法：bi-encoder 召 50-100，rerank 砍到 top-3 至 top-5。召得不够多 rerank 就是装饰，不创造价值。