logo
05

Rerank — Turning Recall into Selection

⏱️ 25 min

Rerank — Turning Recall into Selection

Chapter 3 sketched the three-stage shape of selection (filter / rerank / LLM-judge). This chapter pulls the middle stage apart.

Why retrieval uses bi-encoder and rerank uses cross-encoder

Both encoders answer "how relevant are query and doc," differently —

Bi-encoder: query and doc go through the encoder separately, producing two independent embeddings, then cosine similarity.

Query   ──→ Encoder ──→ q_vec  ┐
                                 ├─→ cosine(q_vec, d_vec) → score
Doc     ──→ Encoder ──→ d_vec  ┘

Win: doc embeddings are computed offline and stored in a vector DB. Querying a 10M-doc corpus returns in milliseconds — only the query vector runs at runtime.

Cost: no cross-attention, so the model can't see specific match points. Query "Is Australia's GST 10%?" vs doc "Australia's consumption tax (GST) is currently 10%." — bi-encoder knows they're similar but can't tell whether the similarity is on "tax rate" or "country."

Cross-encoder: query and doc are concatenated through the encoder together, outputting a 0-1 score.

[Query] [SEP] [Doc] ──→ Encoder ──→ score

Win: cross-attention sees token-to-token relationships. Precision lands 30-50% above bi-encoder (Sentence-Transformers comparison).

Cost: every query × doc requires a forward pass. 1 × 10M = 10M forwards. Can't be the first-stage filter — only the rerank.

The standard two-stage selection pipeline

Stage A (Retrieval, bi-encoder)
  10M docs → 提前 embed → 向量库
  Query → embed → ANN top-50
  ~50ms, 召回率 90%+

Stage B (Rerank, cross-encoder)
  Top-50 + Query → cross-encoder 逐对打分 → top-10
  ~200-500ms(batch + GPU)

Stage C (LLM-as-judge, optional)
  Top-10 → 便宜 LLM(Haiku/4o-mini)yes/no → top-3
  ~1-2s, $0.001/query

bi-encoder top-N caps the rerank ceiling. N=50 is the common compromise.

Mainstream rerank options (2026-04)

ToolTypeNDCG@10CostWhen to pick
Cohere rerank-3Managed≈ 0.75$2/1K queryNo ML engineer
BGE-reranker-v2-m3self-host≈ 0.72~10ms/queryGPUs + zh/en bilingual
Anthropic Contextual RetrievalRAG kit≈ 0.78Anthropic APIAll-in Anthropic
Voyage rerank-2Managed≈ 0.74$0.05/1M tokenSmall volume, token billing
Jina rerankManaged + OSS≈ 0.71Free 1M token/moPOC

NDCG@10 from vendor announcements + BEIR benchmark third-party comparisons. Corpora vary ±10% — run your own eval.

  • < 10K queries/mo → Cohere/Voyage
  • 100K queries/mo → BGE self-host (cost crushes managed)

  • All-in Anthropic → Contextual Retrieval
  • Chinese-heavy → BGE / Cohere

Rerank code (cross-encoder, BGE)

# tested: 2026-04-26 · sentence-transformers@3.0.1
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")  # 首次 ~2GB

def rerank_top_k(query: str, docs: list[str], top_k: int = 10):
    """50 candidates → top_k. GPU batch_size=32 单次 ~200ms."""
    pairs = [(query, doc) for doc in docs]
    scores = reranker.predict(pairs, batch_size=32)
    ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
    return ranked[:top_k]

candidates = vector_db.search(query_embedding, top_k=50)
top10 = rerank_top_k(query, [c.text for c in candidates], top_k=10)

Without GPU batching, 50 pairs takes 5 seconds — production needs GPU or managed API.

LLM-as-judge — final filter on top of rerank

Rerank misses one class: topically relevant but doesn't actually answer. Query "Australia GST registration threshold" — rerank ranks "History of Australia's GST" #3. Both about GST, both score high, but doc never answers the threshold question.

Cheap model decides directly:

# tested: 2026-04-26 · anthropic@0.40.0
def judge(query: str, doc: str) -> bool:
    """Haiku 判断 doc 能否回答 query. ~$0.0001/call."""
    prompt = f"""你是检索质量评估员。
Query: {query}
Doc: {doc}

这个 Doc 能直接回答 Query 吗?只回答 yes 或 no。"""
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text.strip().lower().startswith("yes")

Cost: 10 candidates × $0.0001 = $0.001/query. 10K/day costs $10.

Skip when: realtime chat (users feel +1-2s latency), or retrieval is already stable (eval doesn't show judge gains).

Real JR case: implicit rerank inside daily-jobs

daily-jobs "Stage 2: Rerank" isn't a cross-encoder model — it's structured tag + business rule lightweight rerank:

# tested: 2026-04-26 · routine: Daily Jobs - AI Engineer Bootcamp
# 18 候选 job → 3 final pick

Step 1: 打 tier 标签
  - aspirational  → Mid-Senior + brand in {Apple, Google, Canva, Atlassian, ...}
  - actionable    → Junior/Graduate + brand in {AU Big 4 banks, top tech}
  - special       → title 含 {"Intern", "Graduate Program", "2026-27 Start"}

Step 2: tier 内 rerank
  - 优先 posted < 24h
  - 优先 title 含 {"AI Engineer", "ML Engineer", "LLM Engineer"}
  - 优先 location 在 Sydney/Melbourne

Step 3: 每 tier 选 1 → 3 final picks

"Business rule as rerank" — with clean ranking signals (tier, freshness, keywords), rules beat cross-encoder on explainability. But serious semantic search ("FastAPI vs Django for AI backend?") defeats rules — needs cross-encoder.

Managed API vs Self-host — Trade-off

DimensionCohere/Voyage ManagedBGE Self-host
Integration< 1 day1-2 weeks
Cost per query$0.001-0.002$0.0001
Latency50-200ms10-50ms
Data leaves companyYesNo
Custom fine-tuneNot supportedYes
Ops burden01 ML eng part-time
Throughput ceilingrate limitGPU
Best for< 50K queries/mo> 100K queries/mo

JR experience: first RAG project went on Cohere; once shipped, monthly bill + compliance decided BGE migration. Validate on managed first, optimize cost on self-host second.

Takeaway

Bi-encoder solves "find the candidates," cross-encoder solves "rank them precisely." The two stages stack to 90% recall + 80%+ precision. LLM-as-judge is the optional third stage, catching "relevant but doesn't answer." Production RAG pick: < 50K/month go managed (Cohere), > 100K/month go BGE self-host.


References

  1. Reimers, N. & Gurevych, I. Sentence-Transformers: Cross-Encoders documentation — bi vs cross encoder 工程对比.
  2. BGE. BAAI bge-reranker-v2-m3 model card — 中英双语 rerank 模型.
  3. Cohere. Rerank API documentation — managed rerank API.
  4. Anthropic. (2024-09-19). Introducing Contextual Retrieval — Anthropic 自家 RAG 改进方案.
  5. Thakur, N. et al. BEIR benchmark — retrieval / rerank 第三方评测基准.
  6. Voyage AI. Rerank-2 API docs — token 计费 rerank.

Production case: JR Academy omni-report Daily Jobs routines — 用业务规则(tier + freshness + keyword)做 rerank 的轻量模式.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

为什么不直接用 cross-encoder 做 retrieval?

Cross-encoder 慢 bi-encoder 6 个数量级:10M 文档库 = 10M 次 forward。Bi-encoder 把 doc embedding 离线算好存向量库,query 时只算 query embedding + ANN 查找,毫秒返回。组合用:bi-encoder 召回 50 → cross-encoder rerank 10。

Cohere rerank 和 BGE self-host 怎么选?

按 query 量分:< 50K/月用 Cohere(managed,单 query $0.001-0.002),> 100K/月用 BGE-reranker-v2-m3(自托管,GPU 摊销后单 query ~$0.0001),全 Anthropic 栈用 Contextual Retrieval。中文为主选 BGE 或 Cohere。

LLM-as-judge 什么时候加?

Rerank 后漏「话题相关但答非所问」时加,比如 query 问「澳洲 GST 注册门槛」rerank 把「澳洲 GST 历史沿革」排进 top 3。LLM-as-judge 用 Haiku/4o-mini 直接 yes/no 判断 doc 能否回答 query,10K query/天 $10。实时聊天 + 已稳定 RAG 不加。

Rerank 加上后整体延迟会涨多少?

Cohere rerank API 50-150ms(top-50 候选)、BGE 自托管 GPU 上 30-80ms、CPU 跑 BGE-base 200-500ms。加 LLM-as-judge 再加 300-800ms(Haiku)。production 总目标:retrieval 50ms + rerank 100ms + judge 500ms + LLM 主调用 2-3s ≈ 3-4 秒,对话场景可接受。

我已经用 LangChain 了,rerank 怎么接?

LangChain 内建 ContextualCompressionRetriever + 多个 reranker 包装:CohereRerank、JinaRerank、FlashrankRerank(CPU 友好开源)、CrossEncoderReranker(HuggingFace BGE 包装)。原 retriever 不动,套一层包装即可,5-10 行代码上线。

Rerank 上线后效果不如预期,最常见原因?

召回 candidate 数太少。Bi-encoder 只召 top-10 喂 rerank,rerank 没东西可挑。正确做法:bi-encoder 召 50-100,rerank 砍到 top-3 至 top-5。召得不够多 rerank 就是装饰,不创造价值。