Bi-Encoders vs Cross-Encoders: Choosing the Right Architecture for Semantic Search
Deep dive into bi-encoder and cross-encoder architectures for semantic similarity. Learn the trade-offs, implementation patterns, and when to use each approach in RAG systems and search applications.
When building semantic search, RAG systems, or recommendation engines, one architectural decision will fundamentally shape your system's performance: should you use bi-encoders or cross-encoders? The answer isn't straightforward - each architecture makes different trade-offs between speed and accuracy that matter enormously at scale.
Bi-encoders can search through millions of documents in milliseconds but may miss nuanced relevance. Cross-encoders capture subtle semantic relationships with remarkable accuracy but can't scale beyond a few hundred comparisons per query. Understanding when to use each - and how to combine them - is essential for building production-grade semantic systems.
This guide explains both architectures from first principles, compares their characteristics, and shows you how to implement the two-stage retrieval pattern that powers modern search systems at companies like Google, Microsoft, and OpenAI.
Key Takeaways
- Bi-encoders pre-compute embeddings for millisecond search across millions of documents
- Cross-encoders process query-document pairs together for higher accuracy but cannot scale
- Two-stage retrieval (bi-encoder retrieve, cross-encoder rerank) is the production standard
- Retrieve 50-200 candidates with bi-encoder, rerank top 10-20 with cross-encoder
- Fine-tuning on domain data typically improves both architectures by 10-20%
- Choose pre-trained models based on your language, accuracy needs, and latency budget
- Hybrid search combining BM25 keyword matching with semantic search often outperforms either alone
The Core Problem: Semantic Similarity at Scale
Traditional keyword search fails when users express the same concept differently. A search for "how to fix a slow laptop" won't match a document titled "Speed up your computer performance" despite being semantically identical. Semantic search solves this by comparing meaning rather than words.
But here's the challenge: to find semantically similar documents, you need to compare your query against every document in your corpus. With millions of documents, this becomes computationally intractable - unless you're clever about how you structure the comparison.
This is where bi-encoders and cross-encoders diverge. They represent two fundamentally different approaches to the same problem:
Bi-Encoder Approach
"Encode everything once, compare embeddings fast"
- • Pre-compute document embeddings
- • Store in vector database
- • Compare query embedding to all docs
- • Millisecond retrieval at any scale
Cross-Encoder Approach
"Consider query and document together for precision"
- • Process query+document pairs
- • Full attention between all tokens
- • More accurate relevance scores
- • Can only score a few hundred pairs
How Bi-Encoders Work
A bi-encoder uses two separate transformer encoders (or the same encoder applied twice) to independently convert queries and documents into fixed-size embedding vectors. These embeddings exist in a shared semantic space where similar meanings cluster together.
Bi-Encoder Architecture
Query: "laptop running slow" Document: "Speed up your computer"
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Transformer │ │ Transformer │
│ Encoder │ │ Encoder │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
[0.23, -0.45, 0.12, ...] [0.21, -0.42, 0.15, ...]
Query Embedding Document Embedding
│ │
└──────────────┬───────────────────────┘
│
▼
Cosine Similarity
0.94
The Pre-Computation Advantage
The key insight is that document embeddings can be computed once and stored. When a query arrives, you only need to:
- 1. Encode the query - One forward pass through the transformer (~10-50ms)
- 2. Compare against all documents - Vector similarity operations are extremely fast
With optimised libraries like FAISS or vector databases like Pinecone, you can compare against billions of vectors in under 100 milliseconds. This is why bi-encoders dominate large-scale retrieval.
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# Load a bi-encoder model
5model = SentenceTransformer('all-MiniLM-L6-v2')
6
7# Pre-compute document embeddings (do this once)
8documents = [
9 "Speed up your computer performance with these tips",
10 "Best practices for Python code optimization",
11 "How to troubleshoot network connectivity issues",
12 "Machine learning model deployment strategies",
13]
14
15# Encode all documents - these embeddings are stored/cached
16doc_embeddings = model.encode(documents, convert_to_numpy=True)
17
18# At query time, encode the query and compare
19query = "laptop running slow"
20query_embedding = model.encode(query, convert_to_numpy=True)
21
22# Compute cosine similarities
23similarities = np.dot(doc_embeddings, query_embedding) / (
24 np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
25)
26
27# Get top results
28top_indices = np.argsort(similarities)[::-1]
29for idx in top_indices[:3]:
30 print(f"Score: {similarities[idx]:.3f} | {documents[idx]}")Embedding Quality Matters
Because bi-encoders compress documents into fixed-size vectors (typically 384-768 dimensions), information is necessarily lost. The quality of this compression depends on:
- Model architecture: Larger models capture more nuance
- Training data: Models trained on your domain perform better
- Embedding dimension: Higher dimensions preserve more information but cost more to store and compare
The Compression Trade-off
A bi-encoder must compress an entire document (potentially thousands of words) into a single vector of a few hundred numbers. This works well for capturing general topic similarity but can miss specific details that matter for relevance. A document about "Python code optimisation" and "Python snake habitats" might have more similar embeddings than you'd expect because "Python" dominates both.
How Cross-Encoders Work
Cross-encoders take a fundamentally different approach. Instead of encoding query and document separately, they process both together as a single input sequence. This allows full attention between query tokens and document tokens - the model can directly compare every word in the query against every word in the document.
Cross-Encoder Architecture
Query: "laptop running slow" Document: "Speed up your computer"
│ │
└──────────┬───────────────┘
│
▼
[CLS] laptop running slow [SEP] Speed up your computer [SEP]
│
▼
┌─────────────────────┐
│ Transformer │
│ (Full Attention │
│ Between All │
│ Tokens) │
└──────────┬──────────┘
│
▼
┌─────────────────┐
│ Classification │
│ Head │
└────────┬────────┘
│
▼
Relevance Score: 0.89
Why Cross-Encoders Are More Accurate
The key advantage is cross-attention. When processing the combined input, the transformer can:
- Directly compare "laptop" with "computer" and understand they're synonyms in context
- Recognise that "slow" relates to "speed up" as problem-to-solution
- Consider word order and grammatical relationships across the query-document boundary
This produces more nuanced relevance judgments. Cross-encoders consistently outperform bi-encoders on relevance benchmarks, often by significant margins (5-15% improvement in metrics like NDCG@10).
1from sentence_transformers import CrossEncoder
2
3# Load a cross-encoder model trained for relevance ranking
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# Score query-document pairs
7query = "laptop running slow"
8documents = [
9 "Speed up your computer performance with these tips",
10 "Best practices for Python code optimization",
11 "How to troubleshoot network connectivity issues",
12 "Machine learning model deployment strategies",
13]
14
15# Create query-document pairs
16pairs = [[query, doc] for doc in documents]
17
18# Score all pairs (returns relevance scores)
19scores = model.predict(pairs)
20
21# Sort by score
22ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
23for doc, score in ranked:
24 print(f"Score: {score:.3f} | {doc}")The Scalability Problem
Cross-encoders have a fatal flaw for large-scale retrieval: you cannot pre-compute anything. Every query requires processing the query with every document through the full transformer. For a corpus of 1 million documents:
Cross-Encoder Scaling Math
- • Time per pair: ~5-10ms on GPU
- • Documents: 1,000,000
- • Total time: 5,000-10,000 seconds (1.5-3 hours per query)
This is obviously impractical for real-time search.
Cross-encoders can realistically only score hundreds to low thousands of candidates per query. This limitation is fundamental to the architecture - there's no way around it without sacrificing the cross-attention that makes them accurate.
Bi-Encoder vs Cross-Encoder: Complete Comparison
Let's compare both architectures across the dimensions that matter for production systems:
| Characteristic | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Speed at Query Time | Very fast (milliseconds) | Slow (scales linearly with corpus) |
| Relevance Accuracy | Good | Excellent |
| Scalability | Billions of documents | Hundreds per query |
| Pre-computation | Yes - encode docs once | No - must process each query |
| Index Updates | Add new embeddings easily | No index needed |
| Memory for Corpus | Embedding storage required | Just document text |
| GPU Requirements | Query-time only (optional) | Required for reasonable speed |
| Best For | First-stage retrieval, large corpora | Reranking, high-stakes decisions |
Accuracy vs Speed: The Fundamental Trade-off
The performance difference isn't marginal. On standard benchmarks like MS MARCO:
Typical Benchmark Performance (MS MARCO)
Bi-Encoder (all-MiniLM-L6-v2)
- • MRR@10: ~0.33
- • Query latency: 20ms + search
- • Can search 10M+ docs
Cross-Encoder (ms-marco-MiniLM-L-6-v2)
- • MRR@10: ~0.39
- • Query latency: ~5ms per doc
- • Practical limit: ~1000 docs
The cross-encoder achieves roughly 18% better ranking quality, but at a cost that makes it unusable for first-stage retrieval at scale.
The Two-Stage Retrieval Pattern
The solution used by virtually every production semantic search system is two-stage retrieval: use a bi-encoder to quickly retrieve candidates, then use a cross-encoder to precisely rerank the top results.
Two-Stage Retrieval Pipeline
User Query
│
▼
┌───────────────────────────────┐
│ Stage 1: Retrieval │
│ (Bi-Encoder) │
│ │
│ • Encode query (~20ms) │
│ • Search vector index (~50ms) │
│ • Return top 100 candidates │
└───────────────┬───────────────┘
│
Top 100 docs
│
▼
┌───────────────────────────────┐
│ Stage 2: Reranking │
│ (Cross-Encoder) │
│ │
│ • Score 100 pairs (~500ms) │
│ • Sort by relevance │
│ • Return top 10 │
└───────────────┬───────────────┘
│
▼
Final Results
This approach captures most of the cross-encoder's accuracy improvement while maintaining millisecond-scale latency. The bi-encoder's job is recall (don't miss relevant documents), while the cross-encoder's job is precision (rank the relevant ones correctly).
1from sentence_transformers import SentenceTransformer, CrossEncoder
2import numpy as np
3from typing import List, Tuple
4
5class TwoStageRetriever:
6 def __init__(
7 self,
8 bi_encoder_model: str = 'all-MiniLM-L6-v2',
9 cross_encoder_model: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
10 top_k_retrieval: int = 100,
11 top_k_rerank: int = 10
12 ):
13 self.bi_encoder = SentenceTransformer(bi_encoder_model)
14 self.cross_encoder = CrossEncoder(cross_encoder_model)
15 self.top_k_retrieval = top_k_retrieval
16 self.top_k_rerank = top_k_rerank
17
18 self.documents: List[str] = []
19 self.doc_embeddings: np.ndarray = None
20
21 def index_documents(self, documents: List[str]) -> None:
22 """Pre-compute and store document embeddings."""
23 self.documents = documents
24 self.doc_embeddings = self.bi_encoder.encode(
25 documents,
26 convert_to_numpy=True,
27 show_progress_bar=True
28 )
29 # Normalize for cosine similarity
30 self.doc_embeddings = self.doc_embeddings / np.linalg.norm(
31 self.doc_embeddings, axis=1, keepdims=True
32 )
33
34 def search(self, query: str) -> List[Tuple[str, float]]:
35 """Two-stage search: retrieve then rerank."""
36 # Stage 1: Bi-encoder retrieval
37 query_embedding = self.bi_encoder.encode(query, convert_to_numpy=True)
38 query_embedding = query_embedding / np.linalg.norm(query_embedding)
39
40 # Compute similarities (dot product of normalized vectors = cosine)
41 similarities = np.dot(self.doc_embeddings, query_embedding)
42
43 # Get top-k candidates
44 top_indices = np.argsort(similarities)[::-1][:self.top_k_retrieval]
45 candidates = [self.documents[i] for i in top_indices]
46
47 # Stage 2: Cross-encoder reranking
48 pairs = [[query, doc] for doc in candidates]
49 rerank_scores = self.cross_encoder.predict(pairs)
50
51 # Sort by rerank scores
52 reranked = sorted(
53 zip(candidates, rerank_scores),
54 key=lambda x: x[1],
55 reverse=True
56 )
57
58 return reranked[:self.top_k_rerank]
59
60
61# Usage example
62retriever = TwoStageRetriever()
63
64# Index your documents (do once)
65documents = [
66 "How to improve laptop performance and speed",
67 "Python programming best practices guide",
68 "Troubleshooting slow computer issues",
69 "Machine learning model optimization techniques",
70 "Windows performance tuning tips",
71 # ... thousands more documents
72]
73retriever.index_documents(documents)
74
75# Search (fast, accurate)
76results = retriever.search("my laptop is running slowly")
77for doc, score in results:
78 print(f"{score:.3f}: {doc}")Tuning the Pipeline
The key parameters to tune are:
- top_k_retrieval: How many candidates to retrieve. Higher values improve recall but increase reranking time. 50-200 is typical.
- top_k_rerank: How many final results to return. Usually 10-20 for search, 3-5 for RAG.
Latency Budget Example
For a 200ms total latency budget:
- • Query encoding: 20ms
- • Vector search (1M docs): 30ms
- • Cross-encoder reranking (100 docs): 150ms
This leaves headroom for network latency and allows reranking 100 candidates while staying responsive.
Business Use Cases
Understanding when each architecture shines helps you make the right choice for your specific application.
Semantic Search Systems
For customer-facing search (e-commerce, documentation, knowledge bases), the two-stage pattern is essential. Users expect sub-second responses, but also expect relevant results.
E-Commerce Product Search
- Stage 1: Bi-encoder retrieves 200 products from millions in 50ms
- Stage 2: Cross-encoder reranks to surface exact matches (e.g., "wireless noise-cancelling headphones" ranks higher than "wireless headphones")
- Impact: 15-25% improvement in click-through rate
RAG (Retrieval-Augmented Generation)
For RAG systems, the quality of retrieved context directly impacts the quality of generated responses. Cross-encoder reranking is particularly valuable here.
Customer Support AI
- Stage 1: Bi-encoder finds relevant support articles and past tickets
- Stage 2: Cross-encoder identifies the most applicable content
- Impact: Reduces hallucinations, improves answer accuracy by 20-30%
Duplicate Detection
Finding duplicate or near-duplicate content across large document sets. Here, bi-encoders often suffice because you're looking for high similarity rather than subtle relevance.
Content Deduplication
- Approach: Bi-encoder embeddings with high similarity threshold (>0.9)
- Scale: Can compare millions of documents in hours
- Cross-encoder role: Verify borderline cases (0.85-0.95 similarity)
Recommendation Systems
Content-based recommendations using semantic similarity. Bi-encoders excel here because you need to compare user preferences against large item catalogs in real-time.
Content Recommendations
- Approach: Embed user's reading history, find similar articles
- Real-time: Update recommendations as user browses
- Cross-encoder role: Rerank for diversity and freshness
When Bi-Encoder Alone Suffices
Not every use case needs two-stage retrieval. Consider bi-encoder only when:
- • Finding similar items (not query-document matching)
- • High similarity threshold (duplicates, near-matches)
- • Latency constraints under 50ms
- • Lower accuracy is acceptable
Choosing the Right Architecture
Use this decision framework to select the right approach for your use case:
Architecture Selection Decision Tree
Corpus size > 10,000 documents?
Yes → You need bi-encoder for first-stage retrieval
Real-time latency requirements (< 500ms)?
Yes → Two-stage with limited reranking candidates
High-stakes decisions (legal, medical, financial)?
Yes → Definitely add cross-encoder reranking
Simple similarity matching (duplicates, recommendations)?
Maybe bi-encoder alone is sufficient
Model Selection Guide
Choosing the right pre-trained models significantly impacts performance:
| Use Case | Recommended Bi-Encoder | Recommended Cross-Encoder |
|---|---|---|
| General English | all-MiniLM-L6-v2 | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Higher Accuracy | all-mpnet-base-v2 | cross-encoder/ms-marco-electra-base |
| Multilingual | paraphrase-multilingual-MiniLM-L12-v2 | cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 |
| Long Documents | BAAI/bge-large-en-v1.5 | BAAI/bge-reranker-large |
| Maximum Quality | intfloat/e5-large-v2 | cross-encoder/stsb-roberta-large |
Fine-Tuning Recommendation
Pre-trained models work well for general use, but fine-tuning on your domain data typically improves performance by 10-20%. This is especially true for specialised domains like legal, medical, or technical content. Both bi-encoders and cross-encoders can be fine-tuned using contrastive learning on query-document pairs.
Integration with Vector Databases
In production, you'll typically store bi-encoder embeddings in a vector database. Here's how the pattern works with popular options:
1import pinecone
2from sentence_transformers import SentenceTransformer, CrossEncoder
3
4# Initialize
5pinecone.init(api_key="your-api-key", environment="your-env")
6index = pinecone.Index("semantic-search")
7
8bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
9cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
10
11def two_stage_search(query: str, top_k: int = 10) -> list:
12 # Stage 1: Vector search with Pinecone
13 query_embedding = bi_encoder.encode(query).tolist()
14
15 results = index.query(
16 vector=query_embedding,
17 top_k=100, # Retrieve more for reranking
18 include_metadata=True
19 )
20
21 # Extract documents from results
22 candidates = [
23 (match.id, match.metadata['text'])
24 for match in results.matches
25 ]
26
27 # Stage 2: Cross-encoder reranking
28 pairs = [[query, doc] for _, doc in candidates]
29 scores = cross_encoder.predict(pairs)
30
31 # Combine IDs with reranked scores
32 reranked = sorted(
33 zip([c[0] for c in candidates], [c[1] for c in candidates], scores),
34 key=lambda x: x[2],
35 reverse=True
36 )
37
38 return reranked[:top_k]The same pattern works with other vector databases like Qdrant, Weaviate, Milvus, or pgvector. The bi-encoder handles the initial retrieval from the vector index, and the cross-encoder refines the ranking.
Hybrid Search: Adding Keyword Matching
Many production systems combine semantic search with traditional keyword matching for even better results:
1from rank_bm25 import BM25Okapi
2import numpy as np
3
4class HybridRetriever:
5 def __init__(self, documents: list[str]):
6 # BM25 for keyword matching
7 tokenized = [doc.lower().split() for doc in documents]
8 self.bm25 = BM25Okapi(tokenized)
9
10 # Bi-encoder for semantic matching
11 self.bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
12 self.doc_embeddings = self.bi_encoder.encode(documents)
13
14 # Cross-encoder for reranking
15 self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
16
17 self.documents = documents
18
19 def search(self, query: str, top_k: int = 10, alpha: float = 0.5):
20 # BM25 scores
21 bm25_scores = self.bm25.get_scores(query.lower().split())
22 bm25_scores = bm25_scores / (bm25_scores.max() + 1e-6) # Normalize
23
24 # Semantic scores
25 query_emb = self.bi_encoder.encode(query)
26 semantic_scores = np.dot(self.doc_embeddings, query_emb)
27 semantic_scores = (semantic_scores - semantic_scores.min()) / (
28 semantic_scores.max() - semantic_scores.min() + 1e-6
29 )
30
31 # Combine scores
32 hybrid_scores = alpha * semantic_scores + (1 - alpha) * bm25_scores
33
34 # Get top candidates for reranking
35 top_indices = np.argsort(hybrid_scores)[::-1][:100]
36 candidates = [self.documents[i] for i in top_indices]
37
38 # Cross-encoder reranking
39 pairs = [[query, doc] for doc in candidates]
40 rerank_scores = self.cross_encoder.predict(pairs)
41
42 reranked = sorted(
43 zip(candidates, rerank_scores),
44 key=lambda x: x[1],
45 reverse=True
46 )
47
48 return reranked[:top_k]Conclusion
Bi-encoders and cross-encoders represent two points on the speed-accuracy trade-off curve. Bi-encoders enable searching billions of documents in milliseconds through pre-computed embeddings, while cross-encoders capture nuanced semantic relationships with superior accuracy but can only process hundreds of pairs per query.
For most production systems, the answer isn't choosing one or the other - it's combining them. The two-stage retrieval pattern (bi-encoder retrieval followed by cross-encoder reranking) has become the de facto standard because it captures most of the cross-encoder's accuracy benefits while maintaining real-time latency.
As you implement semantic search, RAG systems, or recommendation engines, start with the two-stage pattern. Tune the number of candidates retrieved and reranked based on your latency budget and accuracy requirements. And remember that fine-tuning on your specific domain data often provides the biggest performance gains of all.
Frequently Asked Questions
When should I use a bi-encoder vs a cross-encoder?
How do I fine-tune a bi-encoder or cross-encoder for my domain?
What embedding dimension should I use for bi-encoders?
How many candidates should I retrieve for cross-encoder reranking?
Can I use cross-encoders for multilingual search?
How do bi-encoders and cross-encoders compare to LLM-based reranking?
What is the impact of document length on encoder performance?
How do I evaluate my retrieval system's performance?
Table of Contents
Related Articles
Understanding Vector Databases for Business
Discover how vector databases enable semantic search, power RAG systems, and revolutionize how AI accesses information. Complete guide to embeddings, similarity search, and choosing the right vector database.
What is RAG (Retrieval Augmented Generation)?
Learn how RAG combines the power of large language models with your business data to provide accurate, contextual AI responses. Complete guide to understanding and implementing RAG systems.
Knowledge Graphs & Semantic Search: A Technical Guide
Build intelligent search systems with knowledge graphs. Learn graph database selection, ontology design, entity extraction, and RAG integration with production code examples.
