Learn Technical Deep DivesBi-Encoders vs Cross-Encoders: Choosing the Right Architecture for Semantic Search

intermediate

15 min read

28 January 2025

Bi-Encoders vs Cross-Encoders: Choosing the Right Architecture for Semantic Search

Q: When should I use a bi-encoder vs a cross-encoder?

Use bi-encoders when you need to search large corpora (>10,000 documents) with low latency. Use cross-encoders when you have a small candidate set and accuracy is critical. For most production systems, use both: bi-encoder for initial retrieval, cross-encoder for reranking the top 50-200 candidates. This two-stage approach gives you the speed of bi-encoders with most of the accuracy benefits of cross-encoders.

Q: How do I fine-tune a bi-encoder or cross-encoder for my domain?

Both architectures can be fine-tuned using contrastive learning. You need training data consisting of query-document pairs with relevance labels. For bi-encoders, use techniques like MultipleNegativesRankingLoss from sentence-transformers. For cross-encoders, standard classification fine-tuning works well. Even a few thousand domain-specific examples typically improve performance by 10-20% over pre-trained models.

Q: What embedding dimension should I use for bi-encoders?

Most modern bi-encoders produce 384-768 dimensional embeddings. Smaller dimensions (384) are faster to compare and store but capture less nuance. Larger dimensions (768+) are more accurate but increase storage and computation costs. For most applications, 384-dimensional models like all-MiniLM-L6-v2 offer an excellent balance. Only go larger if you have empirical evidence it helps your specific use case.

Q: How many candidates should I retrieve for cross-encoder reranking?

It depends on your latency budget and accuracy requirements. Retrieving 100 candidates takes about 500ms to rerank on GPU. Retrieving 50 takes ~250ms. Start with 100 candidates and measure if fewer works for your use case. The key insight is that you want enough candidates to ensure relevant documents are included (recall), but not so many that reranking becomes slow.

Q: Can I use cross-encoders for multilingual search?

Yes. Models like cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 are trained on multilingual data and can score relevance across languages. However, cross-language relevance (query in English, document in Spanish) is typically less accurate than same-language matching. For best multilingual results, consider language-specific models or models explicitly trained for cross-lingual retrieval like mUSE or LaBSE.

Q: How do bi-encoders and cross-encoders compare to LLM-based reranking?

LLMs like GPT-4 can be used for reranking by prompting them to score relevance. They can be more accurate than traditional cross-encoders, especially for complex queries requiring reasoning. However, they are 10-100x slower and more expensive. Use LLM reranking only for high-value applications where the accuracy improvement justifies the cost, or as a third stage after cross-encoder reranking to refine the top 5-10 results.

Q: What is the impact of document length on encoder performance?

Most bi-encoders and cross-encoders have a maximum input length (typically 256-512 tokens). Longer documents must be chunked. For bi-encoders, you can encode each chunk separately and either use the best-matching chunk or aggregate chunk embeddings. Cross-encoders are more sensitive to length because they process the full query+document sequence. For long documents, strategies like sliding window or hierarchical encoding help maintain quality.

Q: How do I evaluate my retrieval system's performance?

Use standard information retrieval metrics: MRR (Mean Reciprocal Rank) measures where the first relevant result appears, NDCG (Normalized Discounted Cumulative Gain) evaluates ranking quality, and Recall@K measures what fraction of relevant documents appear in the top K results. Create a test set of queries with known relevant documents. Compare your two-stage system against bi-encoder-only and measure the improvement from reranking.

Deep dive into bi-encoder and cross-encoder architectures for semantic similarity. Learn the trade-offs, implementation patterns, and when to use each approach in RAG systems and search applications.

Clever Ops Team

When building semantic search, RAG systems, or recommendation engines, one architectural decision will fundamentally shape your system's performance: should you use bi-encoders or cross-encoders? The answer isn't straightforward - each architecture makes different trade-offs between speed and accuracy that matter enormously at scale.

Bi-encoders can search through millions of documents in milliseconds but may miss nuanced relevance. Cross-encoders capture subtle semantic relationships with remarkable accuracy but can't scale beyond a few hundred comparisons per query. Understanding when to use each - and how to combine them - is essential for building production-grade semantic systems.

This guide explains both architectures from first principles, compares their characteristics, and shows you how to implement the two-stage retrieval pattern that powers modern search systems at companies like Google, Microsoft, and OpenAI.

Key Takeaways

Bi-encoders pre-compute embeddings for millisecond search across millions of documents
Cross-encoders process query-document pairs together for higher accuracy but cannot scale
Two-stage retrieval (bi-encoder retrieve, cross-encoder rerank) is the production standard
Retrieve 50-200 candidates with bi-encoder, rerank top 10-20 with cross-encoder
Fine-tuning on domain data typically improves both architectures by 10-20%
Choose pre-trained models based on your language, accuracy needs, and latency budget
Hybrid search combining BM25 keyword matching with semantic search often outperforms either alone

The Core Problem: Semantic Similarity at Scale

Traditional keyword search fails when users express the same concept differently. A search for "how to fix a slow laptop" won't match a document titled "Speed up your computer performance" despite being semantically identical. Semantic search solves this by comparing meaning rather than words.

But here's the challenge: to find semantically similar documents, you need to compare your query against every document in your corpus. With millions of documents, this becomes computationally intractable - unless you're clever about how you structure the comparison.

This is where bi-encoders and cross-encoders diverge. They represent two fundamentally different approaches to the same problem:

Bi-Encoder Approach

"Encode everything once, compare embeddings fast"

• Pre-compute document embeddings
• Store in vector database
• Compare query embedding to all docs
• Millisecond retrieval at any scale

Cross-Encoder Approach

"Consider query and document together for precision"

• Process query+document pairs
• Full attention between all tokens
• More accurate relevance scores
• Can only score a few hundred pairs

How Bi-Encoders Work

A bi-encoder uses two separate transformer encoders (or the same encoder applied twice) to independently convert queries and documents into fixed-size embedding vectors. These embeddings exist in a shared semantic space where similar meanings cluster together.

Bi-Encoder Architecture

    Query: "laptop running slow"          Document: "Speed up your computer"
              │                                      │
              ▼                                      ▼
    ┌─────────────────┐                   ┌─────────────────┐
    │   Transformer   │                   │   Transformer   │
    │    Encoder      │                   │    Encoder      │
    └────────┬────────┘                   └────────┬────────┘
              │                                      │
              ▼                                      ▼
    [0.23, -0.45, 0.12, ...]             [0.21, -0.42, 0.15, ...]
         Query Embedding                    Document Embedding
              │                                      │
              └──────────────┬───────────────────────┘
                             │
                             ▼
                    Cosine Similarity
                         0.94

The Pre-Computation Advantage

The key insight is that document embeddings can be computed once and stored. When a query arrives, you only need to:

1. Encode the query - One forward pass through the transformer (~10-50ms)
2. Compare against all documents - Vector similarity operations are extremely fast

With optimised libraries like FAISS or vector databases like Pinecone, you can compare against billions of vectors in under 100 milliseconds. This is why bi-encoders dominate large-scale retrieval.

Bi-Encoder with Sentence-Transformerspython

1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# Load a bi-encoder model
5model = SentenceTransformer('all-MiniLM-L6-v2')
6
7# Pre-compute document embeddings (do this once)
8documents = [
9    "Speed up your computer performance with these tips",
10    "Best practices for Python code optimization",
11    "How to troubleshoot network connectivity issues",
12    "Machine learning model deployment strategies",
13]
14
15# Encode all documents - these embeddings are stored/cached
16doc_embeddings = model.encode(documents, convert_to_numpy=True)
17
18# At query time, encode the query and compare
19query = "laptop running slow"
20query_embedding = model.encode(query, convert_to_numpy=True)
21
22# Compute cosine similarities
23similarities = np.dot(doc_embeddings, query_embedding) / (
24    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
25)
26
27# Get top results
28top_indices = np.argsort(similarities)[::-1]
29for idx in top_indices[:3]:
30    print(f"Score: {similarities[idx]:.3f} | {documents[idx]}")

Embedding Quality Matters

Because bi-encoders compress documents into fixed-size vectors (typically 384-768 dimensions), information is necessarily lost. The quality of this compression depends on:

Model architecture: Larger models capture more nuance
Training data: Models trained on your domain perform better
Embedding dimension: Higher dimensions preserve more information but cost more to store and compare

The Compression Trade-off

A bi-encoder must compress an entire document (potentially thousands of words) into a single vector of a few hundred numbers. This works well for capturing general topic similarity but can miss specific details that matter for relevance. A document about "Python code optimisation" and "Python snake habitats" might have more similar embeddings than you'd expect because "Python" dominates both.

How Cross-Encoders Work

Cross-encoders take a fundamentally different approach. Instead of encoding query and document separately, they process both together as a single input sequence. This allows full attention between query tokens and document tokens - the model can directly compare every word in the query against every word in the document.

Cross-Encoder Architecture

    Query: "laptop running slow"    Document: "Speed up your computer"
                    │                          │
                    └──────────┬───────────────┘
                               │
                               ▼
            [CLS] laptop running slow [SEP] Speed up your computer [SEP]
                               │
                               ▼
                    ┌─────────────────────┐
                    │    Transformer      │
                    │  (Full Attention    │
                    │   Between All       │
                    │     Tokens)         │
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────┐
                    │  Classification │
                    │     Head        │
                    └────────┬────────┘
                             │
                             ▼
                    Relevance Score: 0.89

Why Cross-Encoders Are More Accurate

The key advantage is cross-attention. When processing the combined input, the transformer can:

Directly compare "laptop" with "computer" and understand they're synonyms in context
Recognise that "slow" relates to "speed up" as problem-to-solution
Consider word order and grammatical relationships across the query-document boundary

This produces more nuanced relevance judgments. Cross-encoders consistently outperform bi-encoders on relevance benchmarks, often by significant margins (5-15% improvement in metrics like NDCG@10).

Cross-Encoder with Sentence-Transformerspython

1from sentence_transformers import CrossEncoder
2
3# Load a cross-encoder model trained for relevance ranking
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# Score query-document pairs
7query = "laptop running slow"
8documents = [
9    "Speed up your computer performance with these tips",
10    "Best practices for Python code optimization",
11    "How to troubleshoot network connectivity issues",
12    "Machine learning model deployment strategies",
13]
14
15# Create query-document pairs
16pairs = [[query, doc] for doc in documents]
17
18# Score all pairs (returns relevance scores)
19scores = model.predict(pairs)
20
21# Sort by score
22ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
23for doc, score in ranked:
24    print(f"Score: {score:.3f} | {doc}")

The Scalability Problem

Cross-encoders have a fatal flaw for large-scale retrieval: you cannot pre-compute anything. Every query requires processing the query with every document through the full transformer. For a corpus of 1 million documents:

Cross-Encoder Scaling Math

• Time per pair: ~5-10ms on GPU
• Documents: 1,000,000
• Total time: 5,000-10,000 seconds (1.5-3 hours per query)

This is obviously impractical for real-time search.

Cross-encoders can realistically only score hundreds to low thousands of candidates per query. This limitation is fundamental to the architecture - there's no way around it without sacrificing the cross-attention that makes them accurate.

Bi-Encoder vs Cross-Encoder: Complete Comparison

Let's compare both architectures across the dimensions that matter for production systems:

Characteristic	Bi-Encoder	Cross-Encoder
Speed at Query Time	Very fast (milliseconds)	Slow (scales linearly with corpus)
Relevance Accuracy	Good	Excellent
Scalability	Billions of documents	Hundreds per query
Pre-computation	Yes - encode docs once	No - must process each query
Index Updates	Add new embeddings easily	No index needed
Memory for Corpus	Embedding storage required	Just document text
GPU Requirements	Query-time only (optional)	Required for reasonable speed
Best For	First-stage retrieval, large corpora	Reranking, high-stakes decisions

Accuracy vs Speed: The Fundamental Trade-off

The performance difference isn't marginal. On standard benchmarks like MS MARCO:

Typical Benchmark Performance (MS MARCO)

Bi-Encoder (all-MiniLM-L6-v2)

• MRR@10: ~0.33
• Query latency: 20ms + search
• Can search 10M+ docs

Cross-Encoder (ms-marco-MiniLM-L-6-v2)

• MRR@10: ~0.39
• Query latency: ~5ms per doc
• Practical limit: ~1000 docs

The cross-encoder achieves roughly 18% better ranking quality, but at a cost that makes it unusable for first-stage retrieval at scale.

The Two-Stage Retrieval Pattern

The solution used by virtually every production semantic search system is two-stage retrieval: use a bi-encoder to quickly retrieve candidates, then use a cross-encoder to precisely rerank the top results.

Two-Stage Retrieval Pipeline

                        User Query
                            │
                            ▼
            ┌───────────────────────────────┐
            │       Stage 1: Retrieval       │
            │         (Bi-Encoder)           │
            │                                │
            │  • Encode query (~20ms)        │
            │  • Search vector index (~50ms) │
            │  • Return top 100 candidates   │
            └───────────────┬───────────────┘
                            │
                     Top 100 docs
                            │
                            ▼
            ┌───────────────────────────────┐
            │       Stage 2: Reranking       │
            │        (Cross-Encoder)         │
            │                                │
            │  • Score 100 pairs (~500ms)    │
            │  • Sort by relevance           │
            │  • Return top 10               │
            └───────────────┬───────────────┘
                            │
                            ▼
                    Final Results

This approach captures most of the cross-encoder's accuracy improvement while maintaining millisecond-scale latency. The bi-encoder's job is recall (don't miss relevant documents), while the cross-encoder's job is precision (rank the relevant ones correctly).

Complete Two-Stage Retrieval Implementationpython

1from sentence_transformers import SentenceTransformer, CrossEncoder
2import numpy as np
3from typing import List, Tuple
4
5class TwoStageRetriever:
6    def __init__(
7        self,
8        bi_encoder_model: str = 'all-MiniLM-L6-v2',
9        cross_encoder_model: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
10        top_k_retrieval: int = 100,
11        top_k_rerank: int = 10
12    ):
13        self.bi_encoder = SentenceTransformer(bi_encoder_model)
14        self.cross_encoder = CrossEncoder(cross_encoder_model)
15        self.top_k_retrieval = top_k_retrieval
16        self.top_k_rerank = top_k_rerank
17
18        self.documents: List[str] = []
19        self.doc_embeddings: np.ndarray = None
20
21    def index_documents(self, documents: List[str]) -> None:
22        """Pre-compute and store document embeddings."""
23        self.documents = documents
24        self.doc_embeddings = self.bi_encoder.encode(
25            documents,
26            convert_to_numpy=True,
27            show_progress_bar=True
28        )
29        # Normalize for cosine similarity
30        self.doc_embeddings = self.doc_embeddings / np.linalg.norm(
31            self.doc_embeddings, axis=1, keepdims=True
32        )
33
34    def search(self, query: str) -> List[Tuple[str, float]]:
35        """Two-stage search: retrieve then rerank."""
36        # Stage 1: Bi-encoder retrieval
37        query_embedding = self.bi_encoder.encode(query, convert_to_numpy=True)
38        query_embedding = query_embedding / np.linalg.norm(query_embedding)
39
40        # Compute similarities (dot product of normalized vectors = cosine)
41        similarities = np.dot(self.doc_embeddings, query_embedding)
42
43        # Get top-k candidates
44        top_indices = np.argsort(similarities)[::-1][:self.top_k_retrieval]
45        candidates = [self.documents[i] for i in top_indices]
46
47        # Stage 2: Cross-encoder reranking
48        pairs = [[query, doc] for doc in candidates]
49        rerank_scores = self.cross_encoder.predict(pairs)
50
51        # Sort by rerank scores
52        reranked = sorted(
53            zip(candidates, rerank_scores),
54            key=lambda x: x[1],
55            reverse=True
56        )
57
58        return reranked[:self.top_k_rerank]
59
60
61# Usage example
62retriever = TwoStageRetriever()
63
64# Index your documents (do once)
65documents = [
66    "How to improve laptop performance and speed",
67    "Python programming best practices guide",
68    "Troubleshooting slow computer issues",
69    "Machine learning model optimization techniques",
70    "Windows performance tuning tips",
71    # ... thousands more documents
72]
73retriever.index_documents(documents)
74
75# Search (fast, accurate)
76results = retriever.search("my laptop is running slowly")
77for doc, score in results:
78    print(f"{score:.3f}: {doc}")

Tuning the Pipeline

The key parameters to tune are:

top_k_retrieval: How many candidates to retrieve. Higher values improve recall but increase reranking time. 50-200 is typical.
top_k_rerank: How many final results to return. Usually 10-20 for search, 3-5 for RAG.

Latency Budget Example

For a 200ms total latency budget:

• Query encoding: 20ms
• Vector search (1M docs): 30ms
• Cross-encoder reranking (100 docs): 150ms

This leaves headroom for network latency and allows reranking 100 candidates while staying responsive.

📚 Want to learn more?

Business Use Cases

Understanding when each architecture shines helps you make the right choice for your specific application.

Semantic Search Systems

For customer-facing search (e-commerce, documentation, knowledge bases), the two-stage pattern is essential. Users expect sub-second responses, but also expect relevant results.

E-Commerce Product Search

Stage 1: Bi-encoder retrieves 200 products from millions in 50ms
Stage 2: Cross-encoder reranks to surface exact matches (e.g., "wireless noise-cancelling headphones" ranks higher than "wireless headphones")
Impact: 15-25% improvement in click-through rate

RAG (Retrieval-Augmented Generation)

For RAG systems, the quality of retrieved context directly impacts the quality of generated responses. Cross-encoder reranking is particularly valuable here.

Customer Support AI

Stage 1: Bi-encoder finds relevant support articles and past tickets
Stage 2: Cross-encoder identifies the most applicable content
Impact: Reduces hallucinations, improves answer accuracy by 20-30%

Duplicate Detection

Finding duplicate or near-duplicate content across large document sets. Here, bi-encoders often suffice because you're looking for high similarity rather than subtle relevance.

Content Deduplication

Approach: Bi-encoder embeddings with high similarity threshold (>0.9)
Scale: Can compare millions of documents in hours
Cross-encoder role: Verify borderline cases (0.85-0.95 similarity)

Recommendation Systems

Content-based recommendations using semantic similarity. Bi-encoders excel here because you need to compare user preferences against large item catalogs in real-time.

Content Recommendations

Approach: Embed user's reading history, find similar articles
Real-time: Update recommendations as user browses
Cross-encoder role: Rerank for diversity and freshness

When Bi-Encoder Alone Suffices

Not every use case needs two-stage retrieval. Consider bi-encoder only when:

• Finding similar items (not query-document matching)
• High similarity threshold (duplicates, near-matches)
• Latency constraints under 50ms
• Lower accuracy is acceptable

Choosing the Right Architecture

Use this decision framework to select the right approach for your use case:

Architecture Selection Decision Tree

Corpus size > 10,000 documents?

Yes → You need bi-encoder for first-stage retrieval

Real-time latency requirements (< 500ms)?

Yes → Two-stage with limited reranking candidates

High-stakes decisions (legal, medical, financial)?

Yes → Definitely add cross-encoder reranking

Simple similarity matching (duplicates, recommendations)?

Maybe bi-encoder alone is sufficient

Model Selection Guide

Choosing the right pre-trained models significantly impacts performance:

Use Case	Recommended Bi-Encoder	Recommended Cross-Encoder
General English	all-MiniLM-L6-v2	cross-encoder/ms-marco-MiniLM-L-6-v2
Higher Accuracy	all-mpnet-base-v2	cross-encoder/ms-marco-electra-base
Multilingual	paraphrase-multilingual-MiniLM-L12-v2	cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
Long Documents	BAAI/bge-large-en-v1.5	BAAI/bge-reranker-large
Maximum Quality	intfloat/e5-large-v2	cross-encoder/stsb-roberta-large

Fine-Tuning Recommendation

Pre-trained models work well for general use, but fine-tuning on your domain data typically improves performance by 10-20%. This is especially true for specialised domains like legal, medical, or technical content. Both bi-encoders and cross-encoders can be fine-tuned using contrastive learning on query-document pairs.

💡 Need expert help with this?

Integration with Vector Databases

In production, you'll typically store bi-encoder embeddings in a vector database. Here's how the pattern works with popular options:

Two-Stage Retrieval with Pineconepython

1import pinecone
2from sentence_transformers import SentenceTransformer, CrossEncoder
3
4# Initialize
5pinecone.init(api_key="your-api-key", environment="your-env")
6index = pinecone.Index("semantic-search")
7
8bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
9cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
10
11def two_stage_search(query: str, top_k: int = 10) -> list:
12    # Stage 1: Vector search with Pinecone
13    query_embedding = bi_encoder.encode(query).tolist()
14
15    results = index.query(
16        vector=query_embedding,
17        top_k=100,  # Retrieve more for reranking
18        include_metadata=True
19    )
20
21    # Extract documents from results
22    candidates = [
23        (match.id, match.metadata['text'])
24        for match in results.matches
25    ]
26
27    # Stage 2: Cross-encoder reranking
28    pairs = [[query, doc] for _, doc in candidates]
29    scores = cross_encoder.predict(pairs)
30
31    # Combine IDs with reranked scores
32    reranked = sorted(
33        zip([c[0] for c in candidates], [c[1] for c in candidates], scores),
34        key=lambda x: x[2],
35        reverse=True
36    )
37
38    return reranked[:top_k]

The same pattern works with other vector databases like Qdrant, Weaviate, Milvus, or pgvector. The bi-encoder handles the initial retrieval from the vector index, and the cross-encoder refines the ranking.

Hybrid Search: Adding Keyword Matching

Many production systems combine semantic search with traditional keyword matching for even better results:

Hybrid Search with BM25 + Semantic + Rerankingpython

1from rank_bm25 import BM25Okapi
2import numpy as np
3
4class HybridRetriever:
5    def __init__(self, documents: list[str]):
6        # BM25 for keyword matching
7        tokenized = [doc.lower().split() for doc in documents]
8        self.bm25 = BM25Okapi(tokenized)
9
10        # Bi-encoder for semantic matching
11        self.bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
12        self.doc_embeddings = self.bi_encoder.encode(documents)
13
14        # Cross-encoder for reranking
15        self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
16
17        self.documents = documents
18
19    def search(self, query: str, top_k: int = 10, alpha: float = 0.5):
20        # BM25 scores
21        bm25_scores = self.bm25.get_scores(query.lower().split())
22        bm25_scores = bm25_scores / (bm25_scores.max() + 1e-6)  # Normalize
23
24        # Semantic scores
25        query_emb = self.bi_encoder.encode(query)
26        semantic_scores = np.dot(self.doc_embeddings, query_emb)
27        semantic_scores = (semantic_scores - semantic_scores.min()) / (
28            semantic_scores.max() - semantic_scores.min() + 1e-6
29        )
30
31        # Combine scores
32        hybrid_scores = alpha * semantic_scores + (1 - alpha) * bm25_scores
33
34        # Get top candidates for reranking
35        top_indices = np.argsort(hybrid_scores)[::-1][:100]
36        candidates = [self.documents[i] for i in top_indices]
37
38        # Cross-encoder reranking
39        pairs = [[query, doc] for doc in candidates]
40        rerank_scores = self.cross_encoder.predict(pairs)
41
42        reranked = sorted(
43            zip(candidates, rerank_scores),
44            key=lambda x: x[1],
45            reverse=True
46        )
47
48        return reranked[:top_k]

Conclusion

Bi-encoders and cross-encoders represent two points on the speed-accuracy trade-off curve. Bi-encoders enable searching billions of documents in milliseconds through pre-computed embeddings, while cross-encoders capture nuanced semantic relationships with superior accuracy but can only process hundreds of pairs per query.

For most production systems, the answer isn't choosing one or the other - it's combining them. The two-stage retrieval pattern (bi-encoder retrieval followed by cross-encoder reranking) has become the de facto standard because it captures most of the cross-encoder's accuracy benefits while maintaining real-time latency.

As you implement semantic search, RAG systems, or recommendation engines, start with the two-stage pattern. Tune the number of candidates retrieved and reranked based on your latency budget and accuracy requirements. And remember that fine-tuning on your specific domain data often provides the biggest performance gains of all.

Frequently Asked Questions

When should I use a bi-encoder vs a cross-encoder?

How do I fine-tune a bi-encoder or cross-encoder for my domain?

What embedding dimension should I use for bi-encoders?

How many candidates should I retrieve for cross-encoder reranking?

Can I use cross-encoders for multilingual search?

How do bi-encoders and cross-encoders compare to LLM-based reranking?

What is the impact of document length on encoder performance?

How do I evaluate my retrieval system's performance?

Ready to Implement?

This guide provides the knowledge, but implementation requires expertise. Our team has done this 50+ times and can get you production-ready in weeks.

✓ FT Fast 500 APAC Winner✓ 50+ Implementations✓ Results in Weeks

Need Expert Guidance?

Get personalized recommendations from our team.