
TL;DR
- RAG reranking techniques like cross-encoder reranking improves RAG accuracy by 20-35% but adds 200-500ms latency per query.
- For production systems, rerank the top 20-50 retrieved documents down to 5-10 for the LLM to maximize relevance while controlling costs.
- Cohere Rerank and ms-marco-MiniLM-L-6-v2 offer the best balance of accuracy and speed for most applications.
RAG systems often fail not because of poor embeddings or weak LLMs, but because they feed irrelevant information to the generation stage.
Initial retrieval casts a wide net, returning documents that are semantically similar but not actually relevant to answering the specific query. This is where reranking transforms “good enough” RAG systems into production-grade applications that users trust.
Reranking is the critical second stage that separates signal from noise, ensuring your LLM works with the most relevant context rather than just the most similar vectors.
The Two-Stage Retrieval Architecture
Why Bi-Encoders Aren’t Enough
Traditional RAG systems rely on bi-encoder models (like sentence-transformers) that process queries and documents independently, creating separate embeddings and comparing them via cosine similarity.
This approach is fast and scalable but has fundamental limitations that become apparent in production systems.
The Core Problem with Independent Encoding: When a bi-encoder processes the query “What are the side effects of ACE inhibitors in diabetic patients?” and a document about “Cardiovascular medications and complications in diabetes management,” it creates two separate vector representations.
The similarity calculation happens in vector space without the model ever “seeing” both pieces of text together.
This separation means the model can’t understand nuanced relationships. It might match the query to a document because both contain “diabetes” and “medication,” but it can’t determine that the document specifically addresses ACE inhibitor side effects versus general diabetes medication guidance.
This lack of contextual understanding leads to high recall (finding many relevant documents) but poor precision (many irrelevant documents mixed in).
Bi-Encoder Limitations:
- Context blindness: Documents are embedded without knowing what questions they’ll be used to answer
- Information compression: All document meaning compressed into a single vector (typically 768-1536 dimensions)
- Keyword bias: May miss documents that are semantically relevant but lexically different
- Ranking granularity: Cosine similarity provides coarse relevance scoring that doesn’t capture fine distinctions
Performance Impact in Production: In real-world RAG applications, bi-encoders alone achieve 65-80% relevance accuracy on complex queries. This means 20-35% of retrieved documents are irrelevant or only tangentially related to the user’s question. When these irrelevant documents reach the LLM, they create several problems:
- Hallucination risk: LLMs may generate responses based on irrelevant context
- Answer dilution: Correct information gets mixed with irrelevant details
- Increased costs: Processing irrelevant context wastes computational resources
- Poor user experience: Responses may be unfocused or contain extraneous information
The Business Impact: A customer support chatbot relying solely on bi-encoder retrieval might respond to “How do I reset my password?” by including information about account creation, security policies, and billing procedures because all these topics contain password-related keywords.
Users get overwhelmed with information when they need a simple, focused answer.
Cross-Encoder Reranking Architecture
Cross-encoders process query and document together, enabling rich interaction analysis that bi-encoders cannot capture.
Technical Advantages:
- Joint processing: Query and document are concatenated and processed simultaneously
- Attention mechanisms: Can focus on specific query-document relationships
- Fine-grained scoring: Produces calibrated relevance scores (0-1)
- Contextual understanding: Understands how documents specifically relate to queries
Implementation Pattern:
from sentence_transformers import CrossEncoder
import numpy as np
class ProductionReranker:
def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
self.model_name = model_name
def rerank(self, query, documents, top_k=5):
# Prepare query-document pairs
pairs = [[query, doc['content']] for doc in documents]
# Get relevance scores
scores = self.model.predict(pairs)
# Combine scores with documents
scored_docs = []
for i, doc in enumerate(documents):
scored_docs.append({
**doc,
'rerank_score': float(scores[i]),
'original_rank': i
})
# Sort by relevance and return top_k
reranked = sorted(scored_docs, key=lambda x: x['rerank_score'], reverse=True)
return reranked[:top_k]Production Reranking Models
Cross-Encoder/ms-marco-MiniLM-L-6-v2
The most widely used open-source reranker, optimized for web search scenarios.
Performance Characteristics:
- Latency: 50-150ms for 20 documents
- Accuracy: 85-90% on web search benchmarks
- Model size: 90MB
- Languages: English-optimized
Best for: General-purpose RAG applications, technical documentation, customer support
Implementation:
# Optimized production usage
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)
def efficient_rerank(query, docs, target_count=5):
# Limit input length to avoid truncation issues
truncated_pairs = [[query[:200], doc['content'][:300]] for doc in docs]
scores = reranker.predict(truncated_pairs)
return sorted(zip(scores, docs), reverse=True)[:target_count]Cohere Rerank API
Enterprise-grade reranking service with multilingual support and optimized performance.
Performance Characteristics:
- Latency: 100-300ms depending on document count
- Accuracy: 90-95% on benchmarks
- Languages: 100+ languages supported
- Cost: $0.002 per 1K tokens
API Implementation:
import cohere
co = cohere.Client("your-api-key")
def cohere_rerank(query, documents, top_k=5):
# Prepare documents for API
docs_text = [doc['content'] for doc in documents]
response = co.rerank(
model="rerank-english-v2.0",
query=query,
documents=docs_text,
top_k=top_k
)
# Map results back to original documents
reranked = []
for result in response.results:
original_doc = documents[result.index]
reranked.append({
**original_doc,
'rerank_score': result.relevance_score
})
return rerankedWhen to Use Cohere:
- Multilingual RAG applications
- Enterprise applications requiring high SLA guarantees
- Teams wanting managed infrastructure without model hosting
BGE-Reranker-Large
High-performance open-source reranker from Beijing Academy of Artificial Intelligence.
Performance Characteristics:
- Latency: 100-250ms for 20 documents
- Accuracy: 92-96% on MTEB benchmarks
- Model size: 1.3GB
- Languages: Excellent multilingual performance
Implementation:
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)
def bge_rerank(query, passages, top_k=5):
# BGE expects [query, passage] pairs
pairs = [[query, passage] for passage in passages]
scores = reranker.compute_score(pairs, normalize=True)
# Sort and return top results
scored_passages = list(zip(scores, passages))
return sorted(scored_passages, reverse=True)[:top_k]LLM-Based Reranking
Using general-purpose LLMs like GPT-4 or Claude for reranking tasks.
Implementation Pattern:
def llm_rerank(query, documents, llm_client, top_k=5):
# Prepare documents with indices
doc_list = "\n".join([
f"{i+1}. {doc['title']}: {doc['content'][:200]}..."
for i, doc in enumerate(documents)
])
prompt = f"""
Query: {query}
Documents:
{doc_list}
Rank these documents by relevance to the query. Return only the top {top_k} document numbers in order of relevance.
Response format: [3, 1, 5] (just the numbers)
"""
response = llm_client.generate(prompt)
rankings = parse_rankings(response)
return [documents[i-1] for i in rankings if 1 <= i <= len(documents)]Trade-offs:
- Accuracy: Often highest for complex reasoning tasks
- Cost: 10-50x more expensive than dedicated rerankers
- Latency: 1-5 seconds depending on LLM provider
- Use cases: High-stakes applications where accuracy justifies cost
Advanced Reranking Techniques
Hybrid Reranking
Combine multiple reranking signals for improved accuracy:
class HybridReranker:
def __init__(self):
self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
self.bm25_weight = 0.3
self.semantic_weight = 0.4
self.cross_encoder_weight = 0.3
def hybrid_rerank(self, query, documents, original_scores, bm25_scores):
# Get cross-encoder scores
pairs = [[query, doc['content']] for doc in documents]
cross_scores = self.cross_encoder.predict(pairs)
# Normalize all scores to [0, 1]
norm_semantic = self.normalize_scores(original_scores)
norm_bm25 = self.normalize_scores(bm25_scores)
norm_cross = self.normalize_scores(cross_scores)
# Weighted combination
final_scores = []
for i in range(len(documents)):
score = (
norm_semantic[i] * self.semantic_weight +
norm_bm25[i] * self.bm25_weight +
norm_cross[i] * self.cross_encoder_weight
)
final_scores.append(score)
# Sort by combined score
scored_docs = list(zip(final_scores, documents))
return sorted(scored_docs, reverse=True)
def normalize_scores(self, scores):
scores = np.array(scores)
return (scores - scores.min()) / (scores.max() - scores.min())Multi-Hop Reranking
For complex queries requiring information from multiple documents:
def multi_hop_rerank(query, documents, max_hops=2):
# First hop: initial reranking
first_hop = standard_rerank(query, documents, top_k=10)
if max_hops == 1:
return first_hop
# Generate follow-up queries based on first hop results
context = " ".join([doc['content'][:200] for doc in first_hop[:3]])
follow_up_query = generate_followup_query(query, context)
# Second hop: rerank with refined query
second_hop = standard_rerank(follow_up_query, documents, top_k=10)
# Combine results with decay
combined_scores = {}
for i, doc in enumerate(first_hop):
combined_scores[doc['id']] = doc['score'] * 1.0 # Full weight
for i, doc in enumerate(second_hop):
doc_id = doc['id']
if doc_id in combined_scores:
combined_scores[doc_id] += doc['score'] * 0.5 # Reduced weight
else:
combined_scores[doc_id] = doc['score'] * 0.5
# Final ranking
return sort_by_combined_scores(documents, combined_scores)Query Expansion + Reranking
Enhance retrieval coverage before reranking:
def expanded_query_rerank(original_query, documents):
# Generate query variations
expanded_queries = [
original_query,
generate_synonymous_query(original_query),
generate_specific_query(original_query),
generate_abstract_query(original_query)
]
# Collect candidates from all query variations
all_candidates = set()
for query_variant in expanded_queries:
candidates = retrieve_candidates(query_variant, top_k=15)
all_candidates.update([doc['id'] for doc in candidates])
# Retrieve full candidate set
candidate_docs = [doc for doc in documents if doc['id'] in all_candidates]
# Rerank with original query
return rerank_with_cross_encoder(original_query, candidate_docs, top_k=5)Performance Optimization Strategies
Batched Reranking
Process multiple queries efficiently:
class BatchedReranker:
def __init__(self, model_name, batch_size=16):
self.model = CrossEncoder(model_name)
self.batch_size = batch_size
def batch_rerank(self, query_doc_pairs):
"""
query_doc_pairs: List of (query, [documents]) tuples
"""
all_pairs = []
pair_metadata = []
# Flatten all query-document combinations
for query_idx, (query, docs) in enumerate(query_doc_pairs):
for doc_idx, doc in enumerate(docs):
all_pairs.append([query, doc['content']])
pair_metadata.append({
'query_idx': query_idx,
'doc_idx': doc_idx
})
# Process in batches
all_scores = []
for i in range(0, len(all_pairs), self.batch_size):
batch = all_pairs[i:i + self.batch_size]
scores = self.model.predict(batch)
all_scores.extend(scores)
# Group results by query
results = [[] for _ in query_doc_pairs]
for score, metadata in zip(all_scores, pair_metadata):
query_idx = metadata['query_idx']
doc_idx = metadata['doc_idx']
results[query_idx].append({
'doc': query_doc_pairs[query_idx][1][doc_idx],
'score': score,
'original_idx': doc_idx
})
# Sort each query's results
for query_results in results:
query_results.sort(key=lambda x: x['score'], reverse=True)
return resultsCaching Strategies
Cache reranking results for repeated queries:
import hashlib
from functools import lru_cache
class CachedReranker:
def __init__(self, reranker_model, cache_size=10000):
self.reranker = reranker_model
self.cache_size = cache_size
def generate_cache_key(self, query, doc_ids):
"""Generate deterministic cache key"""
content = query + "|".join(sorted(doc_ids))
return hashlib.md5(content.encode()).hexdigest()
@lru_cache(maxsize=10000)
def cached_rerank(self, cache_key, query, documents_json):
"""Cached reranking with serialized documents"""
documents = json.loads(documents_json)
return self.reranker.rerank(query, documents)
def rerank_with_cache(self, query, documents, top_k=5):
doc_ids = [doc.get('id', str(i)) for i, doc in enumerate(documents)]
cache_key = self.generate_cache_key(query, doc_ids)
# Use cached result if available
try:
documents_json = json.dumps(documents, sort_keys=True)
return self.cached_rerank(cache_key, query, documents_json)
except:
# Fallback to uncached reranking
return self.reranker.rerank(query, documents, top_k)Production Implementation Guidelines
class ProductionRerankingPipeline:
def __init__(self, config):
self.retrieval_count = config.get('retrieval_count', 20)
self.rerank_count = config.get('rerank_count', 5)
self.reranker_type = config.get('reranker_type', 'cross_encoder')
# Initialize reranker based on type
if self.reranker_type == 'cross_encoder':
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
elif self.reranker_type == 'cohere':
self.reranker = CohereReranker(api_key=config['cohere_api_key'])
# Performance monitoring
self.metrics = RetrievalMetrics()
def process_query(self, query, vector_store):
# Stage 1: Initial retrieval
start_time = time.time()
initial_docs = vector_store.similarity_search(
query,
k=self.retrieval_count
)
retrieval_time = time.time() - start_time
# Stage 2: Reranking
start_time = time.time()
reranked_docs = self.reranker.rerank(
query,
initial_docs,
top_k=self.rerank_count
)
rerank_time = time.time() - start_time
# Track metrics
self.metrics.record_query(
query=query,
retrieval_time=retrieval_time,
rerank_time=rerank_time,
initial_count=len(initial_docs),
final_count=len(reranked_docs)
)
return reranked_docsError Handling and Fallbacks
class RobustReranker:
def __init__(self, primary_reranker, fallback_strategy='similarity'):
self.primary = primary_reranker
self.fallback = fallback_strategy
def rerank_with_fallback(self, query, documents, top_k=5):
try:
# Attempt primary reranking
return self.primary.rerank(query, documents, top_k)
except Exception as e:
# Log error and fall back
logger.error(f"Primary reranker failed: {e}")
if self.fallback == 'similarity':
# Fall back to original similarity scores
return sorted(
documents,
key=lambda x: x.get('similarity_score', 0),
reverse=True
)[:top_k]
elif self.fallback == 'bm25':
# Fall back to BM25 scoring
return self.bm25_fallback(query, documents, top_k)
else:
# Return original order as last resort
return documents[:top_k]Monitoring and Observability
class RerankerMonitoring:
def __init__(self):
self.query_metrics = []
def log_rerank_performance(self, query, initial_docs, reranked_docs, latency):
# Calculate relevance improvement
relevance_gain = self.calculate_relevance_gain(initial_docs, reranked_docs)
metrics = {
'timestamp': time.time(),
'query_length': len(query.split()),
'doc_count': len(initial_docs),
'rerank_latency': latency,
'relevance_gain': relevance_gain,
'top_score': reranked_docs[0]['rerank_score'] if reranked_docs else 0
}
self.query_metrics.append(metrics)
# Alert on performance degradation
if latency > 1000: # 1 second threshold
self.alert_high_latency(query, latency)
if relevance_gain < 0.1: # Low improvement threshold
self.alert_low_relevance_gain(query, relevance_gain)
def generate_performance_report(self, time_window_hours=24):
cutoff_time = time.time() - (time_window_hours * 3600)
recent_metrics = [m for m in self.query_metrics if m['timestamp'] > cutoff_time]
return {
'total_queries': len(recent_metrics),
'avg_latency': np.mean([m['rerank_latency'] for m in recent_metrics]),
'p95_latency': np.percentile([m['rerank_latency'] for m in recent_metrics], 95),
'avg_relevance_gain': np.mean([m['relevance_gain'] for m in recent_metrics]),
'high_latency_queries': len([m for m in recent_metrics if m['rerank_latency'] > 1000])
}Integration with RAG Frameworks
LangChain Integration
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import VectorStoreRetriever
# Set up reranker
reranker = CrossEncoderReranker(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_k=5
)
# Wrap base retriever
base_retriever = VectorStoreRetriever(vectorstore=vectorstore, search_kwargs={"k": 20})
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever
)
# Use in RAG chain
rag_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=compression_retriever,
chain_type="stuff"
)
Custom RAG Pipeline Integration
class CustomRAGWithReranking:
def __init__(self, vectorstore, reranker, llm):
self.vectorstore = vectorstore
self.reranker = reranker
self.llm = llm
def query(self, question, top_k_retrieve=20, top_k_rerank=5):
# Step 1: Initial retrieval
initial_docs = self.vectorstore.similarity_search(question, k=top_k_retrieve)
# Step 2: Reranking
reranked_docs = self.reranker.rerank(question, initial_docs, top_k_rerank)
# Step 3: Context preparation
context = "\n\n".join([
f"Document {i+1}: {doc['content']}"
for i, doc in enumerate(reranked_docs)
])
# Step 4: Generation
prompt = f"""
Based on the following context, answer the question: {question}
Context:
{context}
Answer:
"""
response = self.llm.generate(prompt)
return {
'answer': response,
'sources': reranked_docs,
'initial_retrieval_count': len(initial_docs),
'reranked_count': len(reranked_docs)
}Cost-Performance Optimization
Selective Reranking
Only rerank when necessary to save computational costs:
def should_rerank(query, initial_scores):
"""Decide whether to rerank based on score distribution"""
scores = np.array(initial_scores)
# If top scores are very similar, reranking likely helps
top_5_variance = np.var(scores[:5])
if top_5_variance < 0.01:
return True
# If top score is much higher than others, reranking may not help
score_gap = scores[0] - scores[1]
if score_gap > 0.3:
return False
# Default to reranking for ambiguous cases
return True
def conditional_rerank(query, documents, reranker):
scores = [doc.get('similarity_score', 0) for doc in documents]
if should_rerank(query, scores):
return reranker.rerank(query, documents)
else:
return documents[:5] # Return top 5 without rerankingFrequently Asked Questions
Should I always rerank, or only for specific query types?
Rerank selectively based on query complexity and initial retrieval confidence. Simple factual queries with high-confidence initial results may not benefit from reranking. Complex multi-part questions or queries with low initial score variance see the most improvement from reranking.
What’s the optimal number of documents to rerank?
Retrieve 20-50 documents initially and rerank to 5-10 for the LLM. This balance maximizes recall while controlling costs. Reranking more than 50 documents shows diminishing returns and increases latency significantly.
How do I choose between different reranking models?
Start with ms-marco-MiniLM-L-6-v2 for general use cases—it’s fast, accurate, and well-tested. Upgrade to BGE-reranker-large for multilingual needs or Cohere for enterprise SLA requirements. Use LLM-based reranking only for high-stakes applications where accuracy justifies 10-50x higher costs.
Can reranking make retrieval worse?
Yes, if the reranker is trained on different data distributions than your use case. Always A/B test reranking on your specific queries. Poor reranking can hurt more than it helps, especially if your initial retrieval is already well-tuned.
How do I handle reranking latency in real-time applications?
Implement asynchronous reranking, caching for common queries, and fallback strategies. For latency-critical applications, consider lighter rerankers like ms-marco-MiniLM-L-6-v2 over larger models. Cache reranking results for frequently asked questions.
What’s the ROI of implementing reranking in production?
Reranking typically improves RAG accuracy by 20-35% with 200-500ms additional latency. For customer-facing applications, this often translates to higher user satisfaction and reduced support tickets. The computational cost (2-10x higher than retrieval alone) is usually justified by improved user experience.
Reranking is one of the highest-impact optimizations you can make to a RAG system. The key is choosing the right model and implementation strategy for your specific accuracy, latency, and cost requirements rather than defaulting to the most sophisticated approach available.
For more RAG API related information:
- CustomGPT.ai’s open-source UI starter kit (custom chat screens, embeddable chat window and floating chatbot on website) with 9 social AI integration bots and its related setup tutorials.
- Find our API sample usage code snippets here.
- Our RAG API’s Postman hosted collection – test the APIs on postman with just 1 click.
- Our Developer API documentation.
- API explainer videos on YouTube and a dev focused playlist.
- Join our bi-weekly developer office hours and our past recordings of the Dev Office Hours.
P.s – Our API endpoints are OpenAI compatible, just replace the API key and endpoint and any OpenAI compatible project works with your RAG data. Find more here.
Wanna try to do something with our Hosted MCPs? Check out the docs for the same.
Priyansh is Developer Relations Advocate who loves technology, writer about them, creates deeply researched content about them.