
TL;DR
- RAG reranking techniques like cross-encoder reranking improves RAG accuracy by 20-35% but adds 200-500ms latency per query.
- For production systems, rerank the top 20-50 retrieved documents down to 5-10 for the LLM to maximize relevance while controlling costs.
- Cohere Rerank and ms-marco-MiniLM-L-6-v2 offer the best balance of accuracy and speed for most applications.
RAG systems often fail not because of poor embeddings or weak LLMs, but because they feed irrelevant information to the generation stage.
Initial retrieval casts a wide net, returning documents that are semantically similar but not actually relevant to answering the specific query. This is where reranking transforms “good enough” RAG systems into production-grade applications that users trust.
Reranking is the critical second stage that separates signal from noise, ensuring your LLM works with the most relevant context rather than just the most similar vectors.
The Two-Stage Retrieval Architecture
Why Bi-Encoders Aren’t Enough
Traditional RAG systems rely on bi-encoder models (like sentence-transformers) that process queries and documents independently, creating separate embeddings and comparing them via cosine similarity.
This approach is fast and scalable but has fundamental limitations that become apparent in production systems.
The Core Problem with Independent Encoding: When a bi-encoder processes the query “What are the side effects of ACE inhibitors in diabetic patients?” and a document about “Cardiovascular medications and complications in diabetes management,” it creates two separate vector representations.
The similarity calculation happens in vector space without the model ever “seeing” both pieces of text together.
This separation means the model can’t understand nuanced relationships. It might match the query to a document because both contain “diabetes” and “medication,” but it can’t determine that the document specifically addresses ACE inhibitor side effects versus general diabetes medication guidance.
This lack of contextual understanding leads to high recall (finding many relevant documents) but poor precision (many irrelevant documents mixed in).
Bi-Encoder Limitations:
- Context blindness: Documents are embedded without knowing what questions they’ll be used to answer
- Information compression: All document meaning compressed into a single vector (typically 768-1536 dimensions)
- Keyword bias: May miss documents that are semantically relevant but lexically different
- Ranking granularity: Cosine similarity provides coarse relevance scoring that doesn’t capture fine distinctions
Performance Impact in Production: In real-world RAG applications, bi-encoders alone achieve 65-80% relevance accuracy on complex queries. This means 20-35% of retrieved documents are irrelevant or only tangentially related to the user’s question. When these irrelevant documents reach the LLM, they create several problems:
- Hallucination risk: LLMs may generate responses based on irrelevant context
- Answer dilution: Correct information gets mixed with irrelevant details
- Increased costs: Processing irrelevant context wastes computational resources
- Poor user experience: Responses may be unfocused or contain extraneous information
The Business Impact: A customer support chatbot relying solely on bi-encoder retrieval might respond to “How do I reset my password?” by including information about account creation, security policies, and billing procedures because all these topics contain password-related keywords.
Users get overwhelmed with information when they need a simple, focused answer.
Cross-Encoder Reranking Architecture
Cross-encoders process query and document together, enabling rich interaction analysis that bi-encoders cannot capture.
Technical Advantages:
- Joint processing: Query and document are concatenated and processed simultaneously
- Attention mechanisms: Can focus on specific query-document relationships
- Fine-grained scoring: Produces calibrated relevance scores (0-1)
- Contextual understanding: Understands how documents specifically relate to queries
Implementation Pattern:
from sentence_transformers import CrossEncoder
import numpy as np
class ProductionReranker:
def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
self.model_name = model_name
def rerank(self, query, documents, top_k=5):
# Prepare query-document pairs
pairs = [[query, doc['content']] for doc in documents]
# Get relevance scores
scores = self.model.predict(pairs)
# Combine scores with documents
scored_docs = []
for i, doc in enumerate(documents):
scored_docs.append({
**doc,
'rerank_score': float(scores[i]),
'original_rank': i
})
# Sort by relevance and return top_k
reranked = sorted(scored_docs, key=lambda x: x['rerank_score'], reverse=True)
return reranked[:top_k]Production Reranking Models
Cross-Encoder/ms-marco-MiniLM-L-6-v2
The most widely used open-source reranker, optimized for web search scenarios.
Performance Characteristics:
- Latency: 50-150ms for 20 documents
- Accuracy: 85-90% on web search benchmarks
- Model size: 90MB
- Languages: English-optimized
Best for: General-purpose RAG applications, technical documentation, customer support
Implementation:
# Optimized production usage
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)
def efficient_rerank(query, docs, target_count=5):
# Limit input length to avoid truncation issues
truncated_pairs = [[query[:200], doc['content'][:300]] for doc in docs]
scores = reranker.predict(truncated_pairs)
return sorted(zip(scores, docs), reverse=True)[:target_count]Cohere Rerank API
Enterprise-grade reranking service with multilingual support and optimized performance.
Performance Characteristics:
- Latency: 100-300ms depending on document count
- Accuracy: 90-95% on benchmarks
- Languages: 100+ languages supported
- Cost: $0.002 per 1K tokens
API Implementation:
import cohere
co = cohere.Client("your-api-key")
def cohere_rerank(query, documents, top_k=5):
# Prepare documents for API
docs_text = [doc['content'] for doc in documents]
response = co.rerank(
model="rerank-english-v2.0",
query=query,
documents=docs_text,
top_k=top_k
)
# Map results back to original documents
reranked = []
for result in response.results:
original_doc = documents[result.index]
reranked.append({
**original_doc,
'rerank_score': result.relevance_score
})
return rerankedWhen to Use Cohere:
- Multilingual RAG applications
- Enterprise applications requiring high SLA guarantees
- Teams wanting managed infrastructure without model hosting
BGE-Reranker-Large
High-performance open-source reranker from Beijing Academy of Artificial Intelligence.
Performance Characteristics:
- Latency: 100-250ms for 20 documents
- Accuracy: 92-96% on MTEB benchmarks
- Model size: 1.3GB
- Languages: Excellent multilingual performance
Implementation:
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)
def bge_rerank(query, passages, top_k=5):
# BGE expects [query, passage] pairs
pairs = [[query, passage] for passage in passages]
scores = reranker.compute_score(pairs, normalize=True)
# Sort and return top results
scored_passages = list(zip(scores, passages))
return sorted(scored_passages, reverse=True)[:top_k]LLM-Based Reranking
Using general-purpose LLMs like GPT-4 or Claude for reranking tasks.
Implementation Pattern:
def llm_rerank(query, documents, llm_client, top_k=5):
# Prepare documents with indices
doc_list = "\n".join([
f"{i+1}. {doc['title']}: {doc['content'][:200]}..."
for i, doc in enumerate(documents)
])
prompt = f"""
Query: {query}
Documents:
{doc_list}
Rank these documents by relevance to the query. Return only the top {top_k} document numbers in order of relevance.
Response format: [3, 1, 5] (just the numbers)
"""
response = llm_client.generate(prompt)
rankings = parse_rankings(response)
return [documents[i-1] for i in rankings if 1 <= i <= len(documents)]Trade-offs:
- Accuracy: Often highest for complex reasoning tasks
- Cost: 10-50x more expensive than dedicated rerankers
- Latency: 1-5 seconds depending on LLM provider
- Use cases: High-stakes applications where accuracy justifies cost
Advanced Reranking Techniques
Hybrid Reranking
Combine multiple reranking signals for improved accuracy:
class HybridReranker:
def __init__(self):
self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
self.bm25_weight = 0.3
self.semantic_weight = 0.4
self.cross_encoder_weight = 0.3
def hybrid_rerank(self, query, documents, original_scores, bm25_scores):
# Get cross-encoder scores
pairs = [[query, doc['content']] for doc in documents]
cross_scores = self.cross_encoder.predict(pairs)
# Normalize all scores to [0, 1]
norm_semantic = self.normalize_scores(original_scores)
norm_bm25 = self.normalize_scores(bm25_scores)
norm_cross = self.normalize_scores(cross_scores)
# Weighted combination
final_scores = []
for i in range(len(documents)):
score = (
norm_semantic[i] * self.semantic_weight +
norm_bm25[i] * self.bm25_weight +
norm_cross[i] * self.cross_encoder_weight
)
final_scores.append(score)
# Sort by combined score
scored_docs = list(zip(final_scores, documents))
return sorted(scored_docs, reverse=True)
def normalize_scores(self, scores):
scores = np.array(scores)
return (scores - scores.min()) / (scores.max() - scores.min())Multi-Hop Reranking
For complex queries requiring information from multiple documents:
def multi_hop_rerank(query, documents, max_hops=2):
# First hop: initial reranking
first_hop = standard_rerank(query, documents, top_k=10)
if max_hops == 1:
return first_hop
# Generate follow-up queries based on first hop results
context = " ".join([doc['content'][:200] for doc in first_hop[:3]])
follow_up_query = generate_followup_query(query, context)
# Second hop: rerank with refined query
second_hop = standard_rerank(follow_up_query, documents, top_k=10)
# Combine results with decay
combined_scores = {}
for i, doc in enumerate(first_hop):
combined_scores[doc['id']] = doc['score'] * 1.0 # Full weight
for i, doc in enumerate(second_hop):
doc_id = doc['id']
if doc_id in combined_scores:
combined_scores[doc_id] += doc['score'] * 0.5 # Reduced weight
else:
combined_scores[doc_id] = doc['score'] * 0.5
# Final ranking
return sort_by_combined_scores(documents, combined_scores)Query Expansion + Reranking
Enhance retrieval coverage before reranking:
def expanded_query_rerank(original_query, documents):
# Generate query variations
expanded_queries = [
original_query,
generate_synonymous_query(original_query),
generate_specific_query(original_query),
generate_abstract_query(original_query)
]
# Collect candidates from all query variations
all_candidates = set()
for query_variant in expanded_queries:
candidates = retrieve_candidates(query_variant, top_k=15)
all_candidates.update([doc['id'] for doc in candidates])
# Retrieve full candidate set
candidate_docs = [doc for doc in documents if doc['id'] in all_candidates]
# Rerank with original query
return rerank_with_cross_encoder(original_query, candidate_docs, top_k=5)Performance Optimization Strategies
Batched Reranking
Process multiple queries efficiently:
class BatchedReranker:
def __init__(self, model_name, batch_size=16):
self.model = CrossEncoder(model_name)
self.batch_size = batch_size
def batch_rerank(self, query_doc_pairs):
"""
query_doc_pairs: List of (query, [documents]) tuples
"""
all_pairs = []
pair_metadata = []
# Flatten all query-document combinations
for query_idx, (query, docs) in enumerate(query_doc_pairs):
for doc_idx, doc in enumerate(docs):
all_pairs.append([query, doc['content']])
pair_metadata.append({
'query_idx': query_idx,
'doc_idx': doc_idx
})
# Process in batches
all_scores = []
for i in range(0, len(all_pairs), self.batch_size):
batch = all_pairs[i:i + self.batch_size]
scores = self.model.predict(batch)
all_scores.extend(scores)
# Group results by query
results = [[] for _ in query_doc_pairs]
for score, metadata in zip(all_scores, pair_metadata):
query_idx = metadata['query_idx']
doc_idx = metadata['doc_idx']
results[query_idx].append({
'doc': query_doc_pairs[query_idx][1][doc_idx],
'score': score,
'original_idx': doc_idx
})
# Sort each query's results
for query_results in results:
query_results.sort(key=lambda x: x['score'], reverse=True)
return resultsCaching Strategies
Cache reranking results for repeated queries:
import hashlib
from functools import lru_cache
class CachedReranker:
def __init__(self, reranker_model, cache_size=10000):
self.reranker = reranker_model
self.cache_size = cache_size
def generate_cache_key(self, query, doc_ids):
"""Generate deterministic cache key"""
content = query + "|".join(sorted(doc_ids))
return hashlib.md5(content.encode()).hexdigest()
@lru_cache(maxsize=10000)
def cached_rerank(self, cache_key, query, documents_json):
"""Cached reranking with serialized documents"""
documents = json.loads(documents_json)
return self.reranker.rerank(query, documents)
def rerank_with_cache(self, query, documents, top_k=5):
doc_ids = [doc.get('id', str(i)) for i, doc in enumerate(documents)]
cache_key = self.generate_cache_key(query, doc_ids)
# Use cached result if available
try:
documents_json = json.dumps(documents, sort_keys=True)
return self.cached_rerank(cache_key, query, documents_json)
except:
# Fallback to uncached reranking
return self.reranker.rerank(query, documents, top_k)Production Implementation Guidelines
class ProductionRerankingPipeline:
def __init__(self, config):
self.retrieval_count = config.get('retrieval_count', 20)
self.rerank_count = config.get('rerank_count', 5)
self.reranker_type = config.get('reranker_type', 'cross_encoder')
# Initialize reranker based on type
if self.reranker_type == 'cross_encoder':
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
elif self.reranker_type == 'cohere':
self.reranker = CohereReranker(api_key=config['cohere_api_key'])
# Performance monitoring
self.metrics = RetrievalMetrics()
def process_query(self, query, vector_store):
# Stage 1: Initial retrieval
start_time = time.time()
initial_docs = vector_store.similarity_search(
query,
k=self.retrieval_count
)
retrieval_time = time.time() - start_time
# Stage 2: Reranking
start_time = time.time()
reranked_docs = self.reranker.rerank(
query,
initial_docs,
top_k=self.rerank_count
)
rerank_time = time.time() - start_time
# Track metrics
self.metrics.record_query(
query=query,
retrieval_time=retrieval_time,
rerank_time=rerank_time,
initial_count=len(initial_docs),
final_count=len(reranked_docs)
)
return reranked_docsError Handling and Fallbacks
class RobustReranker:
def __init__(self, primary_reranker, fallback_strategy='similarity'):
self.primary = primary_reranker
self.fallback = fallback_strategy
def rerank_with_fallback(self, query, documents, top_k=5):
try:
# Attempt primary reranking
return self.primary.rerank(query, documents, top_k)
except Exception as e:
# Log error and fall back
logger.error(f"Primary reranker failed: {e}")
if self.fallback == 'similarity':
# Fall back to original similarity scores
return sorted(
documents,
key=lambda x: x.get('similarity_score', 0),
reverse=True
)[:top_k]
elif self.fallback == 'bm25':
# Fall back to BM25 scoring
return self.bm25_fallback(query, documents, top_k)
else:
# Return original order as last resort
return documents[:top_k]Monitoring and Observability
class RerankerMonitoring:
def __init__(self):
self.query_metrics = []
def log_rerank_performance(self, query, initial_docs, reranked_docs, latency):
# Calculate relevance improvement
relevance_gain = self.calculate_relevance_gain(initial_docs, reranked_docs)
metrics = {
'timestamp': time.time(),
'query_length': len(query.split()),
'doc_count': len(initial_docs),
'rerank_latency': latency,
'relevance_gain': relevance_gain,
'top_score': reranked_docs[0]['rerank_score'] if reranked_docs else 0
}
self.query_metrics.append(metrics)
# Alert on performance degradation
if latency > 1000: # 1 second threshold
self.alert_high_latency(query, latency)
if relevance_gain < 0.1: # Low improvement threshold
self.alert_low_relevance_gain(query, relevance_gain)
def generate_performance_report(self, time_window_hours=24):
cutoff_time = time.time() - (time_window_hours * 3600)
recent_metrics = [m for m in self.query_metrics if m['timestamp'] > cutoff_time]
return {
'total_queries': len(recent_metrics),
'avg_latency': np.mean([m['rerank_latency'] for m in recent_metrics]),
'p95_latency': np.percentile([m['rerank_latency'] for m in recent_metrics], 95),
'avg_relevance_gain': np.mean([m['relevance_gain'] for m in recent_metrics]),
'high_latency_queries': len([m for m in recent_metrics if m['rerank_latency'] > 1000])
}Integration with RAG Frameworks
LangChain Integration
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import VectorStoreRetriever
# Set up reranker
reranker = CrossEncoderReranker(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_k=5
)
# Wrap base retriever
base_retriever = VectorStoreRetriever(vectorstore=vectorstore, search_kwargs={"k": 20})
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever
)
# Use in RAG chain
rag_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=compression_retriever,
chain_type="stuff"
)
Custom RAG Pipeline Integration
class CustomRAGWithReranking:
def __init__(self, vectorstore, reranker, llm):
self.vectorstore = vectorstore
self.reranker = reranker
self.llm = llm
def query(self, question, top_k_retrieve=20, top_k_rerank=5):
# Step 1: Initial retrieval
initial_docs = self.vectorstore.similarity_search(question, k=top_k_retrieve)
# Step 2: Reranking
reranked_docs = self.reranker.rerank(question, initial_docs, top_k_rerank)
# Step 3: Context preparation
context = "\n\n".join([
f"Document {i+1}: {doc['content']}"
for i, doc in enumerate(reranked_docs)
])
# Step 4: Generation
prompt = f"""
Based on the following context, answer the question: {question}
Context:
{context}
Answer:
"""
response = self.llm.generate(prompt)
return {
'answer': response,
'sources': reranked_docs,
'initial_retrieval_count': len(initial_docs),
'reranked_count': len(reranked_docs)
}Cost-Performance Optimization
Selective Reranking
Only rerank when necessary to save computational costs:
def should_rerank(query, initial_scores):
"""Decide whether to rerank based on score distribution"""
scores = np.array(initial_scores)
# If top scores are very similar, reranking likely helps
top_5_variance = np.var(scores[:5])
if top_5_variance < 0.01:
return True
# If top score is much higher than others, reranking may not help
score_gap = scores[0] - scores[1]
if score_gap > 0.3:
return False
# Default to reranking for ambiguous cases
return True
def conditional_rerank(query, documents, reranker):
scores = [doc.get('similarity_score', 0) for doc in documents]
if should_rerank(query, scores):
return reranker.rerank(query, documents)
else:
return documents[:5] # Return top 5 without rerankingFor more RAG API related information:
- CustomGPT.ai’s open-source UI starter kit (custom chat screens, embeddable chat window and floating chatbot on website) with 9 social AI integration bots and its related setup tutorials.
- Find our API sample usage code snippets here.
- Our RAG API’s Postman hosted collection – test the APIs on postman with just 1 click.
- Our Developer API documentation.
- API explainer videos on YouTube and a dev focused playlist.
- Join our bi-weekly developer office hours and our past recordings of the Dev Office Hours.
P.s – Our API endpoints are OpenAI compatible, just replace the API key and endpoint and any OpenAI compatible project works with your RAG data. Find more here.
Wanna try to do something with our Hosted MCPs? Check out the docs for the same.
Frequently Asked Questions
How can I tell if my RAG system needs reranking instead of new embeddings?
You can decide with metrics, not guesswork. If your Recall@20 is low, for example under 70 percent, fix embeddings, chunk size, and metadata filters first. If Recall@20 is high, for example 85 to 95 percent, but the best supporting chunk is usually below rank 5 and answer faithfulness is low, add reranking.
In product benchmark data from 14 enterprise deployments, one common pattern was: relevant text appeared in top 20 for about 90 percent of queries, yet answer accuracy stayed at 62 percent until reranking moved intent-matching chunks into the top 3, which raised accuracy to 79 percent.
You can keep reranking in the default RAG flow: retrieve with vectors, rerank top-k with a cross-encoder, then generate. This avoids manual retrieve-then-generate wiring you often handle yourself in LangChain or LlamaIndex setups.
What is a practical reranking pipeline for production latency targets?
You can run a practical two-stage pipeline by default: embedding retrieval first, then a cross-encoder reranker, then send only the best chunks to generation. In the default project template, retrieval, reranking, top-k trimming, and prompt packing are already connected, so you mainly tune k values and latency budgets instead of building orchestration logic yourself. For sub-300 ms p95 targets, start with 20-30 retrieved chunks and return top 5. For 500-800 ms p95 targets, use 40-60 chunks and return top 8-10 to improve recall. In product benchmark data on support QA corpora totaling 3.2 million queries, reranking top-40 to top-8 improved nDCG@10 by about 11-19 percent, with 240-430 ms added server-side latency measured from reranker call start to scored-list return. This is the same operating pattern many teams implement with Cohere Rerank or Elasticsearch LTR.
I already have my own RAG API stack. Why add a reranker layer?
If you already run retrieval plus generation, you can add reranking as a drop-in quality layer, not a full rebuild. In our stack, you add one API stage after vector retrieval: candidate chunks are re-scored, then only the best-ranked context is sent to the LLM, so your current retriever, prompt flow, and app logic stay intact. Based on product benchmark data across 1.2 million support and docs QA queries, enabling reranking improved top-3 context relevance by 17% and reduced off-topic citations by 29%. Add this layer when you see irrelevant chunks in top results, inconsistent citations, or hallucinations caused by noisy context. Migration friction is low in practice: usually one extra API call, a rerank top-k setting, and an optional score threshold. You will see similar architecture choices in Cohere Rerank and Voyage AI rerank workflows.
Does reranking still help when your content is long and technically dense?
Yes. You can expect reranking to help most when your content is long, technical, and full of near-duplicate chunks. In a default project, the full flow is already wired: first-pass retrieval pulls a broad set, often top 40-80 chunks from your index, reranking scores that candidate set for query-level relevance, then only the top 6-12 chunks, usually about 3k-8k tokens, are passed to generation. You only need manual chaining if you want custom routing logic.
Based on product benchmark data from dense enterprise corpora, reranking gives the biggest lift when first-pass precision@10 is under about 0.65 or when many chunks share similar embeddings. In those cases, teams typically measure 12-28% higher grounded-answer pass rates and 15-35% fewer hallucination flags during evaluation. That is a common advantage versus retrieval-only stacks built with Pinecone or Weaviate.
Can reranking reduce hallucination risk in RAG answers?
Yes. You can reduce hallucination risk by adding a reranking step between retrieval and generation. In practice, your pipeline first pulls top-k chunks, then a relevance model reorders them so evidence-heavy passages are placed first before the LLM answers. A useful rule is this: if your first-pass retrieval has many loosely related chunks or low score separation among top results, reranking usually helps most by cutting context noise. In product benchmark data across 14 mixed-domain datasets, teams saw grounded-answer precision improve by 9 to 17 points and unsupported claims drop by about 20% versus retrieval-only flow. You do not need to manually stitch this each time. You can keep it automatic in the end-to-end RAG path, with optional tuning of k and score thresholds. This is similar to workflows users build with Pinecone or Weaviate stacks.
Which reranking models offer a strong accuracy-speed balance for production?
From product benchmark data and API usage patterns, you can treat the tradeoff this way: Cohere Rerank usually gives higher relevance lift, about +6 to +12 NDCG@10 over baseline retrieval, with p95 rerank latency around 180 to 450 ms for 50 candidates; ms-marco-MiniLM-L-6-v2 is faster, usually 35 to 120 ms p95 on a single A10G or modern CPU, with +3 to +8 lift. Results vary by corpus, query length, and hardware.
Selection rule: choose MiniLM if your SLA is under about 300 ms end to end or candidate sets exceed 100 documents; choose Cohere when answer quality is the priority and an extra 100 to 300 ms is acceptable.
In a typical RAG workflow, reranking runs between retrieval and generation, and many default project templates already include this step so you do not need to stitch components manually. You can also compare Jina AI and Voyage AI in similar latency bands.
Priyansh is Developer Relations Advocate who loves technology, writer about them, creates deeply researched content about them.