CustomGPT.ai Blog

RAG vs CAG : Can Cache-Augmented Generation Really Replace Retrieval?

January 30, 2025

8 min read

A recent VentureBeat article highlights a new Cache-Augmented Generation (CAG) method that promises no retrieval overhead and even better performance than Retrieval-Augmented Generation (RAG).

Sounds too good to be true?

We decided to find out by running our own tests on KV-Cache (a popular CAG implementation) versus RAG (represented by CustomGPT.ai, a popular RAG-As-A-Service platform).

Below are our insights on what happens when you apply these methods to real workloads.

1. Setting the Stage: RAG vs. KV-Cache (CAG)

RAG

What It Is
A Retrieval-Augmented Generation approach that uses a retriever to find relevant documents, then passes them to a large language model for final answers.

Where It Shines

Handles larger or frequently updated datasets without loading everything at once.
Avoids massive prompts, which can lead to truncation or context overload.

Key Limitations

Adds a retrieval step, which can be slower.
Often relies on external APIs or indexing overhead.

KV-Cache (CAG)

What It Is
A method that aims for near-zero retrieval time by loading all documents directly into the model’s context window. In principle, it cuts out the retriever entirely.

Note: In our benchmarks, we used a “No Cache” version of KV-Cache because the model was too large to run locally. Instead, we mimicked the same behavior via an API (OpenRouter) by feeding all documents each time. We’re not comparing retrieval speed here, since KV-Cache would obviously win if run locally on a suitable setup.

Where It Shines

If your entire knowledge base easily fits in the model’s context, you get almost instant answers (no retrieval step).
Best for stable datasets that rarely change.

Key Limitations

Context Size: If you exceed the model’s capacity, you must truncate or compress, killing accuracy.
Local Requirement: Real caching needs control over memory, meaning you must run the model on your own infrastructure.
Frequent Updates: Reloading the entire knowledge in context is impractical for dynamic data.

2. The BIG BUT (and We Cannot Lie)

Long-context LLMs (like Google Gemini or Claude with hundreds of thousands of tokens) are emerging, making CAG more appealing for some workloads.

But there’s a big condition:

You must run the model locally and have access to its memory to enable caching. Many high-powered LLMs are hosted, limit context lengths, and obviously, you can’t access the memory for user-level manipulation via an API.
Once your dataset crosses a threshold you might exceed the context window. If that happens, the method can break entirely or force you to truncate vital info, tanking accuracy.

This snippet from one error log says it all:

“error”:{“message”:”This endpoint’s maximum context length is 131072 tokens. However, you requested about 719291 tokens…”}

Translation: You’re out of luck unless you compress or chunk your data which can reduce the performance by a lot.

3. Our Benchmark Setup

We used the HotpotQA dataset (known for multi-hop QA) and ran our tests on the meta-llama/llama-3.1-8b-instruct model. We posed 50 questions each to two knowledge sizes—50 documents and 500 documents—to see how each method performs at different scales.

Because we used an API (OpenRouter) for KV-Cache, there was no actual “cache” or local memory optimization happening; we simply passed all documents in each request.

top_k=5 for CustomGPT.ai, and no top_k for KV-Cache (it loads everything).
No retrieval time comparison: Our focus is on semantic accuracy, since KV-Cache would trivially have zero retrieval overhead if it were truly caching locally.

4. Results

Our benchmark tests on the HotpotQA dataset revealed interesting insights into the performance of RAG (CustomGPT.ai) and KV-Cache (CAG) under different knowledge sizes.

Below are the key findings:

RAG vs CAG chart shows CustomGPT.ai scoring 0.811 at k=500 versus KV Cache at 0.492 in semantic similarity. — Figure 1: Average semantic similarity scores for KV-Cache (No Cache) and CustomGPT.ai (RAG) across knowledge sizes (k=50 and k=500). Tests were conducted on the HotpotQA dataset using the meta-llama/llama-3.1-8b-instruct model, with 50 questions per knowledge size. KV-Cache used an API (OpenRouter) without local caching, while CustomGPT.ai employed top_k=5 for retrieval.

Key Takeaways

KV-Cache Struggles with Scale: As the dataset grows, KV-Cache faces context size limits, which require prompt truncation or compression.
RAG Handles Complexity: CustomGPT.ai’s retrieval mechanism ensures only relevant documents are used, avoiding context overload and maintaining accuracy.

The Bottom Line

While KV-Cache shines with small, stable datasets, RAG proves more robust for larger, dynamic knowledge bases, making it a better fit for real-world, enterprise-level tasks.

5. KV-Cache (CAG): Pros & Cons

CAG can appear unbeatable in early or small-scale tests (e.g. ~50 documents). But scaling up to 500+ documents reveals some crucial issues:

Context Overflow

When you exceed the model’s max context window, you risk prompt truncation or outright token-limit errors. Vital information gets cut, and accuracy suffers.

Local Hardware

To truly leverage KV-Cache, you need direct access to the model’s memory. If you rely on a hosted or API-driven model, there’s no way to manage caching yourself.

Frequent Updates

Every time your data changes, you have to rebuild the entire cache. This overhead can undermine the supposed “instant” advantage that KV-Cache promises.

6. Quizzing Time: Score Wars – Why ‘Rosie Mac’ is the Winner

Not all scores tell the full story. When evaluating model responses, similarity metrics compare generated answers to a reference text. But what happens when one answer is more detailed than the reference? Does it get rewarded—or penalized? Let’s look at a real example from our benchmark.

The Question:

Q: Who was the body double for Emilia Clarke playing Daenerys Targaryen in Game of Thrones?

Two Correct Answers:

Answer A

“Rosie Mac was the body double for Emilia Clarke in her portrayal of Daenerys Targaryen in Game of Thrones.”

Answer B

“Rosie Mac.”

Which one do you think scored higher on our similarity metric? Most people might assume the more detailed answer (A) wins. But here are the actual scores:

Answer A: 0.60526
Answer B: 0.98361

Yes, the shorter “Rosie Mac.” received the higher score. Why? Because the ground truth reference answer was simply “Rosie Mac”—so the more detailed response introduced extra words that lowered the alignment score.

This doesn’t mean longer answers are worse—often, they provide better context. But it highlights why similarity metrics should be interpreted with caution, especially in nuanced or multi-hop reasoning tasks. Our overall results remain valid, but it’s important to look beyond raw scores to gain a comprehensive, unbiased perspective on how these models truly perform.

7. Final Thoughts: No Free Lunch

Yes, Cache-Augmented Generation can truly offer zero retrieval overhead—if your entire knowledge base and context can fit comfortably in your local LLM. But for many enterprise or multi-hop tasks, that’s a big “if.”

If your data is large or updates frequently, RAG approaches like CustomGPT.ai may remain the more robust and flexible choice.

Learn More

CustomGPT.ai : RAG-as-a-Service for large or dynamic data
VentureBeat Article: How Cache-Augmented Generation Reduces Latency & Complexity
CAG Official Repo : CAG Original
Our Modified Fork (With Results) : CAG Fork

Frequently Asked Questions

What is Retrieval-Augmented Generation (RAG)?

It’s a technique that fetches external documents at inference time to enrich a model’s responses, allowing you to handle bigger or changing data sets without overloading the model’s context.

Can Cache-Augmented Generation (CAG) fully replace RAG for enterprise knowledge assistants?

Not in most broad enterprise cases. CAG (KV-cache) can reduce retrieval overhead by loading documents directly into the model context, but RAG is better suited to larger or frequently updated datasets because it retrieves only relevant documents at query time. If your knowledge base changes often or is large, RAG is usually the safer default.

What trade-off appears first when using CAG at scale?

The main trade-off is context pressure. CAG aims for near-zero retrieval time by loading all documents into context, but very large prompts can increase truncation or context-overload risk. That is why retrieval-based systems are often preferred for larger knowledge bases.

How did you measure semantic similarity?

We used a BERTScore model (“all-MiniLM-L6-v2”) to compare generated answers with ground-truth references.

If you already have a working RAG system, when is switching to CAG reasonable?

Switching is most reasonable when your corpus is relatively small and stable, and retrieval latency is your top constraint. If content is large or frequently updated, RAG remains a better fit because it does not require loading everything into context each time.

Can better document formatting alone make CAG perform like RAG?

No. Formatting can help any system, but CAG and RAG differ architecturally. CAG relies on loading documents directly into context, while RAG first retrieves relevant documents. For large corpora, retrieval itself is a key advantage because it limits prompt size and reduces overload risk.

Why can RAG be more reliable than CAG for factual QA even if CAG is faster?

CAG can reduce retrieval time, but reliability can drop when prompts become too large. RAG improves reliability by selecting relevant documents before generation, which helps avoid massive prompt payloads that can lead to truncation or context overload.

What does “No Cache” KV-Cache mean in your diagrams?

It indicates we didn’t run an actual local caching mechanism. Instead, we replicated the effect by passing all documents via an API request each time, so we could compare its semantic accuracy without focusing on speed.

Why was HotpotQA used?

HotpotQA requires retrieving multiple documents to answer a single question, making it ideal for testing retrieval methods like RAG and highlighting KV-Cache’s limitations with large knowledge bases.

When is multi-hop retrieval needed?

When no single document contains the full answer—common in research, legal analysis, and complex reasoning tasks requiring fact linking.

Alden Do Rosario

Founder @ CustomGPT.ai , Husband & Father of 4, Avid cricket wicket-keeper

benchmarks, cag, rag vs cag