RAG vs CAG : Can Cache-Augmented Generation Really Replace Retrieval?

A recent VentureBeat article highlights a new Cache-Augmented Generation (CAG) method that promises no retrieval overhead and even better performance than Retrieval-Augmented Generation (RAG). 

Sounds too good to be true? 

We decided to find out by running our own tests on KV-Cache (a popular CAG implementation) versus RAG (represented by CustomGPT.ai, a popular RAG-As-A-Service platform).

Below are our insights on what happens when you apply these methods to real workloads.

1. Setting the Stage: RAG vs. KV-Cache (CAG)

RAG

What It Is
A Retrieval-Augmented Generation approach that uses a retriever to find relevant documents, then passes them to a large language model for final answers.

Where It Shines

  • Handles larger or frequently updated datasets without loading everything at once.
  • Avoids massive prompts, which can lead to truncation or context overload.

Key Limitations

  • Adds a retrieval step, which can be slower.
  • Often relies on external APIs or indexing overhead.

KV-Cache (CAG)

What It Is
A method that aims for near-zero retrieval time by loading all documents directly into the model’s context window. In principle, it cuts out the retriever entirely.

Note: In our benchmarks, we used a “No Cache” version of KV-Cache because the model was too large to run locally. Instead, we mimicked the same behavior via an API (OpenRouter) by feeding all documents each time. We’re not comparing retrieval speed here, since KV-Cache would obviously win if run locally on a suitable setup.

Where It Shines

  • If your entire knowledge base easily fits in the model’s context, you get almost instant answers (no retrieval step).
  • Best for stable datasets that rarely change.

Key Limitations

  • Context Size: If you exceed the model’s capacity, you must truncate or compress, killing accuracy.
  • Local Requirement: Real caching needs control over memory, meaning you must run the model on your own infrastructure.
  • Frequent Updates: Reloading the entire knowledge in context is impractical for dynamic data.

2. The BIG BUT (and We Cannot Lie)

Long-context LLMs (like Google Gemini or Claude with hundreds of thousands of tokens) are emerging, making CAG more appealing for some workloads. 

But there’s a big condition:

  • You must run the model locally and have access to its memory to enable caching. Many high-powered LLMs are hosted, limit context lengths, and obviously, you can’t access the memory for user-level manipulation via an API.
  • Once your dataset crosses a threshold you might exceed the context window. If that happens, the method can break entirely or force you to truncate vital info, tanking accuracy.

This snippet from one error log says it all:

“error”:{“message”:”This endpoint’s maximum context length is 131072 tokens. However, you requested about 719291 tokens…”}

Translation: You’re out of luck unless you compress or chunk your data which can reduce the performance by a lot.

3. Our Benchmark Setup

We used the HotpotQA dataset (known for multi-hop QA) and ran our tests on the meta-llama/llama-3.1-8b-instruct model. We posed 50 questions each to two knowledge sizes—50 documents and 500 documents—to see how each method performs at different scales.

Because we used an API (OpenRouter) for KV-Cache, there was no actual “cache” or local memory optimization happening; we simply passed all documents in each request.

  • top_k=5 for CustomGPT.ai, and no top_k for KV-Cache (it loads everything).
  • No retrieval time comparison: Our focus is on semantic accuracy, since KV-Cache would trivially have zero retrieval overhead if it were truly caching locally.

4. Results

Our benchmark tests on the HotpotQA dataset revealed interesting insights into the performance of RAG (CustomGPT.ai) and KV-Cache (CAG) under different knowledge sizes. 

Below are the key findings:

RAG vs CAG - Comparison
Figure 1: Average semantic similarity scores for KV-Cache (No Cache) and CustomGPT.ai (RAG) across knowledge sizes (k=50 and k=500). Tests were conducted on the HotpotQA dataset using the meta-llama/llama-3.1-8b-instruct model, with 50 questions per knowledge size. KV-Cache used an API (OpenRouter) without local caching, while CustomGPT.ai employed top_k=5 for retrieval.

Key Takeaways

  • KV-Cache Struggles with Scale: As the dataset grows, KV-Cache faces context size limits, which require prompt truncation or compression.
  • RAG Handles Complexity: CustomGPT.ai’s retrieval mechanism ensures only relevant documents are used, avoiding context overload and maintaining accuracy.

The Bottom Line

While KV-Cache shines with small, stable datasets, RAG proves more robust for larger, dynamic knowledge bases, making it a better fit for real-world, enterprise-level tasks.

5. KV-Cache (CAG): Pros & Cons

CAG can appear unbeatable in early or small-scale tests (e.g. ~50 documents). But scaling up to 500+ documents reveals some crucial issues:

Context Overflow

When you exceed the model’s max context window, you risk prompt truncation or outright token-limit errors. Vital information gets cut, and accuracy suffers.

Local Hardware

To truly leverage KV-Cache, you need direct access to the model’s memory. If you rely on a hosted or API-driven model, there’s no way to manage caching yourself.

Frequent Updates

Every time your data changes, you have to rebuild the entire cache. This overhead can undermine the supposed “instant” advantage that KV-Cache promises.

6. Quizzing Time: Score Wars – Why ‘Rosie Mac’ is the Winner

Not all scores tell the full story. When evaluating model responses, similarity metrics compare generated answers to a reference text. But what happens when one answer is more detailed than the reference? Does it get rewarded—or penalized? Let’s look at a real example from our benchmark.

The Question:

Q: Who was the body double for Emilia Clarke playing Daenerys Targaryen in Game of Thrones?

Two Correct Answers:

Answer A

“Rosie Mac was the body double for Emilia Clarke in her portrayal of Daenerys Targaryen in Game of Thrones.”

Answer B

“Rosie Mac.”

Which one do you think scored higher on our similarity metric? Most people might assume the more detailed answer (A) wins. But here are the actual scores:

  • Answer A: 0.60526
  • Answer B: 0.98361

Yes, the shorter “Rosie Mac.” received the higher score. Why? Because the ground truth reference answer was simply “Rosie Mac”—so the more detailed response introduced extra words that lowered the alignment score.

This doesn’t mean longer answers are worse—often, they provide better context. But it highlights why similarity metrics should be interpreted with caution, especially in nuanced or multi-hop reasoning tasks. Our overall results remain valid, but it’s important to look beyond raw scores to gain a comprehensive, unbiased perspective on how these models truly perform.

7. Final Thoughts: No Free Lunch

Yes, Cache-Augmented Generation can truly offer zero retrieval overhead—if your entire knowledge base and context can fit comfortably in your local LLM. But for many enterprise or multi-hop tasks, that’s a big “if.”

If your data is large or updates frequently, RAG approaches like CustomGPT.ai may remain the more robust and flexible choice.

8. Frequently Asked Questions

  1. What is Retrieval-Augmented Generation (RAG)?

It’s a technique that fetches external documents at inference time to enrich a model’s responses, allowing you to handle bigger or changing data sets without overloading the model’s context.

  1. How did you measure semantic similarity?

We used a BERTScore model (“all-MiniLM-L6-v2”) to compare generated answers with ground-truth references.

  1. What does “No Cache” KV-Cache mean in your diagrams?

It indicates we didn’t run an actual local caching mechanism. Instead, we replicated the effect by passing all documents via an API request each time, so we could compare its semantic accuracy without focusing on speed.

  1. Why was HotpotQA used?

HotpotQA requires retrieving multiple documents to answer a single question, making it ideal for testing retrieval methods like RAG and highlighting KV-Cache’s limitations with large knowledge bases.

  1. When is multi-hop retrieval needed?

When no single document contains the full answer—common in research, legal analysis, and complex reasoning tasks requiring fact linking.

Learn More

Build a Custom GPT for your business, in minutes.

Deliver exceptional customer experiences and maximize employee efficiency with custom AI agents.

Trusted by thousands of organizations worldwide

Related posts

Leave a reply

Your email address will not be published. Required fields are marked *

*

3x productivity.
Cut costs in half.

Launch a custom AI agent in minutes.

Instantly access all your data.
Automate customer service.
Streamline employee training.
Accelerate research.
Gain customer insights.

Try 100% free. Cancel anytime.