Create your AI that knows when to say "I don't know." Try it on your data →

CustomGPT.ai Blog

RAG Implementation with LLMs from Scratch: A Step-by-Step Guide (Part 2)

·

11 min read

rag implementation

RAG Implementation with LLMs from Scratch

Implementing Retrieval-Augmented Generation (RAG) can significantly enhance the capabilities of large language models (LLMs), making them more accurate and contextually relevant. In this blog, we will guide you through the process of RAG implementation with LLM, discuss the RAG framework, and explore its applications. This step-by-step guide will help you understand the RAG approach to LLMs and how to effectively integrate it into your projects. Explore its application in platforms like LangChain and CustomGPT. Let’s get started!

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI approach that combines information retrieval techniques with generative models to enhance the accuracy and relevance of AI-generated content. The RAG framework operates in two main steps: retrieval and generation.

The working of Retrieval-Augmented Generation involves two main steps:

Retrieval: RAG Implementation

In the retrieval step, relevant information is sourced from external knowledge bases using techniques such as keyword-based search, semantic similarity search, or neural network-based retrieval. This process involves scanning through a collection of documents and identifying the most relevant ones.

Generation: RAG Implementation

 After retrieving the relevant information, it is used to augment the generation process. The generative model then incorporates this information to produce more accurate, contextually relevant, and fluent responses. Usually, a transformer-based model such as BERT, GPT-2, or GPT-3, produces text resembling human language using the retrieved documents.

RAG Framework: A Technical Deep Dive

RAG implementation with LLMs maps q(x) through Query Encoder, MIPS retriever pη, and Generator pθ to y.

As we know RAG model operates in a two-step process:

Retrieval Step

When a query is asked, it is converted into numerical vectors called embeddings using Query Encoder (q). During the retrieval step, the system scans through the corpus and selects the N most relevant documents, typically employing similarity metrics such as cosine similarity.

RAG implementation with LLMs uses TfidfVectorizer and cosine_similarity in Jupyter notebook Untitled20.

Here’s an explanation of the code snippet:

  • Import necessary modules as shown in the first two lines of the above code snippet.
  • The modules will be used to convert user queries into vectors using TfidfVectorizer and to find the most relevant document to the user query among the collection of documents using cosine_similarity.
  • vectorizer = TfidfVectorizer() creates an instance of the TfidfVectorizer class, which will be used to convert text data into TF-IDF features.
  • TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used in natural language processing and information retrieval to reflect the importance of a word in a document relative to a collection of documents, typically a corpus.
  • tfidf_matrix = vectorizer.fit_transform(corpus) converts the corpus to a TF-IDF matrix.
  • query_vector = vectorizer.transform([query]) transforms the query into a TF-IDF vector.
  • similarity_scores = cosine_similarity(query_vector, tfidf_matrix) computes the cosine similarity between the TF-IDF vector of the query and the TF-IDF matrix of the corpus.
  • The resulting similarity_scores array contains the cosine similarity scores between the query and each document in the corpus.

This way the most relevant document of the query will be retrieved using retrieval code snippet.

Generation Step

Now the generator processes these N-retrieved documents along with the original query to generate the relevant response to the user query.

RAG implementation notebook runs TfidfVectorizer cosine similarity and facebook/rag-token-base RagRetriever setup.

Here’s an explanation of the provided code in points:

  • The code imports necessary modules in the first line from the transformers library to work with RAG  models.
  • Initializing Tokenizer, Retriever, and Model: It initializes three components required for RAG: tokenizer, retriever, and model.
  • RagTokenizer.from_pretrained() initializes the tokenizer for RAG models.
  • RagRetriever.from_pretrained() initializes the retriever component for RAG models. It specifies the index_name as “exact” and uses a dummy dataset for demonstration purposes.
  • RagTokenForGeneration.from_pretrained() initializes the RAG model for text generation, specifying the previously initialized retriever.
  • Tokenizing the Query: The query is tokenized using the initialized tokenizer. tokenizer() method takes the query as input and returns the tokenized representation as input_ids.
  • Generating Response: The model generates a response based on the tokenized query and retrieved documents.
  • generate() method of the model generates the response based on the input_ids.
  • The generated response is decoded into human-readable text using the decode() method of the tokenizer, skipping special tokens.

This code essentially sets up an RAG model for generation and generates a response for a given query. After defining the RAG model now you can set up this RAG model with the large language model.

Setting Up RAG with LLM

Before configuring RAG for Large Language Models (LLMs) you will require:

Data Corpus

Gather a dataset in various formats such as SQL databases, Elasticsearch, or JSON files. This corpus serves as the knowledge base for retrieving relevant information.

Machine Learning Framework

Choose a machine learning framework like TensorFlow or PyTorch to implement and train the RAG model.

Computational Resources

Ensure access to sufficient computational resources, including CPUs or GPUs, for both training and inference tasks. These resources are necessary to handle the computational demands of RAG implementations.

RAG Approach with LLM: Steps to Implement RAG in LLMs

To implement the RAG technique with LLMs, you need to follow a series of steps. Here’s how you can set up the RAG model with LLM:

Data preparation

Ensure your dataset is in a searchable format. If utilizing Elasticsearch, index your data appropriately.

Select Model

Choose the retriever and generator models. You can opt for pre-trained models or train your own based on your specific requirements.

Train Model

Train the retriever and generator models separately.

  • retriever.train()
  • generator.train()

Integrate LLM Models

Combine the trained retriever and generator models to create a unified RAG model.

  • rag_model = RagModel(retriever, generator)

Test Your Model

Validate the model’s performance using metrics such as BLEU for text generation quality and recall for retrieval accuracy.

By following these straightforward steps, you can develop a robust RAG model ready to enhance your LLMs for improved performance.

Utility Function to evaluate RAG model performance

You can evaluate RAG model performance with utility functions like get_retrieval_score(). This function evaluates the effectiveness of the retriever with metrics such as Precision or NDCG (Normalized Discounted Cumulative Gain).

  • from sklearn.metrics import ndcg_score
  • ndcg = ndcg_score(y_true, y_score)

By employing this function, you can efficiently fine-tune your retriever’s performance, ensuring it accurately retrieves the most relevant documents from the corpus.

Technologies to implement RAG with LLM

Several technologies can support RAG implementation. LangChain provides open-source building blocks for retrieval pipelines, while CustomGPT.ai provides a no-code path for building RAG agents from connected content.

Technologies available to implement RAG:

  • LangChain
  • CustomGPT

Implementing RAG with LangChain

LangChain displays a parrot and chain-link emoji before bold black text, centered on a plain white background.

LangChain is an open-source framework for building applications with language models, including retrieval-augmented generation workflows. To implement RAG in LangChain, start from the current LangChain RAG tutorial and adapt the retriever, vector store, and generation chain to your dataset.

Implementing RAG in LangChain usually involves these steps:

Installation: Install LangChain and the provider packages your project needs, then configure your development environment.

  • pip install langchain

Index your content: Load documents, split them into chunks, create embeddings, and store them in a vector database or retriever.

  • Use LangChain document loaders, text splitters, embeddings, and retrievers for your source content.

Build the chain: Connect the retriever to the model so the answer is generated from retrieved passages.

  • Create a retrieval chain that passes the user question and retrieved context to the model.
  • Test grounding quality with sample questions before deploying.
  • Keep source citations visible so reviewers can inspect answer support.

Query execution: Run queries through your retrieval chain so the model answers from retrieved context instead of model memory alone.

  • response = retrieval_chain.invoke({“input”: “your query”})

By following these steps, you can build a LangChain-based RAG workflow while keeping retrieval, grounding, and citation behavior testable.

CustomGPT.ai using RAG: A no-code platform

CustomGPT.ai builds RAG agents from your content. The platform retrieves relevant information from connected knowledge sources and uses that context to generate grounded responses.

CustomGPT.ai homepage states "Provide instant answers from your information" with Start free trial and Try a demo buttons.
CustomGPT.ai homepage shown as a RAG chatbot deployment example linking private knowledge to LLM responses.

Its biggest strength lies in its no-code convenience features making it a versatile choice for large number of audience.

  • CustomGPT.ai is designed to be a no-code platform, making it accessible to both non-technical and technical users. Teams can build content-grounded agents without writing retrieval pipeline code.
  • CustomGPT.ai is useful when teams want to connect documents, files, and webpages, then let users ask questions with answers grounded in those sources.

Read the full blog on How you can train the chatbot with external data sources with no coding.

Conclusion

RAG is most useful when teams need LLM answers grounded in trusted source material. You can build a RAG stack yourself with frameworks such as LangChain, or use CustomGPT.ai to build a RAG agent from connected content without wiring the retrieval pipeline by hand.

FAQs: Setting Up RAG with LLM

Do you need both retrieval and generation steps in a RAG implementation?

Yes. You need both steps for true RAG: retrieve evidence, then generate from that evidence. In CustomGPT.ai, retrieval settings, context assembly, and fallback behavior are managed for you. If you wire APIs yourself in OpenAI Assistants or Microsoft Copilot Studio, you typically configure top-k retrieval, truncation rules, and fallback behavior on your own. At query time, your content is chunked and embedded; the system retrieves relevant context before generation. If relevance confidence is low, a grounded system should ask for clarification or return a source-grounded not-found response instead of guessing.

What retrieval methods are used in RAG systems?

RAG systems commonly use keyword search, semantic retrieval with embeddings, metadata filters, and reranking. For broad factual questions, hybrid retrieval can improve recall; for team knowledge-base queries, metadata filters such as team, document type, and date can improve precision. CustomGPT.ai handles ingestion, chunking, indexing, retrieval, citation mapping, and grounded response generation as one pipeline.

Why does RAG use an external knowledge base?

You can use RAG with an external knowledge base so answers come from your documents, not just model memory. In CustomGPT.ai, your query is embedded, the system retrieves relevant chunks from connected files or URLs, and that evidence is injected into generation. The flow is retrieval, then generation with source-grounded context, so replies are tied to retrieved text and can include citations instead of unsupported guesses. If you are comparing options like OpenAI Assistants or Azure AI Studio, architecture transparency is a key early evaluation check.

What is the practical benefit of RAG for LLM outputs?

The practical benefit of RAG is answer reliability, not just better wording. You can ground each response in retrieved passages, which helps reduce unsupported claims and raises relevance on company-specific questions. You should track this with before-and-after metrics such as grounding accuracy, citation coverage, and unsupported-statement rate. RAG matters most for private or fast-changing docs such as policies, pricing, and runbooks. For general world facts, gains are usually smaller. Teams often compare this approach with Azure OpenAI or Vertex AI stacks.

Can you implement RAG in frameworks like LangChain?

Yes. You can implement RAG with LangChain. In plain terms, your documents are ingested and chunked, embeddings are indexed, top passages are retrieved at query time, then the model response is grounded in those passages with citations. If you prefer a managed path, you can call a CustomGPT.ai agent through an OpenAI-compatible Chat Completions endpoint and keep retrieval orchestration in CustomGPT.ai. Before you build, confirm that your trial includes the full retrieval-to-answer flow and compare that against LlamaIndex or Haystack.

What does the retrieval step do before generation in RAG?

Before generation, CustomGPT.ai can run retrieval across connected files and URLs, combine semantic search with keyword matching, rerank results, and inject source-backed context into the model prompt. Unlike a hand-built RAG stack in tools like LangChain or LlamaIndex, you do not need to wire separate retrieve and generate calls, tune prompt assembly code, or manage retrieval routing yourself. If the available evidence is too sparse, a grounded system should narrow scope, ask for clarification, or avoid unsupported answers.

Related Resources

These guides expand on the core ideas behind implementing RAG with CustomGPT.ai.

  • Mastering Custom RAG — Learn how to tailor retrieval-augmented generation pipelines for more accurate, domain-specific responses.
  • RAG Vs. CRAG — Explore how CRAG builds on traditional RAG approaches and where each method fits best.
  • How CustomGPT.ai Works — Get a practical overview of the platform’s workflow for building and deploying AI agents.
  • Understanding RAG — Review the fundamentals of retrieval-augmented generation and why it improves generative AI outputs.
  • Core RAG Components — Break down the main parts of a RAG system, from retrieval and indexing to generation.
  • API-backed RAG chatbot — Learn how an API-backed RAG chatbot connects retrieval, external knowledge, and AI responses to power Custom GPT-style experiences.

Find exact matches
in your content.

Build a CustomGPT.ai agent from your content.

Find exact answers in your content. Search codes, IDs, and docs. Support teams with self-serve answers. Keep responses grounded in your sources.
Connect docs, files, and webpages.

Discuss white-label fit, reseller rollout, and partner onboarding.