A cover image with a pill with the text "How do large language models work?"

With the rapid advancement of artificial intelligence in machine learning, numerous natural language generation tools based on Large Language Models (LLMs) have emerged. Among these, ChatGPT, developed by OpenAI, has quickly become one of the most popular, serving millions of users worldwide. If you are reading this, you have probably already used ChatGPT. But here’s the question: have you ever asked a chatbot (an LLM) about a topic, only to get an outdated or incorrect response? Or maybe the chatbot replied with something like, "I don’t have the information you're asking for".

The first answer scenario is called an "hallucination" in LLMs – a situation where the model confidently provides an answer, even if it’s incorrect, due to limitations in autoregressive neural network architecture. Simply put, hallucinations happen when LLMs fill in gaps in knowledge, often inventing details when factual data is lacking. In the second answer scenario, what happens is that the LLM was trained using data publicly available and the information that you are asking for is private or unreachable the last time that the LLM was trained.

If you’ve encountered these scenarios, you might have tried supplying to the LLM the necessary details within your question. The language model then reads the supplemented information you gave it and uses that to formulate the answer. This capability is called “in-context learning”, and is something that modern LLMs excel at. This is one of many techniques of a new emerging discipline called Prompt Engineering. More specifically, this technique involves crafting a question to include contextual hints that guide the LLM toward an accurate answer. In other words, you’re adding context to your query to help the model get it right. For example, a prompt for retrieving detailed answer from external information might look like this:

"Based on the context I'll provide, answer the question: Why is the demand for renewable energy sources increasing?"

Context:

In 2024, global awareness about climate change has reached unprecedented levels, with governments and industries worldwide pushing for sustainable alternatives to fossil fuels. New technologies have emerged, making renewable energy sources more affordable and efficient. Policy changes, like tax incentives and stricter emissions regulations, are also encouraging businesses to adopt cleaner energy solutions. Additionally, the economic impact of climate-related natural disasters has forced companies to reconsider their energy consumption patterns to prevent further environmental damage.

So, what is a RAG (Retrieval Augmented Generation) system and how does this in-context learning technique relate to it? Well, if you’ve used prompt engineering, you might know it works quite well. In fact, many LLMs now offer features that let users add documents to the model’s context for answering specific questions. However, these techniques have two main issues: (1) Can we trust third-party apps (like ChatGPT) with our private documents, especially if they’re confidential? And (2) Adding context for each question can be time-consuming and inefficient. This is where Retrieval-Augmented Generation (RAG) systems come in, designed to solve precisely these challenges.

What is a RAG and Its Application in Knowledge Bases

Architecture diagram of a basic RAG system — Architecture of a basic RAG system

A Retrieval-Augmented Generation (RAG) system combines the best of two AI worlds: the precision of information retrieval and the creativity of text generation. Together, they create a seamless pipeline that retrieves relevant knowledge and generates human-like, context-aware responses. Here's how the magic happens:

Information Retrieval: Think of this as a supercharged search engine. The system combs through vast databases or knowledge bases to pinpoint the most relevant chunks of information. It ensures that the generated responses are grounded in factual data, keeping the output accurate and reliable.
Text Generation: This is where creativity meets computation. Using advanced language models (LLM), the system takes the retrieved information and crafts it into coherent, natural-sounding text. It’s like having a skilled writer who knows exactly how to weave data into engaging narratives or helpful answers.

In essence, a RAG system is designed to enhance LLM’s by supplying them with relevant information from external sources, allowing them to generate answers based on real-time data rather than relying solely on their fixed knowledge base. Here’s how a RAG works:

User Query: The process starts when a user poses a question to the system.

Information Retrieval:

The RAG system searches a pre-defined database or knowledge base to retrieve relevant documents or text passages (called chunks).
These knowledge bases could include articles, reports, or internal documents.
The retrieved documents provide context and act as a reference for generating the response.

Input Prompt Preparation:

The retrieved chunks are incorporated into the input prompt of the large language model (LLM).
This step ensures the LLM generates answers that are specific, accurate, and directly related to the query.

Response Generation:

Guided by the retrieved data, the LLM formulates a response that is not only natural-sounding but also grounded in the provided context.
This minimizes the risk of "hallucinations" (factually incorrect or irrelevant outputs) by basing the answers on current and reliable data.

Powerful Integration:

By combining retrieval and generation, RAG systems deliver precise, contextually relevant answers, making them ideal for scenarios like customer support, research, and enterprise knowledge management.

To tackle the two issues mentioned at the beginning, our team developed an on-premise (meaning “hosted on our own infrastructure”) RAG system built on a service-oriented architecture. This setup allows the system to use open-source LLMs (such as Gemma, LLaMA, etc.) or even private ones (like those from OpenAI or Anthropic). In its initial version, our RAG system can cache documents from MediaWiki-based pages, but can be easily expanded to other knowledge sources, like scraped web pages and PDFs. Now, with that context, let’s dive into how we built this on-premise RAG system.

On-premise RAG System Architecture

The RAG system uses a service-oriented architecture to enhance scalability and maintainability. This modular setup allows each component to operate independently, making updates and troubleshooting easier. The following image shows the architecture; each service and module will be explained below.

Crawler Service

The Crawler Service extracts content from knowledge bases so it can be used by other system components. In its initial version, it supports only MediaWiki-based sources and provides two key resources:

An index of all available Wiki pages.
The content and metadata of each page.

It includes two main modules:

MediaWiki Scraper: Extracts content, URLs, and modification dates from Wiki pages using a paginator to iterate efficiently.
Crawler: Coordinates the extraction process and supports configurable scrapers, enabling future compatibility with sources beyond MediaWiki.

This service enables structured access to all knowledge base content, preparing it for retrieval and processing. Now, we turn to the Retrieval Service, which processes and stores this information to retrieve relevant document chunks, ensuring accurate, context-driven responses, we will now explain this service and its modules.

Retrieval Service

The Retrieval Service processes and stores knowledge base content to enable the system to retrieve the most relevant chunks when answering user queries. It performs two main tasks:

Store content in a structured, vector-based format.
Retrieve relevant chunks based on similarity to the input query.

Its main modules include:

Cache Module: Avoids reprocessing unchanged documents by checking metadata like document ID and last edit time.
Text Splitter: Divides documents into structured, token-aware chunks, preserving code and tables during the process.
Retrievers Module: Converts text into embeddings using an LLM. These embeddings allow similarity-based search between user queries and document chunks.

Three retrievers were implemented to assess performance, with their differences and results to be detailed later in the blog. Overall, the Retrieval Service enables fast, relevant responses by preparing content for efficient similarity-based searches through the Vector Service.

Vector Service

The Vector Service is responsible for managing the storage and retrieval of vectorized knowledge base content, enabling efficient similarity-based searches between query embeddings and stored document embeddings. It is composed of two primary modules: the Vector Store and the Late Interaction Index, each catering to different retrievers within the system.

Table with the chunk structure fields — Document Chunk Structure (embeddings and metadata).

Vector Store: This module utilizes Weaviate as its underlying storage system and includes a custom database manager. The Vector Store handles the storage and retrieval of embeddings generated by the Simple and RAPTOR retrievers. The main functions are:
1. Add Chunk: Adds each chunk’s embeddings and metadata (such as document ID, last edit date, URL, and text).
2. Find Chunks by Vector Distance: Retrieves the top N chunks based on the cosine similarity between query embeddings and stored embeddings, using the parameters query embeddings and n_results.
Late Interaction Index: This module is specifically designed for the ColBERTv2 retriever, supporting fine-grained, token-level searches. Built using the RAGatouille library, this module enables late interaction, which allows token-to-token comparison for more precise retrieval. Key functions include:
1. Add Chunks: Adds all document chunks to the index simultaneously, generating token-level embeddings across the corpus. This indexing process uses the same chunk structure as the Vector Store, enabling compatibility across retrievers.
2. Retrieve Relevant Chunks: Processes queries by computing token-level embeddings and retrieving the top N most relevant chunks using ColBERTv2’s late interaction mechanism, matching query tokens to stored document tokens for highly accurate retrieval.

The Vector Service plays a critical role for the retriever service by managing how information is stored and accessed within the knowledge base. Additionally, the LLM Service is integral to the system, handling embedding generation and text processing tasks, as detailed in the following section.

LLM Service

The LLM Service is dedicated to generating embeddings and summaries for document chunks during the caching process. It also handles generating answers to questions posed to the Q&A Agent module of the RAG Service, which will be discussed in later sections. This service leverages Ollama to load open-source models and includes a communication interface compatible with the OpenAI API, providing flexibility for users to choose between open-source and OpenAI models. The primary functions of the LLM Service are as follows:

Create Embeddings from Text: This function generates embeddings for a given text using the system's configured embedding model.
Create Summaries: This function creates summaries for given texts using the LLM model configured in the system.
Answer Questions Based on a Given Context: This function generates answers to questions using only the provided context, ensuring that responses are relevant to the query.

With these capabilities, the system can cache knowledge base information and retrieve relevant chunks in response to a query. The next step is integrating these functions to answer natural language questions using the retrieved chunks, a task managed by the RAG Service, which will be detailed in the following section.

RAG Service

The RAG (Retrieval-Augmented Generation) Service is responsible for answering user questions by leveraging information from the knowledge base. It works by:

Requesting the most relevant chunks for a given query from the Retrieval Service.
Constructing an optimized prompt using the question and the retrieved chunks.
Sending the prompt to the LLM Service to generate a response, along with the sources used.

Each of the RAG Service modules are described below:

Prompt Builder: Creates the prompt sent to the LLM. It includes instructions, the relevant chunks, and the user's question at the end to focus the response.

RAG Prompt Structure Example — RAG Service Prompt structure.

Q&A Agent: Coordinates the entire RAG pipeline. It receives the question from the chatbot, requests relevant chunks from the Retrieval Service, builds the prompt, sends it to the LLM Service, and attaches the source URLs to the final answer before returning it to the chatbot.

With the RAG Service in place, the system is now fully equipped to retrieve relevant knowledge chunks and deliver context-aware responses to user queries, ensuring transparency and traceability. This functionality is seamlessly integrated into the User Interface, allowing users to interact with the system through an intuitive chat interface that enhances accessibility and usability.

User Interface and Experience

The RAG system features an intuitive online chat interface, developed using OpenWebUI, which enables users to interact seamlessly with the system. This user-friendly interface supports the entry of questions, to which the RAG system responds by retrieving and structuring relevant knowledge base content. Through this interface, users can quickly access structured answers backed by source documentation, ensuring both accessibility and transparency.

Openwebui preview — OpenWebUI interface integration.

Upon entering a query, the system retrieves the most contextually relevant information from the backend and formats it into a clear response. For example, if a user queries the dependencies of a particular library, the interface returns a list of requirements (such as CUDA, OpenCV, ONNX) along with source links to RidgeRun’s official documentation, providing immediate access to verified information.

The User Interface provides a streamlined interaction point where users can effortlessly access precise, context-driven answers backed by verified sources. With the interface facilitating effective knowledge retrieval, the next step is to assess the System Evaluation—analyzing the performance and effectiveness of each component to ensure robust, reliable results.

Contextualized Chunks

One of the challenges in splitting documents into smaller, manageable pieces (chunks) is ensuring that each chunk retains enough context to remain useful. This is where Contextualized Chunking comes into play—a technique designed to provide additional context by adding a summary of the original document at the beginning of each chunk.

Why Contextualized Chunking Matters

Consider a document about a software library called X. The document might have one paragraph describing the library and another listing its dependencies. When splitting the document into chunks, each paragraph could become a separate chunk. This separation creates a problem: the chunk listing the dependencies might not mention the library's name, leaving it isolated and ambiguous.

For example, if you were to ask the system, "What are the dependencies for library X?", the chunk about dependencies might lack the necessary context, causing the system to fail to provide an accurate response.

Diagram explaining contextualized chunks feature — Contextualized Chunks Example

Information Retrieval techniques

This section presents the three retrieval strategies implemented in the system to extract relevant information from the knowledge base. Each retriever uses a different approach to represent and compare document content with user queries. By evaluating their performance, the system aims to identify which method provides the most accurate and efficient results in a RAG setting.

Simple Retriever

The Simple Retriever offers a straightforward approach that emphasizes ease of implementation and efficiency. It avoids complex structures or specialized indexing, instead relying on conventional chunking and embedding techniques. It’s ideal for simpler applications or as a baseline for comparison with more advanced methods.

Summarizes documents and replaces components like tables and code blocks with identifiers before chunking.
Splits content into manageable chunks, restores original components, and ensures they meet size limits.
Stores these chunks using the Vector Service.
During retrieval, it generates query embeddings, compares them with stored embeddings, and returns the most relevant chunks.
Optionally supports re-ranking using ColBERT to refine results.

RAPTOR Retriever

The RAPTOR retriever introduces a hierarchical approach that enhances contextual relevance by generating summaries at different levels of abstraction. It clusters related chunks and summarizes them, forming a tree-like structure. This design is well-suited for handling large documents or collections with diverse topics.

Clusters chunks into groups and generates summaries for each cluster, building a multi-level tree.
Flattens the tree and stores the summarized nodes for retrieval.
Retrieves relevant chunks by comparing query embeddings with stored summaries.
Can also apply re-ranking to refine the output.

ColBERT Retriever

The ColBERT (Contextualized Late Interaction over BERT) focuses on fine-grained, token-level matching for high-accuracy retrieval. Unlike other retrievers that generate one embedding per chunk, ColBERT creates an embedding for each token, allowing for deeper semantic alignment between query and content. This makes it highly effective in complex or ambiguous information needs.

Summarizes and chunks documents similarly to the Simple Retriever.
Generates token-level embeddings for each chunk, preserving word-level context.
Builds a special index optimized for late interaction (MaxSim).
For a given query, creates token embeddings and calculates similarity to document tokens using MaxSim.
Ranks chunks based on cumulative token-to-token similarity scores and returns the top results.

Each retriever is integrated into the system through an Orchestrator, which manages communication with the Crawler Service, coordinates caching processes, and enables dynamic selection of retrieval strategies based on task requirements. With the retrievers in place, the next step was to evaluate their performance across key dimensions. The following section outlines the methodology and results of this evaluation, focusing on both retrieval quality and caching efficiency.

Retrievers Evaluation

The evaluation of the implemented retrievers focused on two main tasks: Information Retrieval and Information Caching. For retrieval performance, metrics like Precision, Recall, F1 Score, and MRR (Mean Reciprocal Rank) were used to measure how effectively each retriever returned relevant chunks based on a user query. For a detailed explanation of these metrics, please refer to this blog post. In terms of caching, the system’s resource usage—including CPU, memory, GPU, and disk consumption—along with runtime, were analyzed to assess efficiency.

Information Retrieval Performance

For Information Retrieval, each retriever was evaluated using a subset of 10 questions and answers. This limited subset was chosen due to the significant time required to evaluate these metrics, as explained in this blog post. The Simple + Reranker retriever showed the highest recall and MRR, demonstrating its capability to effectively retrieve and rank relevant chunks. RAPTOR + Reranker also performed well, particularly in precision, leveraging its hierarchical tree structure to organize and refine relevant content. Despite its simpler design, the Simple retriever proved efficient, balancing accuracy with minimal processing demands. Meanwhile, ColBERT’s late interaction mechanism provided precise token-level matches but showed limitations in recall due to its flat indexing approach.

Graphic of the Performance Comparison of Retrievers: Precision, Recall, F1 and MRR — Performance Comparison of Retrievers: Precision, Recall, F1 and MRR

Information Caching Performance

The Information Caching Performance analysis revealed distinct resource demands for each retriever. The Simple retriever, requiring 3.68 hours for initial caching, showed the lowest CPU and memory consumption, making it suitable for systems with limited resources. While it takes 3.68 hours initially, this retriever only requires full caching once; new documents can be added individually without reprocessing the entire dataset. In contrast, RAPTOR required more time and memory due to its tree-building process, which improves accuracy but at a higher processing cost. ColBERT had memory and disk usage comparable to RAPTOR but completed caching faster. However, both RAPTOR and ColBERT must re-cache the entire dataset when a document is added or modified, limiting efficiency in dynamic environments.

Retriever	CPU (%)	Memory (MB)	GPU (MB)	Disk Usage (MB)	Processing Time (h)
Simple	23.60	2292.61	7172.00	97.17	3.68
RAPTOR	30.50	5827.45	7206.00	268.04	6.22
ColBERTv2	26.50	5573.66	7262.00	107.928	3.01

Resource usage and processing time for the implemented retrievers.

This evaluation highlights that while RAPTOR retrievers offer strong precision and recall with their structured approach, the Simple retriever demonstrates superior efficiency in caching, as it requires a full caching process only once and allows for incremental document additions. With these insights into retriever performance, we now move on to the RAG System Evaluation, where we assess the system’s overall effectiveness in delivering accurate, context-driven responses using the RAGAS metrics.

RAG System Evaluation

The evaluation of the RAG system was performed using the RAGAS metrics: Faithfulness, Answer Relevance, Answer Semantic Similarity, and Answer Correctness. For a detailed understanding of these metrics, please refer to this blog post. These metrics were calculated with two different LLM approaches—first using the open-source LLM Llama 3.1 8B and then with the third-party LLM GPT-4o—to assess the system’s effectiveness in delivering accurate, context-driven responses.

RAGAS metrics results using Llama3.1 8B LLM

Using Llama 3.1 8B, the RAPTOR + Reranker retriever achieved the highest scores in faithfulness and relevance, benefiting from its hierarchical retrieval structure that organizes information for enhanced accuracy. However, the Simple + Reranker retriever outperformed others in MRR, highlighting its efficiency in ranking relevant chunks quickly. Across metrics, the RAPTOR retrievers demonstrated strong performance in capturing relevant content, while the ColBERT retriever showed advantages in token-level matching but lagged in overall correctness and recall.

Graphic: Performance Comparison of Retrievers with Llama 3.1 8B: Faithfullness, Answer Relevance, Answer Semantic Similarity and Answer Correctness — Performance Comparison of Retrievers with Llama 3.1 8B: Faithfullness, Answer Relevance, Answer Semantic Similarity and Answer Correctness

RAGAS metrics results using GPT-4o LLM

In contrast, the GPT-4o model improved performance across all retrievers. Simple + Reranker and ColBERT showed notable increases in faithfulness and correctness scores due to GPT-4o’s larger parameter size and improved handling of complex contexts. With GPT-4o, RAPTOR maintained high answer relevance and correctness scores, while the Simple retriever gained from efficient caching, making it an ideal choice for lower-resource implementations.

Graph: Performance Comparison of Retrievers with GPT-4o: Faithfullness, Answer Relevance, Answer Semantic Similarity and Answer Correctness — Performance Comparison of Retrievers with GPT-4o: Faithfullness, Answer Relevance, Answer Semantic Similarity and Answer Correctness

The improvements with GPT-4o, compared to Llama 3.1 8B, can be attributed to its significantly larger parameter count, allowing for better handling of complex retrieval augmented generation tasks. The performance gap between the two models further highlights GPT-4o’s capacity for processing and retrieving more accurate and relevant information.

Conclusions and Future Work

The project successfully developed an on-premise Retrieval-Augmented Generation system tailored to enhance information retrieval and response generation using MediaWiki-based knowledge bases. Key findings demonstrate that a straightforward retrieval approach, such as the Simple retriever, can yield efficient and relevant responses, especially suited for dynamic knowledge bases like wikis. Additionally, while advanced retrievers like RAPTOR and ColBERT brought improvements in contextual relevance and precision, they demanded higher computational resources and complex configurations, underscoring the importance of aligning retriever selection with resource availability and intended application.

Future work will focus on expanding the capabilities of the RAG system, incorporating additional knowledge sources beyond MediaWiki-based content, such as PDFs and web scraping, to diversify the system’s accessible information. Further research could also explore advanced models and retrievers to assess their efficacy in maintaining response accuracy and resource efficiency.

Ready to Take Your Project to the Next Level? Contact Us!

In this project we were able to use retrieval information techniques and LLMs to build an on-premise RAG System. Do you need help with your project? Let's have a chat, contact us at support@ridgerun.ai

Retrieval Augmented Generation: How We Designed and Implemented an On-Premise RAG System for RidgeRun

What is a RAG and Its Application in Knowledge Bases