Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
RAG Architecture
Next article icon

RAG Architecture

Last Updated : 09 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Retrieval-Augmented Generation (RAG) is an architecture which enhances the capabilities of Large Language Models (LLMs) by integrating them with external knowledge sources. This integration allows LLMs to access up-to-date, domain-specific information which helps in improving the accuracy and relevance of generated responses. RAG is effective in addressing challenges such as hallucinations and outdated knowledge.

RAG Architecture

The Retrieval-Augmented Generation (RAG) architecture is a two-part process involving a retriever component and a generator component.

1. Retrieval Component: The retrieval component identifies relevant data to assist in generating accurate responses. Dense Passage Retrieval (DPR) is a common model that is used to perform retrieval. Lets see how DPR works:

  • Query Encoding: When we submits a query such as a question or prompt, it is converted into a dense vector using an encoder. This vector represents the query's semantic meaning in a high-dimensional space.
  • Passage Encoding: Each document in the knowledge base is also encoded into vectors. This encoding process is done offline and stored in a way that allows for fast retrieval when the query is entered.
  • Retrieval: Upon receiving the query the system compares the query vector with the vectors of all the documents in the knowledge base. It then retrieves the most relevant passages.

2. Generative Component: When the retrieval model identifies the relevant match it is then passed to the generative component. The generative component is based on Transformer architecture like BART or GPT. The generated response will be a combination of the retrieved information along with a newly generated output from the model.

The generative component uses two main strategies i.e Fusion-in-decoder (FiD) and Fusion-in-Encoder (FiE). The final output is based on the user's input. In fusion management both FiD and FiE combine the retrieved information with the user's input to generate a response.

  • FiD (Fusion-in-Decoder): The retrieval and generation processes are kept separate. The generative model only merges the retrieved information during the decoding phase. This allows the model to focus on the most relevant parts of each document when generating the final response, offering greater flexibility in the integration of retrieved data.
  • FiE (Fusion-in-Encoder): FiE combines the query and the retrieved passages at the beginning of the process. Both are processed simultaneously by the encoder. While this method can be more efficient, it offers less flexibility in integrating the retrieved information compared to FiD.

Lets see the key difference between FiD and FiE:

Aspect

Fusion-in-Decoder(FiD)

Fusion-in-Encoder(FiE)

Fusion Point

Fusion occurs in the decoding phase.

Fusion happens at the encoding phase before decoding.

Process Separation

Retrieval and generation are kept separate.

Retrieval and generation are processed together.

Efficiency

Slower due to separate retrieval and generation steps.

Faster due to simultaneous process in encoder phase

Complexity

More Complex

Simpler

Performance

Higher-quality response

Quicker response generation

Workflow of a Retrieval-Augmented Generation (RAG) system

The RAG architecture’s workflow can be broken down into the following steps:

RAG-architecture
Retrieval-Augmented Generation
  1. Query Processing: The input query which could be a natural language question or prompt is first pre-processed. It is then passed to an embedding model that transforms the query into a high-dimensional vector representation.
  2. Embedding Model: The query is passed through an embedding model which transforms it into a vector that captures the deeper meaning of the query.
  3. Vector Database Retrieval: The query is in vector form which is used to search through a vector database. The system uses these vectors to find the most relevant documents that match the query.
  4. Retrieved Contexts: The system retrieves the documents that are closest to the query. These documents are then forwarded to the generative model to help it craft a response.
  5. LLM Response Generation: The LLM combines the original query with the additional retrieved context using its internal mechanisms to generate a response. It uses its trained knowledge alongside the fresh data to create a contextually accurate and coherent answer.
  6. Response: A response that blends the model's inherent knowledge with the up-to-date information retrieved during the process is then presented. This makes the response more accurate and detailed.

Techniques for Optimisation of RAG

The efficiency of Retrieval-Augmented Generation Systems can be enhanced by using various optimisation techniques. These techniques improve performance, reduce latency and ensures the relevance of responses by the system.

  1. Query Expansion: It enhances retrieval accuracy by adding related terms or synonyms to the original query. Using semantic embeddings, queries can be enriched with contextually similar terms which helps in improving the chances of finding the best match in large datasets.
  2. Early Stopping in Generative Models: Early stopping in generative models prevents unnecessary token generation by stopping the response once it is sufficiently complete. This helps to avoid redundancy while keeping responses relevant and concise.
  3. Pipeline Parallelism: It optimises the RAG process by running retrieval and generation tasks simultaneously on different hardware, reducing wait times. While one part of the system retrieves documents, the other can generate the response. Parallel execution cuts down processing time and maximises hardware potentials.
  4. Multi-Stage Retrieval: It involves an initial broad search followed by a refined search for the most relevant documents. This step-by-step approach ensures the retrieval process is efficient and precise. This helps in reducing overhead by narrowing the focus progressively.

Performance

Lets see how RAG architecture performs on various metrics,

  • Accuracy: RAG models tend to have higher accuracy for task that require dynamic knowledge since they use current and domain-specific data.
  • Latency: They may experience higher latency due to the retrieval step, but the use of efficient indexing systems like FAISS can minimise this issue.
  • Cost Efficiency: RAG models are more cost-effective in cases where frequent updates to the knowledge base are required as it does not require the model to be retrained.
  • Memory Usage: They require more memory for storing external knowledge but can offload retrieval tasks unlike transformer models that store everything within the model itself.

Applications of RAG

Lets see the applications of RAG architecture in various fields,

  • Customer Support: By integrating RAG with chatbots, businesses can provide more accurate and context-aware answers by pulling in real-time information from company knowledge bases, FAQs and support documents.
  • Healthcare: Medical professionals get benefit from RAG by retrieving the latest research, treatment guidelines and clinical data to assist in decision-making and it ensures that the information is current and evidence-based.
  • Legal: Legal professionals use RAG to retrieve up-to-date case laws, regulations and legal precedents to generate responses that are grounded in the most relevant and recent legal information.
  • Finance: Financial analysts use RAG systems to access the latest market data and financial reports which helps them make informed decisions quickly and accurately.

Advantages of RAG Architecture

  • Up-to-Date Responses: RAG enables LLMs to generate answers based on the most current external data rather than being limited to pre-trained knowledge that may be outdated.
  • Reduced Hallucinations: By grounding the LLM's response in reliable external knowledge RAG reduces the risk of hallucinations or generation of incorrect data, ensuring that model provides more factual accuracy.
  • Domain-Specific Responses: RAG allows LLMs to provide answers that are more relevant to specific organizational needs or industries without retraining.
  • Efficiency: RAG is cost-effective compared to traditional fine-tuning as it allows models to be updated with new data without needing retraining.

Challenges

  • Data Quality: The accuracy of RAG’s output depends heavily on the quality of the retrieved documents. If the retrieval mechanism pulls in irrelevant or incorrect data, the response may be affected.
  • Latency: The retrieval process can introduce delays, especially when dealing with large datasets. Optimizing retrieval efficiency is crucial to maintaining a responsive system.
  • Complexity: RAG systems involve multiple moving parts which are embedding models, retrieval mechanisms and generative models. Ensuring all these components work seamlessly together can be technically complex.
  • Scalability: As the volume of data grows, scaling the retrieval and generative component to handle larger datasets becomes increasingly challenging.

Next Article
RAG Architecture

M

mohammap46h
Improve
Article Tags :
  • NLP
  • AI-ML-DS

Similar Reads

    What is Agentic RAG?
    Agentic RAG is an advanced version of Retrieval-Augmented Generation (RAG) where an AI agent retrieves external information and autonomously decides how to use that data. In traditional RAG, system retrieves information and generates output in one continuous process but Agentic RAG introduces autono
    11 min read
    RAG(Retrieval-Augmented Generation) using LLama3
    RAG, or Retrieval-Augmented Generation, represents a groundbreaking approach in the realm of natural language processing (NLP). By combining the strengths of retrieval and generative models, RAG delivers detailed and accurate responses to user queries. When paired with LLAMA 3, an advanced language
    8 min read
    Ragged tensors in TensorFlow
    Ragged tensors are a fundamental data structure in TensorFlow, especially in scenarios where data doesn't conform to fixed shapes, such as sequences of varying lengths or nested structures. In this article, we'll understand what ragged tensors are, why they're useful, and provide hands-on coding exa
    5 min read
    Evaluation Metrics for Retrieval-Augmented Generation (RAG) Systems
    Retrieval-Augmented Generation (RAG) systems represent a significant leap forward in the realm of Generative AI, seamlessly integrating the capabilities of information retrieval and text generation. Unlike traditional models like GPT, which predict the next word based solely on previous context, RAG
    7 min read
    How to load Fashion MNIST dataset using PyTorch?
    In machine learning, datasets are essential because they serve as benchmarks for comparing and assessing the performance of different algorithms. Fashion MNIST is one such dataset that replaces the standard MNIST dataset of handwritten digits with a more difficult format. The article explores the Fa
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences