Beyond Basic RAG – Using advanced RAG techniques to build production grade systems
This blog provides architectural components for implementing Advanced Retrieval-Augmented Generation patterns and covers techniques for enhancing context understanding, improving retrieval accuracy, and seamlessly integrating diverse data sources.
Richa Gupta
Amazon Employee
Published Apr 23, 2025
This article was co-written by Porus Arora and Rajnish Shaw
The Retrieval-Augmented Generation (RAG) paradigm has emerged as a powerful technique enhancing Large Language Models (LLMs) by incorporating external knowledge without model retraining. While basic RAG implementations work for simple scenarios, production-grade systems face several challenges:
- Question ambiguity: This occurs when user queries are vague, incomplete, or open to multiple interpretations. This can lead to Irrelevant retrieval and inaccurate responses, hence reducing the user satisfaction
- Unoptimized data - Extracting valuable information from a diverse set of documents can be a complex and challenging task. Determining the most effective approach for breaking down these documents is crucial to ensure efficient data ingestion.
- Low retrieval accuracy: occurs when the system fails to fetch the most relevant information from the knowledge base in response to a query
- Context window performance limitations: This issue arises from the finite capacity of language models to process information within a single context window. Attempting to retrieve too much information can exceed the context window capacity, leading to truncated or incomplete inputs to the language model
As an example, users frequently pose queries that are open to multiple interpretations or lack specific context. For instance, in a financial services chatbot, a simple question like "What are the rates?" could refer to interest rates, exchange rates, or inflation rates.
On the other hand, advanced RAG patterns introduce improvements such as enhanced context understanding, sophisticated handling of nuanced queries, and integration of diverse data sources. This blog post delves into the advanced techniques and architectures required to build robust, production-grade RAG systems capable of handling the complexities and challenges of real-world applications.
At its core, the RAG architecture is built upon two primary pipelines that work in concert to deliver intelligent, context-aware responses. The data ingestion pipeline serves as the foundation, transforming raw information into a queryable knowledge base. This process involves several intricate steps, including document processing and cleaning, strategic content chunking, vector embedding generation, and efficient storage in vector databases. Each of these steps plays a crucial role in preparing the information for rapid and accurate retrieval.
Complementing the ingestion pipeline is the retrieval pipeline, which handles real-time information access and response generation. This component is responsible for processing and optimizing user queries, implementing semantic search capabilities, assembling and filtering relevant context, and ultimately leveraging the power of LLMs to generate coherent and accurate responses. The seamless integration of these two pipelines forms the backbone of an effective RAG system.

Advanced RAG architectures significantly enhance the basic retrieval-augmented generation framework by implementing sophisticated techniques across multiple system components. These improvements include approach for data ingestion, data retrieval, post retrieval strategies and query translation strategies. Let us break it down :
In the world of RAG, how we process and chunk data can make or break system performance. Let's dive into the fascinating evolution of chunking strategies that are revolutionizing how we handle information.

Fixed-size chunking has evolved from a simple text-splitting method into a sophisticated data ingestion strategy for modern RAG systems. While it maintains its fundamental principle of creating uniform-sized chunks, today's implementations incorporate intelligent features like optimal size determination, calculated overlaps, and sentence boundary preservation. The system carefully handles special characters and formatting elements while ensuring content coherence. Despite the emergence of more complex chunking methods, fixed-size chunking remains a reliable choice for many organizations, particularly when dealing with consistent content types. The key to success lies in finding the optimal chunk size that balances context preservation with retrieval efficiency.
Recursive text splitting represents a more intelligent approach to document chunking in RAG systems by respecting the natural hierarchy and structure of content. Unlike fixed-size methods, it employs a multi-level analysis that uses various separators (like paragraphs, sentences, and subsections) to determine optimal splitting points. This hierarchical approach dynamically adjusts chunk sizes based on the document's inherent structure, preserving crucial relationships between different content segments. By maintaining these natural boundaries and connections, recursive splitting creates more coherent chunks that better retain context, ultimately leading to more accurate information retrieval and generation in RAG applications.
Semantic chunking elevates RAG systems by prioritizing meaning over mechanical text division. This approach leverages embedding-based similarity analysis to identify natural semantic boundaries within content, ensuring that related concepts stay together. Unlike traditional methods, it creates chunks based on contextual coherence rather than arbitrary size limits. By analyzing the semantic relationships between different text segments, this method produces self-contained, meaningful chunks that preserve the original content's intent and context. The result is more intelligent information retrieval, as chunks maintain their semantic integrity, leading to more accurate and contextually relevant responses to user queries.
Agentic chunking represents the cutting edge of RAG data ingestion, employing LLM-based agents as intelligent document analysts. These AI agents actively analyze both structure and meaning, making sophisticated decisions about content boundaries and relationships. Unlike simpler chunking methods, agentic systems can identify complex logical connections, maintain cross-references, and create optimally-sized information units tailored for specific retrieval needs. The agents work like skilled editors, understanding document context and preserving critical relationships while ensuring each chunk is self-contained and retrieval-optimized. This approach significantly enhances the RAG system's ability to deliver precise, contextually accurate responses.
Hierarchical chunking stands out as a powerful method for structured documents in RAG systems. This approach preserves the inherent organizational structure of content, maintaining parent-child relationships between different sections. By creating a multi-level representation of the document, it enables sophisticated retrieval capabilities that can navigate through various levels of information granularity. Chunks are nested within larger sections, allowing the system to understand context at multiple scales. This method excels in handling complex documents like technical manuals, legal texts, or academic papers, where the relationship between sections is crucial. Hierarchical chunking ensures that retrieved information retains its original context.
Custom transformation strategies represent tailored solutions for organizations dealing with unique document formats and specialized information structures in RAG systems. These bespoke approaches go beyond standard chunking methods by aligning specifically with business requirements and industry-specific formats. They integrate seamlessly with existing systems while supporting custom metadata requirements and specialized content processing rules. Whether handling proprietary document formats, industry-specific terminology, or unique organizational hierarchies, custom transformations ensure that chunking methods adapt to specific needs rather than forcing content into predetermined formats. This flexibility proves crucial for developing successful production-grade RAG systems that meet precise business objectives.
After chunk segmentation, implementing an effective indexing strategy is crucial for optimizing retrieval performance in production RAG systems. Let's explore two powerful indexing approaches:

Vector store indexing manages embedded data representations, enabling efficient similarity search using specialized algorithms like ANN, graph-based, and inverted file indices.
Hierarchical indexing uses a multi-tier architecture with top-level and detail-level indexes. The top-level index enables broad-scope filtering, while the detail-level index supports precise content matching.
Consider these factors when choosing your indexing approach:
- Query complexity requirements
- Latency vs accuracy trade-offs
- Maintenance overhead
Best practices suggest starting with vector store indexing for smaller datasets, while hierarchical indexing becomes invaluable as your data scale increases and query patterns become more complex.
Traditional RAG implementations use naive retrieval methods, relying on nearest neighbor search and cosine similarity. There are various advanced retrieval strategies for addressing some of the challenges of basic RAG retrieval. Below are the top 4 types which are being used in industry.

Parent Document Retrieval is a hierarchical document management system that combines two storage layers: parent documents stored in memory and their smaller chunks stored in a vector database. When a query is processed, the system first performs similarity search on the child chunks in the vector store, then identifies and retrieves the associated parent documents, providing both precise matching and comprehensive context. This dual-layer approach offers several advantages, including better context preservation, improved retrieval accuracy, and reduced fragmentation.
Self-Query Retrieval is an intelligent search system that combines semantic similarity search with metadata-based filtering and dynamic query interpretation. This approach automatically processes complex queries by parsing user input, identifying filtering conditions, and extracting semantic components, then executes a hybrid search that combines vector similarity matching with metadata filters. The system leverages both structured and unstructured data elements, making it particularly effective for complex multi-criteria searches in various applications
Time-Weighted Retrieval is a sophisticated retrieval system that prioritizes and balances both recent and historical content through temporal scoring mechanisms. The system implements decay functions and access history tracking to assign higher weights to recently accessed or created documents while maintaining the significance of historically important information. By combining traditional similarity scores with temporal weights, it creates a dynamic scoring system that adapts to usage patterns and time-based relevance. This approach is particularly valuable for applications like news archives, social media content etc.
Contextual Compression is a sophisticated approach that optimizes document retrieval and processing through intelligent compression and reranking mechanisms. The system first retrieves relevant documents, then employs specialized compression algorithms that reduce content while maintaining semantic meaning and contextual relationships. Through dynamic reranking workflows, it scores and evaluates document segments, filtering out non-essential content while preserving critical information. This approach is particularly effective for processing large documents and optimizing knowledge bases, as it significantly reduces processing overhead while maintaining accuracy.
Consider these factors when choosing your retrieval approach:
- Type of Knowledge base (Time based relevance, Parent-child relationship etc)
- Type of queries (simple, complex)
- Vector store configuration
When choosing a retrieval methodology for your RAG system, consider your specific application needs and data characteristics. Usually , Parent Document Retrieval is ideal for large, interconnected documents, while Self-Query Retrieval excels in complex, structured data scenarios. Time-Weighted Retrieval is best for time-sensitive information, and Contextual Compression Retrieval shines when extracting precise information from lengthy documents. To select the right method, analyze your data structure and size, understand user requirements, evaluate available computational resources, and consider hybrid approaches. Ultimately, real-world testing and user feedback will help refine your choice and optimize performance.
Post Retrieval Strategies in RAG systems play a crucial role in refining and optimizing the retrieved content before it's passed to the language model for response generation. While initial retrieval brings back potentially relevant chunks of information, these strategies act as sophisticated filtering and enhancement mechanisms to ensure the highest quality input context. They address common challenges like information redundancy, relevance ranking, and context optimization, ultimately improving the accuracy and reliability of the generated responses.
Two key components form the backbone of post-retrieval processing: filtering and reranking. These mechanisms work in tandem to ensure that only the most pertinent and high-quality information reaches the final generation stage. Let's explore each of these strategies in detail:
3.1 Filtering: Post-retrieval filtering is a crucial step in the RAG pipeline, ensuring that the search results are not only relevant but also safe, diverse, and respectful of privacy and confidentiality. Before generating the final answer, filtering out irrelevant or harmful content is essential to maintain the integrity of the system. Post-retrieval filtering serves as a quality control mechanism, ensuring that the final response is not only relevant but also responsible, respectful, and informative.
3.2 Reranking: Re-ranking is a sophisticated technique used to enhance the relevance of search results by using the advanced language understanding capabilities of LLMs.
Initially, a set of candidate documents or passages is retrieved using traditional information retrieval methods like BM25 or vector similarity search. These candidates are then fed into an LLM (Cohere Rerank) that analyzes the semantic relevance between the query and each document. The LLM assigns relevance scores, enabling the re-ordering of documents to prioritize the most pertinent ones. Can use Cohere’s rerank Semantic boost or RRF (Reciprocal Ran Fusion).
Initially, a set of candidate documents or passages is retrieved using traditional information retrieval methods like BM25 or vector similarity search. These candidates are then fed into an LLM (Cohere Rerank) that analyzes the semantic relevance between the query and each document. The LLM assigns relevance scores, enabling the re-ordering of documents to prioritize the most pertinent ones. Can use Cohere’s rerank Semantic boost or RRF (Reciprocal Ran Fusion).
The goal of this step is to make sure the user is asking questions within the scope of our system (and not trying to "jailbreak" the system to make it do something unintended) and prepare the user's query to increase the likelihood that it will locate the best possible article chunks using the cosine similarity / "nearest neighbor" search. There are various methods to do that.

Prompts the LLM to first derive high-level concepts and principles from specific details, allowing the model to understand the broader context before tackling the specific question. This leads to improved reasoning by focusing on general principles first, the LLM can follow a more logical reasoning path towards the solution, rather than getting stuck on irrelevant details.
Example: If a car travels at the speed of 100km/hr, how long does it take to travel 200 kilometers?
Step back question: Given speed and distance, what is the basic formula for calculating time?
Step back question: Given speed and distance, what is the basic formula for calculating time?
This is a variation of Step back prompting where the LLM is used to generate hypothetical document or answer based on the query. By generating hypothetical documents, HyDE can better capture nuanced or multi-faceted query intents, potentially retrieving more relevant information for complex questions. HyDE demonstrates strong performance across multiple languages, making it valuable for multilingual applications.
However, one challenge is the use of language models to generate these documents introduces an extra step in the retrieval process, which may impact processing time.
RAG Fusion is an advanced retrieval-augmented generation technique that enhances traditional RAG by employing multiple query generation and reciprocal rank fusion (RRF) to improve accuracy and context awareness. Unlike standard RAG, it generates multiple search queries for a single input, combines their results, and reranks them to provide more comprehensive and precise responses, albeit at the cost of increased processing time and system complexity.

Before troubleshooting and iterating on the RAG application by optimizing each component, it’s crucial to start with a benchmark to compare against - as is usually done for machine learning projects. Without a benchmark, you will not be able to evaluate if the solution you have built is performing well and if it is demonstrating business impact to your stakeholders.
In order to build a benchmark for RAG, you should first deploy RAG components in the most basic setup, without tuning configurations. Once this benchmark is in place, you can iterate and monitor the quality of responses.
In order to build a benchmark for RAG, you should first deploy RAG components in the most basic setup, without tuning configurations. Once this benchmark is in place, you can iterate and monitor the quality of responses.
Evaluating the performance of RAG systems involves several metrics, such as the relevance of the retrieved context, the groundedness of the answers (i.e., how well the responses are supported by the context), and the overall relevance of the answers to the user’s query. Effective evaluation frameworks not only consider these metrics individually but also examine how they interact within the complete RAG pipeline.
Since a satisfactory LLM output depends entirely on the quality of the retriever and generator, RAG evaluation focuses on evaluating the retriever and generator in your RAG pipeline separately.
deepeval offers three LLM evaluation metrics to evaluate retrievals:
•ContextualPrecisionMetric: evaluates whether the reranker in your retriever ranks more relevant nodes in your retrieval context higher than irrelevant ones.
•ContextualRecallMetric: evaluates whether the embedding model in your retriever is able to accurately capture and retrieve relevant information based on the context of the input.
•ContextualRelevancyMetric: evaluates whether the text chunk size and top-K of your retriever is able to retrieve information without much irrelevancies.
•AnswerRelevancyMetric: evaluates whether the prompt template in your generator is able to instruct your LLM to output relevant and helpful outputs based on the retrieval_context.
•FaithfulnessMetric: evaluates whether the LLM used in your generator can output information that does not hallucinate AND contradict any factual information presented in the retrieval_context.
•ContextualPrecisionMetric: evaluates whether the reranker in your retriever ranks more relevant nodes in your retrieval context higher than irrelevant ones.
•ContextualRecallMetric: evaluates whether the embedding model in your retriever is able to accurately capture and retrieve relevant information based on the context of the input.
•ContextualRelevancyMetric: evaluates whether the text chunk size and top-K of your retriever is able to retrieve information without much irrelevancies.
•AnswerRelevancyMetric: evaluates whether the prompt template in your generator is able to instruct your LLM to output relevant and helpful outputs based on the retrieval_context.
•FaithfulnessMetric: evaluates whether the LLM used in your generator can output information that does not hallucinate AND contradict any factual information presented in the retrieval_context.
Advanced RAG architectures represent a significant evolution from basic implementations, providing robust solutions for production environments. Through optimized chunking strategies, intelligent indexing methods, and sophisticated retrieval techniques, organizations can build more reliable and effective systems.

The success of RAG implementations depends on carefully choosing and combining these components based on specific use cases and requirements. Key considerations include:
- Balancing performance and accuracy
- Selecting appropriate chunking and retrieval strategies
- Implementing effective evaluation frameworks
- Maintaining scalability for production workloads
As RAG technology continues to evolve, these advanced techniques will become increasingly crucial for organizations seeking to leverage LLMs effectively in their specific domains while ensuring consistent, high-quality results.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.