AWS Logo
Menu
RAGAS evaluation to mitigate failure points in Retrieval-Augmented Generation architectures

RAGAS evaluation to mitigate failure points in Retrieval-Augmented Generation architectures

Justin McGinnity, Alfredo Castillo

Alfredo R Castillo
Amazon Employee
Published Oct 10, 2024
Last Modified Dec 3, 2024

Introduction

RAG (Retrieval-Augmented Generation) pipelines consist of two key components:
1. Retriever: Responsible for extracting the most pertinent information to address the query.
2. Generator: Tasked with formulating a response using the retrieved information.
To effectively evaluate a RAG pipeline, it's crucial to assess these components both individually and collectively. This approach yields an overall performance score while also providing specific metrics for each component, allowing for targeted improvements. For instance:
- Enhancing the Retriever: This can be achieved through improved chunking strategies or by employing more advanced embedding models.
- Optimizing the Generator: Experimenting with different language models or refining prompts can lead to better generation outcomes.
However, this raises several important questions: What metrics should be used to measure and benchmark these components? Which datasets are most suitable for evaluation? How can Amazon Bedrock be integrated with RAGAS for this purpose?
In the following sections, we'll delve into these critical aspects and show you how to use a framework called RAGAS to create RAG pipeline evaluation and optimization.

RAG Failure Points

This blog presents an experience report on the failure points of Retrieval Augmented Generation (RAG) systems, which combine information retrieval capabilities with the generative prowess of large language models (LLMs).
Barnett et al. identify seven critical failure points in the development and implementation of Retrieval Augmented Generation (RAG) systems, shown in Figure 1. These failure points represent key areas where system performance can be compromised, leading to suboptimal results or system failures. The authors’ analysis provides valuable insights for practitioners and researchers working on RAG systems.
Figure 1: Indexing and Query processes in a Retrieval Augmented Generation (RAG) system.
Figure 1: Indexing and Query processes in a Retrieval Augmented Generation (RAG) system.
Failure Point 1: Incorrect or Incomplete Knowledge Base
The foundation of any RAG system is its knowledge base. An incorrect or incomplete knowledge base can lead to erroneous or insufficient information retrieval, ultimately affecting the quality of generated outputs. The authors stress the importance of comprehensive and accurate data curation, regular updates, and rigorous quality control measures to maintain the integrity of the knowledge base. Having an incomplete knowledge base can also cause LLM hallucination, where a query is semantically similar to data stored in the knowledge base, and the LLM responds with an answer it believes to be correct. An incorrect or incomplete knowledge base leads to the “Missing Content” failure point in Fig 1.
Failure Point 2: Ineffective Chunking Strategies
Barnett et al. highlight the critical role of chunking strategies in RAG systems. Ineffective chunking can result in loss of context, fragmented information, or oversized chunks that hinder efficient retrieval. Optimal chunking should balance granularity with contextual preservation, adapting to the specific needs of the application and the nature of the data. When retrieving context from a knowledge base, a limited number of chunks can be returned. Given both a limited number of retrieved chunks, and a non-optimal chunking strategy can lead to missing retrieval of the desired chunk containing relevant context and is illustrated as the “Missed Top Ranked” in Fig 1.
Failure Point 3: Large Retrieved Context
Context retrieved from a knowledge base is appended to the query in the prompt before being processed by the LLM. Chunks with large token sizes can flood the input context window of the LLM. Some systems will implement a consolidation step after context retrieval before invoking an LLM to decrease the number of input tokens. The consolidation step has the possibility of removing the relevant context before the LLM processes the query and is illustrated as “Not in Context” in Fig 1.
Failure Point 4: Improperly Formatted Answer
Part of prompt engineering is specifying a desired format or structure of the answer. As the input prompt token sizes increase due to failure points 2 and 3, it is possible the LLM is unable to recall the format specified in the original prompt and returns the answer in a format of its choice. This is shown as “Wrong Format” in Fig 1.
Failure Point 5: Incomplete Response
Incomplete answers are not incorrect, but are missing important context included in the input prompt that are not in the answer. This can happen for a variety of reasons, namely failure points 2 and 3 including large amounts of irrelevant context into the input context window. This is shown as “Incomplete” in Fig 1.
Failure Point 6: Failure to Extract
When a prompt with a large amount of input tokens is being processed by an LLM there is the possibility of the LLM not finding the relevant context to properly answer the question. This is referred to as a “noisy” prompt where the noise refers to extraneous and irrelevant information to answering the question. This is illustrated as “Not Extracted” in Fig 1.
Failure Point 7: Incorrect Specificity
The final failure point identified is the returned being too specific or not specific enough. This failure point can be caused by a variety of points throughout the entire RAG process, but ultimately results in the returned answer being too specific or not specific enough for the user asking the question. This is illustrated as “Incorrect Specificity" in Fig 1.
In conclusion, Barnett et al. provide a comprehensive analysis of potential failure points in RAG systems, offering valuable insights for improving system design, implementation, and maintenance. Their work serves as a crucial guide for practitioners and researchers in the field of RAG, highlighting areas that require particular attention to ensure robust and effective system performance. A lot of these failure points spawn from failure points 2 and 3 where the retrieved context is irrelevant and large, or the relevant context is missed entirely.

Chunking and Embedding

Chunking, the process of segmenting documents into smaller units, is presented as a non-trivial task with significant implications for downstream processes. Two primary approaches are identified: heuristic-based chunking, which relies on syntactic markers, and semantic chunking, which considers textual meaning. The author suggests that further research is needed to evaluate the comparative efficacy of these methods, particularly in terms of their impact on embedding quality and retrieval performance. A recommendation is made for developing a systematic evaluation framework to assess chunking techniques based on metrics such as query relevance and retrieval accuracy.

Fixed Size Chunking:

Fixed-size chunking is a common technique employed in Retrieval-Augmented Generation (RAG) architectures, which combine information retrieval and language generation models. Input text is divided into fixed-size chunks, enabling efficient indexing and retrieval of relevant passages from a large corpus. However, dividing the input into fixed-size chunks can fragment the contextual flow, making it challenging to capture long-range dependencies or understand the full context. Additionally, important information may span across chunk boundaries, leading to potential information loss or incorrect retrieval. While fixed-size chunking simplifies the retrieval process in RAG architectures, it introduces trade-offs between efficiency and context preservation, necessitating careful consideration of chunk size and overlap percentage between chunks to mitigate context fragmentation issues for optimal performance.

Semantic Chunking:

Semantic chunking involves dividing the input text into chunks based on semantic or contextual boundaries rather than fixed sizes. This method aims to preserve the contextual flow and meaning of the input by grouping related information together, potentially mitigating the issues of context fragmentation and information loss across chunk boundaries. However, semantic chunking introduces its own set of challenges, including the need for sophisticated natural language processing techniques to accurately identify semantic boundaries, which can be computationally expensive and prone to errors. Additionally, the variable chunk sizes resulting from semantic chunking may complicate the indexing and retrieval processes, potentially impacting efficiency and parallelization opportunities. While semantic chunking offers the potential for improved context preservation, its implementation requires careful consideration of the trade-offs between contextual accuracy and computational complexity, as well as the development of robust algorithms for semantic boundary detection.

Hierarchical Chunking:

Hierarchical chunking is a powerful technique for understanding complex, nested documents like legal papers, technical manuals, and academic articles. It automatically groups related content into coherent chunks based on semantic similarity, ensuring contextually related information stays together. Hierarchical chunking organizes documents into parent and child chunk levels, performing semantic searches on granular child chunks but presenting results at the broader parent level. This approach provides fewer, more relevant search results by encompassing multiple related child chunks within parent chunks. The hierarchical structure captures relationships between content, enabling contextually relevant responses. Hierarchical chunking allows defining parent chunk sizes, child chunk sizes, and overlap, retrieving granular details but summarizing with higher-level parent chunks for efficiency and conciseness.
The discussion on embeddings highlights the importance in representing various content types, including multimedia and multimodal data. It notes that chunk embeddings are typically generated during system development or document indexing. The paper emphasizes the critical role of query preprocessing in RAG system performance, especially when dealing with negative or ambiguous queries. The author advocates for additional research into architectural patterns and approaches to address the domain-specific limitations inherent in embedding-based systems.
In essence, the interconnected nature of chunking and embeddings in RAG systems is critical, thus it is important to have adequate tools and frameworks to evaluate and optimize these processes to enhance overall system performance and adaptability across diverse domains.

RAGAS - RAG Assessment

In this section, we will discuss RAGAS (RAG Assessment), which is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) applications. There are existing tools and frameworks that help you build GenAI systems that leverage RAG capabilities for semantic search among other use cases, but evaluating it and quantifying your application performance can be hard. This is where RAGAS comes in.
RAGAS has a number of metrics that evaluate each of the Component of a RAG process. The framework provides metrics that we can classify in three categories: 1) Retrieval and Generation, 2)End-to-End Natural Language Comparison and 3) Aspect Critic. Figure 2 shows a list of some of the key metrics RAGAS provides.
Figure2: RAGAS Metrics
Figure2: RAGAS Metrics

Component-wise Metrics

Retrieval Step
In Retrieval-Augmented Generation (RAG), the retrieval step ensures that the information retrieved is relevant to the query or question. This is done by using a retriever model to retrieve information from external sources (e.g. Vector Database), and then providing that information to a LLM model to convert into a readable response.
The following metrics can be used to analyze the performance of this step, which will help with the analysis of Failure Points 1, 2, 4, and 6.
Context Recall: Context Recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed using question, ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance. To estimate context recall from the ground truth answer, each claim in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the ground truth answer should be attributable to the retrieved context.
The formula for calculating context recall is as follows:
Example:
Question: Where is France and what is it’s capital?
Ground truth: France is in Western Europe and its capital is Paris.
High context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.
Low context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.
Context Precision: Context Precision is a metric that evaluates whether all of relevant items present in the contexts are relevant to the ground truth. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, answer and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
Where K is the total number of chunks in contexts and{0,1} is the relevance indicator at rank k.
Example:
Question: Where is France and what is it’s capital? Ground truth: France is in Western Europe and its capital is Paris.
High context precision: [“France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”, “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.”]
Low context precision: [“The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”, “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”,]
The next step in the RAG process in the Augmentation and Generation. In the Augmentation step, the pre-processed information obtained during the Retrieval step is added to the user's query to create an augmented prompt. This step uses prompt engineering to communicate with the LLM to Generate a response to the user's query. The response can be tailored for different tasks, such as question answering, summarization, or creative text.
The following metrics can be used to analyzed the performance of this step, which will help with the analysis of the following Failure Points :
· Failure Point 1: Missing Content
· Failure Point 3: Not in Context
· Failure Point 5: Incomplete
· Failure Point 7: Incorrect Specification.
Faithfulness: This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. The faithfulness score is given by:
Example:
Question: Where and when was Einstein born?
Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time
High faithfulness answer: Einstein was born in Germany on 14th March 1879.
Low faithfulness answer: Einstein was born in Germany on 20th March 1879.
Answer Relevancy: This metric measures how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.
The Answer Relevancy is defined as the mean cosine similarity of the original question to a number of artificial questions, which were generated (reverse engineered) based on the answer:
Where:
Egi= Embedding of the generated question i
Eo= Embedding of the original question
N = Number of generated questions (default = 3)
Answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details.
Example:
Question: Where is France and what is it’s capital?
Low relevance answer: France is in western Europe.
High relevance answer: France is in western Europe and Paris is its capital.

End-to-End (Natural Language Comparison) Metrics

RAGAS provides some additional metrics to evaluate the final output and measure semantic quality and accuracy of the generated response. These metrics can also help evaluate the end-to-end process and the impact of any Failure point in the final response:
· Failure Point 4: Wrong Format
· Failure Point 5: Incomplete
· Failure Point 6: Not Extracted
· Failure Point 7: Incorrect Specification.
Answer semantic similarity: The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
Example:
Ground truth: Albert Einstein’s theory of relativity revolutionized our understanding of the universe.”
High similarity answer: Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.
Low similarity answer: Isaac Newton’s laws of motion greatly influenced classical physics.
Answer Correctness: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a ‘threshold’ value to round the resulting score to binary, if desired.
Example:
Ground truth: Einstein was born in 1879 in Germany.
High answer correctness: In 1879, Einstein was born in Germany.
Low answer correctness: Einstein was born in Spain in 1879.

Aspect Critic

These set of metrics are designed to assess the responses based on predefined aspects such as harmlessness and correctness. The output of aspect critiques is binary, indicating whether the answer aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.
Ragas Critiques offers a range of predefined aspects: harmfulness, maliciousness, coherence, correctness, conciseness.

RAGAS Evaluation Flow

The solution we presented, leverages Amazon Bedrock API to invoke the LLMs and Knowledge Bases to perform the evaluation. RAGAS requires following components:
1) Question and Ground_Truth pairs: You need to create a list of questions (queries or prompts) with the corresponding ideal response. This will be used to create the Evaluation Data set.
2) Context: this is a list of the relevant text chunks returned by the retrieval step within the RAG process, and are presented to LLM in order to generate its response
3) Answer: this is the generated response at the output of the RAG pipeline.
4) LLMs and Knowledge bases: This is the list of LLMs and Knowledge bases under analysis. Here you can experiment with any embeddings, chunking strategy, LLMs, re-ranker settings, etc. and use RAGAS to evaluate the performance
5) LLM for evaluation: RAGAS leverages an LLM as a judge
RAGAS uses these components to calculate the evaluation metrics. For example, Context Precision takes the Question and Context (i.e. Chunks) to create a score. Context and the Answer are used to compute the Faithfulness score. See Figure 2 for more details on the metrics and calculation flow.
Figure 3: RAGAS Evaluation Flow

Try This for Yourself

The following reference architecture was used to implement the integration of RAGAS with Amazon Bedrock:
Two repositories have been created to deploy this solution and to measure performance of your RAG applications.
GenAI Model Evaluator – this repository has a new feature which allows you to evaluate your Bedrock knowledge bases using the RAGAS framework. Simply provide a “Questions.csv” and a “Answers.csv” file containing relevant questions and corresponding ground truths respectively and it will run the RAGAS framework and generate the results.
Figure 4: RAGAS metrics output from GenAI Model Evaluator repo
Evaluate your RAG Pipeline: Update coming soon! We are working on getting this published to aws-samples. This repository gets you hands on using RAGAS within jupyter notebooks to evaluate different embedding models, different chunking strategies, and different large language models to optimize performance.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments