Photo by Chris Liverani on Unsplash

Overview:

Large Language Model (LLM) system evaluation best practices are a moving target. New techniques, frameworks, libraries, etc.. come out almost daily. However, many of the concepts have been around much longer than the term “GenAI” and the solutions end up borrowing a lot from what already exists. In this document, we’ll discuss evaluation over the lifecycle of an llm augmented system.

Code Samples

This github repo contains code examples for each of the following touch points. When starting out, it’s recommended to work through the notebooks in order. You can jump around, but the concepts build off each other. The notebooks are meant to be instructive and build much of the validation framework from scratch so readers can understand how everything works under the hood.

Background:

The majority of this section will focus on Post Training (fine tuning) evaluation and end to end system evaluation since these are the most common scenarios

Task Specific Evaluation

Resist the urge to use metrics and evaluation data created by someone else or that are publicly available. It’s more important to understand the concepts behind building good metrics and tailoring them to your business needs.

A good starting place for any evaluation is to build out at least 100 examples of human curated datapoints that represent your customer’s needs. These examples should be extended over time. As your business needs and solution changes, your eval set should be updated to reflect those changes.

This concept applies to all stages of the system’s lifecycle. Before we go any further, it’s important to break down the components in an LLM augmented system and discuss how you evaluate each step.

Touchpoints

Validation and Evaluation happens at multiple stages in the lifecycle of a model/solution/prompt. Let’s diagram out where the touchpoints are. We’ll start with a basic retrieval augmented generation (RAG) architecture.

In this example, validation and evaluation should happen for each of the 5 touch points listed above. The naive approach is to just focus on touchpoint 5 and (vibe check) it. That works “okay” for a POC, but doesn’t instill confidence when going to production.

In the next couple sections, we’ll talk about how to evaluate each of these touch points to create a robust evaluation framework and discuss the lifecycle of each touchpoint

Note: I chose a RAG architecture because it’s common. The concept of evaluating your systems touch points apply to other architectures as well. The concept is the same, but but metrics chosen, and the “how to” changes slightly depending on your use case. In all the retrieval steps, the key is to define a validation dataset containing the correct answers. Whether the answers are images (multi-modal embeddings), example code snippets (code generation), or a hybrid retrieval system using a graph database, the process is the same. Set up a retrieval task and evaluate it against your validation dataset.

Touch Point 1: Embeddings

Before evaluating an embedding model, it’s important to understand “What” we’re using an embedding model for. The most popular public benchmark is the Massive Text Embedding Benchmark or MTEB. HuggingFace maintains a leaderboard to compare general purpose embedding models against each other to see how they stack up against a wide range of tasks.

This is a decent starting place, but you have to ask yourself, how well does this dataset compliment the task I really care about. If I’m creating a RAG Solution for Lawyers, I’m much more interested in how well the embedding model works for comparing legal text vs. how well it works for medical text. This is why it’s important to build out your own evaluation. A model that might not rank high on a general-purpose benchmark, could rank very high on your specific use case. If none of them work very well, then you can make a case for fine tuning an existing model on your data.

What Metrics Should You Care About?
To answer this question, we need to understand what we’re using the embedding model for. For an information retrieval use case, we care about different metrics than we would for a clustering use case. Because retrieval tends to be the most important use case in RAG, lets focus on retrieval.

Classic methods apply here to embeddings model like recall@k, precision@k, etc.. During this stage, we’re trying to increase the number of relevant items.

How to Evaluate
To perform this evaluation, you need to set up a retrieval task. Generate vector representations of items (documents or chunks) in a shared semantic space and perform a K-nearest-neighbor search on them using a similarity measure (e.g. cosine-similarity, dot-product). This gives you the top-k retrieved item for each query.

You need a dataset of relevance judgments that indicate which documents are relevant to each query. These are typically created by human annotators or derived from click data in product systems.

For each query, count the number of relevant items in the top-k retrieved results. Calculate the precision using (number of relevant documents / k). Average the precision values across all queries.
Apply these same techniques to other metrics like recall, NDCG, or MAP for a more comprehensive evaluation.

Compare Experiments

Once you’ve set up your retrieval task, you can compare different embedding models and chunking strategies through experimentation to pick the combination that fits your use case best. I find it useful to build visualizations to compare different experiments like below

Touch Point 2: Re-Rank

When evaluating a rerank model, it’s crucial to focus on metrics that reflect both the quality and relevance of the reranked results, as well as the model’s ability to improve upon the initial ranking. Normalized Discounted Cumulative Gain (NDCG) is often considered one of the most important metrics, as it accounts for the position of relevant items in the ranked list and can handle graded relevance judgments. Mean Average Precision (MAP) is another valuable metric that provides a single figure of merit for the overall ranking quality across multiple queries. For scenarios where the top results are particularly important, metrics like Precision@k and Mean Reciprocal Rank (MRR) can offer insights into the model’s performance at specific cut-off points.

In addition to these standard information retrieval metrics, it’s beneficial to consider comparative metrics that directly measure the improvement over the base ranking. This can include the percentage of queries improved, the average change in relevant document positions, or a paired statistical test comparing the reranked results to the original ranking. It’s also important to evaluate the model’s efficiency, considering factors like inference time and computational resources required, especially for applications with strict latency requirements. Ultimately, the choice of metrics should align with the specific goals of your reranking task and the priorities of your system, balancing between relevance, user satisfaction, and operational constraints.

How to Evaluate
To evaluate a rerank model using these metrics, start by preparing a test set consisting of queries, their corresponding initial rankings, and human-annotated relevance judgments for each query-document pair. Run your rerank model on the initial rankings to produce a new set of reranked results. Then, calculate the chosen metrics for both the initial and reranked results. For example, to compute NDCG@k, sort the documents for each query by their relevance scores, calculate the Discounted Cumulative Gain (DCG) for the top k results, and normalize it by the Ideal DCG. For MAP, calculate the average precision for each query at every position where a relevant document is retrieved, then take the mean across all queries. To assess improvement, compare the metric scores between the initial and reranked results. It’s also valuable to analyze per-query performance to identify where the rerank model excels or struggles. Finally, consider evaluating on different subsets of your data to ensure consistent performance across various query types or document categories.

Touch Point 3: End to End Information Retrieval System

When evaluating the effectiveness of the entire information retrieval system in a RAG setup (including vectorDB, embedding model, and retrieval mechanism, but excluding the final language model call), the primary focus is on the relevance and comprehensiveness of the retrieved information. This is not too different from evaluating your embedding model. Similar metrics are used but instead of setting up a retrieval task just for your embeddings, you’re combining the vectorDB, Embedding Model, ReRank Model, and any other components that touch the retrieval portion of the system.

Key Metrics
Key metrics include recall@k, which measures the proportion of all relevant documents retrieved in the top k results, and is crucial for ensuring the system captures a high percentage of pertinent information. Precision@k complements this by measuring the proportion of retrieved documents that are relevant. Mean Average Precision (MAP) provides an overall measure of retrieval quality across different recall levels. Normalized Discounted Cumulative Gain (NDCG) is particularly valuable as it considers both the relevance and ranking of retrieved documents.

How to Evaluate
To measure these metrics effectively, you would typically use a test set of queries with known relevant documents. For each query, run it through your entire retrieval pipeline — from embedding the query, searching the vector database, to retrieving the top k documents. Then compare the retrieved documents against the known relevant documents. Calculate Recall@k and Precision@k for various values of k (e.g., 5, 10, 20) to understand performance at different retrieval depths. Compute NDCG using graded relevance judgments if available. To get a single performance figure, calculate the mean of these metrics across all test queries. It’s also valuable to measure query latency and throughput to ensure the system meets any speed requirements. Finally, consider evaluating on different subsets of your data to ensure consistent performance across various query types or document categories.

Touch Point 4: LLM Call

For the purpose of this document, let’s assume you’re prompting an LLM that has been instruction tuned, meaning it is fairly generalized at following instructions. These models are prompted using instructions, but how do you know if the instructions you’re passing in work? This falls under the category of prompt tuning or prompt engineering.

Note: If you are fine tuning an LLM, you’ll do similar things to benchmark the model with the additional step of finding datapoints to fix any issues you find with the prompts being passed in for future model iterations.

What Metrics Should I Care About.
Metrics for LLM calls can be broken up into two categories. Subjective and Absolute. Absolute metrics like latency, throughput, etc.. are easier to calculate. Subjective metrics are more difficult. These subjective categories range from truthfulness, faithfulness, answer relevancy, to any custom metric your business cares about. If you’re writing a Text2SQL app, you might care about whether the correct SQL is generated.

In all the subjective metrics, it typically requires a level of human reasoning to determine a numeric answer.

Techniques

For Subjective metrics you have 2 options. (1) use human annotators or (2) use another LLM to judge your responses. Both have pros and cons and we’ll discuss how to do each below.

Human Evaluators
This is a time intensive process. It requires humans to go through and evaluate your answer. You need to select the humans carefully and make sure their instructions on how to grade are clear. Typically, you give the evaluators a rubric just like a teacher might use when grading a task in school.

LLM As A Judge
This is a newer technique where you give an LLM a grading rubric and it performs the same evaluation that the human annotators would do in the section above. An example rubric might look like this:

How do we trust an LLM to grade the answers correctly?
To use LLM-as-a-judge, you have to iterate on a prompt until the human annotators generally agree with the LLMs grades. An evaluation dataset should be created and graded by a human. That same dataset is run through an LLM using a grading rubric. If the responses align then the prompt is ready to be used. If not, you need to iterate on the prompt until the humans and LLM agrees.

When building out this validation, make sure to choose both good and bad examples. It’s equally important to test both.

How to Evaluate
Typically this is done in a similar way as our information retrieval steps. You run through a validation dataset that reflects your customers expectations. After a model change or prompt change, this validation dataset should be run to see if the changes improve the outcomes.

Evaluating in Production
An important benefit of LLM-As-A-Judge is that it can be run in production. You often won’t have the correct answer when seeing net new requests so the grading rubric needs to be tweaked to not include the gold standard answer. The LLM can output a metric on the systems response which can be pushed into an observability solution to monitor it’s performance on production traffic.

This can become costly. You’re increasing the number of invocations to the LLM so it’s common practice to sample responses and only run the evaluation on a subset of traffic to keep costs down.

Note: An important consideration is how you want to improve the system. Being able to quickly find poor performing prompts / inputs is key to improving it. It’s common to take poor performing examples, correct them, and use them as dynamic few shot examples in future prompts.

Touch Point 5: End To End System Metrics

Similar to the information retrieval end to end metrics, we’ll want to run evaluations on the entire system. This differs slightly from the information retrieval metrics because we’re combining both the accuracy of the information retrieved and the LLMs response.

For this step, you’ll need a validation dataset that reflects customer expectations. This should be a comprehensive view of the different types of asks customers have or will have on your system.

What Metrics Should I Care About.
For RAG based results, you’ll want to measure faithfulness, answer relevance, context precision, and other metrics that make sense for your use case. Using the same Text2SQL example, if your customers are querying a SQL database, you’ll want to know if the results are correct and the right query was run. These metrics are often use case specific.

How to Evaluate
At this stage, we’re evaluating the system as a whole, not just the prompt to the LLM. Setup a retrieval task like you did in the previous touchpoints, but this time pass the retrieved chunks into the context of the model and evaluate the overall results of the model through a validation dataset.

You can use your LLM-As-A-Judge prompt or rely on 3rd party tools like Ragas. Both are useful and not mutually exclusive.

Validation Lifecycle

Now that we’ve discussed the touch points that should be evaluated and what metrics we care about, we’ll discuss the lifecycle of evaluation

PreRelease

At this point, it should be clear that you need validation datasets (holdouts) that you can run to validate changes to the information retrieval portion or the LLM portion of your system. It’s common to run these inside a CI/CD pipeline that’s deploying changes to the system. As you uncover more issues and understand the way users expect to interact with your system, these datasets will change and be added to over time.

A drop in these metrics should block the pipeline from deploying.

Production Validation & Observability

Production validation bleeds into observability a bit. In production, you often don’t know the correct answer like you do in a validation set. In production there’s a different set of metrics you want to look at. If you’re building a RAG solution, user feedback both implicit and explicit can be good metrics.

User Driven Business Metrics

User driven metrics indicate how successful users are in performing tasks with your system. If these metrics go down, that’s generally an indicator that something is not working correctly in your system. Some examples of user driven metrics include.

User Engagement:

Session duration: How long users interact with the system.
Return rate: Frequency of users coming back to use the system.
Query volume: Number of queries processed over time.

User Satisfaction:

Explicit feedback: User ratings or thumbs up/down on responses.
Implicit feedback: Click-through rates on suggested actions or links.
Net Promoter Score (NPS): If applicable to your product.

Task Completion Rate:

Percentage of queries that lead to successful task completion.
Abandonment rate: Proportion of sessions where users leave without completing their intended task.

Time Efficiency:

Time to first response: How quickly the system provides an initial answer.
Time saved: Estimated time saved by users compared to alternative methods.

Changes in these metrics indicate a problem, but we need additional metrics to diagnose and fix the problem.

LLM Driven Metrics

Understanding the “why” behind drops in user driven business metrics is where LLM-As-A-Judge becomes a powerful technique. By providing a rubric or rubric(s) to an LLM that has been tuned to agree with human evaluators, we can run this rubric on a subset of user interactions to understand gain more contextual information about the systems behavior. If user satisfaction goes down and Truthfulness goes down, we know where to investigate.

Candidate Release

For new candidate releases, we can shadow a subset of user traffic and capture these LLM Driven Metrics. If the metrics improve, it provides confidence that the candidate release should move into production. Once shadowing is completed, we can begin A/B testing the solution which allows us to add in the user driven business metrics. If the A/B test was successful, we can be more confident that the candidate release should roll out to all new customers.

Conclusion

In this post we went through various touchpoints in an LLM augmented system and described how to evaluate the system at each stage. If there’s one takeaway, it’s that no one single metric is comprehensive. You need to look at the system as a whole and use appropriate metrics for part of the system you’re evaluating

Contributions

This blog post and the accompanying code base was contributed to by Tanner McRae and Felix Huthmacher. Both are Generative AI SA’s working for AWS. All opinions shared are the authors’ personal opinions, and may not represent the official view of AWS.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.

Evaluating LLM Systems: Best Practices

Guide to evaluating LLM systems: Covers embedding models, reranking, information retrieval, LLM calls, and end-to-end metrics.