AWS Logo
Menu
Evaluating Amazon Bedrock Embeddings Models

Evaluating Amazon Bedrock Embeddings Models

In this guide we present a methodology to evaluate Bedrock embeddings models for a given task

Eduardo Yong
Amazon Employee
Published Dec 2, 2024

Introduction

If you are working on a RAG (Retrieval Augmented Generation) based application and you want evaluate the precision of your retriever by comparing it to other models this blog aims to achieve that. This blog provides a guide on how to evaluate embeddings models for a given task and dataset using MTEB (Massive Text Embedding Benchmark).
MTEB is framework that measures how well an embeddings model perform on a variety of tasks. MTEB consists of 58 datasets covering 112 languages from 8 embedding tasks: Bitext mining, classification, clustering, pair classification, reranking, retrieval, STS and summarization.
We are going to measure embeddings models performance based on the retrieval task. For a dataset in spanish, and the metric we are going to focus on is nDCG (Normalized Discounted Cumulative Gain).
The dataset we are going to use is the PRES (Spanish Passage Retrieval) dataset which is a test collection for passage retrieval from health-related Web resources in Spanish. The task being modeled is that of a Spanish-speaking user with information needs on the subjects of "baby care", "vaccination", or "low back pain" in reputable health-related websites in Spanish and the system retrieves relevant short passages.
The test collection contains 10,037 health-related Web documents in Spanish, 37 topics representing complex information needs formulated in a total of 167 natural language questions, and manual relevance assessments of text passages, pooled from multiple systems.
The models we are going to measure are available through Amazon Bedrock.
Amazon Bedrock is a managed service that offers foundation models (FMs) from AI companies (AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon) through an API, along with capabilities to build generative AI applications around security, privacy, and responsible AI.

Hands on

First we need to install the mteb library
Then we can import it
We need to define a class that needs to have the encode module, we will define a class for Amazon TItan models and another one for Cohere models.
The Cohere models uses the input_type parameter that takes the value search_document for encoding documents and search_query for encoding queries
First let's get the task and define the evaluation object from this task
Now we are going to initialize the TitanModel Class for Amazon Titan Embeddings V1
And run the evaluation
Then for the Amazon Titan V2
And lastly for the Cohere Embed Multilingual model
In this case we are interested in the nDCG, a measure of ranking quality in information retrieval. Where the position is considered while computing the score.

Deep Dive

Dataset

The dataset consists of 3 files:
  1. docs.json, which contains all the documents and an identifier docNo.
  2. relevance_passages.json, which consists of all the queries and which passages (extract of a document) are relevant to that query.
  3. topics.json, consists of groups of topics, which consists in a group of queries related to that topic.

Methodology

The methodology for this tasks consists of: first, to encode the documents and the queries, and then compute the cosine similarity and get the documents for a query with higher similarity.

Conclusion

In this post I explained how to evaluate an embeddings model in Amazon Bedrock following the MTEB methodology for a retrieval task. As this may improve the overall system correctness, I encourage you to evalute the system in every part and as a whole.

Contributions

This blog post and the accompanying code base was contributed to by Eduardo Yong. Solutions Architect working for AWS. All opinions shared are the authors’ personal opinions, and may not represent the official view of AWS.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments