Evaluating Amazon Bedrock Embeddings Models
In this guide we present a methodology to evaluate Bedrock embeddings models for a given task
Eduardo Yong
Amazon Employee
Published Dec 2, 2024
If you are working on a RAG (Retrieval Augmented Generation) based application and you want evaluate the precision of your retriever by comparing it to other models this blog aims to achieve that. This blog provides a guide on how to evaluate embeddings models for a given task and dataset using MTEB (Massive Text Embedding Benchmark).
MTEB is framework that measures how well an embeddings model perform on a variety of tasks. MTEB consists of 58 datasets covering 112 languages from 8 embedding tasks: Bitext mining, classification, clustering, pair classification, reranking, retrieval, STS and summarization.
We are going to measure embeddings models performance based on the retrieval task. For a dataset in spanish, and the metric we are going to focus on is
nDCG
(Normalized Discounted Cumulative Gain).The dataset we are going to use is the PRES (Spanish Passage Retrieval) dataset which is a test collection for passage retrieval from health-related Web resources in Spanish. The task being modeled is that of a Spanish-speaking user with information needs on the subjects of "baby care", "vaccination", or "low back pain" in reputable health-related websites in Spanish and the system retrieves relevant short passages.
The test collection contains 10,037 health-related Web documents in Spanish, 37 topics representing complex information needs formulated in a total of 167 natural language questions, and manual relevance assessments of text passages, pooled from multiple systems.
The models we are going to measure are available through Amazon Bedrock.
Amazon Bedrock is a managed service that offers foundation models (FMs) from AI companies (AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon) through an API, along with capabilities to build generative AI applications around security, privacy, and responsible AI.
Amazon Bedrock is a managed service that offers foundation models (FMs) from AI companies (AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon) through an API, along with capabilities to build generative AI applications around security, privacy, and responsible AI.
First we need to install the mteb library
Then we can import it
We need to define a class that needs to have the
encode
module, we will define a class for Amazon TItan models and another one for Cohere models.The Cohere models uses the
input_type
parameter that takes the value search_document
for encoding documents and search_query
for encoding queriesFirst let's get the task and define the evaluation object from this task
Now we are going to initialize the TitanModel Class for Amazon Titan Embeddings V1
And run the evaluation
Then for the Amazon Titan V2
And lastly for the Cohere Embed Multilingual model
In this case we are interested in the
nDCG
, a measure of ranking quality in information retrieval. Where the position is considered while computing the score.The dataset consists of 3 files:
docs.json
, which contains all the documents and an identifierdocNo
.relevance_passages.json
, which consists of all the queries and which passages (extract of a document) are relevant to that query.topics.json
, consists of groups of topics, which consists in a group of queries related to that topic.
The methodology for this tasks consists of: first, to encode the documents and the queries, and then compute the cosine similarity and get the documents for a query with higher similarity.
In this post I explained how to evaluate an embeddings model in Amazon Bedrock following the MTEB methodology for a retrieval task. As this may improve the overall system correctness, I encourage you to evalute the system in every part and as a whole.
This blog post and the accompanying code base was contributed to by Eduardo Yong. Solutions Architect working for AWS. All opinions shared are the authors’ personal opinions, and may not represent the official view of AWS.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.