AWS Logo
Menu

Implementing a reranker for your RAG

Deploying a reranker on a Sagemaker endpoint with Hugging Face Text Embedding Inference

gengis
Amazon Employee
Published Nov 28, 2024
Last Modified Dec 2, 2024
update december 2024: Bedrock KB now supports reranking directly through the API using Amazon Rerank 1 or Cohere 3.5. so if you using Bedrock Knowledge Bases I'd recommend to try those models first and if you prefer open-source or a custom reranker, deploying on sagemaker remains a valid option.
You connected your LLM to your internal enterprise data (maybe using bedrock Knowledge base or a vector DB) and it is starting to show promising results. but sometimes it is not giving a relevant answer. And when you look at the details it seems the chunks that are being retrieved from your documents are not always the most relevant. Welcome to the information retrieval world!
There are multiple mechanisms that you can deploy to improve search relevancy. Anthropic has a very nice introductory post on search and RAG (and a very nice context retrieval feature) that I recommend anyone working with RAG to look at. One of those strategies - and the one we will cover today- is reranking. In this article we will look at deploying a reranker on a Sagemaker endpoint, making it accessible for various Retrieval-Augmented Generation (RAG) workflows.

Reranking 101

At its core, reranking is a two-step process that refines initial search results to provide more accurate and relevant information. Let's break down the key components of reranking:
  • Initial Candidates: When a query is made, the system first retrieves a set of potentially relevant documents or pieces of information based on traditional retrieval methods (like keyword matching or vector similarity).
  • Reranking Process: After this initial retrieval, reranking takes this initial set of results and applies more sophisticated algorithms (such as BGE) to re-order them.
  • The goal is to bring the most relevant results to the top of the list, improving the overall quality and use
reranking with BGE
A key finding from Anthropic's testing is the consistent superiority of reranking strategies over approaches without reranking. This observation leads to an important consideration for AWS Bedrock Knowledge Base (KB) users: currently, Bedrock KB doesn't support direct reranking.
To add reranking to your RAG pipeline you can use SageMaker endpoints. Recently, SageMaker endpoints gained the ability to scale to zero, which could also make this approach more cost-effective if you experiment with it.
However, it's important to note that reranking isn't without trade-offs. While it can significantly improve relevancy, it also introduces latency and can impact Time to First Token (TTFT). Therefore, it's crucial to consider whether increased relevancy outweighs the need for faster response times in your specific use case.
For those using Bedrock Knowledge Bases, implementing reranking requires deploying the reranking model on a SageMaker endpoint.

Deploy a reranker on SageMaker using Hugging Face TEI

For this implementation, we chose to deploy our reranker using SageMaker endpoints with the Hugging Face Embedding Container. This container, which uses Text Embedding Inference (TEI), offers an efficient solution for deploying embedding models in a managed environment. It provides several advantages that align well with our reranking needs, such as quick startup times, dynamic batching, and optimised inference using techniques like Flash Attention. The container's production-ready features, including distributed tracing and Prometheus metrics, make it suitable for scaling our RAG system on SageMaker while maintaining good performance and ease of management. This approach allows us to leverage SageMaker's robust infrastructure while benefiting from TEI's optimised embedding capabilities.
There are multiple ways to deploy a reranker on sagemaker and you will need to assess which options are best for your workload but I believe one of the simplest way to deploy a reranker endpoint is to use Hugging Face TEI and BAAI/bge-reranker-v2-m3 - Right now it is one of the few opensource models supported via TEI -. It is a few line of python to deploy it :
The full code is available here and its mostly the same with a few delete functions to clean if we redeploy the endpoint:
This example uses a single instance with limited concurrent requests for demonstration. For production, there are multiple things to consider such as auto-scaling, load balancing, and distributed processing. Consider using larger instance types, multi-model endpoints for increased throughput. Implement robust monitoring, logging, and error handling. Adjust configurations based on your specific load patterns, performance requirements, and cost constraints.

Testing the endpoint

It seems there is some documentation error on HF and the way the model expects inputs is a query and arrays of texts such as:
to call the endpoint:
now it returns index and scores so you would need to rerank yourself:
and this is what I get when running this python script:

A few words

It is essential to validate our approach through robust evaluations. Implementing reranking should be done thoughtfully, with careful testing to ensure it truly enhances system performance. These evaluations will help determine if the additional complexity and computational cost of reranking truly yields meaningful improvements in relevance and accuracy.
Reranking is a powerful technique to enhance the relevance of retrieved results in RAG systems. By deploying a reranker model on SageMaker, we can improve the quality of information fed to the LLM, potentially leading to more accurate and contextually appropriate responses. However, this approach comes with trade-offs in terms of latency and complexity. The key is to balance these factors against the improved relevancy, always keeping in mind the specific needs of your application and users. As the field evolves, we will surely see more integrated solutions emerge. For now, combining vector data stores (like Bedrock KB / OpenSearch) , AWS Bedrock, and SageMaker endpoints provides a solid foundation for building sophisticated RAG systems.
If you want to dive deeper there is a nice article from NVIDIA on reranking algorithms (that also showcases the one we are using in this sample) : Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG
You can also look at how to deploy cohere rerank (another great reranker) in a managed fashion : https://aws.amazon.com/blogs/machine-learning/improve-rag-performance-using-cohere-rerank/
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments