AWS | Community | Evaluation for AWS Bedrock Model - A Serverless Approach to Enhance Efficiency and Scalability

Introduction

Evaluating multiple large language models (LLMs) for tasks like text summarization can be resource-intensive and complex. Developers often face significant challenges in orchestrating workflows, managing parallelization, and storing results in a scalable and cost-effective manner. These issues can lead to increased overhead and inefficiencies. Internally, there is a need to develop a tool that assists developers in understanding the scoring of models before moving to production. This tool aims to address these challenges, streamline the evaluation process, and provide valuable insights into model performance.

Problem Statement

Evaluating LLMs is a multifaceted process that demands significant computational resources and meticulous orchestration. Developers encounter hurdles in managing workflow parallelization and ensuring scalable and cost-effective result storage. This complexity often results in operational inefficiencies and elevated overheads. An internal solution is required to facilitate developers in comprehensively assessing model performance prior to production deployment, thus enhancing the evaluation process and delivering crucial insights.

Solution Overview

To tackle these challenges, we leveraged AWS's serverless services to build an automated, scalable, and cost-effective architecture for evaluating LLMs. This solution includes several key features:

Automatic Deployment and Scaling: The evaluation workflow is triggered automatically by changes pushed to the GitHub repository. AWS Step Functions manage the orchestration of parallel evaluations for multiple LLMs, ensuring efficient resource utilization without manual intervention.
Scalability and Cost-Effectiveness: By using AWS serverless services such as AWS Step Functions, API Gateway, and Bedrock LLM, the system can dynamically scale based on demand, minimizing overhead costs when not in use.
Centralized Result Storage: Evaluation results for each LLM are stored in a centralized database, facilitating easy analysis and comparison of results.
Modular Architecture: Designed to be modular with loosely coupled components, the architecture allows easy integration of new LLMs or the replacement of existing ones, promoting flexibility and extensibility in the evaluation process.

Technical Architecture Overview

Short Demo: Youtube Link

GitHub Link: here

Code Deployment

Developers can either push code changes to a GitHub repository or directly update with amplify cli. The Amplify Web Page UI contains all the necessary apis configured through .env files and triggers a GET request to the Amazon API Gateway. The API Gateway invokes an AWS Step Function, orchestrating the evaluation process.

Parallel Evaluation

The Step Function parallelizes the evaluation by triggering multiple instances of the Call Bedrock LLM for Model Evaluation step, one for each model to be evaluated. Evaluation results are stored in a centralized database, in this case its dynamoDB.

Request Routing

The Amazon API Gateway acts as a proxy, receiving requests from the Amplify WebHosting service and routing them to the appropriate Step Function.

Detailed Implementation

AWS Bedrock: AWS Bedrock offers foundational models for various machine learning tasks, including text summarization. This architecture leverages Bedrock to perform LLM evaluations by integrating its capabilities into the serverless workflow. For more details, refer to the AWS Bedrock Documentation.

Bedrock Evaluation API: The Bedrock Evaluation API is crucial to this architecture. It facilitates efficient execution and management of LLM evaluations, ensuring each model is tested against the same dataset under consistent conditions. More information about the Bedrock Evaluation API can be found here. Currently the AWS configured boto3 library throws an error as the library is outdated. We need to manually configure a layer with the latest boto3.

Detailed Workflow:

Initial Trigger: Code changes pushed to GitHub directly or with amplify cli, automatically trigger the Amplify WebHosting service with react UI.
API Gateway: Amplify WebHosting sends a GET request to Amazon API Gateway, invoking an AWS Step Function.
Orchestration with Step Functions: The Step Function manages the evaluation workflow, initiating parallel evaluations for each LLM by invoking the Bedrock Evaluation API for each model (e.g., Amazon Titan, LLAMA 3, Claude 3.5). Also can select an evaluation model or candidate model.
Parallel Processing: Each model is evaluated in parallel, with results sent back to the Step Function for aggregation.
Centralized Storage: Results are stored in a centralized database(DynamoDB), allowing for easy access and comparison.
User Interface: Evaluation results are displayed in the Amplify-hosted web application, providing developers with a user-friendly interface to view and analyze the outcomes.

Scalability and Cost-Effectiveness

The use of AWS serverless technologies, such as AWS Step Functions and API Gateway, ensures that the architecture can scale according to demand. This approach eliminates unnecessary costs when the system is idle, making it a cost-effective solution for enterprises.

Centralized Result Storage

Storing evaluation results in a centralized database (DynamoDB) not only facilitates easy comparison and analysis but also ensures that data is readily available for future reference. This centralized approach improves data governance and security.

Conclusion

The serverless architecture designed by Ayyanar(AJ) and Aadhityaa revolutionizes the evaluation of LLMs by automating deployment, ensuring scalability, and centralizing result storage. This innovative approach not only reduces overhead costs but also enhances the efficiency and flexibility of LLM evaluation, making it an invaluable asset for enterprises.

Site Terms, Privacy, and more.

Evaluation for AWS Bedrock Model - A Serverless Approach to Enhance Efficiency and Scalability

Evaluating LLMs for text summarization is complex and resource-intensive. A tool is needed to streamline workflows, score models, and provide performance insights before production.