AWS Logo
Menu
Automate Amazon Bedrock Evaluations with AWS Step Functions

Automate Amazon Bedrock Evaluations with AWS Step Functions

Run multiple evaluation jobs in parallel to quickly evaluate chunking strategies, embeddings, and vector databases

Rohan Mehta
Amazon Employee
Published Feb 25, 2025
Just want the source code? Check it out here.
As organizations scale their use of Retrieval Augmented Generation (RAG) applications, maintaining consistent quality becomes increasingly challenging. Evaluating your RAG ingestion pipelines is crucial to keep your application's usage of RAG is aligned with business requirements. New chunking approaches, new models, and new document structures will take critical work hours for your team to evaluate, requiring the use of an automated evaluation framework to save time.
A common RAG workload such as a Chatbot requires updates in production as new documents are added to its dataset; ensuring that the Chatbot is providing correct and helpful answers for new types of inquiries is essential to customer satisfaction. An automated evaluation approach helps you maintain high-quality responses while reducing the manual effort required for RAG pipeline evaluation, enabling faster iteration and optimization of your AI applications.
As previously announced during re:Invent 2024, Amazon Bedrock Knowledge Bases now supports RAG evaluation. You can run automatic evaluations to assess your RAG applications, comparing different configurations and tuning as necessary. Please note that as of February 2025 the Evaluations feature remains in preview.
The solution in this blog post automates the steps required to create multiple Knowledge Bases with different configurations and run Evaluations on all of them to compare results. Running an Evaluation for a single Knowledge Base takes at least 10-20 minutes; running multiple Evaluations in parallel accelerates your GenAI development velocity.
We’ll use AWS Step Functions to manage our testing workflow. Using Step Functions lets us invoke AWS Service API’s without needing to write or maintain any code. Second, Step Functions make it easy to create and evaluate multiple Knowledge Bases at the same time with the Parallel State flow state. Finally, the Step Functions console allows us to visualize the progress of the testing pipeline in action, letting us see where any potential errors might’ve occurred.

Workflow Steps

Let’s walk through the Step Functions workflow to create and evaluate the Knowledge Bases. Feel free to update the workflow with your desired configurations or add additional Knowledge Bases to evaluate more than two configurations in a single workflow.
  1. Create Knowledge Bases with desired configurations.
    1. In this example post, I have one Knowledge Base with the default Chunking Strategy of a fixed token length and one Knowledge Base with a Hierarchical Chunking Strategy. You can change the Step Function definition to your desired configuration to test. You could also add additional Knowledge Base creation and evaluation flows to test three or more configurations in parallel.
    2. Creating the Knowledge Base requires the associated vector database to already be deployed. The Opensearch Serverless Collection is deployed prior to this step.
  2. Associate Data Source with each Knowledge Base
    1. In our case, we’ll associate the same S3 bucket with both Knowledge Bases to ensure we’re comparing two different ingestion approaches with the same source data.
  3. Start the Ingestion Job to pull the S3 data into our vector database vector indices
    1. Each Knowledge Base will ingest data into its own vector index within the Amazon Opensearch Collection, ensuring the Knowledge Base data isn’t cross contaminated.
  4. Poll the results of the Ingestion Job
    1. We’ll poll the results of the Ingestion Job every minute until the Job is complete, making sure our data is stored in the vector database before moving onto the evaluation job.
  5. Create Evaluation Jobs
    1. We’ll execute the Evaluation Job on the Knowledge Base; providing a list of example prompt and expected answer pairs. The Evaluation Job will compare the expected answers against the actual answers, grading them on a list of desired metrics.
    2. Here’s the jsonl file I created for my Evaluation Prompts. I used the Step Functions and DynamoDB Developer Guides for my document dataset, so the questions are related to DynamoDB features.
    3. {"conversationTurns":[{"referenceResponses":[{"content":[{"text":"It offers two streaming models for CDC: DynamoDB Streams and Kinesis Data Streams for DynamoDB."}]}],"prompt":{"content":[{"text":"What are the two streaming models for CDC?"}]}}]}
      {"conversationTurns":[{"referenceResponses":[{"content":[{"text":"Customers migrate to DynamoDB for reasons such as scalability, performance, the fully-managed nature of DynamoDB, the flexibility of NoSQL data models, and more."}]}],"prompt":{"content":[{"text":"What are some common reasons customers migrate to DynamoDB?"}]}}]}
  6. Poll the results of the Evaluation Job
    1. We’ll poll the results every ten minutes until both evaluation jobs are complete.
  7. View and compare results
    1. Once both Evaluation Jobs are complete, we can use the Bedrock Console to view the results of the two Evaluations.
    2. For more information on the evaluation criteria, check out the Bedrock documentation.

What’s in the code?

Source Code link.
Implementation Requirements
  1. You can use the included SAM definitions for the S3 Buckets, IAM Roles, and Opensearch Serverless Collection or reference those resources from other CloudFormation stacks. The State Machine only creates the Bedrock Knowledge Bases, Ingestion Jobs, and Evaluation Jobs.
  2. Permission to create IAM roles for Amazon Bedrock Knowledge Bases and Evaluations
Cost Considerations
  1. Implementation costs stem from:
    1. The number of steps in your Step Functions Standard Workflow. If your data takes a long time to ingest or you have many test cases to evaluate, you can increase the polling intervals to reduce your costs.
    2. The amount of data in your Knowledge Base Ingestion and Evaluation Jobs.
    3. The amount of data in your S3 buckets for the raw documents and evaluation prompt and answer pairs.
    4. Your usage of the Amazon Opensearch Serverless Collection

Technical Implementation Notes

  • I’ve only tested this solution with Amazon Opensearch Serverless, but it should work with any vector database supported by Knowledge Bases.
  • This solution creates IAM roles for the Knowledge Base resources. The Knowledge Base Evaluation IAM role is allowed access to any Knowledge Base as the IAM role is created and deployed prior to the Knowledge Bases being created during the workflow. Feel free to modify the IAM role during the workflow with the newly created Knowledge Base ARN’s.
  • Creating Knowledge Bases takes a few seconds, so I’ve added a Wait state in that part of the flow to account for that.
  • Creating multiple Knowledge Bases at the same time led to a few “Too Many Requests” errors, so I staggered the Wait states by 5 seconds to avoid that error.
  • You can adjust the polling intervals as needed, the default is 1 minute for the Ingestion Jobs and 10 minutes for the Evaluation Jobs.
     

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments