AWS Logo
Menu
Bedrock Model Comparison: Optimizing Your GenAI Choices

Bedrock Model Comparison: Optimizing Your GenAI Choices

A tool for comparing AWS Bedrock models that helps you make data-driven decisions based on latency, cost, and quality to find the best model for your use case

Svetozar Miučin
Amazon Employee
Published May 19, 2025

Prerequisites

  • AWS CLI tool connected to your AWS account
  • git
  • Python 3.8+ with:
    • streamlit
    • boto3

Intended audience

People building GenAI products on Bedrock, who are interested in optimizing their product for cost and latency while preserving quality

Motivation

With GenAI working its way into more and more products, it becomes increasingly important to pick the right tools for the job. In this blog post, I will help you explore model selection on Bedrock using a practical tool I've built.
Choosing the right model boils down to the tradeoff between the following:
  • Answer quality - this is arguably the most important characteristic of a model. If we pick poorly, the tasks we execute using the Foundation Model (FM) can suffer from imprecise or outright wrong results. We don't want this.
  • Latency - when our LLM task is on the critical path of a time-sensitive task, latency from invoking a model to getting the answer can contribute to breaching SLAs and causing customer dissatisfaction with our product. We don't want this either.
  • Cost - Amazon Bedrock prices Foundation Model use per input and output token. Not all models are priced equally. In a hand-wavy way, cost is a function (among other things) of model complexity. Choosing a more complex model may in some cases give us the needed edge in answer quality, but doing so for tasks that could be served by a simpler model eats away at our profit margins. We obviously don't want this either.
As with any design space involving tradeoffs, there is no universal answer here. Tradeoffs are tough, but understanding them can help us design systems that are more performant and cheaper, while maintaining the output quality.
Coming from a systems performance optimization background, I'm a big fan of measuring things. There are frameworks and tools (such as Bedrock Model Evaluation) that help us build comprehensive tests for FMs. This is great, but this systematic approach requires time and effort, and sometimes, all we want is a quick way to improve our understanding of the tradeoff space, as it relates to our concrete use case.

Why model choice matters

First, let's take a look at a scenario that illustrates why model choice is important.
Let's assume we are building a system to fit a piece of text into a JSON schema. For this example, I'm using a simple JSON schema document for event descriptions, paired with copy-pasted text for an upcoming event on startups.aws.
The prompt:
I've chosen to run this on two FMs from the Amazon Nova family - Pro and Micro.
Results from Pro:
Results from Micro:
The results are identical. However, the Micro results were returned after around 300ms, and the Pro results after around 1s. That is a 3x difference in latency. Moreover, Micro tokens cost over 20x less than Pro tokens (both input and output).
Picking the wrong model would thus increase the latency of this part of my system by 3x, and increase the price by 20+x!

Navigating model choice

I frequently have conversations with customers where I help them understand the tradeoffs involved in picking the correct model. When I build, I usually resort to the following methodology:
  • Grab the Converse API sample code from here and put it in a script.
  • Change the model list to the models I'm considering (looking up their IDs from the documentation)
  • Change the prompts to the prompt I'm trying to make the choice for
  • Run
  • Observe
  • Choose
You get it, I write code. I get it, there might be better ways.
So, working backwards from my customers' needs to quickly pick a good tool (FM) for the job, I went ahead and did some vibe coding to build a small tool. After some 15-20 minutes of iterating in my IDE of choice hooked to Bedrock FMs for assistance, I arrived to a simple UI built with Streamlit that helps me make these choices much quicker, without touching any code in the process.
You can find the code for the tool here.
Screenshot of the Bedrock Model Comparison tool
Bedrock Model Comparison tool
Behold! Bedrock Model Comparison.
The UI is very simple. Let me walk you through it.
On the left hand side is the model selection panel. It has some preselected models, the ones I most commonly test out first. There is a drop down selector that lets you select any model in the given region (configurable through environment variable BEDROCK_REGION, defaulting to us-east-1). Click on the models in the drop down list to add them to your selection.
In the main panel, there is a prompt entry field. Paste your prompt in here. Press Cmd+Enter or the equivalent in your OS. Once you do that, the Compare Models button becomes clickable, and you're ready to go.
After clicking Compare Models, you will be presented with a loading text, and after a while, you will get the results. Below are two sections:
  • Model Comparison Results - this is a list of the selected models, each part is expandable to show the output of the FM invocation
  • Comparison Table - A table that shows all of the selected models with their Latency and Token Count.
And that's it! Very simple, but turns out to be very useful.
Now, let's see it in action!

Bedrock Model Comparison Tool

For testing out the tool, let's use the JSON schema filling example from the start of this post, but let's put a twist on it to exercise model differences a little bit more:
I'm sure this prompt could be better, but it'll do for now.
I'm going to add a couple of models to the roster and let it run.
After a while, the results are in!
A table showing latencies, token consumption and price columns from the comparison results
Comparison results
Looking at latencies, it is immediately clear that there is a huge variance between times to get the answer. Token counts also show us that there is a difference in the verbosity of the answers (correlated to latency, as expected).
(I'm omitting the full answers and leaving experimentation as the exercise to the reader)
Being very cost-conscious, I'm going to take a look at the lowest latency results - Amazon Nova Pro:
If I check the higher latency models for this example (e.g. Claude 3.7), I can see that they the quality of results is not radically improving:
However, it's costing me more money, and taking more time. I'm left with the understanding that for this task I can choose simpler models to improve my system's performance and cost while keeping the quality sufficiently high to serve the use case.
What did we learn from this?
Model choice is complex. It's dependent on the use case you're trying to serve and the price-performance characteristics of the models. Given the statistical nature of FMs, it is difficult to make good judgement calls without measuring. For anything that requires rigor and scale, investing time into a full-blown evaluation framework is a good idea. For cases where we need one-off (or a few-off) results quickly, simple tools like the one shared in this post can help navigate the tradeoff space efficiently.
Actions for the reader:
  • Check out the code
  • Run it using the instructions in the README.md file
  • Pick your models
  • Input your prompt
  • Make an informed choice of model to serve your specific use case
Have fun prompting!
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments