Choose the best foundational model for your AI applications

Choose the best foundational model for your AI applications

Modality, speed, pricing, fine-tuning & more factor in when you're trying to find the best foundational model for your generative AI-powered app.

Suman Debnath
Amazon Employee
Published Apr 22, 2024
I've always been fascinated by the potential of generative AI, and my journey to find the perfect foundational model (FM) or large language model (LLM) has been both exciting and challenging. Along the way, I've picked up some invaluable insights, and I'm excited to share them with you, especially if you're new to this field.
Selecting the right foundational model for your AI-powered projects is more than just a technical decision; it's about finding a good strategic fit. In this guide, I'll walk you through the key criteria I consider important when choosing a foundational model, with practical examples to make things clearer. But first, let me tell you about the initial spark that got me started.

The Initial Spark

It all began with a simple idea: a generative AI powered application that could change how we interact with technology and make our application smarter if not smartest. Diving into the world of foundational models, I realized it was like building a house – you need a solid foundation that aligns with your blueprint and meets your specific requirements. Just as you wouldn't construct a skyscraper on a foundation meant for a single-family home, choosing the right foundational model is crucial for your project's success. You need a model that can support your vision, accommodate your needs, and provide a sturdy base for future growth and expansion.

Defining the Criteria

Before I could begin my search, I needed to establish a set of criteria – a checklist of qualities that would define the ideal foundational model for my project's foundation. Let's go through each of them one by one:

1. Understanding Modality

What it is: Modality refers to the type of data the model processes—text, images (vision), or embeddings.
Why it matters: The choice of modality should align with the data you're working with. For instance, if your project involves processing natural language, a text-based model like Claude, Mistral, Llama 2, Titain Text G1, etc. are suitable. And if you just want a create embeddings, then you may like to go with models like, Cohere and Titan Embeddings G1. Similarly, for image-related tasks, models like Claude (can process images), Stability AI(SDXL 1.0) and Titan Image Generator G1 (generate images) are more apt.

2. Model Size and Its Implications

What it is: This criterion refers to the number of parameters in a model. If you are new to machine learning, let me give a brief about parameters and its importance specially in the context of LLMs. A parameter is a configuration variable that is internal to the model and whose values are be estimated(trained) during the training phase from the given training data. Parameters are crucial as they directly define the model's capability to learn from data. Large models often have over 50 billion parameters.
Why it matters: The number of parameters is a key indicator of the model's complexity. More parameters mean the model can capture more intricate patterns and nuances in the data, which generally leads to better performance. However, these models are not only expensive to train but also require more computational resources to operate. It's like choosing a car; a bigger engine might be more powerful but also uses more fuel.
For instance, Mixtral 8x7B Instruct has around 46B parameters, Anthropic's models, like Claude 3, has billions of parameters and the model is known for its state-of-the-art performance across a wide range of natural language processing tasks, including text generation, question answering, and language understanding. Similarly, the latest Llama 3 comes in a range of parameter sizes of 8B and 70B and can be used to support a broad range of use cases, with improvements in reasoning, code generation, and instruction following.

3. Inference Speed or Latency

What it is: Inference speed, or latency, is the time it takes for a model to process input (often measured in tokens) and return an output. This processing time is crucial when the model's responses are part of an interactive system, like a chatbot.
Why it matters: Quick response times are essential for real-time applications such as interactive chatbots or instant translation services. These applications depend on the model's ability to process and respond to prompts rapidly to maintain a smooth user experience. Although larger foundational models typically offer more detailed and accurate responses, their complex architectures can lead to slower inference speeds. This slower processing might frustrate users expecting immediate interaction.
To address this challenge, you as a developer might choose models optimized for quicker responses, even if it means compromising somewhat on the depth or accuracy of the responses. For instance, streamlined models designed specifically for speed can handle interactions more swiftly, thus improving the overall user experience.
For example, a model like Claude 3 Haiku from Anthropic, known for its large context window of up to 200K tokens, is capable of handling extensive and complex prompts, providing high-quality outputs. However, due to its size and the extensive amount of data it can process, it might not be as fast as smaller models like Mistral Large, which prioritizes speed over the ability to process large contexts, having a faster inference time but a smaller context window of around 32K tokens. So, Mistral Large could be more appropriate for scenarios requiring fast interactions, whereas Claude 3 Haiku could be better for applications where the depth of understanding and comprehensive context is crucial, even if the responses take slightly longer.
In inference speed is so important, that even many of the models are optimized for inference, e.g. Mixtral 8x7B Instruct leverages up to 45B parameters but only uses about 12B during inference, leading to better inference throughput at the cost of more vRAM. Therefore, it processes input and generates output at the same speed and for the same cost as a 12B model.

4. Maximizing Context Window

What it is: Before diving into why context windows are important, let's understand what they are. In the context of large language models, a context window refers to the amount of text (in tokens) the model can consider at any one time when generating responses. Think of it like the model's memory during a single instance of processing.
For example, the sentence Hello, world! might be split into the tokens [Hello, ,, world, !]. The process of converting raw text into tokens is called tokenization. The specific rules and methods for tokenization can vary. Some models might break text into words and punctuation, while others use subwords (parts of words) to handle a broader range of vocabulary without needing a separate token for every possible word.
On the other hand, context window, also known as the attention window, refers to the maximum number of tokens from the input that the model can consider at one time when making predictions. This is a crucial aspect because it determines how much information the model can use to understand the context and generate responses or predictions. For instance, if a language model has a context window of 512 tokens, it can only consider the last 512 tokens it has seen when generating the next part of the text. Basically, it acts like the model's short-term memory during a task, much like a good conversationalist who remembers everything you've said :)
Why it matters: Larger context windows enable the model to remember and process more information in a single go. This ability is particularly valuable in complex tasks such as understanding long documents, engaging in detailed conversations, or generating coherent and contextually accurate text over larger spans.
For instance, in a conversation, a model with a larger context window would remember more of the earlier dialogue, allowing it to provide responses that are more relevant and connected to the entire conversation. This leads to a more natural and satisfying user experience, as the model can maintain the thread of discussion without losing context.
If we look at Anthropic's Claude, which has a massive context window of up to 200K tokens, allowing them to handle complex, long-form inputs with ease. However, it's important to note that larger context windows often come at the cost of increased computational requirements and relatively slower inference times.
When selecting a foundational model, we may like to consider the trade-off between context window size and other factors like inference speed or computational resources, based on the specific requirements of your application.

5. Pricing Considerations

What it is: The cost of using a foundational model, influenced by the model's complexity and the model provider’s pricing structure.
Why it matters: Deploying high-performance models often comes with high costs due to increased computational needs. While these models provide advanced capabilities, their operational expenses can be steep, particularly for startups or smaller projects on tight budgets.
On the other hand, smaller, less resource-intensive models offer a more budget-friendly option without significantly compromising performance. It's essential to weigh the model’s cost against its benefits to ensure it fits within your project's financial constraints, ensuring you get the best value for your investment without overspending. It's like dining out, sometimes, a gourmet meal is worth it, but other times, a simple dinner is good enough :)

6. Fine-tuning and Continuous Pre-training Capability

What it is: Fine-tuning is a specialized training process where a pre-trained model (a model that has been trained on a large, generic dataset) is further trained (or fine-tuned) on a smaller, specific dataset. This process adapts the model to particularities of the new data, improving its performance on related tasks. Continuous pre-training, on the other hand, involves extending the initial pre-training phase with additional training on new, emerging data that wasn't part of the original training set, helping the model stay relevant as data evolves, and these data are typically unlabeled.
Why it matters: With fine-tuning, you can increase model accuracy by providing your own task-specific labeled training dataset and further specialize your FMs. With continued pre-training, you can train models using your own unlabeled data in a secure and managed environment. Continued pre-training helps models become more domain-specific by accumulating more robust knowledge and adaptability- beyond their original training.
For instance, Amazon Bedrock which supports both fine-tuning and continuous pre-training, which gives you powerful tools to not only personalize but also evolve your customized FMs over time. But please keep in mind that, that not all models are supported for fine tuning or continuous pre-training. So you may like to pick the right model which supports fine-tuning/continuous pre-training if that is what you are looking for.

7. Quality of Response

What it is: Lastly, the criteria which matter the most at the end of the day is the quality of response. This is where you evaluate the output of a model based on several quality metrics, including accuracy, relevance, toxicity, fairness, and robustness against adversarial attacks.
Accuracy measures how often the model's responses are correct according to some standard. Relevance assesses how appropriate the responses are to the context or question posed. Toxicity checks for harmful biases or inappropriate content in the model's outputs. Similarly, fairness evaluates whether the model's responses are unbiased across different groups and, finally, robustness, indicates how well the model can handle intentionally misleading or malicious inputs designed to confuse it.
Why it matters: The reliability and safety of model outputs are paramount, especially in applications that interact directly with users or make automated decisions that can affect people's lives. High-quality responses ensure user trust and satisfaction, reducing the risk of miscommunications and enhancing the overall user experience, thus earning the trust of your customers.
For instance, in a customer service scenario, a model that consistently provides accurate and relevant responses can significantly improve resolution times and customer satisfaction rates. Conversely, if a model outputs toxic or biased responses, it could lead to customer alienation and harm the company's reputation. Thus, robust mechanisms to detect and mitigate toxicity and ensure fairness are crucial to prevent the perpetuation of biases and ensure equitable treatment of all user groups.
So to summarize, when selecting a foundational model, it’s vital to assess not just its primary capabilities but also its additional features and the quality of its responses. These factors will significantly influence the model’s applicability and success in specific environments and use cases.

Utilizing Amazon Bedrock's Model Evaluation

To streamline the selection process, you may like to use Amazon Bedrock's Model Evaluation feature, which allows for both automated and human evaluations of models. This feature can help you to assess models based on predefined metrics and subjective criteria, aiding in a more informed decision-making process. While this is a topic of my next blog on this series, but in brief Amazon Bedrock offers a choice of automatic evaluation and human evaluation. You can use automatic evaluation with predefined metrics such as accuracy, robustness, consistency and many more. You can create your own metric as well.
As a developer, you now have this Amazon Bedrock's model evaluation feature as a tool available for building generative AI powered applications. You can start by experimenting with different models in the playground environment. To iterate faster, add automatic evaluations of the models. Then, when you prepare for an initial launch or limited release, you can incorporate human reviews to help ensure quality. More on this in the next blog, stay tuned...


In this blog, we learned about the key criteria to consider when evaluating and selecting the right foundational model for your generative AI project. From understanding modality and model size to assessing inference speed, context window, pricing, fine-tuning capabilities, and quality of response, each factor plays a crucial role in finding the perfect fit for your specific use case. Choosing the right foundational model is not just a technical decision but a strategic one that can significantly impact the success of your application. By carefully weighing the trade-offs and aligning the model's capabilities with your project's requirements, you can lay a solid foundation for your generative AI endeavors.
Remember, the journey doesn't end with selecting the model. As the field of generative AI continues to evolve rapidly, it's essential to remain open-minded, adaptable, and stay informed about the latest advancements. Leverage Amazon Bedrock and its feature of Model Evaluation can help to streamline the selection process and make informed decisions based on comprehensive assessments. With the right foundational model as your base, you'll be well-equipped to unlock the transformative potential of generative AI. So, embrace the excitement of this journey, and let your curiosity guide you towards finding your generative AI soulmate – a model that not only meets your current needs but also supports your vision for growth and innovation :) Now go and build!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.