Choose the best foundational model for your AI applications
Modality, speed, pricing, fine-tuning & more factor in when you're trying to find the best foundational model for your generative AI-powered app.
foundational model (FM)
or large language model (LLM)
has been both exciting and challenging. Along the way, I've picked up some invaluable insights, and I'm excited to share them with you, especially if you're new to this field.
set of criteria
– a checklist of qualities that would define the ideal foundational model for my project's foundation. Let's go through each of them one by one:text
, images (vision)
, or embeddings
.text-based
model like Claude, Mistral, Llama 2, Titain Text G1, etc. are suitable. And if you just want a create embeddings
, then you may like to go with models like, Cohere and Titan Embeddings G1. Similarly, for image-related
tasks, models like Claude (can process images), Stability AI(SDXL 1.0) and Titan Image Generator G1 (generate images) are more apt. parameters
in a model. If you are new to machine learning, let me give a brief about parameters
and its importance specially in the context of LLMs. A parameter
is a configuration variable that is internal to the model and whose values are be estimated(trained) during the training phase from the given training data. Parameters are crucial as they directly define the model's capability to learn from data. Large models often have over 50 billion parameters.Mixtral 8x7B Instruct
has around 46B parameters, Anthropic's models, like Claude 3
, has billions of parameters and the model is known for its state-of-the-art performance across a wide range of natural language processing tasks, including text generation, question answering, and language understanding. Similarly, the latest Llama 3
comes in a range of parameter sizes of 8B and 70B and can be used to support a broad range of use cases, with improvements in reasoning, code generation, and instruction following. 
processing time
is crucial when the model's responses are part of an interactive system
, like a chatbot.real-time applications
such as interactive chatbots or instant translation services. These applications depend on the model's ability to process and respond to prompts rapidly to maintain a smooth user experience. Although larger foundational models typically offer more detailed and accurate responses, their complex architectures can lead to slower inference speeds. This slower processing might frustrate users expecting immediate interaction.Claude 3 Haiku
from Anthropic, known for its large context window of up to 200K tokens, is capable of handling extensive and complex prompts, providing high-quality outputs. However, due to its size and the extensive amount of data it can process, it might not be as fast as smaller models like Mistral Large
, which prioritizes speed over the ability to process large contexts, having a faster inference time but a smaller context window of around 32K tokens. So, Mistral Large
could be more appropriate for scenarios requiring fast interactions, whereas Claude 3 Haiku
could be better for applications where the depth of understanding and comprehensive context is crucial, even if the responses take slightly longer. Mixtral 8x7B Instruct
leverages up to 45B parameters but only uses about 12B during inference, leading to better inference throughput at the cost of more vRAM. Therefore, it processes input and generates output at the same speed and for the same cost as a 12B model.
context window
refers to the amount of text (in tokens) the model can consider at any one time when generating responses. Think of it like the model's memory
during a single instance of processing. Hello, world!
might be split into the tokens [Hello
, ,
, world
, !
]. The process of converting raw text into tokens is called tokenization. The specific rules and methods for tokenization can vary. Some models might break text into words and punctuation, while others use subwords (parts of words) to handle a broader range of vocabulary without needing a separate token for every possible word.context window
, also known as the attention window
, refers to the maximum number of tokens from the input that the model can consider at one time when making predictions. This is a crucial aspect because it determines how much information the model can use to understand the context and generate responses or predictions. For instance, if a language model has a context window of 512 tokens
, it can only consider the last 512 tokens it has seen when generating the next part of the text. Basically, it acts like the model's short-term memory during a task, much like a good conversationalist who remembers everything you've said :)remember and process
more information in a single go
. This ability is particularly valuable in complex tasks such as understanding long documents, engaging in detailed conversations, or generating coherent and contextually accurate text over larger spans. larger
context window would remember more
of the earlier dialogue, allowing it to provide responses that are more relevant and connected to the entire conversation. This leads to a more natural and satisfying user experience, as the model can maintain the thread of discussion without losing context.
pricing structure
.Fine-tuning
is a specialized training process where a pre-trained model
(a model that has been trained on a large, generic dataset) is further trained
(or fine-tuned) on a smaller, specific dataset
. This process adapts the model to particularities of the new data, improving its performance on related tasks. Continuous pre-training
, on the other hand, involves extending the initial pre-training phase with additional training on new, emerging data that wasn't part of the original training set, helping the model stay relevant as data evolves
, and these data are typically unlabeled.labeled training
dataset and further specialize your FMs. With continued pre-training, you can train models using your own unlabeled data
in a secure and managed environment. Continued pre-training helps models become more domain-specific by accumulating more robust knowledge and adaptability- beyond their original training.
quality of response
. This is where you evaluate the output of a model based on several quality metrics, including accuracy
, relevance
, toxicity
, fairness
, and robustness against adversarial attacks
. High-quality responses
ensure user trust
and satisfaction
, reducing the risk of miscommunications and enhancing the overall user experience, thus earning the trust of your customers. detect and mitigate
toxicity and ensure fairness are crucial to prevent the perpetuation of biases and ensure equitable treatment of all user groups. automatic evaluation
and human evaluation
. You can use automatic evaluation with predefined metrics such as accuracy, robustness, consistency and many more. You can create your own metric as well.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.