logo

How To Choose Your LLM

Large language models (LLM) can generate new stories, summarizing texts, and even performing advanced tasks like reasoning and problem solving, which is not only impressive but also remarkable due to their accessibility and easy integration into applications. In this blog, I will provide you with the tools to understand how LLMs work and select the optimal one for your needs.

Elizabeth Fuentes
Elizabeth Fuentes
Amazon Employee
Published Dec 5, 2023

Generative Artificial Inteligence (Generative AI) has made remarkable progress in 2022, pushing the boundaries with its ability to generate content mimicking human creativity in text, images, audio, and video.

The abilities of Generative AI stem from deep learning models (Fig. 1), which are trained using vast amounts of data. Deep learning models, after extensive training on billions of examples, become what is called "foundation models" (FM). Large language models (LLMs) are one kind of FM that leverage these foundation models for generative capabilities like reasoning, problem-solving and creative expression at a human level. They are capable of understanding language and performing complex tasks through natural conversation.

Where does Gen AI come from?
Fig 1. Where does Gen AI come from?

Over the past few decades, artificial intelligence has been steadily advancing. However, what makes recent advances in generative AI remarkable is its accessibility and easy integration into applications.

In this blog, I'll provide you with the tools to understand the workings of LLMs and select the optimal one for your needs.

There are a lot popular LLMs, some of those more advanced LLMs have been trained on far more data than others. The additional training empowers them to tackle complex tasks and engage in advanced conversations.

Nonetheless, their operation remains the same: users provide instructions or tasks in natural language, and the LLM generates a response based on what the model "thinks" could be the continuation of the prompt. (Fig. 2).

 How LLMs Work
Fig 2. How LLMs Work

The art of building a good prompt is called prompt engineering. It is a discipline with specific techniques for developing and refining prompts that allow language models to have effective outputs. Prompt engineering is focused on optimizing prompts for efficient and helpful responses from language models.

With a well-designed prompt, the model's pre-trained abilities can be leveraged to serve novel queries within its scope. Two of the most well-known Prompt Engineer techniques are:

For tasks that do not require prior examples to understand the context of the task that is required. For example, classification.

 How LLMs Work
Example of Zero-shot Learning.

Zero-shot capabilities refer to the ability of large language models to complete tasks they did not train it on. However, they still face limitations when performing complex tasks with only a short initial prompt without guidance. Few-shot Learning improves model performance on difficult tasks by incorporating demonstrations or in-context learning.

πŸ“š Tip: Put the LLM in context of what its role is, for example: "You are a travel assistant".

 How LLMs Work
Example of Few-shot Learning.

Learn about prompt engineering:

To make this decision, I am going to list some aspects that I consider to be most important:

What will be the need that the LLM is going to solve in the application. The functionalities with the highest usage are:

  • Summarize
  • Classification
  • Question Answering
  • Code generation
  • Content writing
  • Instruccion following
  • Multilingual Task
  • Embedding: translate the text into a vector representation.

As I mentioned before, there are advanced models capable of handling complex tasks and multitasking. For Example, Llama-2-13b-chat is a powerful LLM for managing conversations, but only in English.

You can select a model that can satisfy all your requirements at once, or create decoupled applications with multiple specialized models for each task.

πŸ“š Remember: Use prompt engineer to generate desired outputs.

There are LLM specialized in certain tasks, capable of speaking one language or more than one. It’s important to define if your application will speak only one language or more than one before choosing the LLM. For example, Titan Text Express is multilingual, unlike Titan Text Lite, which only talks in English.

πŸ“š Tip: : If the LLM you need doesn't have the desired language function, try using a multilenguial LLM for translation or Amazon Translate before sending the prompt.

A context window refers to the length of text an AI model can handle and reply to at once, this text, in most LLMs, is measured in tokens.

 Context Window
Fig 3. Context Window

Regarding tokens, are like the individual building blocks that make up words. For example:

  • In English, a single token is typically around 4 characters long.

  • A token is approximately 3/4 of a word.

  • 100 tokens equate to roughly 75 words.

This code snippet shows how to determine the token count using Jurassic-2 Ultra with Amazon Bedrock.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import boto3
import json
bedrock_runtime = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1'
)
model_id = "ai21.j2-ultra-v1"

prompt="Hola Mundo"

kwargs = {
"modelId": model_id,
"contentType": "application/json",
"accept": "*/*",
"body": "{\"prompt\":\""+ prompt +"\",\"maxTokens\":200,\"temperature\":0.7,\"topP\":1,\"stopSequences\":[],\"countPenalty\":{\"scale\":0},\"presencePenalty\":{\"scale\":0},\"frequencyPenalty\":{\"scale\":0}}"
}

response = bedrock_runtime.invoke_model(**kwargs)

Breaking down the response:

1
2
3
response_body = json.loads(response.get("body").read())
completion = response_body.get("completions")[0].get("data").get("text")
print(completion)
1
Bonjourno! How can I assist you today?

Let's find out the token count in both the Prompt Input and Generated Output(completion):

Prompt Input:

1
2
tokens_prompt = response_body.get('prompt').get('tokens')
df_tokens_prompt = json_normalize(tokens_prompt)[["generatedToken.token"]]
Prompt Tokens count
Prompt Tokens count.

Generated Output:

1
2
tokens_completion = response_body.get("completions")[0].get('data')["tokens"]
df_tokens_completion = json_normalize(tokens_completion)[["generatedToken.token"]]
Completion Tokens count
Completion Tokens count.

As there are open source LLMs, there are other payments, depending on the provider, modality and model, however, they all take the number of tokens into consideration.

Referring to the modality of paid LLMs:

βœ… Only Inference: When invoke the model as an API, the pricing corresponds to the number of incoming and outgoing tokens (Fig. 5). Amazon Bedrock is fully managed service offers the option to use LLMs through an API call, with a choice between on-demand or Provisioned Throughput to save costs, see pricing here and pricing examples here.

Only Inference Modality
Fig 5. Only Inference Modality

βœ… Customization (fine-tuning): when it is necessary to fine-tuning the model to a specific need (Fig. 6). In this type of pricing to the previous value, you must add the new training and the storage of the new model. Amazon Bedrock also offers a mode for customization (fine-tuning).

Only Inference Modality
Fig 6. Customization (fine-tuning) Modality

For those who need to experience more there is Amazon SageMaker JumpStart, which allows you, within several functionalities, to train and tune models before deployment with a jupyter notebook. Amazon SageMaker JumpStart has available this models, and check the pricing here.

Take a look at this chart of some available Amazon Bedrock models for a broader perspective when making comparisons.

ProviderModelSupported use casesLanguagesMax tokens Context Window
AnthropicClaude v2Thoughtful dialogue, content creation, complex reasoning, creativity, and codingEnglish and multiple other languages~100k
AnthropicClaude v1.3Text generation, Conversational, CodingEnglish and multiple other languages~100k
CohereCommandChat, text generation, text summarization.English4K
AI21 LabsJurassic-2 UltraQuestion answering, summarization, draft generation, advanced information extraction, ideation for tasks requiring intricate reasoning and logic.English, Spanish, French, German, Portuguese, Italian, Dutch8,192
AI21 LabsJurassic-2 MidQuestion answering, summarization, draft generation, advanced information extraction, ideation.English, Spanish, French, German, Portuguese, Italian, Dutch8,192
AmazonTitan Text Generation 1 (G1)- LiteOpen-ended text generation, brainstorming, summarization, code generation, table creation, data formatting, paraphrasing, chain of thought, rewrite, extraction, Q&A, and chat.English4K
AmazonTitan Text Generation 1 (G1) - ExpressRetrieval augmented generation, open-ended text generation, brainstorming, summarization, code generation, table creation, data formatting, paraphrasing, chain of thought, rewrite, extraction, Q&A, and chat.100+ languages8K
AmazonTitan EmbeddingsTranslates text into a numerical representation, Text retrieval, semantic similarity, and clustering.25+ languages8K
MetaLlama-2-13b-chatAssistant-like chatEnglish13B

Thank you for joining this reading where I explain how LLMs work, and how to improve response using the prompt engineering technique. You learned how to choose the best one for your application based on features such as:

  • The LLM’s mission in the application: what problem will the LLM help me to solve?

  • The language: Do I need the LLM to understand in multiple languages?

  • Length Of Context Window: The amount of text in the input request and generated output.

  • Pricing: Where I need to know the cost of the LLM that fits my needs and also ask myself: Are the LLMs available sufficient for what I need? If not, do I need to do fine-tuning?

Finally, you saw what a comparison chart built with some of the available Amazon Bedrock models looks like.

πŸš€ Some links for you to continue learning and building:


Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.