How to Prompt Mistral AI models, and Why
There are some peculiar things about prompting with Mistral AI Instruct models, what are they? And why?
Mike Chambers
Amazon Employee
Published Mar 5, 2024
Last Modified Apr 5, 2024
UPDATE: Since this post was written another model, Mistral Large, has launched on Amazon Bedrock. Mistral Large in Amazon Bedrock uses the same prompting method as discussed here.
This is a two in one post. In the first section I will step through how to prompt the instruction fine-tuned Mistral AI's 7B and 8x7B models. Then in the second section, for those who are interested, I will dive deeper and explain some of the finer prompting points, including what the
<s>
is all about, and more.Mistral AI's 7B and 8x7B models are available in the base foundation model, and instruction fine-tuned variants. I covered instruction fine tuning in a post a couple of weeks ago. Read that article for more, but in short, instruction fine-tuned models have been specifically trained to understand they they exist to be instructed to perform tasks. As such we use instruction fine-tuned models almost by default, and certainly for tasks like chatbots and agents.
In this post, all the samples given will be run through Mistral AI 7B Instruct. The prompting technique is the same for Mistral AI 8x7B Instruct. I will be using the Amazon Bedrock to access the Mistral models - and I will talk more about that later.
Firstly, these models have not completely forgotten their heritage - as a base model before instruction fine-tuning was applied. It is possible to simply send a text prompt in to the model, and get a generation out. For example this prompt:
Generated this response:
The generation included two line breaks before start of the generation (and it went on for much longer too, I truncated the generation in this sample, and most of the other samples in this post).
And if we're expecting a chatbot like response, we might be disappointed. Take this example:
And this generation:
Probably not what we want back from the chatbot.
According to Mistral AI's own documentation this super simple prompt may "generate sub-optimal outputs", as we are not following the template used to build a prompt for the Instruct model. If we are expecting a chatbot like response, we need to follow the instruction template. And here it is, copied directly from the documentation:
This provided template looks kinda simple, but there's intriguing detail. I will dive deeper in the next section. And if all you want to know is how to prompt, then these examples are almost certainly what you need:
This can be as simple as our test question or a full few-shot, chain-of-thought prompt with a huge context. Whatever it is, it's one request we are making, and not a chat session:
The `<s>` is a special token we will get into that later, just include it for now. The `[INST]` and `[/INST]` 'tags' indicate to the instruction tuned model where the INSTruction is. What you might not have noticed is that there is a space after `[/INST]`, if you don't include this then the model will likely generate a space at the start of the generation. The generation from this prompt will be the text string response the model generates, just as before, but a little tidier, with no line breaks, and with the first character being the first letter of the text. Nice.
I borrowed this title idea from Anthropic Claudes useful prompting hacks. This technique guides the direction of the generation by (sort of) starting the generation for the model. In this case I want the model to start it's answer in a specific way, let's make it sound a little more Aussie:
In this example (possibly not the most useful example) the model will answer the question, and the answer it gives will carry on from the word `G'day` (which is a colloquial greeting where I come from). Here is the response I got:
Let's look at possibly a more useful example:
Here we prompt the model to immediately start with the JSON format we want by including the `{` to start it off. This should avoid the model starting with something like `Sure, here is the JSON you asked for...` or wrapping the JSON in markdown, which as friendly as that may be, is less useful when it comes to parsing the output. Note however that the generation does not include anything that was in the prompt, and this includes the `{` so we are still left with performing string manipulation.
You may also want to add in a relevant stop token, such as `}`, but obviously with more complex JSON that wouldn't work. There are tricks to get around this that involve one or few shot learning and the use of something like `<json></json>` XML like tags to denote the start and end of the generation.
So far we have looked at methods to send a single prompt to the model for the purpose of getting a single response back. But what about chatting with the model. If you're not familiar, the approach of chatting with a model is; after each input from the human, you will send the entire chat history to the model, such that it can generate the 'assistant' response with all the context of the chat so far. So ask a question, get a response, ask a follow up question, add it to the first question and response and then send all that. This is often handled by libraries such as LangChain, but for the purposes of this post, we will look at how these libraries do (or need to) manage the data in the prompt.
For this example, imagine that we are in a new chat session. We start off in the same was as before with a single prompt:
And this time we get a much better response for a chatbot (truncated, as the actually answer was quite long):
Now if we want to ask a follow up question we need to add the response we got to our first question, add a special `</s>` string to show the end of the answer (why? see the second part) and then add the next question as another instruction like this:
To which you might get a generation like this (a list of steps, truncated here, I like where this is going):
And, following this pattern, we can carry on:
Etc, etc.
If this is all looking a bit complex, then help is at hand. As I said previously, instead of directly prompting the model like this, you can abstract away the complexity of chat prompting and chat memory through libraries like LangChain. Or we code our own, with a little help from HuggingFace.
I found this handy Mistral AI chat prompt hack while researching this post, and if nothing else, it's interesting.
PSST: We're done with the "how to prompt" post at this point, from now on, we enter full on nerd-fest! And soon we will get into the special tokens. Also, in this section I introduce some code. All this code is in Python, because... Python.
Hosting your own version of Mistral AI models is way beyond the scope of this post, and is not the subject we're focused on anyway(spoiler: you actually don't need to host the model - more on this later). But HuggingFace Transformers, a set of libraries that we might use to host the models, also has some convenient library code we can borrow for chat prompting. Inside the `AutoTokenizer`, when initialised with the `mistralai/Mistral-7B-Instruct-v0.2` pre-trained model, we can find a chat template. This chat template is a Jinja2 template, that converts a 'role', 'content' based chat history, into a Mistral AI instruct prompt.
I'll walk through an example. For this we will consider this chat history data structure that is commonly used to contain a chat history, with roles and content:
We can transform this into a Mistral AI Instruct prompt, using the chat template in the HuggingFace tokenizer, but without using the tokenizer itself. This is useful if we are sending prompts to API endpoints like Amazon Bedrock, as Bedrock has it's own tokenizer and we want to send text prompts like the ones earlier in this post.
First I will extract the template. I will do this, show you what I did, and then you can just copy and paste the output from this post, as you won't need to do it again.
And the Jinja2 template you get looks like this (you can copy and paste this part if you want to play along).
"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"
What does this template do? Well, to find out I asked another of my favorite LLMs, Anthropic Claude v2. Jumping in to the Amazon Bedrock text playground, I prompted `Explain this Jinja2 template: [pasted template]` and this is what I got:
Well that saves me a job! There are a couple of terms to dig in to, but we get the idea. (I love gen AI!)
To use the template we will need find a little more than just our messages list we defined above. Specifically we need to tell the template what what is calls the `bos_token` and `eos_token` tokens. These are the `<s>` and the `</s>` that we have been using. We will look in to these in more detail later, for now, let's add them in with our messages to the data structure for Jinja to use:
Now we can use Jinja to render the template. First we load the template in to a string:
Then we can run the code (you will need Jinja2 for this: `pip install jinja2`):
And the result we get is:
<s>[INST] What is the capital of Australia? [/INST]The capital city of Australia is Canberra...</s>[INST] But I thought it was Sydney! [/INST]Sydney is indeed a large and famous city in Australia, but it is not the capital city...</s>[INST] Thanks for putting me straight. [/INST]
Which is exactly what we want. So we can use this prompt, get the generation, add it to the end of our `messages` list in our `data`, and keep the conversation going.
Let's now dive a bit deeper, and get the intuition behind these special tokens and strings. And in the process discover that you might not need the `<s>` as it depends in the way the server you're using has been configured. (psst: For Amazon Bedrock you SHOULD include the `<s>`.)
So what *are* the `<s>` and `</s>`? Well, they are in the documentation from Mistral AI. After that template we read...
...the documentation goes on to say:
In order to the train a language model, the engineers first needed to convert the training text from words to numbers, because no matter what type of machine learning model we use, they all only work with numbers. For LLMs, this process is called tokenization. We train a tokenizer to convert words to numbers and output a lookup table so that once it's done, we can easily convert in both directions, words to numbers, and numbers to words. We then use this tokenizaton both during the training of the LLM, and when we are making generations with the LLM - called inference. During inference the prompt is tokenized to token ids (numbers), run through the LLM, which then generates token ids (numbers), which are de-tokenized back to words for our human squishy brains to read.
If we look at the tokenizer table, we would see whole words, partial words, and letters allocated a token id. The job of the tokenizer is to efficiently represent any conceivable text using a compact set of tokens, which may include these whole words, or partial words, and finally, if there this no other matches, the individual characters.
Here is an example of what you will see if you dig into the tokenizer used for Mistral AI models.
Token ID | maps to |
---|---|
100 | a |
101 | b |
102 | c |
... | ... |
8701 | Amazon |
12266 | Bed |
16013 | rock |
15599 | rocks |
28808 | ! |
(Interesting: 'rock' and 'rocks' are different token ids, 'rocks' is it's own thing, not 'rock' + 's'.)
To help the LLM during training, engineers include some special tokens that signify special places in the input sequence. These are 'beginning of sequence' (bos), 'end of sequence' (eos) and 'padding' (pad) tokens. *(Interesting: Mistral AI call them 'beginning of string', 'end of string' but they are the same thing.)*
These tokens are well named and are largely self explanatory.
Beginning of sequence tokens: denote the start of a sequence. If the LLM sees a bos token id, then it knows it's not missing anything that came before, and it indicates a fresh start, free from any previous context that might influence the generated output.
End of sequence tokens: denotes the end of a sequence. If this is in the training data, then the data loader might stop here, and pad the rest of the context window. At infernece, If this token id is generated by the LLM, then it's an indication that the model is done, and the application that is managing the LLM can stop right there, there is nothing more to be said.
Padding token: LLMs operate on the concept of a context window. This is the number of tokens that the model works with. Broadly speaking this is a set, fixed, number. So if you're prompt maps to a smaller number of tokens than the context window, you (or the library of code that you're using) will need to pad out the rest of the context with something, and that something, is the padding token.
In the tokenizer used by Mistral AI, these special tokens are:
Token ID | maps to |
---|---|
1 | bos_token |
2 | eos_token |
_* | pad_token |
(* The padding token is not set, it's basically 'nothing'.)
But wait there's more! In most tokenizers, for most models you might use, if you're just sending in prompts as text into an API endpoint (for example) you can all but forget about the bos token ids and eos token ids. They will automagically happen for you. But with the tokeniser used by Mistral AI Instruct models, things are a little different, you can add them in yourself, in fact, not only can you, you should*.
(* I say 'should' here not 'must', as things may still work without them, but if you want to do things properly, you 'should' add them.)
And how do you add such special tokens to the prompt? Well, that's what we have been doing, when we add in the `<s>` and `</s>` strings. You see these strings, get mapped to these special tokens.
Token ID | maps to string |
---|---|
1 | <s> |
2 | </s> |
There are a few things to note here.
First, these are not XML tags. They are simply strings that look like XML tags, but are not used in a way that conforms to XML syntax.
Secondly, remember that the end of sequence token is used *within* the chat prompt at the end of each generation response. This is not that common, but it can be useful, as it acts as a default stop token. When the model is done responding to your question is should generate an eos token id, and the application will stop the generation. Without this, models often keep generating, hallucinating the follow up 'human' question. In that case we need to work with other stop conditions, for example the commonly used stop condition with Anthripics Claude models is `\n\nHuman:`, in other words stop if you start putting words in the human 'mouth'.
Lastly, these are just strings, so If the user input includes either `<s>`, `</s>` and that gets inserted into the prompt, it could confuse the model. We need to experiment with this, and consider sanitising the input for such strings.
If you send a prompt to Mistral AI models without the bos token id, there is evidence to suggest that it will still work. There is also evidence to suggest that the model will generate sub-optimal outputs. So let's agree that the bos token id should be present, yes.
However, do *YOU* need to add it with the `<s>` string? That depends on the code that you're using (or is being used for you) to run the tokenizer.
Let's spin up some tokenizer code and take a look. Here are some of my tests (as per the beginning of March 2024):
For this test I downloaded the tokenizer, within the Mistral AI 7B Instruct download from Mistral AI's own site (WARNING LARGE DOWNLOAD: CDN), and grabbed `tokenizer.py` from the `mistralai-src` repo.
Output:
Result:
This tokenizer code generates a bos token id of `1` for us. So if you are hosting your own model using this code from Mistral AI's repo, the answer is no, you don't have to add the `<s>`. However also note that for some reason this tokenizer did not map `1` back to `<s>` so something strange is happening here. Anyone?
For this test, which is a bit easier, I used the HuggingFace transformers library, and just loaded the tokenizer component. (Note that the code on the page I just linked to shows using the tokenizer with `apply_chat_template`, and as we discussed earlier, this is a different thing alltogether as we pass that function a chat history not a prompt string.)
Output:
Result:
By defaut this tokenizer code generates a bos token id of `1` for us. So if you are hosting your own model using this code from HuggingFace, the answer is no, you don't have to add the `<s>`. UNLESS using `add_special_tokens=False` on the tokenizer, in which case the answer is yes.
Amazon Bedrock hosts Mistral 7B and Mixtral 8x7B models as a fully managed serverless API. You don't have to run the model or the tokenizer yourself, it's all done for you. As such there is no way to peek inside the tokenizer code and we rely on the documentation OR you can reply on ME :), the tokenizer used for Amazon Bedrock does not add a bos token id, so make sure to add the `<s>` to your prompt as we did throughout the whole of the first part of this post.
(At the end of this post I will walk you through getting started with Mistral AI, on Amazon Bedrock.)
You might have noticed that whenever we have talked about `[INST]` and `[/INST]` we've referred to them as strings, not tokens. The simple reason for this, is that these are strings not tokens. :)
Let's adapt the code from above to prove this:
Output:
The tokenizer used was defined when the base model was trained. So for example, when Mistral 7B was trained, the engineers created or chose a tokenizer to use, and that was that, done. When the model was instruction fine tuned, there is no need (or ability) to change the tokenizer as it's ingrained into the model already. The `[INST]` and `[/INST]` stings arrived in the instruction fine tuning dataset just like any other word in that dataset. Want to use these strings in the non-instrcution fine-tuned model? Go ahead, it might even do what you expect, but only if you're lucky.
I have written an article about the concepts of instruction fine tunning so you can read up on that there. In short, to perform instruction fine tuning on a model, you need to create a dataset providing some prompts, and the generation (answer) you expect/want. For example:
This is not to teach the model facts (although this might be a secondary outcome), but rather to teach the model what questions and answers look like, and that if asked a 'question', the appropriate response is an 'answer'.
In the Mistral AI documentation, they provide, as a reference, "the format used to tokenize instructions during fine-tuning:":
We can unpack this and surmise that the instruct fine tuning dataset that was used by Mistral AI to create their instruction fine-tuned models would be as follows:
<s>[INST] What is the capital of Australia? [/INST] The capital of Australia is Canberra.</s>[INST] What is the capital of Queensland? [/INST] The capital of Queensland is Brisbane.</s>[INST] How do you make a cup of tea? [/INST] 1. Choose the Right Water...</s>
Looks familiar? It's the same as the chat prompt we built earlier. Note how the
<s>
used for the bos token id is only used once at the very start of the data, and that </s>
is used after each answer. The question (or more accurately 'instruction') is wrapped in the [INST]
and [/INST]
tags. The format and the capitalisation of all of these tags is set, and we must follow this when we prompt.If you read this far, thanks! Reach out to me on LInkedIn and let me know! If you found it useful then consider sharing it and letting other people know. If you have feedback please also get in touch!
In this post we looked at how to prompt the Mistral AI instruct fine-tuned models. We found out what the
<s>
is all about, and that you need to add it sometimes, and not in others.Once you're ready to move your project to production head over to Amazon Bedrock. Bedrock provides a single API endpoint to connect in to a variety of generative AI models from leading AI providers such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and also Amazon, and now Mistral AI.
Here is how to enable access to these models in your AWS account:
From the AWS console, navigate to the Amazon Bedrock page. Mistral models have launched in Oregon, so make sure to be in the
Expand the menu on the left hand side, scroll down and select "Model access":
Select the orange "Manage model access" button, and scroll down to see the new Mistral AI models. If you're happy with the licence, then select the checkboxes next to the models, and click 'Save changes'.
You can now access the models! Head to the Amazon Bedrock text playground to start experimenting with your prompts. When you're ready to write some code, take a look at the code samples we have here,and here.
us-west-2
region, more regions are coming, so check to see if they're now available in another regions:Expand the menu on the left hand side, scroll down and select "Model access":
Select the orange "Manage model access" button, and scroll down to see the new Mistral AI models. If you're happy with the licence, then select the checkboxes next to the models, and click 'Save changes'.
You can now access the models! Head to the Amazon Bedrock text playground to start experimenting with your prompts. When you're ready to write some code, take a look at the code samples we have here,and here.
Happy Prompting!
Huge thanks to EK, Sr Applied Scientist, GenAI at AWS for being my sounding board while writing this post. And to Matt Coles Principal Engineer at AWS for reviewing my words!
I am a Senior Developer Advocate for Amazon Web Services, specialising in Generative AI. You can reach me directly through LinkedIn, come and connect, and help grow the community.
Thanks - Mike
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.