Training and Deploying LLMs on AWS Trainium and AWS Inferentia2 with Optimum Neuron
This post is written by Jonathan Etiz, Solutions Architect AWS, and John Gray, Solutions Architect AWS
John Gray
Amazon Employee
Published Mar 16, 2025
Last Modified Mar 22, 2025
Introduction
With the massive increase of companies experimenting with or implementing Generative AI (GenAI) into their applications, it has become essential for any machine learning engineer or enthusiast to understand how GenAI models actually work. Despite this popularity, training and deploying these models can still be challenging. On top of finding and deploying a suitable foundational model, producing meaningful outputs to one’s business objectives can be time-consuming and compute-intensive. This often requires domain adaptation through fine-tuning, which can be both time-consuming and computationally expensive.
Fine-tuning is the process of performing additional training on a pre-trained foundational model to adapt the model to provide a desired output. For example, a large language model can be fine-tuned to answer business specific questions, generate code, or solve math questions more effectively. After a model has been trained and fine-tuned, it can be deployed for inferencing, where a model is provided with input and generates and output based on its training.
These processes are computationally intensive, AWS has developed specialized hardware to accelerate specific tasks, such as training and inferencing. Amazon Elastic Compute Cloud (EC2) Trn1 instances, powered by AWS Trainium chips, are purpose built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs). Likewise, Amazon EC2 Inf2 instances, powered by AWS Inferentia2, are purpose built for DL inference. For a deeper dive into their technical aspects, see the AWS Trainium and AWS Inferentia2 architecture documentation. Additionally, Hugging Face has produced an SDK, Optimum Neuron, which allows customers to quickly and easily extend Transformers based training and inference code to run on Amazon EC2 Trn1 and Amazon EC2 Inf2 instances.
In this blog, we will show how to quickly fine-tune and deploy a large-language model using Hugging Face’s Optimum Neuron with AWS Trainium and AWS Inferentia2 hardware.
This solution consists of an Amazon EC2 Trn1 instance, Amazon EC2 Inf2 instance, and an Amazon Simple Storage Service (S3) bucket, which is used to store the trained model. Both Amazon EC2 instances are deployed with the HuggingFace Deep Learning AMI for AWS Neuron, which preloads the instances with the Ubuntu operating system, the Neuron SDK and drivers, and various Python libraries which will be used to train and test the model.
Hugging Face Transformers is one of the Python libraries included in the DLAMI that provides a robust API enabling PyTorch, TensorFlow, and JAX to interface with Hugging Face models. The environment also relies upon Optimum Neuron, the Hugging Face API between the Transformers library and Neuron SDK, which we need to compile and run the model with the AWS Trainium and AWS Inferentia2 accelerators.
For convenience, the whole environment can be deployed with AWS CloudFormation, with steps below:
- You must have a public subnet in us-east-1, in an Availability Zone (AZ) that has available Amazon EC2 Trn1 and Amazon EC2 Inf2 instances.
- You must have an Amazon EC2 key pair. For information on creating an EC2 key pair, see Amazon EC2 key pairs and Linux instances.
- You must have sufficient vCPU capacity to run an Amazon EC2 inf2.24xlarge and Amazon EC2 trn1.32xlarge instance.
For this walkthorugh, Launch a trn1 and inf2 instance with the HuggingFace Deep Learning AMI.
Begin by connecting to the Amazon EC2 Trn1 instance, where we will prepare a dataset and fine-tune the model.
To download gated datasets and models, we need to authenticate with a user access token. Create a Hugging Face account, and go to the User Access Tokens page. Create a Read token with any name, and keep it somewhere safe. Back on the Amazon EC2 Trn1 instance, use:
Enter your token, and type “n” when prompted to add git credential.
Once we have authenticated with Hugging Face, we are able to do fine-tuning. In our example, we will use OpenAI’s Grade School Math 8K (gsm8k) dataset, to fine-tune Mistral-7B-Instruct-v0.3 towards helping with grade-school math problems. Mistral-7B-Instruct is a fine-tuned version of Mistral-7B optimized towards instructing the model, which in turn makes it better for question answering. This is notable, because a foundational model that hasn’t been fine-tuned will simply continue prompts provided by users.
To fine-tune the model on gsm8k, we need to format each question-answer pair into a single text string that is formatted with the appropriate control tokens. Mistral-7B-Instruct works with prompts in the below format, with the first string being the instruction (user prompt), and the second string being the response. Note how the user prompt starts with the
<s>
token, and the model response ends with the </s>
token. These are used by the model to keep track of individual exchanges (sentences). Additionally, the [INST]
and [/INST]
tokens are used for the model to recognize the actual “question”, or instruction posed to the model.The dataset we use, gsm8k, is formatted with two columns, question and answer. Thus, we need to format each training sample to the above format; we will use a simple Python script. Begin a blank Python file with the imports below:
The training script that will be used later uses the “text” column of a dataset for training data. We define a function to format the prompts in the text column, and with the control tokens:
Now add the code to load the dataset and format it using the above function:
Lastly, save the dataset:
Save and close the file, then run it. Notice a new subdirectory,
dataset_formatted
which contains the dataset we will use for training.Once the dataset has been prepared, we are ready to fine-tune. Mistral on Hugging Face is gated, so make sure to visit the page and agree to the terms to authorize your account to download the model. Create a file
train.sh
and enter the following contents:This script sets a variety of parameters and launches a distributed training process on our Trainium instance. To start off we set various compiler flags in order to set up training. Next up we set the distributed training configruation including processes, world size, and parallelism configuration. For this example we are using a parrallelism configuration of
TP_DEGREE = 8
and no Pipeline Parallelism. Using one trn1.32xlarge instance which has 32 neuron devices, this means we will have 4 DP workers. All of these parameters, are passed to run_clm.py,
which can be found in ~/examples
To launch distributed training on the instance, run the training script with the following command:
The training run will take 30~40 minutes. During training, the model is split into a number of shards relative to the tensor parallelism and pipeline parallelism. Each worker within a data parallelism rank loads its respective shard before commencing training.
The training process produced sharded checkpoints to allow users to quickly resume from the checkpoints. Once training is complete, however, we need to consolidate the shards into a consolidated model. This can be done with the following command as part of the optimum-cli:
In order to move the trained model, the AWS CloudFormation template created an Amazon S3 bucket and AWS Identity and Access Management (IAM) role to allow our Amazon EC2 instances the necessary permissions to the bucket. The CloudFormation template set the S3 bucket name as an environment variable, so you can upload the model to S3 with the following command:
aws s3 cp mistral_trained/ $S3_BUCKET --recursive
aws s3 cp mistral_trained/ $S3_BUCKET --recursive
Once the model has been uploaded to S3, we are done with the Amazon EC2 Trn1 instance. Log into the Amazon EC2 Inf2 instance, and in the home directory, pull the trained model from S3:
We now need to compile the model for inferencing. Create a file,
compile.py
with the following contents:Run the compiler with the below command. This process will take up to 10 minutes.
Copy the tokenizer from the trained to compiled model:
Now that we have the model compiled, you can create a basic script,
test.py
to test the model.For more info on generation parameters, view the Transformers GenerationConfig documentation.
Your output will be similar to the following:
Below is output from a non-fine-tuned Mistral-7B-Instruct-v0.3, and you can see the fine-tuning had a positive effect:
Now that we see the fine-tuning worked, we can take a variety of steps to deploy the fine-tuned model. From writing a custom server application on Amazon EC2, to deploying on Amazon EKS, or building an Amazon SageMaker Endpoint leveraging Hugging Face Text Generation Interface containers or SageMaker Large Model Inference containers there are a variety of ways to leverage the plethora of AWS services to performantly and cost-effectively deploy Generative AI models. For examples of deployment options refer to the following:
Additionally, included in the examples folder found in the home directory is a simple chat program,
chat.py
which can be used as a framework for a large language model chat assistant with contextual memory.When you’re done with the environment, be sure to delete the AWS CloudFormation stack to save costs.
For information on training, deploying, and maintaining Generative AI models, we invite you to refer to the following documentation for future projects:
Bios:
Jonathan Etiz is a Solutions Architect Intern at AWS. He is a senior at San José State University pursuing a bachelor’s degree in Data Science, with a minor in Mathematics. He has experience with the software development lifecycle, AWS services, embedded systems, and previously worked in the information technology field. He enjoys swimming, hiking, and skydiving in his free time.
John Gray is a Sr. Solutions Architect in Annapurna Labs, AWS, based out of Seattle. In this role, John works with customers on their AI and machine learning use cases, architects solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS AI chips.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.