Training and Deploying LLMs on AWS Trainium and AWS Inferentia2 with Optimum Neuron

(This post is written by Jon Etiz, Solutions Architect AWS, and John Gray, Solutions Architect AWS)

Introduction

With the massive increase of companies experimenting with or implementing Generative AI (GenAI) into their applications, it has become essential for any machine learning engineer or enthusiast to understand how GenAI models actually work. Despite this popularity, training and deploying these models can still be challenging. On top of finding and deploying a suitable foundational model, producing meaningful outputs to one’s business objectives can be time-consuming and compute-intensive. This often requires domain adaptation through fine-tuning, which can be both time-consuming and computationally expensive.

Fine-tuning is the process of performing additional training on a pre-trained foundational model to adapt the model to provide a desired output. For example, a large language model can be fine-tuned to answer business specific questions, generate code, or solve math questions more effectively. After a model has been trained and fine-tuned, it can be deployed for inferencing, where a model is provided with input and generates and output based on its training.

These processes are computationally intensive, AWS has developed specialized hardware to accelerate specific tasks, such as training and inferencing. Amazon Elastic Compute Cloud (EC2) Trn1 instances, powered by AWS Trainium chips, are purpose built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs). Likewise, Amazon EC2 Inf2 instances, powered by AWS Inferentia2, are purpose built for DL inference. For a deeper dive into their technical aspects, see the AWS Trainium and AWS Inferentia2 architecture documentation. Additionally, Hugging Face has produced an SDK, Optimum Neuron, which allows customers to quickly and easily extend Transformers based training and inference code to run on Amazon EC2 Trn1 and Amazon EC2 Inf2 instances.

In this blog, we will show how to quickly fine-tune and deploy a large-language model using Hugging Face’s Optimum Neuron with AWS Trainium and AWS Inferentia2 hardware.

Solution Overview

This solution consists of an Amazon EC2 Trn1 instance, Amazon EC2 Inf2 instance, and an Amazon Simple Storage Service (S3) bucket, which is used to store the trained model. Both Amazon EC2 instances are deployed with the HuggingFace Deep Learning AMI for AWS Neuron, which preloads the instances with the Ubuntu operating system, the Neuron SDK and drivers, and various Python libraries which will be used to train and test the model.

Hugging Face Transformers is one of the Python libraries included in the DLAMI that provides a robust API enabling PyTorch, TensorFlow, and JAX to interface with Hugging Face models. The environment also relies upon Optimum Neuron, the Hugging Face API between the Transformers library and Neuron SDK, which we need to compile and run the model with the AWS Trainium and AWS Inferentia2 accelerators.

Environment Setup

For convenience, the whole environment can be deployed with AWS CloudFormation, with steps below:

Prerequisites

You must have a public subnet in us-east-1, in an Availability Zone (AZ) that has available Amazon EC2 Trn1 and Amazon EC2 Inf2 instances.
You must have an Amazon EC2 key pair. For information on creating an EC2 key pair, see Amazon EC2 key pairs and Linux instances.
You must have sufficient vCPU capacity to run an Amazon EC2 inf2.24xlarge and Amazon EC2 trn1.32xlarge instance.

Setting up resources

For this walkthorugh, Launch a trn1 and inf2 instance with the HuggingFace Deep Learning AMI.

Log in to Amazon EC2

Begin by connecting to the Amazon EC2 Trn1 instance, where we will prepare a dataset and fine-tune the model.

To download gated datasets and models, we need to authenticate with a user access token. Create a Hugging Face account, and go to the User Access Tokens page. Create a Read token with any name, and keep it somewhere safe. Back on the Amazon EC2 Trn1 instance, use:

1
huggingface-cli login

Enter your token, and type “n” when prompted to add git credential.

Dataset Preparation and Overview

Once we have authenticated with Hugging Face, we are able to do fine-tuning. In our example, we will use OpenAI’s Grade School Math 8K (gsm8k) dataset, to fine-tune Mistral-7B-Instruct-v0.3 towards helping with grade-school math problems. Mistral-7B-Instruct is a fine-tuned version of Mistral-7B optimized towards instructing the model, which in turn makes it better for question answering. This is notable, because a foundational model that hasn’t been fine-tuned will simply continue prompts provided by users.

To fine-tune the model on gsm8k, we need to format each question-answer pair into a single text string that is formatted with the appropriate control tokens. Mistral-7B-Instruct works with prompts in the below format, with the first string being the instruction (user prompt), and the second string being the response. Note how the user prompt starts with the <s> token, and the model response ends with the </s> token. These are used by the model to keep track of individual exchanges (sentences). Additionally, the [INST] and [/INST] tokens are used for the model to recognize the actual “question”, or instruction posed to the model.

1
2
3
4
5
6
"<s>[INST] What is AWS? [/INST]\n\n",
"Amazon Web Services (AWS) is a comprehensive, secure cloud services platform 
offered by Amazon that provides computer infrastructure and application services
on a pay-as-you-go basis. It offers a wide variety of services, including
computing power, storage options, networking, databases, analytics, machine
learning, and more.</s>"

The dataset we use, gsm8k, is formatted with two columns, question and answer. Thus, we need to format each training sample to the above format; we will use a simple Python script. Begin a blank Python file with the imports below:

1
from datasets import DatasetDict, load_dataset

The training script that will be used later uses the “text” column of a dataset for training data. We define a function to format the prompts in the text column, and with the control tokens:

1
2
3
def format(sample):
    sample['text'] = f"<s>[INST] {sample['question']} [/INST]\n\n{sample['answer']}</s>"
    return sample

Now add the code to load the dataset and format it using the above function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Downloads the gsm8k dataset directly from Hugging Face.
dataset = load_dataset("gsm8k", "main")

# We need to split the dataset into a training, and validation set.
# Note gsm8k has 'test', we rename to 'validation' for our training script.
train = dataset['train']
validation = dataset['test']

# Map the format function on all elements of the training and validation splits.
# Also removes the question and answer columns we no longer need.
train = train.map(format, remove_columns=list(train.features))
validation = validation.map(format, remove_columns=list(validation.features))

# Create a new DatasetDict with our train and validation splits.
dataset = DatasetDict({"train": train, "validation": validation})

Lastly, save the dataset:

1
dataset.save_to_disk('dataset_formatted')

Save and close the file, then run it. Notice a new subdirectory, dataset_formatted which contains the dataset we will use for training.

Fine-Tune Mistral-7B-Instruct with Hugging Face Optimum Neuron

Once the dataset has been prepared, we are ready to fine-tune. Mistral on Hugging Face is gated, so make sure to visit the page and agree to the terms to authorize your account to download the model. Create a file train.sh and enter the following contents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
set -ex
# In PT2.1, functionalization is needed to close 3% convergence gap compared to PT1.13 for ZeRO1
export XLA_DISABLE_FUNCTIONALIZATION=1

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
# Limit memory allocation to prevent crashes
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

# Distributed configs
PROCESSES_PER_NODE=32
WORLD_SIZE=1
DISTRIBUTED_ARGS="--nproc_per_node $PROCESSES_PER_NODE"
LOG_PATH=logs

# Create the log path
mkdir -p $LOG_PATH
echo $DISTRIBUTED_ARGS

# Parallelism configuration
GBS=512
NUM_EPOCHS=10
TP_DEGREE=8
PP_DEGREE=1
DP=$(($PROCESSES_PER_NODE * $WORLD_SIZE / $TP_DEGREE / $PP_DEGREE))
BS=1
GRADIENT_ACCUMULATION_STEPS=1
BLOCK_SIZE=2048
LOGGING_STEPS=1
MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.3"
OUTPUT_DIR="mistral_trained"

MAX_STEPS=-1

# Our script will first look in the working directory for a dataset matching the name, or download it from the Hugging Face hub
DATASET_NAME="dataset_formatted"

XLA_USE_BF16=1 torchrun $DISTRIBUTED_ARGS examples/run_clm.py \
    --model_name_or_path $MODEL_NAME \
    --num_train_epochs $NUM_EPOCHS \
    --dataset_name $DATASET_NAME \
    --do_train \
    --learning_rate 8e-6 \
    --warmup_steps 30 \
    --max_steps $MAX_STEPS \
    --per_device_train_batch_size $BS \
    --per_device_eval_batch_size $BS \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --gradient_checkpointing \
    --block_size $BLOCK_SIZE \
    --bf16 \
    --zero_1 false \
    --tensor_parallel_size $TP_DEGREE \
    --pipeline_parallel_size $PP_DEGREE \
    --logging_steps $LOGGING_STEPS \
   --save_total_limit 1 \
   --output_dir $OUTPUT_DIR \
   --overwrite_output_dir

This script sets a variety of parameters and launches a distributed training process on our Trainium instance. To start off we set various compiler flags in order to set up training. Next up we set the distributed training configruation including processes, world size, and parallelism configuration. For this example we are using a parrallelism configuration of TP_DEGREE = 8 and no Pipeline Parallelism. Using one trn1.32xlarge instance which has 32 neuron devices, this means we will have 4 DP workers. All of these parameters, are passed to run_clm.py, which can be found in ~/examples

To launch distributed training on the instance, run the training script with the following command:

1
2
chmod +x train.sh
./train.sh

The training run will take 30~40 minutes. During training, the model is split into a number of shards relative to the tensor parallelism and pipeline parallelism. Each worker within a data parallelism rank loads its respective shard before commencing training.

The training process produced sharded checkpoints to allow users to quickly resume from the checkpoints. Once training is complete, however, we need to consolidate the shards into a consolidated model. This can be done with the following command as part of the optimum-cli:

1
optimum-cli neuron consolidate mistral_trained/shards/ mistral_trained/

Save the Trained Model to Amazon S3

In order to move the trained model, the AWS CloudFormation template created an Amazon S3 bucket and AWS Identity and Access Management (IAM) role to allow our Amazon EC2 instances the necessary permissions to the bucket. The CloudFormation template set the S3 bucket name as an environment variable, so you can upload the model to S3 with the following command:
aws s3 cp mistral_trained/ $S3_BUCKET --recursive

Test our Fine-Tuned Model

Once the model has been uploaded to S3, we are done with the Amazon EC2 Trn1 instance. Log into the Amazon EC2 Inf2 instance, and in the home directory, pull the trained model from S3:

1
aws s3 sync $S3_BUCKET mistral_trained/

Compile the Model for Inference

We now need to compile the model for inferencing. Create a file, compile.py with the following contents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from optimum.neuron import NeuronModelForCausalLM

# num_cores is the number of neuron cores. Find this with the command neuron-ls
compiler_args = {"num_cores": 12, "auto_cast_type": 'bf16'}
input_shapes = {"batch_size": 1, "sequence_length": 4096}

# Compiles an Optimum Neuron model from the previously trained (uncompiled) model
model = NeuronModelForCausalLM.from_pretrained(
    "mistral_trained",
    export=True,
    **compiler_args,
    **input_shapes)

# Saves the compiled model to the directory mistral_neuron
model.save_pretrained("mistral_neuron")

Run the compiler with the below command. This process will take up to 10 minutes.

1
python compile.py

Copy the tokenizer from the trained to compiled model:

1
cp mistral/tokenizer* mistral_neuron/

Now that we have the model compiled, you can create a basic script, test.py to test the model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

# Load the compiled model.
model = NeuronModelForCausalLM.from_pretrained("./mistral_neuron", local_files_only=True)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("./mistral_neuron")
tokenizer.pad_token_id = tokenizer.eos_token_id

# Set a message to send to the model for inferencing
message = f"[INST] Two girls each got 1/6 of the 24 liters of water. Then a boy got 6 liters of water. How many liters of water were left? [/INST]\n\n"
tokenized_message = tokenizer(message, return_tensors="pt")

# Do the inferencing
outputs = model.generate(
    **tokenized_message,
    max_new_tokens=512, # How many tokens the model can generate in the response
    do_sample=True, # Use sampling or greedy decoding
    temperature=0.9, # The value used to modulate next token probabilities
    top_k=50, # The number of highest probability vocabulary tokens to keep for top-k-filtering
    top_p=0.9 # If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
)

# Decode the output from a tensor array to text.
answer = tokenizer.decode(outputs[0][len(tokenized_message[0]):], skip_special_tokens=True)

print(answer)

For more info on generation parameters, view the Transformers GenerationConfig documentation.

Your output will be similar to the following:

1
2
3
4
5
6
7
8
9
Let's break it down:

1. The first girl got 1/6 of the water, and the second girl also got 1/6 of the water. Since 1/6 + 1/6 = 2/6 or 1/3 of the water, they both together got 1/3 of the water.

2. The remaining water is 24 liters - (1/3 * 24) = 24 - 8 = 16 liters.

3. The boy then took 6 liters of water, leaving 16 - 6 = 10 liters of water left.

So, there were 10 liters of water left after the boy took his share.

Below is output from a non-fine-tuned Mistral-7B-Instruct-v0.3, and you can see the fine-tuning had a positive effect:

1
2
3
4
5
6
7
8
First, let's find out how much water each girl had:
subs = 24 / 6
subs = 4 liters
So each girl got 4 liters of water. Since there were two girls, then they got a total of 4 * 2 = 8 liters of water.
After giving the boy 6 liters of water, the remaining amount is:
leftover = 24 - 8 - 6
leftover = 12 liters
So there are 12 liters of water left.

Deployment Options & Next Steps

Now that we see the fine-tuning worked, we can take a variety of steps to deploy the fine-tuned model. From writing a custom server application on Amazon EC2, to deploying on Amazon EKS, or building an Amazon SageMaker Endpoint leveraging Hugging Face Text Generation Interface containers or SageMaker Large Model Inference containers there are a variety of ways to leverage the plethora of AWS services to performantly and cost-effectively deploy Generative AI models. For examples of deployment options refer to the following:

Additionally, included in the examples folder found in the home directory is a simple chat program, chat.py which can be used as a framework for a large language model chat assistant with contextual memory.

When you’re done with the environment, be sure to delete the AWS CloudFormation stack to save costs.

For information on training, deploying, and maintaining Generative AI models, we invite you to refer to the following documentation for future projects:

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.