Training and Deploying LLMs on AWS Trainium and AWS Inferentia2 with Optimum Neuron
Learn how to fine-tune and deploy a Mistral model on Inferentia and Trainium instances with Optimum-Neuron
- You must have a public subnet in us-east-1, in an Availability Zone (AZ) that has available Amazon EC2 Trn1 and Amazon EC2 Inf2 instances.
- You must have an Amazon EC2 key pair. For information on creating an EC2 key pair, see Amazon EC2 key pairs and Linux instances.
- You must have sufficient vCPU capacity to run an Amazon EC2 inf2.24xlarge and Amazon EC2 trn1.32xlarge instance.
1
huggingface-cli login
<s>
token, and the model response ends with the </s>
token. These are used by the model to keep track of individual exchanges (sentences). Additionally, the [INST]
and [/INST]
tokens are used for the model to recognize the actual “question”, or instruction posed to the model.1
2
3
4
5
6
"<s>[INST] What is AWS? [/INST]\n\n",
"Amazon Web Services (AWS) is a comprehensive, secure cloud services platform
offered by Amazon that provides computer infrastructure and application services
on a pay-as-you-go basis. It offers a wide variety of services, including
computing power, storage options, networking, databases, analytics, machine
learning, and more.</s>"
1
from datasets import DatasetDict, load_dataset
1
2
3
def format(sample):
sample['text'] = f"<s>[INST] {sample['question']} [/INST]\n\n{sample['answer']}</s>"
return sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Downloads the gsm8k dataset directly from Hugging Face.
dataset = load_dataset("gsm8k", "main")
# We need to split the dataset into a training, and validation set.
# Note gsm8k has 'test', we rename to 'validation' for our training script.
train = dataset['train']
validation = dataset['test']
# Map the format function on all elements of the training and validation splits.
# Also removes the question and answer columns we no longer need.
train = train.map(format, remove_columns=list(train.features))
validation = validation.map(format, remove_columns=list(validation.features))
# Create a new DatasetDict with our train and validation splits.
dataset = DatasetDict({"train": train, "validation": validation})
1
dataset.save_to_disk('dataset_formatted')
dataset_formatted
which contains the dataset we will use for training.train.sh
and enter the following contents:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
set -ex
# In PT2.1, functionalization is needed to close 3% convergence gap compared to PT1.13 for ZeRO1
export XLA_DISABLE_FUNCTIONALIZATION=1
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
# Limit memory allocation to prevent crashes
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"
# Distributed configs
PROCESSES_PER_NODE=32
WORLD_SIZE=1
DISTRIBUTED_ARGS="--nproc_per_node $PROCESSES_PER_NODE"
LOG_PATH=logs
# Create the log path
mkdir -p $LOG_PATH
echo $DISTRIBUTED_ARGS
# Parallelism configuration
GBS=512
NUM_EPOCHS=10
TP_DEGREE=8
PP_DEGREE=1
DP=$(($PROCESSES_PER_NODE * $WORLD_SIZE / $TP_DEGREE / $PP_DEGREE))
BS=1
GRADIENT_ACCUMULATION_STEPS=1
BLOCK_SIZE=2048
LOGGING_STEPS=1
MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.3"
OUTPUT_DIR="mistral_trained"
MAX_STEPS=-1
# Our script will first look in the working directory for a dataset matching the name, or download it from the Hugging Face hub
DATASET_NAME="dataset_formatted"
XLA_USE_BF16=1 torchrun $DISTRIBUTED_ARGS examples/run_clm.py \
--model_name_or_path $MODEL_NAME \
--num_train_epochs $NUM_EPOCHS \
--dataset_name $DATASET_NAME \
--do_train \
--learning_rate 8e-6 \
--warmup_steps 30 \
--max_steps $MAX_STEPS \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
--gradient_checkpointing \
--block_size $BLOCK_SIZE \
--bf16 \
--zero_1 false \
--tensor_parallel_size $TP_DEGREE \
--pipeline_parallel_size $PP_DEGREE \
--logging_steps $LOGGING_STEPS \
--save_total_limit 1 \
--output_dir $OUTPUT_DIR \
--overwrite_output_dir
TP_DEGREE = 8
and no Pipeline Parallelism. Using one trn1.32xlarge instance which has 32 neuron devices, this means we will have 4 DP workers. All of these parameters, are passed to run_clm.py,
which can be found in ~/examples
1
2
chmod +x train.sh
./train.sh
1
optimum-cli neuron consolidate mistral_trained/shards/ mistral_trained/
aws s3 cp mistral_trained/ $S3_BUCKET --recursive
1
aws s3 sync $S3_BUCKET mistral_trained/
compile.py
with the following contents:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from optimum.neuron import NeuronModelForCausalLM
# num_cores is the number of neuron cores. Find this with the command neuron-ls
compiler_args = {"num_cores": 12, "auto_cast_type": 'bf16'}
input_shapes = {"batch_size": 1, "sequence_length": 4096}
# Compiles an Optimum Neuron model from the previously trained (uncompiled) model
model = NeuronModelForCausalLM.from_pretrained(
"mistral_trained",
export=True,
**compiler_args,
**input_shapes)
# Saves the compiled model to the directory mistral_neuron
model.save_pretrained("mistral_neuron")
1
python compile.py
1
cp mistral/tokenizer* mistral_neuron/
test.py
to test the model.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM
# Load the compiled model.
model = NeuronModelForCausalLM.from_pretrained("./mistral_neuron", local_files_only=True)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("./mistral_neuron")
tokenizer.pad_token_id = tokenizer.eos_token_id
# Set a message to send to the model for inferencing
message = f"[INST] Two girls each got 1/6 of the 24 liters of water. Then a boy got 6 liters of water. How many liters of water were left? [/INST]\n\n"
tokenized_message = tokenizer(message, return_tensors="pt")
# Do the inferencing
outputs = model.generate(
**tokenized_message,
max_new_tokens=512, # How many tokens the model can generate in the response
do_sample=True, # Use sampling or greedy decoding
temperature=0.9, # The value used to modulate next token probabilities
top_k=50, # The number of highest probability vocabulary tokens to keep for top-k-filtering
top_p=0.9 # If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
)
# Decode the output from a tensor array to text.
answer = tokenizer.decode(outputs[0][len(tokenized_message[0]):], skip_special_tokens=True)
print(answer)
1
2
3
4
5
6
7
8
9
Let's break it down:
1. The first girl got 1/6 of the water, and the second girl also got 1/6 of the water. Since 1/6 + 1/6 = 2/6 or 1/3 of the water, they both together got 1/3 of the water.
2. The remaining water is 24 liters - (1/3 * 24) = 24 - 8 = 16 liters.
3. The boy then took 6 liters of water, leaving 16 - 6 = 10 liters of water left.
So, there were 10 liters of water left after the boy took his share.
1
2
3
4
5
6
7
8
First, let's find out how much water each girl had:
subs = 24 / 6
subs = 4 liters
So each girl got 4 liters of water. Since there were two girls, then they got a total of 4 * 2 = 8 liters of water.
After giving the boy 6 liters of water, the remaining amount is:
leftover = 24 - 8 - 6
leftover = 12 liters
So there are 12 liters of water left.
chat.py
which can be used as a framework for a large language model chat assistant with contextual memory.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.