How to Speed Up Model Training and Cut Down Billing Time with Amazon SageMaker
Optimizing the compilation and training of the open source GPT-2 model on the Stanford Sentiment Treebank v2 (SST2) dataset, using the features of the Amazon SageMaker Training Compiler.
- You can run this code on Amazon SageMaker Studio, Amazon SageMaker notebook instance (the way we're using it now), or on your local computer where the AWS CLI is set up. If you use Amazon SageMaker Studio or Amazon SageMaker notebook instance, make sure to select one of the PyTorch-based kernels, namely
PyTorch 3
orconda_pytorch_p38
respectively. - This notebook uses 2 x
ml.g4dn.12xlarge
instances with multiple GPUs. If you don't have enough quotas, please refer to the "Supported Regions and Quotas" to request an increase in service quotas for Amazon SageMaker resources.
1
2
3
4
5
6
7
8
9
!pip install "sagemaker>=2.108.0" botocore boto3 awscli pandas numpy –upgrade
import botocore
import boto3
import sagemaker
import pandas as pd
print(f"sagemaker: {sagemaker.__version__}")
print(f"boto3: {boto3.__version__}")
print(f"botocore: {botocore.__version__}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import sagemaker
sess = sagemaker.Session()
# SageMaker session bucket -> used for uploading data, models and logs
# SageMaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
dataset_config_name
in the notebook code; if you compare it to the entry_point
file of the Hugging Face estimator (run_clm.py
is defined in this example), the code in it is written like this:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
estimator_args = dict(
source_dir="scripts",
entry_point="run_clm.py",
instance_type="ml.g4dn.12xlarge",
instance_count=1,
role=role,
py_version="py38",
volume_size=100,
disable_profiler=True, # Disabling SageMaker Profiler to avoid overheads during benchmarking
debugger_hook_config=False, # Disabling SageMaker Debugger to avoid overheads during benchmarking
base_job_name="trcomp-pt-example",
metric_definitions=[
{"Name": "summary_train_runtime", "Regex": "'train_runtime': ([0-9.]*)"},
{
"Name": "summary_train_samples_per_second",
"Regex": "'train_samples_per_second': ([0-9.]*)",
},
{"Name": "summary_train_steps_per_second", "Regex": "'train_steps_per_second': ([0-9.]*)"},
{"Name": "summary_train_loss", "Regex": "'train_loss': ([0-9.]*)"},
{"Name": "epoch", "Regex": "'epoch': ([0-9.]*)"},
{"Name": "train_loss", "Regex": "'loss': ([0-9.]*)"},
{"Name": "learning_rate", "Regex": "'learning_rate': ([0-9.]*)"},
],
)
# Since ml.g4dn.12xlarge instance has 4 GPUs, we set num_gpus_per_instance to 4
num_gpus_per_instance = 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
hyperparameters = {
"model_type": "gpt2",
"tokenizer_name": "gpt2",
"dataset_name": "glue",
"dataset_config_name": "sst2",
"do_train": True,
"do_eval": False,
"fp16": True,
"per_device_eval_batch_size": 8,
"num_train_epochs": 100,
"block_size": 512,
"overwrite_output_dir": True,
"save_strategy": "no",
"evaluation_strategy": "no",
"logging_strategy": "epoch",
"output_dir": "/opt/ml/model",
"dataloader_drop_last": True,
}
per_device_train_batch_size
defines the maximum number of batches that can fit into the ml.g4dn.12xlarge instance memory. If you change the model version, instance type, sequence length, or other parameters that affect memory consumption, you'll need to find the corresponding maximum batch size.1
distribution={"pytorchddp": {"enabled": True}}
run_clm.py
, which you can find in the scripts folder.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sagemaker.pytorch import PyTorch
# The original learning rate was set for a batch of 32. Here we scale learning rate linearly with an adjusted batch size
per_device_train_batch_size = 10
global_batch_size = (
per_device_train_batch_size * num_gpus_per_instance * estimator_args["instance_count"]
)
learning_rate = float("5e-5") / 32 * global_batch_size
# Configure the training job
native_estimator = PyTorch(
framework_version="1.11",
hyperparameters=dict(
**hyperparameters,
**{
"per_device_train_batch_size": per_device_train_batch_size,
"learning_rate": learning_rate,
},
),
distribution={"pytorchddp": {"enabled": True}},
**estimator_args,
)
# Start the training job
native_estimator.fit(wait=False)
native_estimator.latest_training_job.name
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig
new_per_device_train_batch_size = 20
global_batch_size = (
new_per_device_train_batch_size * num_gpus_per_instance * estimator_args["instance_count"]
)
learning_rate = float("5e-5") / 32 * global_batch_size
# Configure the training job
optimized_estimator = HuggingFace(
compiler_config=TrainingCompilerConfig(),
transformers_version="4.21",
pytorch_version="1.11",
hyperparameters=dict(
**hyperparameters,
**{
"per_device_train_batch_size": new_per_device_train_batch_size,
"learning_rate": learning_rate,
},
),
distribution={"pytorchxla": {"enabled": True}},
**estimator_args,
)
# Start the training job
optimized_estimator.fit(wait=False)
optimized_estimator.latest_training_job.name
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.