Compiling 70B+ LLMs with NxDI on AWS Trainium using EKS
Step by step guide to compile large models like DeepSeek-R1-Distill-Llama-70B on AWS Trainuim instances via EKS.
trn1.32xlarge
EC2 instance with optimal tensor parallelism and NxDI compilation, while monitoring progress using CloudWatch ContainerInsights. We recommend reviewing "Get started with DeepSeek R1 on AWS Inferentia and Trainium" before reading this guide. For more details on NxDI, please refer to the NxDI Overview.- Deploy NodePool and EC2NodeClass that provisions
trn1.32xlarge
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: amd-neuron-trn1
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["trn1"]
...
EC2NodeClass
that uses the EKS-managed DLAMI and allocate large disks-space to store the compiled model artifacts (graph and weights)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: amd-neuron
spec:
amiFamily: AL2
role: "KarpenterNodeRole-myrole"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "mycluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "mycluster"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 900Gi
volumeType: gp3
encrypted: true
- Build an OCI image based on the latest Neuron Containers that includes n
euronx_distributed_inference
and a script that will compile the model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os
from huggingface_hub import create_repo,upload_folder,login
hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
max_model_len=int(os.environ['MAX_MODEL_LEN'])
max_num_seqs=int(os.environ['MAX_NUM_SEQS'])
tensor_parallel_size=int(os.environ['TENSOR_PARALLEL_SIZE'])
model_name=os.environ['MODEL_NAME']
compiled_model_name=os.environ['COMPILED_MODEL_NAME']
os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"
os.environ['NEURON_COMPILED_ARTIFACTS']=model_name
login(hf_token,add_to_git_credential=True)
def push_compiled_model_to_hf(
local_dir: str,
repo_id: str,
commit_message: str,
token: str = None,
):
create_repo(
repo_id=repo_id,
token=token,
exist_ok=True,
private=False
)
upload_folder(
folder_path=local_dir,
path_in_repo="",
repo_id=repo_id,
commit_message=commit_message
)
from vllm import LLM, SamplingParams
llm = LLM(
model=model_name,
max_num_seqs=max_num_seqs,
max_model_len=max_model_len,
device="neuron",
override_neuron_config={},
tensor_parallel_size=tensor_parallel_size)
prompts = [
"The president of the United States is",
]
sampling_params = SamplingParams(top_k=10, temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
push_compiled_model_to_hf(
local_dir=model_name,
repo_id=compiled_model_name,
commit_message=f"Add NxD compiled model {compiled_model_name} from {model_name} for vLLM; max_num_seqs={max_num_seqs},max_model_len={max_model_len},tensor_parallel_size={tensor_parallel_size}"
- Deploy a Kubernetes Job that uses the nodepool, DLAMI and the NeuronContainers, and compile script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
apiVersion: batch/v1
kind: Job
metadata:
name: compile-job
spec:
template:
spec:
restartPolicy: OnFailure
nodeSelector:
karpenter.sh/nodepool: amd-neuron-trn1
serviceAccountName: appsimulator
schedulerName: my-scheduler
containers:
- name: app
image: myaccountid.dkr.ecr.us-west-2.amazonaws.com/myociimage:mylecrtag
imagePullPolicy: Always
volumeMounts:
- name: dshm
mountPath: /dev/shm
command:
- /bin/bash
- "-exc"
- |
set -x
pip install --upgrade pip
pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
pip install --upgrade neuronx-cc transformers_neuronx neuronx_distributed sentence_transformers transformers torch-neuronx accelerate triton protobuf
# Temporary clone: in the future, this repository will be included as a dependency, eliminating the need for cloning; more info in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#setup
git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
cd upstreaming-to-vllm
pip install -r requirements-neuron.txt
VLLM_TARGET_DEVICE="neuron" && pip install -e .
python /compile-vllm.py
resources:
limits:
memory: "465Gi"
aws.amazon.com/neuron: "16"
requests:
memory: "465Gi"
aws.amazon.com/neuron: "16"
env:
- name: PYTHONWARNINGS
value: "ignore::UserWarning"
- name: MODEL_NAME
value: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
- name: MAX_MODEL_LEN
value: "128"
- name: MAX_NUM_SEQS
value: "1"
- name: TENSOR_PARALLEL_SIZE
value: "32"
- name: COMPILED_MODEL_NAME
value: "myhfuser/DeepSeek-R1-Distill-Llama-70B-nxd-tp32-len128-bs1"
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
name: hf-secrets
key: HUGGINGFACE_TOKEN
volumes:
- name: dshm
emptyDir:
medium: Memory
- Discover the memory consumption by the compile job in CloudWatch ContainerInsight. In the plot belowImage not found
Tracking compilation progress with CloudWatch Container Insights
TENSOR_PARALLEL_SIZE
By the end pf this job expects to see the model in your HuggingFace user.
1
2
$kubectl logs compile-job-p8qwv| grep "initializing tensor model parallel"| wc -l
6
- Now you can run another job that download the compiled model from the HuggingFace repo and invoke the model without waiting for the model to be compiled.
- For details on the configuration parameters for vLLM, refer the Neuron continuous batching guide.
- For a full code sample on serving models with vLLM or HuggingFace pipelines, please refer to our published AWS sample. We’re showcasing specific examples like this one and welcome your feedback on adding more.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.