Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

AWS Logo
Menu

Compiling 70B+ LLMs with NxDI on AWS Trainium using EKS

Step by step guide to compile large models like DeepSeek-R1-Distill-Llama-70B on AWS Trainuim instances via EKS.

Yahav Biran
Amazon Employee
Published Mar 25, 2025
A customer aiming to serve models with tens of billions of parameters—while saving up to 50% on accelerated compute costs—can use EC2 Trainium alongside EKS for scalable, efficient container orchestration and seamless AWS integration, and leverage real-time monitoring and proactive troubleshooting with services like CloudWatch or OpenTelemetry. Model compilation is memory intensive and requires optimal configuration to maximize resources while ensuring sufficient capacity for monitoring tools. This post demonstrates how to compile the DeepSeek R1 70B distilled model on a trn1.32xlarge EC2 instance with optimal tensor parallelism and NxDI compilation, while monitoring progress using CloudWatch ContainerInsights. We recommend reviewing "Get started with DeepSeek R1 on AWS Inferentia and Trainium" before reading this guide. For more details on NxDI, please refer to the NxDI Overview.
  1. Deploy NodePool and EC2NodeClass that provisions trn1.32xlarge
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: amd-neuron-trn1
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["trn1"]
...
Also configure EC2NodeClass that uses the EKS-managed DLAMI and allocate large disks-space to store the compiled model artifacts (graph and weights)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: amd-neuron
spec:
amiFamily: AL2
role: "KarpenterNodeRole-myrole"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "mycluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "mycluster"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 900Gi
volumeType: gp3
encrypted: true
  1. Build an OCI image based on the latest Neuron Containers that includes neuronx_distributed_inference and a script that will compile the model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os
from huggingface_hub import create_repo,upload_folder,login

hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
max_model_len=int(os.environ['MAX_MODEL_LEN'])
max_num_seqs=int(os.environ['MAX_NUM_SEQS'])
tensor_parallel_size=int(os.environ['TENSOR_PARALLEL_SIZE'])
model_name=os.environ['MODEL_NAME']
compiled_model_name=os.environ['COMPILED_MODEL_NAME']
os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"
os.environ['NEURON_COMPILED_ARTIFACTS']=model_name

login(hf_token,add_to_git_credential=True)

def push_compiled_model_to_hf(
local_dir: str,
repo_id: str,
commit_message: str,
token: str = None,
):
create_repo(
repo_id=repo_id,
token=token,
exist_ok=True,
private=False
)

upload_folder(
folder_path=local_dir,
path_in_repo="",
repo_id=repo_id,
commit_message=commit_message
)

from vllm import LLM, SamplingParams
llm = LLM(
model=model_name,
max_num_seqs=max_num_seqs,
max_model_len=max_model_len,
device="neuron",
override_neuron_config={},
tensor_parallel_size=tensor_parallel_size)
prompts = [
"The president of the United States is",
]
sampling_params = SamplingParams(top_k=10, temperature=0.8, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

push_compiled_model_to_hf(
local_dir=model_name,
repo_id=compiled_model_name,
commit_message=f"Add NxD compiled model {compiled_model_name} from {model_name} for vLLM; max_num_seqs={max_num_seqs},max_model_len={max_model_len},tensor_parallel_size={tensor_parallel_size}"
  1. Deploy a Kubernetes Job that uses the nodepool, DLAMI and the NeuronContainers, and compile script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
apiVersion: batch/v1
kind: Job
metadata:
name: compile-job
spec:
template:
spec:
restartPolicy: OnFailure
nodeSelector:
karpenter.sh/nodepool: amd-neuron-trn1
serviceAccountName: appsimulator
schedulerName: my-scheduler
containers:
- name: app
image: myaccountid.dkr.ecr.us-west-2.amazonaws.com/myociimage:mylecrtag
imagePullPolicy: Always
volumeMounts:
- name: dshm
mountPath: /dev/shm
command:
- /bin/bash
- "-exc"
- |
set -x
pip install --upgrade pip
pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
pip install --upgrade neuronx-cc transformers_neuronx neuronx_distributed sentence_transformers transformers torch-neuronx accelerate triton protobuf
# Temporary clone: in the future, this repository will be included as a dependency, eliminating the need for cloning; more info in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#setup
git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
cd upstreaming-to-vllm
pip install -r requirements-neuron.txt
VLLM_TARGET_DEVICE="neuron" && pip install -e .
python /compile-vllm.py
resources:
limits:
memory: "465Gi"
aws.amazon.com/neuron: "16"
requests:
memory: "465Gi"
aws.amazon.com/neuron: "16"
env:
- name: PYTHONWARNINGS
value: "ignore::UserWarning"
- name: MODEL_NAME
value: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
- name: MAX_MODEL_LEN
value: "128"
- name: MAX_NUM_SEQS
value: "1"
- name: TENSOR_PARALLEL_SIZE
value: "32"
- name: COMPILED_MODEL_NAME
value: "myhfuser/DeepSeek-R1-Distill-Llama-70B-nxd-tp32-len128-bs1"
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
name: hf-secrets
key: HUGGINGFACE_TOKEN
volumes:
- name: dshm
emptyDir:
medium: Memory
Note that setting resources.limits.memory to "465Gi" is crucial because the default memory requests are much lower than what the compilation process requires. In Kubernetes, the admission controller and kubelet eviction logic monitor memory usage closely. If a pod exceeds its allocated memory or if node-level memory pressure is detected, the kubelet’s eviction manager may terminate the pod to reclaim resources. This high limit ensures that the model compilation process—which involves memory-intensive operations—is not preempted by other processes (like monitoring applications) competing for resources, thereby preventing unwanted evictions by the control plane.
  1. Discover the memory consumption by the compile job in CloudWatch ContainerInsight. In the plot below
    Image not found
    Tracking compilation progress with CloudWatch Container Insights
The number of spikes in the memory utilization expected be as the value of the configured TENSOR_PARALLEL_SIZE
By the end pf this job expects to see the model in your HuggingFace user.
You can also track the progress via kubectl:
1
2
$kubectl logs compile-job-p8qwv| grep "initializing tensor model parallel"| wc -l
6
  1. Now you can run another job that download the compiled model from the HuggingFace repo and invoke the model without waiting for the model to be compiled.
Learn more:
  • For details on the configuration parameters for vLLM, refer the Neuron continuous batching guide.
  • For a full code sample on serving models with vLLM or HuggingFace pipelines, please refer to our published AWS sample. We’re showcasing specific examples like this one and welcome your feedback on adding more.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

1 Comment

Log in to comment