Compiling 70B+ LLMs with NxDI on AWS Trainium using EKS

A customer aiming to serve models with tens of billions of parameters—while saving up to 50% on accelerated compute costs—can use EC2 Trainium alongside EKS for scalable, efficient container orchestration and seamless AWS integration, and leverage real-time monitoring and proactive troubleshooting with services like CloudWatch or OpenTelemetry. Model compilation is memory intensive and requires optimal configuration to maximize resources while ensuring sufficient capacity for monitoring tools. This post demonstrates how to compile the DeepSeek R1 70B distilled model on a trn1.32xlarge EC2 instance with optimal tensor parallelism and NxDI compilation, while monitoring progress using CloudWatch ContainerInsights. We recommend reviewing "Get started with DeepSeek R1 on AWS Inferentia and Trainium" before reading this guide. For more details on NxDI, please refer to the NxDI Overview.

Deploy EKS cluster and Karpenter
Enable CloudWatch Container Insights for EKS
Deploy NodePool and EC2NodeClass that provisions trn1.32xlarge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: amd-neuron-trn1
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["trn1"]
...

Also configure EC2NodeClass that uses the EKS-managed DLAMI and allocate large disks-space to store the compiled model artifacts (graph and weights)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: amd-neuron
spec:
  amiFamily: AL2
  role: "KarpenterNodeRole-myrole"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "mycluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "mycluster"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 900Gi
        volumeType: gp3
        encrypted: true

Build an OCI image based on the latest Neuron Containers that includes neuronx_distributed_inference and a script that will compile the model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os
from huggingface_hub import create_repo,upload_folder,login

hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
max_model_len=int(os.environ['MAX_MODEL_LEN'])
max_num_seqs=int(os.environ['MAX_NUM_SEQS'])
tensor_parallel_size=int(os.environ['TENSOR_PARALLEL_SIZE'])
model_name=os.environ['MODEL_NAME']
compiled_model_name=os.environ['COMPILED_MODEL_NAME']
os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"
os.environ['NEURON_COMPILED_ARTIFACTS']=model_name

login(hf_token,add_to_git_credential=True)

def push_compiled_model_to_hf(
    local_dir: str,
    repo_id: str,
    commit_message: str,
    token: str = None,
):
    create_repo(
        repo_id=repo_id,
        token=token,
        exist_ok=True,
        private=False
    )

    upload_folder(
        folder_path=local_dir,
        path_in_repo="",
        repo_id=repo_id,
        commit_message=commit_message
    )

from vllm import LLM, SamplingParams
llm = LLM(
    model=model_name,
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    device="neuron",
    override_neuron_config={},
    tensor_parallel_size=tensor_parallel_size)
prompts = [
    "The president of the United States is",
]
sampling_params = SamplingParams(top_k=10, temperature=0.8, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

push_compiled_model_to_hf(
  local_dir=model_name,
  repo_id=compiled_model_name,
  commit_message=f"Add NxD compiled model {compiled_model_name} from {model_name} for vLLM; max_num_seqs={max_num_seqs},max_model_len={max_model_len},tensor_parallel_size={tensor_parallel_size}"

Deploy a Kubernetes Job that uses the nodepool, DLAMI and the NeuronContainers, and compile script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
apiVersion: batch/v1
kind: Job
metadata:
  name: compile-job
spec:
  template:
    spec:
      restartPolicy: OnFailure
      nodeSelector:
        karpenter.sh/nodepool: amd-neuron-trn1
      serviceAccountName: appsimulator
      schedulerName: my-scheduler
      containers:
      - name: app
        image: myaccountid.dkr.ecr.us-west-2.amazonaws.com/myociimage:mylecrtag
        imagePullPolicy: Always
        volumeMounts:
          - name: dshm
            mountPath: /dev/shm
        command:
        - /bin/bash
        - "-exc"
        - |
          set -x
          pip install --upgrade pip
          pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
          pip install --upgrade neuronx-cc transformers_neuronx neuronx_distributed sentence_transformers transformers torch-neuronx accelerate triton protobuf
          # Temporary clone: in the future, this repository will be included as a dependency, eliminating the need for cloning; more info in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#setup
          git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
          cd upstreaming-to-vllm
          pip install -r requirements-neuron.txt
          VLLM_TARGET_DEVICE="neuron" && pip install -e .
          python /compile-vllm.py
        resources:
          limits:
            memory: "465Gi"
            aws.amazon.com/neuron: "16"
          requests:
            memory: "465Gi"
            aws.amazon.com/neuron: "16"
        env:
        - name: PYTHONWARNINGS
          value: "ignore::UserWarning"
        - name: MODEL_NAME
          value: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
        - name: MAX_MODEL_LEN
          value: "128"
        - name: MAX_NUM_SEQS
          value: "1"
        - name: TENSOR_PARALLEL_SIZE
          value: "32"
        - name: COMPILED_MODEL_NAME
          value: "myhfuser/DeepSeek-R1-Distill-Llama-70B-nxd-tp32-len128-bs1"
        - name: HUGGINGFACE_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secrets
              key: HUGGINGFACE_TOKEN
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory

Note that setting resources.limits.memory to "465Gi" is crucial because the default memory requests are much lower than what the compilation process requires. In Kubernetes, the admission controller and kubelet eviction logic monitor memory usage closely. If a pod exceeds its allocated memory or if node-level memory pressure is detected, the kubelet’s eviction manager may terminate the pod to reclaim resources. This high limit ensures that the model compilation process—which involves memory-intensive operations—is not preempted by other processes (like monitoring applications) competing for resources, thereby preventing unwanted evictions by the control plane.

Discover the memory consumption by the compile job in CloudWatch ContainerInsight. In the plot below
Image not found
Tracking compilation progress with CloudWatch Container Insights

The number of spikes in the memory utilization expected be as the value of the configured TENSOR_PARALLEL_SIZE
By the end pf this job expects to see the model in your HuggingFace user.

You can also track the progress via kubectl:

1
2
$kubectl logs compile-job-p8qwv| grep "initializing tensor model parallel"| wc -l
6

Now you can run another job that download the compiled model from the HuggingFace repo and invoke the model without waiting for the model to be compiled.

Learn more:

For details on the configuration parameters for vLLM, refer the Neuron continuous batching guide.
For a full code sample on serving models with vLLM or HuggingFace pipelines, please refer to our published AWS sample. We’re showcasing specific examples like this one and welcome your feedback on adding more.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Compiling 70B+ LLMs with NxDI on AWS Trainium using EKS

Step by step guide to compile large models like DeepSeek-R1-Distill-Llama-70B on AWS Trainuim instances via EKS.

Learn more:

1 Comment