AWS Logo
Menu

Compiling 70B+ LLMs with NxDI on AWS Trainium using EKS

Step by step guide to compile large models like DeepSeek-R1-Distill-Llama-70B on AWS Trainuim instances via EKS.

Yahav Biran
Amazon Employee
Published Mar 25, 2025
A customer aiming to serve models with tens of billions of parameters—while saving up to 50% on accelerated compute costs—can use EC2 Trainium alongside EKS for scalable, efficient container orchestration and seamless AWS integration, and leverage real-time monitoring and proactive troubleshooting with services like CloudWatch or OpenTelemetry. Model compilation is memory intensive and requires optimal configuration to maximize resources while ensuring sufficient capacity for monitoring tools. This post demonstrates how to compile the DeepSeek R1 70B distilled model on a trn1.32xlarge EC2 instance with optimal tensor parallelism and NxDI compilation, while monitoring progress using CloudWatch ContainerInsights. We recommend reviewing "Get started with DeepSeek R1 on AWS Inferentia and Trainium" before reading this guide. For more details on NxDI, please refer to the NxDI Overview.
  1. Deploy NodePool and EC2NodeClass that provisions trn1.32xlarge
Also configure EC2NodeClass that uses the EKS-managed DLAMI and allocate large disks-space to store the compiled model artifacts (graph and weights)
  1. Build an OCI image based on the latest Neuron Containers that includes neuronx_distributed_inference and a script that will compile the model:
  1. Deploy a Kubernetes Job that uses the nodepool, DLAMI and the NeuronContainers, and compile script
Note that setting resources.limits.memory to "465Gi" is crucial because the default memory requests are much lower than what the compilation process requires. In Kubernetes, the admission controller and kubelet eviction logic monitor memory usage closely. If a pod exceeds its allocated memory or if node-level memory pressure is detected, the kubelet’s eviction manager may terminate the pod to reclaim resources. This high limit ensures that the model compilation process—which involves memory-intensive operations—is not preempted by other processes (like monitoring applications) competing for resources, thereby preventing unwanted evictions by the control plane.
  1. Discover the memory consumption by the compile job in CloudWatch ContainerInsight. In the plot below
    CloudWatch Container Insights memory usage of the compilation job
    Tracking compilation progress with CloudWatch Container Insights
The number of spikes in the memory utilization expected be as the value of the configured TENSOR_PARALLEL_SIZE
By the end pf this job expects to see the model in your HuggingFace user.
You can also track the progress via kubectl:
  1. Now you can run another job that download the compiled model from the HuggingFace repo and invoke the model without waiting for the model to be compiled.
Learn more:
  • For details on the configuration parameters for vLLM, refer the Neuron continuous batching guide.
  • For a full code sample on serving models with vLLM or HuggingFace pipelines, please refer to our published AWS sample. We’re showcasing specific examples like this one and welcome your feedback on adding more.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

1 Comment