AWS Logo
Menu

Deploying the DeepSeek-V3 Model (full version) in Amazon EKS Using vLLM and LWS

This guide provides a streamlined process to deploy the full 671B parameter DeepSeek-V3 MoE model on Amazon EKS using vLLM and LeaderWorkerSet API

Published Apr 21, 2025

Who is this Guide for?

This guide assumes you:
  • Have intermediate Kubernetes experience (kubectl, Helm)
  • Are familiar with AWS CLI and EKS
  • Understand basic GPU concepts
This guide provides a streamlined process to deploy the 671B parameter DeepSeek-V3 MoE model on Amazon EKS using vLLMa nd LeaderWorkerSet API (LWS). We will be deploying on Amazon EC2 G6e instances as they are a bit more accessible/available and we want to see how to load models across multiple nodes.
The main idea here is to peel the onion to see how exactly folks are deploying these large models with a practical demonstration to understand all the pieces and how they fit together.
The latest versions of the files are available here:
https://github.com/dims/skunkworks/tree/main/v3

Prerequisites

  • AWS CLI: For managing AWS resources.
  • eksctl/eksdemo: To create and manage EKS clusters.
  • kubectl: The command-line tool for Kubernetes.
  • helm: Kubernetes’ package manager.
  • jq: For parsing JSON.
  • Docker: For building container images.
  • Hugging Face Hub access: You’ll need a token to download the model.

Creating a suitable EKS Cluster

We will use an AWS account with sufficient quota for four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU).
You can use eksdemo for example:
if you want to use eksctl instead, run the same above command with --dry-run to get the equivalent command and configuration yaml.
Essentially, ensure you have enough GPU nodes, allocate a large volume size per node, and enable EFA. You can use any tool of your choice, but remember you will have to adjust say for taints in the deployment yaml as needed.
🔍 Why EFA? Elastic Fabric Adapter accelerates inter-node communication, critical for multi-GPU inference.

A container image with EFA

Ideally you would just use a public image from the vllm folks:
However, we want to use EFA because Elastic Fabric Adapter (EFA) enhances inter-node communication for high-performance computing and machine learning applications within Amazon EKS clusters.
In the following Dockerfile, we start by grabbing a powerful CUDA base image, then go on an installation spree, pulling in EFA, NCCL, and AWS-OFI-NCCL, while instructing apt to hang onto its downloaded packages. Once everything’s compiled, we carefully graft these freshly built libraries onto the vLLM image above.
🛠 GPU Compatibility: The COMPUTE_CAPABILITY_VERSION=90 setting is specific to L40S GPUs. Adjust this for your hardware.
Note we also install hugging_hub with the high speed hf_transfer component and update the ray package. There is a ray_init.sh which helps us start vllm and ray in the leader and worker nodes brought up by LWS.
Both these files are adaptations of code written by various folks and are available here and here.

Verify the cluster

Step 1: Check Daemonsets

Check if the Nvidia and EFA daemonsets are running using

Step 2: Verify Node Resources

Check if the nodes are correctly annotated with the GPU count and EFA capacity

Step 3: Inspect Hardware

Install the node-shell kubectl/krew plugin to peek into the nodes (will be handy for later) using kubectl krew install node-shell
Now check if the devices are correctly present in each node

Run the deepseek-v3 workload

Install LWS Controller

Use helm to install LWS:
Check if the LWS pods are running using kubectl get pods -n lws-system
Edit deepseek-lws.yaml to insert your hugging face token (ensure it's base64 encoded):
Important: Replace "PASTE_BASE_64_VERSION_OF_YOUR_HF_TOKEN_HERE" with the base64 encoded version of your Hugging Face token. To base64 encode it, you can use something like echo -n 'your_token' | base64
A couple of other things to point out, if you see the vllm command line you will notice
Across the 4 notes we have 32 GPUs, we are splitting these into 8 way tensor-parallel and 4 pipeline stages for a total of 32 (read about these params here).
Apply the yaml using kubectl:

Check on the vllm pods

You will need to wait until vllm-0 gets to 1/1. You can check in on what is happening inside the main pod using
In the deepseek-lws.yaml, you will notice that we have turned up all the logging way up high so you get an idea of all the things happening (or not!) in the system. Once you get familiar, you can turn down the settings to as much as you wish.
You will see the model being downloaded:
If you inspect the deepseek-lws.yaml, you will see that /root/local directory on the host is used to store the model. So even if the pods fail for some reason, the next pod will pick up downloading from where the previous pod failed.
After a while you will see the following:
Once you see the following, the vllm openapi endpoint is ready!

Take it for a spin!

Access the API

To access the DeepSeek-V3 model using your localhost, use the following command:
To check if the model is registered using the openapi spec, use the following command:
To test the deployment use the following command:
you will see something like:
Now feel free to tweak the deepseek-lws.yaml and re-apply the changes using:
Just to be sure, you can clean up using kubectl delete -f deepseek-lws.yaml and use kubectl get pods to make sure all the pods are gone before you run kubectl apply.
Happy Hacking!!

Bonus

If you were a keen observer and noticed that we forwarded port 8265 as well, point your browser to look at the Ray dashboard!
You can see the GPU usage specifically when you are running an inference.
Thanks
This post is based on Bryant Biggs's work in various repositories, thanks Bryant. Also thanks to Arush Sharma for a quick review and suggestions. Kudos to the folks in the Ray, vLLM, LWS, and Kubernetes communities for making it easier to compose these complex scenarios.

Things to try

As mentioned earlier we are relying here on a host node directory to persist model downloads across pod restarts. There are other options you can try as well, see the mozilla.ai link below that uses Persistent Volumes for example. Yet another option to store/load the model is using The FSx for Lustre Container Storage Interface (CSI) driver.
terraform-aws-eks-blueprints github repo has a terraform based setup you can try too.

Links

1 Comment