Deploying the DeepSeek-V3 Model (full version) in Amazon EKS Using vLLM and LWS
This guide provides a streamlined process to deploy the full 671B parameter DeepSeek-V3 MoE model on Amazon EKS using vLLM and LeaderWorkerSet API
Published Apr 21, 2025
This guide assumes you:
- Have intermediate Kubernetes experience (kubectl, Helm)
- Are familiar with AWS CLI and EKS
- Understand basic GPU concepts
This guide provides a streamlined process to deploy the 671B parameter DeepSeek-V3 MoE model on Amazon EKS using vLLMa nd LeaderWorkerSet API (LWS). We will be deploying on Amazon EC2 G6e instances as they are a bit more accessible/available and we want to see how to load models across multiple nodes.
The main idea here is to peel the onion to see how exactly folks are deploying these large models with a practical demonstration to understand all the pieces and how they fit together.
The latest versions of the files are available here:
https://github.com/dims/skunkworks/tree/main/v3
https://github.com/dims/skunkworks/tree/main/v3
- AWS CLI: For managing AWS resources.
- eksctl/eksdemo: To create and manage EKS clusters.
- kubectl: The command-line tool for Kubernetes.
- helm: Kubernetes’ package manager.
- jq: For parsing JSON.
- Docker: For building container images.
- Hugging Face Hub access: You’ll need a token to download the model.
We will use an AWS account with sufficient quota for four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU).
You can use eksdemo for example:
if you want to use eksctl instead, run the same above command with
--dry-run
to get the equivalent command and configuration yaml.Essentially, ensure you have enough GPU nodes, allocate a large volume size per node, and enable EFA. You can use any tool of your choice, but remember you will have to adjust say for taints in the deployment yaml as needed.
🔍 Why EFA? Elastic Fabric Adapter accelerates inter-node communication, critical for multi-GPU inference.
Ideally you would just use a public image from the vllm folks:
However, we want to use EFA because Elastic Fabric Adapter (EFA) enhances inter-node communication for high-performance computing and machine learning applications within Amazon EKS clusters.
In the following Dockerfile, we start by grabbing a powerful CUDA base image, then go on an installation spree, pulling in EFA, NCCL, and AWS-OFI-NCCL, while instructing apt to hang onto its downloaded packages. Once everything’s compiled, we carefully graft these freshly built libraries onto the vLLM image above.
🛠 GPU Compatibility: The COMPUTE_CAPABILITY_VERSION=90 setting is specific to L40S GPUs. Adjust this for your hardware.
Note we also install
hugging_hub
with the high speed hf_transfer
component and update the ray
package. There is a ray_init.sh
which helps us start vllm
and ray
in the leader and worker nodes brought up by LWS.Check if the Nvidia and EFA daemonsets are running using
Check if the nodes are correctly annotated with the GPU count and EFA capacity
Install the
node-shell
kubectl/krew plugin to peek into the nodes (will be handy for later) using kubectl krew install node-shell
Now check if the devices are correctly present in each node
Use helm to install LWS:
Check if the LWS pods are running using
kubectl get pods -n lws-system
Edit
deepseek-lws.yaml
to insert your hugging face token (ensure it's base64 encoded):Important: Replace "PASTE_BASE_64_VERSION_OF_YOUR_HF_TOKEN_HERE" with the base64 encoded version of your Hugging Face token. To base64 encode it, you can use something like
echo -n 'your_token' | base64
A couple of other things to point out, if you see the
vllm
command line you will noticeAcross the 4 notes we have 32 GPUs, we are splitting these into 8 way tensor-parallel and 4 pipeline stages for a total of 32 (read about these params here).
Apply the yaml using kubectl:
You will need to wait until
vllm-0
gets to 1/1
. You can check in on what is happening inside the main pod usingIn the
deepseek-lws.yaml
, you will notice that we have turned up all the logging way up high so you get an idea of all the things happening (or not!) in the system. Once you get familiar, you can turn down the settings to as much as you wish.You will see the model being downloaded:
If you inspect the deepseek-lws.yaml, you will see that
/root/local
directory on the host is used to store the model. So even if the pods fail for some reason, the next pod will pick up downloading from where the previous pod failed.After a while you will see the following:
Once you see the following, the vllm openapi endpoint is ready!
To access the DeepSeek-V3 model using your localhost, use the following command:
To check if the model is registered using the openapi spec, use the following command:
To test the deployment use the following command:
you will see something like:
Now feel free to tweak the
deepseek-lws.yaml
and re-apply the changes using:Just to be sure, you can clean up using
kubectl delete -f deepseek-lws.yaml
and use kubectl get pods
to make sure all the pods are gone before you run kubectl apply
.Happy Hacking!!
If you were a keen observer and noticed that we forwarded port
8265
as well, point your browser to look at the Ray dashboard!You can see the GPU usage specifically when you are running an inference.
Thanks
This post is based on Bryant Biggs's work in various repositories, thanks Bryant. Also thanks to Arush Sharma for a quick review and suggestions. Kudos to the folks in the Ray, vLLM, LWS, and Kubernetes communities for making it easier to compose these complex scenarios.
As mentioned earlier we are relying here on a host node directory to persist model downloads across pod restarts. There are other options you can try as well, see the mozilla.ai link below that uses Persistent Volumes for example. Yet another option to store/load the model is using The FSx for Lustre Container Storage Interface (CSI) driver.
terraform-aws-eks-blueprints
github repo has a terraform based setup you can try too.