
Deploying DeepSeek model on AWS inferentia with vLLM
Step by step guide to deploy DeepSeek-R1-Distill-Llama-8B/70B model on Amazon EC2 Inferentia2 Instance using vLLM
- Launch DLAMI Neuron based Amazon EC2 instance.
- Install vLLM
- Download model from huggingface
- Launch and test Inference Model
- Active AWS account with appropriate permissions to launch EC2 instances and manage related resources.
- Service Quota limit for approved for Running On-Demand Inf instances under Amazon Elastic Compute Cloud (Amazon EC2) with a value of 96 or above.
- A Key Pair created to connect with EC2 instance.
- Go to the AWS Management Console and click on “EC2” Service.
- On EC2 Dashboard, locate and click on "Launch Instance" button to start the instance creation process.
- Select the AMI (Amazon Machine Image): Under Application and OS Images (Amazon Machine Image), click on “Browse more AMIs”, search for "Deep Learning AMI Neuron" in the search bar. Select the Deep Learning AMI Neuron (Ubuntu 22.04).
- Choose the Instance Type: In the "Choose Instance Type" step, select an instance type that is compatible with AWS Neuron, such as "inf2.24xlarge" for the use-case. You can select larger instance if you are looking to launch 70B model.
- Select your Key pair name, Network settings, allowing secured SSH access for this EC2 instance.
- Make sure to select 500GB Volume being configured as Root volume under Configured storage of gp3 type.
- You can keep rest of the configuration as per your convenience and follow this guide for detailed instructions.
ssh -i /path/key-pair-name.pem instance-user-name@instance-public-dns-name
source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate
sudo apt-get install git-lfs
git lfs install
huggingface-cli login
<Use your huggingface token>
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-8B
/home/ubuntu/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/d66bcfc2f3fd52799f95943264f32ba15ca0003d/
export MODEL_PATH=/home/ubuntu/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/d66bcfc2f3fd52799f95943264f32ba15ca0003d/
*Make sure to change this command based on your path.
DeepSeek-R1-Distill-Llama-8B
model you can run it with below command inside ~/vllm/ directory.1
2
3
4
5
6
7
8
9
10
11
12
cd ~/vllm/
python3 -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--served-model-name DeepSeek-R1-Distill-Llama-8B \
--tensor-parallel-size 8 \
--max-model-len 2048 \
--max-num-seqs 4 \
--block-size 8 \
--use-v2-block-manager \
--device neuron \
--port 8080
DeepSeek-R1-Distill-Llama-70B
model you can run it with below command inside ~/vllm/ directory.1
2
3
4
5
6
7
8
9
10
11
12
cd ~/vllm/
python3 -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--served-model-name DeepSeek-R1-Distill-Llama-70B \
--tensor-parallel-size 16 \
--max-model-len 2048 \
--max-num-seqs 4 \
--block-size 8 \
--use-v2-block-manager \
--device neuron \
--port 8080
# Send a curl request to the model for 8B
1
2
3
curl localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "DeepSeek-R1-Distill-Llama-8B", "prompt": "What is DeepSeek R1?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
# Send a curl request to the model for 70B
1
2
3
curl localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "DeepSeek-R1-Distill-Llama-70B", "prompt": "What is DeepSeek R1?", "temperature":0, "max_tokens": 512}' | jq '.choices[0].text'
- For details on the configuration parameters for vLLM, refer the Neuron continuous batching guide.
- Get started with (Hugging Face Optimum Neuron)[https://huggingface.co/docs/optimum-neuron/index]
- Getting started with SageMaker Large Model Inference Containers
- Other resource related to DeepSeek on AWS
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.