Deploying DeepSeek model on AWS inferentia with vLLM

DeepSeek, a Chinese artificial intelligence (AI) company, has recently garnered significant attention for its innovative AI models that rival leading Western counterparts in performance while being more cost-effective. . With their recent launch of their model DeepSeek-R-1, which rivals the performance of leading Western AI systems, yet comes at a fraction of the cost.

DeepSeek-R-1 matches the capabilities of OpenAI's o1 reasoning model across a variety of tasks, including math, coding, and general reasoning. What sets it apart is its remarkable cost-effectiveness, costing less than 10% of its Western counterpart. DeepSeek-R-1 is completely open-source, empowering developers worldwide to leverage its cutting-edge technology and integrate it into their own systems.

This guide will walk you through step-by-step process of deploying this model on Amazon EC2 Inferentia2 Instances to get best price-performance ratio. We are going to deploy DeepSeek-R1-Distill-Llama-8B model and setup inference using open-source tool called vLLM. Same guide can be used to deploy DeepSeek-R1-Distill-Llama-70B models as well.

Outline

Launch DLAMI Neuron based Amazon EC2 instance.
Install vLLM
Download model from huggingface
Launch and test Inference Model

Stage 1. Launch DLAMI Neuron based Amazon EC2 Instance

Prerequisites:

Active AWS account with appropriate permissions to launch EC2 instances and manage related resources.
Service Quota limit for approved for Running On-Demand Inf instances under Amazon Elastic Compute Cloud (Amazon EC2) with a value of 96 or above.
A Key Pair created to connect with EC2 instance.

Steps:

Go to the AWS Management Console and click on “EC2” Service.
On EC2 Dashboard, locate and click on "Launch Instance" button to start the instance creation process.
Select the AMI (Amazon Machine Image): Under Application and OS Images (Amazon Machine Image), click on “Browse more AMIs”, search for "Deep Learning AMI Neuron" in the search bar. Select the Deep Learning AMI Neuron (Ubuntu 22.04).
Choose the Instance Type: In the "Choose Instance Type" step, select an instance type that is compatible with AWS Neuron, such as "inf2.24xlarge" for the use-case. You can select larger instance if you are looking to launch 70B model.
Select your Key pair name, Network settings, allowing secured SSH access for this EC2 instance.
Make sure to select 500GB Volume being configured as Root volume under Configured storage of gp3 type.
You can keep rest of the configuration as per your convenience and follow this guide for detailed instructions.

You will see your EC2 Instance will be up within minutes. Once you see the launched EC2 Instance up and running. Next step is to connect to this machine using an ssh client. You can follow this guide for more instructions on how to connect to EC2 instance with ssh client.

You’ll use command something like this. Make sure you SSH key-pair file is having permission of 400 (read only).
ssh -i /path/key-pair-name.pem instance-user-name@instance-public-dns-name

Once you connected to your launched ec2 instance. We will proceed with next step.

Stage 2. Install vLLM

Before installation, make sure you have activated the inference environment already configured inside the machine. You can execute below command for the same

source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

We also need to install and enable git-lfs to download large model files from the source. To install and enable we can run below commands.

sudo apt-get install git-lfs
git lfs install

vLLM is an open-source tool allows for easy, fast way of serving Large Language Models (LLMs). vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK. vLLM documentation provides detailed guide on how to do Installation with Neuron. As we have used DLAMI amis, you can skip Step 0 & Step 1 and start following

Step 2. Install transformers-neuronx and its dependencies

and

Step 3. Install vLLM from source

Stage 3. Download model from huggingface

Once vLLM gets installed, lets download the model from huggingface-cli, which will be pre-installed in the machine. You may require to setup your huggingface account and generate an access-token. You can follow this guide on how to generate your access-token after logging inside your huggingface account.

Once you get your access-token, execute below command inside your machine.

huggingface-cli login
<Use your huggingface token>

huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-8B

You can change 8B to 70B for larger model access. It may take 5-10 minutes depending on network access. Once successfully dowloaded. You will see model will be downloaed inside a folder something like below

/home/ubuntu/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/d66bcfc2f3fd52799f95943264f32ba15ca0003d/

Path may differ slightly depending on your cache location. Copy this path and setup an MODEL_PATH environment variable with command like

export MODEL_PATH=/home/ubuntu/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/d66bcfc2f3fd52799f95943264f32ba15ca0003d/

*Make sure to change this command based on your path.

Step 4. Launch and test Inference Model

To serve downloaded DeepSeek-R1-Distill-Llama-8B model you can run it with below command inside ~/vllm/ directory.

1
2
3
4
5
6
7
8
9
10
11
12
cd ~/vllm/

python3 -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --served-model-name DeepSeek-R1-Distill-Llama-8B \
    --tensor-parallel-size 8 \
    --max-model-len 2048 \
    --max-num-seqs 4 \
    --block-size 8 \
    --use-v2-block-manager \
    --device neuron \
    --port 8080

To serve DeepSeek-R1-Distill-Llama-70B model you can run it with below command inside ~/vllm/ directory.

1
2
3
4
5
6
7
8
9
10
11
12
cd ~/vllm/

python3 -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --served-model-name DeepSeek-R1-Distill-Llama-70B \
    --tensor-parallel-size 16 \
    --max-model-len 2048 \
    --max-num-seqs 4 \
    --block-size 8 \
    --use-v2-block-manager \
    --device neuron \
    --port 8080

To test the same model and do inference. You can run below command by connecting in a separate terminal and ssh into same machine.
# Send a curl request to the model for 8B

1
2
3
curl localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "DeepSeek-R1-Distill-Llama-8B", "prompt": "What is DeepSeek R1?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

# Send a curl request to the model for 70B

1
2
3
curl localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "DeepSeek-R1-Distill-Llama-70B", "prompt": "What is DeepSeek R1?", "temperature":0, "max_tokens": 512}' | jq '.choices[0].text'

To Learn more:

For details on the configuration parameters for vLLM, refer the Neuron continuous batching guide.
Get started with (Hugging Face Optimum Neuron)[https://huggingface.co/docs/optimum-neuron/index]
Getting started with SageMaker Large Model Inference Containers
DeepSeek with inferentia using Ollama
Other resource related to DeepSeek on AWS

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.