Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

AWS Logo
Menu
Deploying DeepSeek-R1-Distill-Llama-8B on AWS m7i Instance

Deploying DeepSeek-R1-Distill-Llama-8B on AWS m7i Instance

This is a community post discussing how to deploy DeepSeek-R1-Distill-Llama-8B on AWS EC2 M7i instances

Dylan
Amazon Employee
Published Mar 6, 2025
Last Modified Mar 10, 2025
Authors:
Anish Kumar, AI Software Engineering Manager, Intel
Supporting authors:
Dylan Souvage, PSA, AWS
Vishwa Gopinath Kurakundi, PSA, AWS
DeepSeek R1 distilled models have rapidly gained popularity since their launch, particularly for their efficiency and cost-effectiveness. They are designed to run on standard hardware while maintaining a significant portion of the original model's reasoning capabilities. These models have quickly become popular due to their resource efficiency as well as their open source license, which allows them to be deployed freely in applications.
This tutorial focuses on deploying a distilled version of DeepSeek-R1, which as a smaller model was trained to generate responses that would match the quality of a “teacher” large language model (LLM). I will deploy it using vLLM* (version 0.7.0 or higher), which has become a popular serving engine due to its high throughput, ease of integration with Hugging Face hub, and support for a variety of CPUs, GPUs, and AI accelerators. The efficiency of the DeepSeek-R1-Distill model opens the possibility to perform inference using fewer computing resources. In this case, I’ll show how to deploy on an Amazon Elastic Compute Cloud (Amazon EC2) m7i.2xlarge instance, which uses Intel® Xeon® Scalable processors with 8 vCPUs and 32GB of memory.
For this example, I’ll utilize the DeepSeek-R1-Distill-Llama-8B version; however, these steps can be applied to deploy other DeepSeek-R1-Distill models of varying parameter sizes and LLM distillations available on the Hugging Face hub.
Deploy Docker Engine on m7i.2xlarge EC2 Instances
The first thing you’ll need to do is install the Docker* Engine on your Amazon EC2 instance. You can follow the Docker setup instructions.
Deploy vLLM on AWS m7i.2xlarge EC2 Instances
In this section, I’ll show you how to install vLLM on an EC2 instance to deploy the DeepSeek-R1-Distill-Llama-8B model. You’ll learn how to request access to the model, create a Docker container to use vLLM to deploy the model, and run online inference on the model.
Step 1: Generate a Hugging Face token (optional step for gated model access)
This step is required only for accessing gated models like meta-llama. For accessing gated models, you’ll need to log into your Hugging Face account to generate an access token, which you can get by following these steps. When you get to the “Save your Access Token” screen, as shown in Figure 1, make sure to copy the token because it will not be shown again.
Image not found
Save Hugging Face token
Figure 1: Generate and copy your Hugging Face token
Step 2: Clone the vLLM github repository
To install vLLM docker containers, you’ll need to clone the official GitHub repository into your AWS EC2 instance. Run the following commands in your terminal after connecting to your EC2 instance:
1
git clonehttps://github.com/vllm-project/vllm.git
1
cd vllm
Step 3: Export the Hugging Face token
As noted in Step 1, if you’re using gated models like, for example, the meta-llama models, you need to export the Hugging Face token from Step 1 to the HF_TOKEN environment variable as shown below:
Image not found
Clone vllm github repo
Figure 2: Clone vllm github repository and Export HF_TOKEN="YOUR_TOKEN_HERE"
Step 4: Build the vLLM CPU Docker container
I will build the vLLM CPU Docker container as described in the vLLM documentation.
1
sudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
This command builds the vllm docker container for CPU using the instructions that are part of Dockerfile.cpu. The Docker container will include Intel’s optimization for CPU including the Intel Extension for PyTorch, which ensures the LLM inferences are optimized to run on Intel® Xeon® 4th Generation or above processors.
The --shm-size flag allows the container to access the host’s shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.
Image not found
vllm documentation
Figure 3: Documentation on vllm setup using Docker containers for CPU
Image not found
building vllm docker container
Figure 4: Building the vllm docker container using the Dockerfile.cpu
Test Inference on AWS m7i.2xlarge EC2 instances
In this section, I’ll explore how to test inference of the DeepSeek-R1-Distill-Llama-8B model using vLLM for CPU in online mode with OpenAI-style REST API endpoints.
Step 1: Start the vLLM CPU Docker container
When you start the Docker container with the below command, it starts the inference serving engine on port 8000. I also use the host mode networking with “—network=host” parameter.
When you run the below command, you should see that the model server with deepseek-ai/DeepSeek-R1-Distill-Llama-8B is running on port 8000, ready to serve inference requests.
1
sudo docker run -it --rm --network=host vllm-cpu-env --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --device cpu --max_model_len 500
Here you’ll see all the various parameters that are available while initiating a vllm serving engine via a Docker container:
Image not found
initiating vllm deepseek
Figure 5: Initiating the docker container with “DeepSeek-R1-Distill-Llama-8B” model
Image not found
vllm has been initiated
Figure 6: vllm has been initiated and port 8000 is now ready to serve
Step 2: Initiate an inference request with a cURL POST request
You’ll need to connect to the same EC2 instance in a new terminal to execute the inference post request using cURL:
1
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "prompt": "Explain why the sun emits immense energy", "temperature": 0, "max_tokens": 32}'
In the above sample request, I am prompting the DeepSeek-R1-Distill-Llama-8B model with the query, “Explain why the sun emits immense energy.” I set the maximum output token to be 32. Based on the parameter and the prompt query, the model returns the response: “The sun emits energy because it’s a massive star with a high internal temperature and nuclear fusion happening in.”
Note: For any given model prompt, the response from the model might vary slightly each time.
Image not found
curl test
Figure 7: A curl HTTP request on port 8000 for inference to the DeepSeek-R1-Distill-Llama-8B and its response from the model.
Resources
This tutorial shows how simple it is to deploy a DeepSeek-R1-Distill-Llama-8B model on an AWS m7i.2xlarge instance powered by Intel® Xeon® CPUs, using vLLM as the serving engine. These steps will work for different versions of the DeepSeek-R1 model, as well as for different AWS instances based on Intel® Xeon® CPUs. See the resources below to learn more about what is available and how to get started.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments

Log in to comment