Deploying DeepSeek-R1-Distill-Llama-8B on AWS m7i Instance

Authors:

Anish Kumar, AI Software Engineering Manager, Intel

Supporting authors:

Dylan Souvage, PSA, AWS

Vishwa Gopinath Kurakundi, PSA, AWS

DeepSeek R1 distilled models have rapidly gained popularity since their launch, particularly for their efficiency and cost-effectiveness. They are designed to run on standard hardware while maintaining a significant portion of the original model's reasoning capabilities. These models have quickly become popular due to their resource efficiency as well as their open source license, which allows them to be deployed freely in applications.

This tutorial focuses on deploying a distilled version of DeepSeek-R1, which as a smaller model was trained to generate responses that would match the quality of a “teacher” large language model (LLM). I will deploy it using vLLM* (version 0.7.0 or higher), which has become a popular serving engine due to its high throughput, ease of integration with Hugging Face hub, and support for a variety of CPUs, GPUs, and AI accelerators. The efficiency of the DeepSeek-R1-Distill model opens the possibility to perform inference using fewer computing resources. In this case, I’ll show how to deploy on an Amazon Elastic Compute Cloud (Amazon EC2) m7i.2xlarge instance, which uses Intel® Xeon® Scalable processors with 8 vCPUs and 32GB of memory.

For this example, I’ll utilize the DeepSeek-R1-Distill-Llama-8B version; however, these steps can be applied to deploy other DeepSeek-R1-Distill models of varying parameter sizes and LLM distillations available on the Hugging Face hub.

Deploy Docker Engine on m7i.2xlarge EC2 Instances

The first thing you’ll need to do is install the Docker* Engine on your Amazon EC2 instance. You can follow the Docker setup instructions.

Deploy vLLM on AWS m7i.2xlarge EC2 Instances

In this section, I’ll show you how to install vLLM on an EC2 instance to deploy the DeepSeek-R1-Distill-Llama-8B model. You’ll learn how to request access to the model, create a Docker container to use vLLM to deploy the model, and run online inference on the model.

Step 1: Generate a Hugging Face token (optional step for gated model access)

This step is required only for accessing gated models like meta-llama. For accessing gated models, you’ll need to log into your Hugging Face account to generate an access token, which you can get by following these steps. When you get to the “Save your Access Token” screen, as shown in Figure 1, make sure to copy the token because it will not be shown again.

Image not found

Save Hugging Face token

Figure 1: Generate and copy your Hugging Face token

Step 2: Clone the vLLM github repository

To install vLLM docker containers, you’ll need to clone the official GitHub repository into your AWS EC2 instance. Run the following commands in your terminal after connecting to your EC2 instance:

1
git clonehttps://github.com/vllm-project/vllm.git

1
cd vllm

Step 3: Export the Hugging Face token

As noted in Step 1, if you’re using gated models like, for example, the meta-llama models, you need to export the Hugging Face token from Step 1 to the HF_TOKEN environment variable as shown below:

Image not found

Clone vllm github repo

Figure 2: Clone vllm github repository and Export HF_TOKEN="YOUR_TOKEN_HERE"

Step 4: Build the vLLM CPU Docker container

I will build the vLLM CPU Docker container as described in the vLLM documentation.

1
sudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .

This command builds the vllm docker container for CPU using the instructions that are part of Dockerfile.cpu. The Docker container will include Intel’s optimization for CPU including the Intel Extension for PyTorch, which ensures the LLM inferences are optimized to run on Intel® Xeon® 4th Generation or above processors.

The --shm-size flag allows the container to access the host’s shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.

Image not found

vllm documentation

Figure 3: Documentation on vllm setup using Docker containers for CPU

Image not found

building vllm docker container

Figure 4: Building the vllm docker container using the Dockerfile.cpu

Test Inference on AWS m7i.2xlarge EC2 instances

In this section, I’ll explore how to test inference of the DeepSeek-R1-Distill-Llama-8B model using vLLM for CPU in online mode with OpenAI-style REST API endpoints.

Step 1: Start the vLLM CPU Docker container

When you start the Docker container with the below command, it starts the inference serving engine on port 8000. I also use the host mode networking with “—network=host” parameter.

When you run the below command, you should see that the model server with deepseek-ai/DeepSeek-R1-Distill-Llama-8B is running on port 8000, ready to serve inference requests.

1
sudo docker run -it --rm --network=host vllm-cpu-env --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --device cpu --max_model_len 500

Here you’ll see all the various parameters that are available while initiating a vllm serving engine via a Docker container:

Image not found

initiating vllm deepseek

Figure 5: Initiating the docker container with “DeepSeek-R1-Distill-Llama-8B” model

Image not found

vllm has been initiated

Figure 6: vllm has been initiated and port 8000 is now ready to serve

Step 2: Initiate an inference request with a cURL POST request

You’ll need to connect to the same EC2 instance in a new terminal to execute the inference post request using cURL:

1
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "prompt": "Explain why the sun emits immense energy", "temperature": 0, "max_tokens": 32}'

In the above sample request, I am prompting the DeepSeek-R1-Distill-Llama-8B model with the query, “Explain why the sun emits immense energy.” I set the maximum output token to be 32. Based on the parameter and the prompt query, the model returns the response: “The sun emits energy because it’s a massive star with a high internal temperature and nuclear fusion happening in.”

Note: For any given model prompt, the response from the model might vary slightly each time.

Image not found

curl test

Figure 7: A curl HTTP request on port 8000 for inference to the DeepSeek-R1-Distill-Llama-8B and its response from the model.

Resources

This tutorial shows how simple it is to deploy a DeepSeek-R1-Distill-Llama-8B model on an AWS m7i.2xlarge instance powered by Intel® Xeon® CPUs, using vLLM as the serving engine. These steps will work for different versions of the DeepSeek-R1 model, as well as for different AWS instances based on Intel® Xeon® CPUs. See the resources below to learn more about what is available and how to get started.

Deploy a DeepSeek-R1-Distill Chatbot in Minutes on Low-Cost AWS Xeon Instances

Does DeepSeek* Solve the Small Scale Model Performance Puzzle?

HuggingFace Model Card for DeepSeek-R1

JumpStart AI Development

Amazon EC2 Instance Types

Resources for getting started with GenAI development

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Deploying DeepSeek-R1-Distill-Llama-8B on AWS m7i Instance

This is a community post discussing how to deploy DeepSeek-R1-Distill-Llama-8B on AWS EC2 M7i instances

Comments