DeepSeek-R1 Distill Model on CPU with AWS Graviton4

Authored by Yudho Ahmad Diponegoro (Sr. Solutions Architect at AWS) and Vincent Wang (Efficient Compute GTM Solutions Architect at AWS)

DeepSeek-R1 large language model (LLM) by Deepseek has gained popularity upon its release, with the remarkable reasoning capabilities[1]. The reasoning allows for problem solving using Chains of Thought (CoT). It consists of the main models DeepSeek-R1-Zero and DeepSeek-R1 which are based on DeepSeek-V3 model architecture. While the main models are of 671 billions of parameters in size, there are smaller models based on Llama 3.1, Llama 3.3, and Qwen2.5 ranging from 1.5B to 70B parameters, which are fine-tuned from the curated dataset of DeepSeek-R1. These smaller fine-tuned models are referred to as DeepSeek-R1 Distill models.

There has been interests in deploying DeepSeek-R1 Distill models. While GPU or an AI chip can provide faster inference speed suitable for low latency applications (e.g. chat assistance), there can be use cases where batch inferences on CPU will provide better price-performance. For example, a business may need to extract data from huge volume of ingested raw text. With CPU typically being lower in cost, it may be sufficient for these asynchronous use cases.

AWS Graviton4 is the most powerful and energy-efficient AWS processor to date for a broad range of cloud workloads. AWS Graviton4 provides up to 30% better compute performance, 50% more cores, and 75% more memory bandwidth than current generation Graviton3 processors, delivering the best price performance and energy efficiency for a broad range of workloads running on Amazon Elastic Compute Cloud (EC2).

This blog post focuses in the experience of hosting DeepSeek-R1-Distill-Llama-70B on Amazon EC2 with AWS Graviton4, along with the steps on how to deploy the model, and the performance observed. Authors use c8g.16xlarge instance type in N. Virginia (us-east-1) AWS Region having 64 vCPUs and 128 GB RAM.

Deploying DeepSeek-R1-Distill-Llama-70B on EC2 with AWS Graviton4

In an AWS account, you can launch an EC2 instance by following the documentation. For this experiment authors use Graviton4 instances (e.g. c8g.16xlarge instance family) with default AMI (Amazon Linux 2023 AMI 2023.6.20250128.0 arm64 HVM kernel-6.1).

Once launched you can login in to your Graviton4 instance then run this simple shell script to load the model, which is no more than 10 lines! Then you can get a state-of-the-art performance for your LLM inference.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#Install libraries 
sudo yum -y groupinstall "Development Tools"
sudo yum -y install cmake

#Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

#build with cmake
mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
cmake --build . -v --config Release -j $(nproc)

#Download model
cd bin
wget https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf

#Running model inference with llama.cpp cli
./llama-cli -m DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:/" -n 512 -t 64 -no-cnv

The above script uses llama.cpp to host the model. If you want a lightweight, OpenAI API compatible, HTTP server for serving LLMs, just use following command to replace with llama-cli in the script.

1
2
#Running model inference with llama.cpp server
./llama-server -m DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf —-host 0.0.0.0 —-port 8080

If you do not want to download the model to local, you can change the llama-cli or llama-server command to load the model directly from HuggingFace repo.

1
2
#Running model inference with llama.cpp cli
./llama-cli --hf-repo bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF --hf-file DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf "Building a visually appealing website can be done in ten simple steps:/" -n 512 -t 64 -no-cnv 

1
2
#Running model inference with llama.cpp server
./llama-server --hf-repo bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF --hf-file DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf -—host 0.0.0.0 -—port 8080

Performance and cost

Authors performed benchmark for the DeepSeek-R1-Distill-Llama-70B model on c8g.16xlarge EC2 instance with llmperf. The result is shown in the following table.

Input tokens	Output tokens	Completed requests per minute
40	100	4.5
200	100	3.03
200	150	2.36
800	150	2

Depending on the input and output tokens, the inferences yield various performance. While the above performance might not work well for real-time use case, such as chatbot, it may work for batch processing, such as data extraction. Taking into example of 4.5 requests per minute, it is about 6,480 inferences per day. If every inference refers to 1 email in the context of email data extraction, this means that each model can serve the data extraction from about 6 thousands emails daily.

The compute cost (excluding storage and data transfer) for running the c8g.16xlarge in N. Virginia AWS Region is about USD 1,863 per month under On-Demand pricing at the time of this post writing. With EC2 Savings Plan, the monthly compute cost can go down to USD 1,336 under 1-year No Upfront option or USD 819 under 3-year All Upfront option. You can visit the EC2 pricing page to get the updated pricing.

When you do not need an always-on instance to serve your batch inference, you can leverage Amazon EC2 Spot Instances which let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices.

This Spot Instance Advisor can provide you with information on the potential savings with EC2 Spot Instances for a certain instance type (e.g. c8g.16xlarge) and the potential frequency of interruption. At the time of this post writing, the c8g.16xlarge in N. Virginia region under Spot Instances has about 66% savings over On-Demand price with frequency of interruption of only 5-10%. You can leverage EC2 Auto Scaling group to maintain a certain number of instances and mix Spot Instances with On-Demand Instances to have a sweet spot between availability and cost.

Conclusion and further reading

DeepSeek-R1 and DeepSeek-R1 Distill models have gained popularity due to their reasoning capability. While DeepSeek-R1 Distill models can be deployed on GPU or AI chip for potential best latency and throughput, not all use cases require this level of performance. Some use cases can have better price-performance when the models are hosted on CPU. With AWS Graviton4 in combination with Savings Plan or Spot Instances, it is possible to host the DeepSeek-R1-Distill-Llama-70B model using c8g.16xlarge EC2 instance which allows a performance level which can be acceptable for batch jobs.

For getting started on deploying DeepSeek-R1 Distill models with GPU on EC2, please refer to this blog post.

References

[1] https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

DeepSeek-R1 Distill Model on CPU with AWS Graviton4

DeepSeek-R1-Distill-Llama-70B model can be deployed on AWS Graviton4-powered EC2 instances to achieve price-performance suitable for batch tasks.

Deploying DeepSeek-R1-Distill-Llama-70B on EC2 with AWS Graviton4

Performance and cost

Conclusion and further reading

References

Comments