DeepSeek-R1 Distill Model on CPU with AWS Graviton4
DeepSeek-R1-Distill-Llama-70B model can be deployed on AWS Graviton4-powered EC2 instances to achieve price-performance suitable for batch tasks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Install libraries
sudo yum -y groupinstall "Development Tools"
sudo yum -y install cmake
Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
build with cmake
mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
cmake --build . -v --config Release -j $(nproc)
Download model
cd bin
wget https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf
Running model inference with llama.cpp cli
./llama-cli -m DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:/" -n 512 -t 64 -no-cnv
llama-cli
in the script.1
2
Running model inference with llama.cpp server
./llama-server -m DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf —-host 0.0.0.0 —-port 8080
llama-cli
or llama-server
command to load the model directly from HuggingFace repo.1
2
Running model inference with llama.cpp cli
./llama-cli --hf-repo bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF --hf-file DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf "Building a visually appealing website can be done in ten simple steps:/" -n 512 -t 64 -no-cnv
1
2
Running model inference with llama.cpp server
./llama-server --hf-repo bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF --hf-file DeepSeek-R1-Distill-Llama-70B-Q4_0.gguf -—host 0.0.0.0 -—port 8080
Input tokens | Output tokens | Completed requests per minute |
---|---|---|
40 | 100 | 4.5 |
200 | 100 | 3.03 |
200 | 150 | 2.36 |
800 | 150 | 2 |
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.