
Small Language Models (SLMs) inference with llama.cpp on Graviton4
In this blog post, we'll explore the powerful combination of these three technologies: Llama 3.2, llama.cpp, and Graviton4. We'll dive into the process of setting up and running SLM inference using llama.cpp on Graviton4 processors, analyze the performance benefits, and discuss the potential implications for AI deployment in cloud and edge computing environments.
1. An AWS account with access to EC2 instances
2. Basic familiarity with AWS EC2 and command line interfaces
1. Log in to your AWS Console and navigate to EC2.
2. Launch a new instance, selecting a C8g instance type (e.g., c8g.16xlarge).
3. Choose an Amazon Linux 2023 ARM64 AMI.
4. Configure your instance details, storage(recommened size minimum 50GB), and security group as needed.
5. Launch the instance and connect to it via SSH.
Once connected to your instance, install the necessary dependencies:
sudo yum update -y
sudo yum groupinstall "Development Tools"
sudo yum -y install cmake
Clone the llama.cpp repository and build it with optimizations for Graviton4:
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp# build with cmake
mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
cmake --build . -v --config Release -j$nproc
Download the Llama 3.2 model of your choice. For this example, we'll use the 3B parameter model:
Now you can run inference using the built llama.cpp and the downloaded model:
./llama-cli -m llama-3.2-3b-instruct.Q4_0.gguf -p "Explain how generative AI models create new content from training data." -n 512 -t $(nproc)
Text generation
Prompt processing + text generation
./llama-bench -m llama-3.2-3b-instruct.Q4_0.gguf -pg 256,1024 -t 64 -o csv
./llama-server -m ./llama-3.2-3b-instruct.Q4_0.gguf —port 8080 —host 0.0.0.0 -t 64
export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="http://<private IP address of the Graviton instance>:8080/v1“
python token_benchmark_ray.py —model "SanctumAI/Llama-3.2-3B-Instruct-GGUF" —mean-input-tokens 550 —stddev-input-tokens 50 —mean-output-tokens 250 —stddev-output-tokens 10 —max-num-completed-requests 100 —timeout 600 —num-concurrent-requests 2 —results-dir "result_outputs" —llm-api openai —additional-sampling-params '{}'
Chart B. Latency performance, lower is better
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.