AWS Logo
Menu
Small Language Models (SLMs) inference with llama.cpp on Graviton4

Small Language Models (SLMs) inference with llama.cpp on Graviton4

In this blog post, we'll explore the powerful combination of these three technologies: Llama 3.2, llama.cpp, and Graviton4. We'll dive into the process of setting up and running SLM inference using llama.cpp on Graviton4 processors, analyze the performance benefits, and discuss the potential implications for AI deployment in cloud and edge computing environments.

Vincent Wang
Amazon Employee
Published Jan 13, 2025

Small Language Models (SLMs) inference with llama.cpp on AWS Graviton4

The landscape of artificial intelligence is evolving rapidly, with Small Language Models (SLMs) emerging as powerful and efficient alternatives to their larger counterparts. A prime example of this trend is Meta's recently introduced Llama 3.2 collection, which includes compact yet highly capable models designed for various applications, including edge computing.
Llama 3.2 offers a range of models, with its smaller variants exemplifying the SLMs approach. The collection includes lightweight 1B and 3B parameter models, which are particularly well-suited for edge devices and applications where computational resources may be limited. These SLMs offer an impressive balance between performance and efficiency, making them ideal for a wide range of AI applications.
While the development of these SLMs is groundbreaking, their true potential can only be realized when paired with efficient inference solutions and powerful, yet energy-efficient hardware. This is where llama.cpp and AWS Graviton4 processor enter the picture.
llama.cpp, an innovative open-source project, has gained substantial traction in the AI community for its ability to enable efficient inference of language models, including SLMs like Llama 3.2, on various hardware platforms. By optimizing the inference process, llama.cpp has democratized access to state-of-the-art language models, opening up new possibilities for deploying these sophisticated models in resource-constrained environments.
AWS Graviton4, the latest addition to their line of ARM-based processors, promises unprecedented performance and energy efficiency for AI workloads. When combined with the optimized inference capabilities of llama.cpp and the compact yet powerful nature of Llama 3.2's smaller models, it presents an exciting opportunity for high-performance, cost-effective AI deployment.
In this blog post, we'll explore the powerful combination of these three technologies: Llama 3.2 as an example of SLMs, llama.cpp, and Graviton4. We'll dive into the process of setting up and running SLM inference using llama.cpp on Graviton4 processors, analyze the performance benefits, and discuss the potential implications for AI deployment in cloud and edge computing environments.

How we deploy it

This guide will walk you through the process of setting up and running Llama 3.2 inference using llama.cpp on an Amazon EC2 C8g instance powered by Graviton4 processors.
Prerequisites
1. An AWS account with access to EC2 instances
2. Basic familiarity with AWS EC2 and command line interfaces
Step 1: Launch a Graviton4 C8g Instance
1. Log in to your AWS Console and navigate to EC2.
2. Launch a new instance, selecting a C8g instance type (e.g., c8g.16xlarge).
3. Choose an Amazon Linux 2023 ARM64 AMI.
4. Configure your instance details, storage(recommened size minimum 50GB), and security group as needed.
5. Launch the instance and connect to it via SSH.
Step 2: Install Dependencies
Once connected to your instance, install the necessary dependencies:
sudo yum update -y
sudo yum groupinstall "Development Tools"
sudo yum -y install cmake
Step 3: Clone and Build llama.cpp
Clone the llama.cpp repository and build it with optimizations for Graviton4:
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp# build with cmake
mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
cmake --build . -v --config Release -j $nproc
Step 4: Download Llama 3.2 Model
Download the Llama 3.2 model of your choice. For this example, we'll use the 3B parameter model:
Note: GGUF (GPT-Generated Unified Format) is a file format designed for efficient storage and loading of large language models. It offers advantages such as improved compatibility, easier distribution, and better performance. To learn more about GGUF and its benefits, you can refer to the [GGUF documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md).
Step 5: Run Inference
Now you can run inference using the built llama.cpp and the downloaded model:
./llama-cli -m llama-3.2-3b-instruct.Q4_0.gguf -p "Explain how generative AI models create new content from training data." -n 512 -t $(nproc)
Here is a example output

How we benchmark it

You can use a native benchmark tool in the llama.cpp toolkit named llama-bench for performance test.
llama-bench can perform three types of tests:
Prompt processing
Text generation
Prompt processing + text generation
Here is an example of running a throughput performance testing on llama-3.2-3b model, with 256 prompt size and generating a sequence of 1024 tokens, remember to specify the vCPU number of the Graviton instance as the value of -t parameter to achieve the best parallel processing performance on CPU
./llama-bench -m llama-3.2-3b-instruct.Q4_0.gguf -pg 256,1024 -t 64 -o csv
Benchmark Result:
Chart A. Throughput performance, higher is better
For a latency benchmark, you can use LLMPerf, a tool for evaulation the performance of LLM APIs. Install the LLMPerf tool in a big size EC2 instance, like c6i.16xlarge, then setup another Graviton instance in the same subnet as the llama.cpp model inference server.
Here is an example of running an end_to_end_latency performance testing on llama-3.2-3b model, with 550 prompt size and generating a sequence of 250 tokens. First run following command in Graviton instance, remember to specify the vCPU number of the Graviton instance as the value of -t parameter to achieve the best parallel processing performance on CPU
./llama-server -m ./llama-3.2-3b-instruct.Q4_0.gguf —port 8080 —host 0.0.0.0 -t 64
Then run following command in the LLMPerf instance to start the test
export OPENAI_API_KEY=secret_abcdefg
python token_benchmark_ray.py —model "SanctumAI/Llama-3.2-3B-Instruct-GGUF" —mean-input-tokens 550 —stddev-input-tokens 50 —mean-output-tokens 250 —stddev-output-tokens 10 —max-num-completed-requests 100 —timeout 600 —num-concurrent-requests 2 —results-dir "result_outputs" —llm-api openai —additional-sampling-params '{}'
Benchmark Result:

Chart B. Latency performance, lower is better

Conclusion

From the benchmark result, we can see comparing to Graviton3, Graviton4 has 30.6% performance advantage for throughput and 22% performance advantage for latency. AWS Graviton instances excel at SLM inference due to their high memory bandwidth and large capacity, which are essential for efficiently handling model inference. Powered by custom AWS Graviton processors, these instances offer optimized performance for machine learning workloads while maintaining energy efficiency. They provide a cost-effective solution compared to traditional x86 instances, with scalable sizes to meet various needs. Their strong network performance, seamless integration with AWS services, and overall design tailored for ML tasks make Graviton instances a compelling choice for organizations looking to balance performance, cost-effectiveness, and scalability in their SLM inference operations.

About authors

Vincent Wang is GTM SA for Efficient Compute. More than 8 years Cloud Computing and customer facing experience. Good at cloud computing hypervisor and elastic distribute computing system; follow AWS silicon innovation footprint, deliver AWS Graviton based instances go-to-market technical leadership.
Shining Ma is a cloud support engineer on the premium support team at Amazon Web Services (AWS). He is accredited subject-matter expert (SME) in EC2 Linux, Elastic Block Store (EBS), and ElastiCache services. Shining thrives on resolving intricate issues and challenges that customers encounter while leveraging the powerful AWS services.
Dennis Lin is a Cloud Architect at AWS ProServe, where he helps customers solve their business problems by designing and running innovative solutions on AWS. His areas of interest are cloud infrastructure, security, and AI/ML.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments