
Finding Your LLM’s Breaking Point: Load Testing SageMaker Real-time Inference Endpoints with Locust
Learn how variables such as model size, instance choice, deployment configuration, and inference parameters impact peak requests-per-second and latency using Locust, a load testing tool.
Published May 28, 2025
Every large language model (LLM) has its limits, but how do you find them? Whether deploying new cutting-edge AI features or scaling established models, understanding the performance boundaries of your inference endpoints is crucial for delivering reliable, low-latency applications. Yet, with so many variables — model size, instance choice, hosting framework, deployment configuration, and inference parameters — pinpointing bottlenecks can feel like searching for a needle in a haystack.
In this post, we’ll show you how to use Locust, an open-source load testing tool, to benchmark SageMaker real-time inference endpoints. You’ll learn how to simulate user demand, measure peak requests-per-second and latency, and see how different deployment choices impact your model’s responsiveness.
According to the documentation, Amazon SageMaker real-time inference is ideal for inference workloads with real-time, interactive, low-latency requirements. You can deploy your model to SageMaker AI hosting services and get an endpoint for inference. These endpoints are fully managed and support autoscaling. You control the variables — model choice, instance type, instance count, auto-scaling, deployment container, hosting framework, deployment configuration, and the size of input and output payloads (input and output token count). All of these variables can impact inference performance.
SageMaker’s real-time inference documentation discusses two essential characteristics of inference performance: peak requests-per-second (RPS) and latency. When a SageMaker endpoint reaches its maximum processing capacity, it is designed to handle only a fixed number of requests per second based on available resources. In other words, the consistent RPS despite increasing latency is a natural consequence of reaching the processing capacity limits of your SageMaker endpoint. As additional users send requests, these requests begin queuing, resulting in:
- Longer wait times for responses (increased latency)
- Consistent RPS that fails to increase despite more incoming traffic
- Potential request failures once queue times exceed the 60-second invocation timeout
This behavior is common in systems with fixed processing capacity — once saturated, adding more traffic increases latency without improving throughput.
In systems with fixed processing capacity, once saturated, adding more traffic increases latency without improving throughput.
The behavior of SageMaker’s real-time inference endpoints can be visualized in Locust’s UI below. Notice how the latency continues to increase from ~2–9.5 seconds (middle chart) as more users are added, up to a maximum of 50 (bottom chart). However, requests-per-second peak and then remain constant at ~5.4 RPS.

According to the documentation, Locust is an open-source performance tool (aka load testing tool) for HTTP and other protocols. Its developer-friendly approach lets you define your tests in regular Python code. You can run Locust tests from the command line or using its web-based UI. With Locust, you can view throughput, response times, and errors in real time and/or export for later analysis. You can import regular Python libraries into your tests, and with Locust’s pluggable architecture, it is infinitely expandable.

All of the open-source code for this article can be found on GitHub. The repository includes a Jupyter Notebook to deploy the models and the Locust load testing files.
Previous articles have been written on the topic of load testing SageMaker with Locust, including Best practices for load testing Amazon SageMaker real-time inference endpoints, and Achieve high performance at scale for model serving using Amazon SageMaker multi-model endpoints with GPU.
Examples of code for using Locust with SageMaker are available on GitHub, including repositories SageMaker Endpoint Load Testing and Best practices for load testing Amazon SageMaker real-time inference endpoints. I’ve drawn upon these existing resources and the latest Locust documentation, Testing other systems/protocols, to author the code for this article, which has been updated, simplified, and optimized for the models we will test.
In the script above, we will gradually increase the
max_tokens
parameter while we explore the impact on RPS and latency. All other inference parameters will remain consistent, including: temperature, top_p, top_k, and repetition_penalty
.To call the Python script, we use the
locust
command to call a specific locust.conf
file:The Locust config file defines the correct Python script to call in the
locustfile
parameter:Locust provides a list of all available Locust configuration options. We will gradually increase the concurrent
users
parameter in the locust.conf
file from 10 to 100. According to the documentation, a processes
parameter value of -1 means Locust will auto-detect the number of cores in your machine and launch one worker for each. For all the load tests in this post, we launched eight workers, one on each available core. We will also change the host
parameter, which sets the title at the top of the Locust UI, each time we launch a new SageMaker real-time inference endpoint with different configurations. All other parameters in the file will remain the same, including spawn-rate
, which determines how many new users are spawned per second.
According to the blog post, Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15, Amazon recently announced the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine (released April 13, 2025). This version now supports the latest open-source models, such as Meta’s Llama 4 Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. This LMI container also includes Deep Java Library (DJL) v0.33.0, the open-source, high-level, engine-agnostic Java framework for deep learning. Lastly, this container contains NVIDIA’s CUDA Toolkit 12.8, which includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.
We will be using the LMI container recommended in that blog post:
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128
. Update the AWS Region if you are not in us-east-1.vLLM is an open-source library that makes running large language models faster and more efficient, especially for handling many requests simultaneously. It uses more intelligent memory management, mainly through PagedAttention, which helps the model use GPU memory better and reduces wasted resources. This allows vLLM to serve LLMs with much higher speed and lower cost compared to traditional methods. vLLM is easy to use, supports many popular LLMs, and is designed to work well in production environments where quick responses and scalability are important. In this post, we will be using the vLLM configuration recommended in the blog post mentioned above, with some slight modifications:
Using the SageMaker Python SDK, we then use the vLLM configuration as part of the code to create the SageMaker model, endpoint configuration, and real-time inference endpoint:
We will explore several load testing scenarios to demonstrate how Locust can measure an Amazon SageMaker real-time endpoint’s peak RPS and latency. To save time, money, and resources, we will use smaller, open-weight text generation models from 1–14 billion parameters, all available on Hugging Face. The same scaling and performance principles should apply to larger models and more powerful instance types. Tests include:
- Test 1: Impact of Model Size on Peak RPS and Latency
- Test 2: Impact of Generated Token Count on Peak RPS and Latency
- Test 3: Impact of Instance Count on Peak RPS and Latency

First, we will examine Locust load test results for four lightweight open-weight models with parameters ranging from 1–14B, ordered by model size:
All four models have a tensor type of BF16 (bfloat16), which refers to the precision with which the model’s data, such as weights, activations, and gradients, are stored and computed. No further quantization was used in the tests. Also, note that the Llama models have a
max_model_len
of 128k, whereas the Qwen models are ~32k.Each test was run on (1) ml.g6e.xlarge instance, which contains (1) NVIDIA L40S GPU with 48 GB of memory and (4) vCPUs with 32 GiB of memory. First, we start with a baseline of one user to assess the endpoint’s responsiveness under minimal load. The p95 Latency column shows the response speed of each model with no load.


Once we have our single-user baseline, we can test the endpoint under a reasonable load, starting with 25 users. Examining the load tests reveals a sharp decrease in peak RPS and a corresponding increase in p95 latency (measured in seconds) as the model size and number of parameters increase.


The peak RPS and p95 latency are approximate values, as fluctuations in performance occur even after peak RPS is reached. We typically waited 1–2 minutes after peak RPS was reached to approximate performance metrics.

The maximum number of tokens to generate in the completion is controlled through the
max_tokens
parameter of the invoke_endpoint
method.Assuming the model’s response is equal to the
max_tokens
value, increasing the max_tokens
value will decrease the peak RPS while increasing the p95 latency. A model’s response may be shorter or longer than the max_tokens
value depending on a number of variables (prompt, temperature, repetition_penalty). If the response is longer, it is truncated to the max_tokens
value. For example, if the model’s response to the prompt “Name three popular places to visit in London, with descriptions.” is ~300 tokens, but the max_tokens
value is set to 64, the response will be truncated mid-sentence after the word “Changing”:“1. Buckingham Palace: Buckingham Palace is the official residence and administrative headquarters of the British monarch, located in the City of Westminster, London. The palace is known for its grandeur and historical significance, with a mix of Gothic, Baroque, and Neo-Classical architectural styles. Visitors can watch the Changing”
We tested
max_tokens
values of 32–4,096 with 25 users on (1) ml.g6e.xlarge instance. Looking at the results, we can observe a sharp decrease in peak RPS and a sharp increase in p95 latency as the max_tokens
value increase.

Tests with a
max_tokens
value of 2,048 and 4,096 failed, with the latency exceeding the endpoint’s 60-second invocation timeout with 25 users, as highlighted below by the red box. The number of the generated tokens exceeded the compute capacity of the ml.g6e.xlarge instance. To overcome this limit and allow larger responses from this model, we could decrease the concurrent number of users or increase the instance size.
In the third test, we will increase the number of instances behind the SageMaker real-time endpoint, increasing the peak RPS while decreasing the p95 latency.
We tested an increasing number of ml.g5.12xlarge instances, which contain (4) NVIDIA A10G Tensor Core GPUs with 96 GB of memory and (48) vCPUs with 192 GiB of memory. The number of concurrent users was tested at 25 and 50.


We can set the initial instance count during deployment, increase the count manually at any time post-deployment, or use variant automatic scaling.


Load testing SageMaker real-time inference endpoints with Locust reveals that endpoint performance is closely tied to model size, instance type, deployment configuration, and inference settings. SageMaker maintains steady peak RPS up to its processing limit, after which additional traffic only increases latency and risks request timeouts. Locust makes it easy to simulate real-world loads, monitor key metrics, and pinpoint bottlenecks. By leveraging these insights and best practices, you can fine-tune your SageMaker deployments for optimal reliability and high-performance model inference at scale.

This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, images, logos, and brands are the property of their respective owners.