AWS Logo
Menu

Cost-Effective LLM Serving with Optimal TP/DP on EKS using Inf2.48xl and g5.12xl

Learn how to cut LLM inference costs using EKS with Inf2.48xl and g5.12xl. We tuned tensor and data parallelism with vLLM, achieving up to 38% better RPS per dollar on Inferentia2—all while keeping latency under 1.5s and fully utilizing hardware.

Yahav Biran
Amazon Employee
Published May 20, 2025

To help a customer reduce inference costs and improve infrastructure resilience beyond GPU-only deployments, we evaluated the performance of meta/llama-3.1-8B with 16 K input and 512 output tokens across both NVIDIA and Inferentia2 instances. Starting from their existing g5.12xlarge setup, we tuned tensor parallelism (TP) to meet a 1.5-second latency target, then scaled via data parallelism (DP) to maximize throughput. Using EKS with Karpenter and the AWS Load Balancer Controller, we declaratively orchestrated hardware-specific deployments—selecting optimal instance types and resource constraints. The software stack, centered on vLLM, abstracts away accelerator differences and is backed by our DLC and DLAMI layers, enabling a unified, production-ready inference pipeline across GPU and Neuron environments.
We began our evaluation using the customer’s existing setup on g5.12xlarge, which provides access to 4 A10G GPUs. The EKS cluster was provisioned using Karpenter, allowing dynamic scaling of compute nodes. We defined dedicated GPU node pools and EC2NodeClass resources, which are available in the scalable-hw-agnostic-inference GitHub repository. To monitor pod- and node-level resource usage during the benchmarking process, we enabled Amazon CloudWatch Container Insights, which provided visibility into GPU utilization, memory, and CPU metrics. With the infrastructure and observability in place, we deployed multiple configurations of the model using vLLM—varying batch sizes (bs) and tensor parallelism (tp) values such as bs8–tp4, bs16–tp4, and bs32–tp4. By measuring throughput and p95 latency for each configuration, we identified the TP setting that met our latency constraint (≤1.5 seconds) while maximizing throughput per dollar.
To evaluate model performance under load, we launched the script defined in load.yaml, which issues inference requests to the model’s load balancer endpoint with fixed input and output token lengths. The deployed service is fronted by an AWS Load Balancer, so we used its built-in CloudWatch metrics—specifically TargetResponseTime for call latency and HTTPCode_Target_2XX_Count for successful inference throughput. To systematically identify the maximum sustainable throughput for each configuration, we deployed the controller defined in load-ctl.yaml. This load control component gradually increases the request rate until the observed latency exceeds our 1.5-second threshold, allowing us to identify the "breaking point" of each configuration and select the optimal tensor parallelism setup for downstream scaling.
Latency and throughput trends for LLM configs on GPU and Inferentia2.
Latency and Throughput Under Load for LLM Inference on GPU vs Inferentia2

The chart below shows the results of the load test as we gradually increased the request rate using the controller. Across all deployment variations, latency remained within the acceptable range up to approximately 37 requests per second (RPS). Beyond that point, several configurations began to exceed the 1.5-second response time threshold. We therefore set 37 RPS as the effective throughput ceiling for all configurations that met the latency constraint. While most variants performed similarly, we ruled out inf2-bs32-tp8 as it exhibited consistently higher latency and did not offer any throughput advantage over more efficient configurations. Based on this data, we selected the optimal tensor parallelism (TP) setting for each platform: a10g-bs8-tp4, utilizing all 4 NVIDIA GPUs on g5.12xlarge, and inf2-bs8-tp8, using 4 Neuron devices (8 NeuronCores) on inf2.48xlarge. Although the throughput across several options was comparable, these configurations offered the best balance of latency, utilization, and scalability—forming the baseline for our data-parallel scaling strategy.
As another validation point for our selected TP configuration, we monitored NeuronCore and GPU utilization during peak load. The charts below show that both inf2-bs8-tp8 on Inferentia2 and a10g-bs8-tp4 on g5.12xlarge consistently drove high utilization across all allocated cores—indicating that our selected TP configurations are effectively saturating the underlying hardware. For Inferentia2, the assigned NeuronCores operated near full capacity with only minimal idle time, while GPU devices on the g5.12xlarge demonstrated similarly dense activity across all four A10G units. These utilization patterns confirm that our chosen TP settings are not underutilizing available compute and are well-aligned with the hardware limits—further supporting their selection as the baseline for scaling via data parallelism.
Utilization patterns for NeuronCores and GPUs during LLM inference.
NeuronCore vs GPU Utilization During LLM Inference on Inf2 and A10G

With the optimal TP configurations in place, we scaled each setup using data parallelism (DP) to maximize overall throughput per instance. On g5.12xlarge, we deployed a single a10g-bs8-tp4 pod per instance, achieving approximately 36 RPS. On inf2.48xlarge, we packed three inf2-bs8-tp8 pods, each using 4 Neuron devices (8 cores total), reaching an aggregate of 114 RPS per instance. From a cost-efficiency perspective, this yields 36 / $5.672 ≈ 6.35 RPS per dollar for g5.12xlarge and 114 / $12.9813 ≈ 8.78 RPS per dollar for inf2.48xlarge. Despite the higher hourly cost of Inferentia2, it delivers ~3.17× more throughput per instance and ~38% better RPS per dollar, making it the more scalable and cost-effective choice for production inference at scale.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments