
Strategies and options to get GPU capacity on AWS
Startups use GPUs for ML, GenAI, HPC, Graphics, and more. Learn how to access EC2 GPU capacity on AWS - explore the options here.
Mehdi Yosofie
Amazon Employee
Published Apr 9, 2025
As a Solutions Architect in the AWS Startups team, we work with a lot of startups who run many diverse workloads on AWS. Those startups build technology for different domains such as HCLS, Automotive, B2C, B2B, ISV, Fintech, Robotics, Media and Entertainment, or they are AI startups. They use AWS for compute, networking, databases, storage, analytics, ML training and inference, and much more.
One of the topics startups ask us is about GPU capacity. GPUs, as part of AWS’s EC2 accelerated compute platform are used for Machine Learning (AIML), Generative AI applications, High Performance Computing (HPC), and graphics, rendering and gaming applications. In Machine Learning, GPUs are usually used to train or fine tune machine learning models, or to host models for inference. There are multiple options how to get access to GPU instances on the AWS cloud. Here are the options.

- EC2 Capacity Blocks for ML
- On-Demand Capacity Reservation
- Consider other GPU instance types and sizes
- AWS Inferentia and AWS Trainium
- Consider CPUs instead of GPUs
- Spot
- Talk to your AWS Account Team (TAM, SA, AM)
- Diversification + Flexibility is key (Instance type and size flexible, AZ and Region flexible, time flexible)
- Increase AWS service limits and quotas
- Amazon SageMaker via SageMaker training plans
- Amazon EC2 UltraServers and Amazon EC2 UltraClusters
One of the options to get access to popular Amazon EC2 P4, P5, P5e, and P5en instances, is via “Amazon EC2 Capacity Blocks for ML“. Capacity Blocks is a feature of Amazon EC2 to get access to popular GPU instances by reserving those machines in advance. You can reserve for up to six months, in cluster sizes of 1 to 64 instances. You can reserve up to 8 weeks in advance.
Check out the webpage or the AWS Console to see which instance types are currently supported. The supported instances are only available in some AWS Regions — check out the AWS Console or the docs here. Amazon EC2 Capacity Blocks for ML are charged up front at the time when you do the reservation. Prices for Capacity Blocks can be found here. Prices may change on regular basis, so it is good to verify the price before requesting your Capacity Blocks.
You can find Amazon EC2 Capacity Blocks in the Amazon EC2 Console as shown in the screenshot.

- Capacity Blocks and which instances types are supported in which AWS Regions
- Read more about Capacity Blocks pricing and billing
- Capacity Blocks Getting Started
CLI commands to find and to purchase a Capacity Block (more info)
Another option is “Amazon EC2 On-Demand Capacity Reservation“. With On-Demand Capacity Reservation, you can reserve EC2 instances for your AWS account and pay the same price as on-demand pricing. While, as of today, April 2025, EC2 Capacity Blocks for ML only supports P4 and P5 instances, you can leverage On-Demand Capacity Reservation for other instance types as well. You can either do immediate reservation which you can modify or cancel at any time, or you can start at a future date (see blog article) at which you specify the amount and duration you commit. Once the commitment time is over, you can modify or cancel the reservation to release it. On-Demand Capacity Reservation is charged at the same pricing as on-demand whether you run instances in reserved capacity or not.
Read more about On-Demand Capacity Reservation:
- Difference between Capacity Blocks for ML and On-Demand Capacity Reservation
- On-Demand Capacity Reservation concepts
- On-Demand Capacity Reservation pricing and billing
- On-Demand Capacity Reservation Getting Started
CLI command to create a Capacity Reservation for immediate use (more info):
CLI command to create a future-dated Capacity Reservation (more info):
Besides Amazon EC2 P4 and P5 instances, you shouldn’t forget about other cost effective GPU instances for your Deep Learning workloads. Evaluate and test other (cost effective) GPU instances and Accelerated Computing instances. Look at the AWS pages “Recommended GPU instances” and the “Amazon EC2 instance types” page → ”Accelerated Computing“ to find the full lists. G instances are usually a better choice than P instances for ML inference, for development before you actually run your training at scale, or for specific use cases such as high-performance computing for large-scale protein simulations.

Click on “Accelerated Computing” on the Amazon EC2 instance types page to see a bigger list of accelerated computing instances.

You can see in the next screenshot that there are different instance sizes with different capabilities, such as single-core and multi-core GPU instances. Let’s see the example with EC2 g6e.
Example: EC2 g6e instance sizes and details

EC2 Inferentia (Inf1, Inf2) and Trainium (Trn1, Trn2) instances are chips designed and built by AWS and provide better price performance over equivalent GPU instances. You have to use the Neuron SDK and run the compiled code on these chips which is connected to some engineering effort. Neuron supports frontier models such as Llama3.3-70b and Llama Llama3.1-405b. Neuron SDK integrates with PyTorch and JAX, and offers NxD Training and NxD Inference PyTorch libraries for distributed workflows. It also supports third party libraries such as Hugging Face Optimum Neuron, PyTorch Lightning, and AXLearn library for JAX model training. Read more in the Neuron SDK documentation.
You can find examples of Llama model deployments on Inferentia chips on Amazon EKS and Llama3 fine-tuning on Trainium chips on the Data on EKS repository.

Check out these two videos from Julien Simon who benchmarked Amazon Trainium and Nvidia chips in 2023. Even though the video is from 2023 and the benchmarks are done on chip generations from that time, the presenter concludes that Trainium chips show significant advantages in both speed and cost-effectiveness across his benchmarks, though results may vary depending on specific models and workloads.
Consider and evaluate if CPUs could work to run your ML workloads. Memory optimized Amazon EC2 R instances can be used for ML inference due to their large memory capacity. If the model fits into memory, you should consider running ML Inference workloads on EC2 R8 instances. You can make the model fit either through quantization and flash attention (read more here), or any other technique such as distillation. Some frameworks take advantage of Intel's MKL DNN, which speeds up training and inference on EC2 C5 CPU instance types. EC2 C5 instances have up to 72 Intel vCPUs. Read more here. It's worth mentioning that Intel Advanced Vector Extensions (AVX) is often used for Intel CPUs to improve performance, and Scalable Vector Extensions version 2 (SVE2), supported in Graviton 4, accelerates vector multiplication and loops.
- https://aws.amazon.com/ec2/instance-types/ → Memory Optimized
If your workload is stateless, interruptible, and flexible, you should consider Spot instances - also with G instances. Spot is another purchase option besides on-demand. Spot is EC2 spare capacity and comes with great cost benefits of up to 90% off which is a great cost optimization strategy for your machine learning workloads. When running inference workloads on Amazon EKS, Karpenter as a cluster autoscaling tool is particularly valuable as it supports multiple capacity types and if configured so, Karpenter will prioritize
reserved
capacity, followed by spot
, then finally on-demand
. Specifically for inferencing, if you use Amazon EKS, I highly suggest to use Karpenter. It is a very valid way of getting capacity at cost-effective rates. Since CPUs are more available than GPUs, you can probably get them quite often via Spot.Purchase options overview

Even though Spot instances can be reclaimed by Amazon EC2, there are Amazon EC2 Spot best practices how to leverage it at scale and even in production. If your workload is interruptible and fault-tolerant, then Spot is a good option.
- Amazon EC2 billing and purchasing options
In case you tried out some of these options and recommendations in this list, and you still couldn’t successfully launch GPU instances and run your machine learning workloads, ping your AWS Account Team, such as your Account Manager, Solutions Architect, or Technical Account Manager. They will speak with you to understand your use case, and try to find ways to support you getting access to GPU instances.
- Read more about “Meet your AWS account team”
If possible, be flexible regarding Amazon EC2 instance type and size. When you don’t succeed in launching GPU instances, try to also use a different Availability Zone, and especially another AWS Region. Every AWS Region has its own capacity. Especially popular GPU instances are often in use by customers and you might have more luck in another AWS Region. If you need to stay within the EU due to GDPR or other compliance regulations, consider the different AWS Regions in Europe. Check and double-check with your legal team if e.g. US regions could be an option for you. Capacity can differ based on time/region usage, it is sometimes worth exploring running workloads at different times to utilise spare capacity.
As a general advice, try to not stick to a single EC2 instance type and size if possible, but try to diversify and be flexibleregarding type and size, and especially AWS Availability Zone and AWS Region when it comes to GPU instances.
Startups want to get started fast. However, in newly created AWS Accounts, nothing was running yet, and limits and quotas are set to the minimum. The AWS service quotas documentation even states: *“Your account's actual quota value may be less than the AWS default quota value if the account was recently created or if you use the account minimally.“*The usage over time will increase your trust against AWS, some quotas will slightly increase automatically, and you can also ask for some other limits and quotas.
One of the important aspects that we regularly educate startup customers, is to increase the limits and quotas for their accounts and for their AWS services in their AWS account, in case they need more limits and quotas. Almost all AWS services have their own limits and quotas. This is not only applicable to GPU instances, however also for other AWS services.
In case of insufficient capacity issues to spin up EC2 instances, but also if you hit any limit or quota of another AWS service, make sure to check if you have hit your “limits and quotas”. For example, you have a certain default limit for EC2 G and P instances.
In case of insufficient capacity issues to spin up EC2 instances, but also if you hit any limit or quota of another AWS service, make sure to check if you have hit your “limits and quotas”. For example, you have a certain default limit for EC2 G and P instances.
With the following example links, I want to make sure you understand that individual AWS services usually have their own page for limits and quotas. There is usually a table in those pages. Note, there is usually an “adjustable” column in many tables which means you can request limit increases for those adjustable limits and quotas.
Examples:
- Search limits and quotas for other AWS services as well
Note that many of the limits and quotas are adjustable. You can go to the Service Quotas service in the AWS console in the browser and ask for a limit increase for a certain metric. Make sure to ask for a reasonable limit increase and don’t ask for a very high limit increase without careful consideration or business needs, because the request will go through the AWS support and service teams who ultimately can decide on various factors such as your AWS usage history and your previous engagements whether to approve a certain limit increase or not. In some cases, you have to communicate with AWS Support why you need the limit increase and you need to provide a business justification. Some limit increase cases might be approved automatically.
Limit increase example:
The screenshot shows a request quota increase for running On-Demand G and VT instances:
The screenshot shows a request quota increase for running On-Demand G and VT instances:

You can find service-codes with
aws service-quotas list-services
Amazon SageMaker is the go-to AI/ML service for machine learning training and inference and more. It provides a unified platform for data, analytics, and AI. From preparing your data, creating and sharing your data signals, reserving GPU capacity training plans, training machine learning models, deploying your model for inference, up to monitoring your model quality, Amazon SageMaker can support across your entire AI journey.
I would like to emphasise SageMaker training plans. “SageMaker training plans” lets you reserve compute capacity. Current supported instances are Amazon EC2 P4, P5, and Trn1 and Trn2 instances. You can reserve training plans for “SageMaker training jobs”, or for “HyperPod clusters” — read more here.
Read more here:
- SageMaker training plans — supported instance types
- Amazon SageMaker Hyperpod introduces Amazon EKS support
Not applicable to most of our startups, but Amazon EC2 UltraServers and Amazon EC2 UltraClusters are two other options to get access to GPU or Trainium instances. With EC2 UltraClusters, you can get access to an on-demand supercomputer by scaling to thousands of GPUs or purpose-built ML AI chips such as AWS Trainium. The instances are interconnected with the Elastic Fabric Adapter (EFA) network interface to establish high amount of inter-node communication. EC2 UltraServers provides a similar experience. Speaking of supercomputers, there is a groundbreaking project between AWS and NVIDIA by constructing one of world’s fastest AI supercomputers in the cloud, called Project Ceiba.
- Read more about EC2 UltraClusters: https://aws.amazon.com/ec2/ultraclusters/
- Read more about EC2 UltraServers: https://aws.amazon.com/ec2/ultraservers/
I hope these options and suggestions make sense and help you out. As a conclusion, I can tell that a good working solution if you can plan ahead are the reservation options. Otherwise, increase trust in your newly created AWS Accounts. In case your AWS Account is not new, choose the other options, e.g. try to be flexible in terms of instance type/size and AZ and region, and try to use other services such as Amazon SageMaker. Give it a go! AWS is investing heavily in GPU and AI infrastructure globally!
Read about ways to monitor your GPU usage to make sure you are using the right resources:
I would love to read more about your use cases in the comments. Feel free to share!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.