
Scaling with Amazon Bedrock: A practical guide on managing service quotas and throughput
In this post, you will learn about essential service quotas in Amazon Bedrock. You'll discover how to effectively manage Tokens per Minute (TPM) and Requests per Minute (RPM) for your AI applications. We'll guide you through the process of monitoring, requesting increases, and optimizing your quota usage.
Key service quotas for Amazon Bedrock
How do I manage service quotas in the AWS Console?
My applied limit is below the default quota. Why is this?
What makes a good service quota increase request?
How do I easily monitor support cases?
How do I proactively monitor quotas?
How do I design a resilient application that can handle traffic spikes and optimize throughput?
- Tokens Per Minute (TPM): Controls the total token throughput
- Requests Per Minute (RPM): Limits the number of API calls
- Navigate to AWS Service Quotas in the AWS Console
- Select Amazon Bedrock from the services list
- Ensure you're viewing the correct region
- Search for the appropriate quota such as TPM or RPM your specific foundation models (e.g. "Tokens per minute for Anthropic Claude 3.5 Sonnet v2")
- Engagement with your AWS account team with notification of support case ID
- Use case description:
- Name of AWS Account Manager:
- Model ID(s):
- Region(s):
- Cross-region inference profile ID(s):
- For most use cases, consider cross-region inference, a seamless solution for getting higher throughput and performance, while managing incoming traffic spikes. See inference profile IDs.
- Steady State (p50):
- Peak (p90):
- Steady State (p50):
- Peak (p90):
- Average Input Tokens per Request:
- Average Output Tokens per Request:
- Percentage of Requests with Input Tokens greater than 25k:
- Usage growth projections:
- Monitoring and alarms:
- Describe how you implemented monitoring for usage patterns.
- Previous provider:
- Historical usage and billing:
For high growth and migration use cases, please contact your AWS account team.
- Retry the request: It's a best practice to use retries with exponential backoff and random jitter. If you use AWS SDKs, then see Retry behavior.
- Use cross-region inference: Implementing cross-region inference is as simple as replacing
modelId
with an inference profile. This capability allows you to seamlessly leverage capacity across multiple AWS regions via the AWS backbone network, ensuring optimal latency while maintaining data security by keeping all traffic within AWS. - Queue management or load balancing: Implement queuing to smooth out request spikes. For example, you can use Amazon SQS to buffer model invocation requests or utilize open-source frameworks, such as LiteLLM to handle load balancing and fallbacks. This approach helps maintain consistent performance within quota limits.
- Provisioned throughput: For high-volume use cases, purchasing provisioned throughput allows you to provision a higher level of throughput for a model at a fixed cost, increasing model invocation capacity.
- Multi-model and agentic architectures: Break down complex tasks into subtasks handled by multiple, smaller models. Leverage orchestration frameworks such as LangGraph, CrewAI or Multi-Agent-Orchestrator Framework for coordinating model interactions. This strategy can reduce the load on any single model, optimizing quota usage across multiple resources. It may also help you improve accuracy by leveraging the right model for the right task.
- Asynchronous and event-driven batch processing: For non-urgent tasks, consider incorporating asynchronous and batch processes in your application. AWS provides various services to build event-driven architectures. For example, you can build a scalable and efficient pipeline for Amazon Bedrock batch inference. Bedrock batch inference are charged at a 50% discount to on-demand pricing. This strategy can reduce costs, while increasing overall throughput by smoothing out spikes in usage.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.