
Why Claude 4 API Hits Rate Limits: Token Burndown Explained
Learn why your Bedrock Claude 4 API requests hit rate limits faster than expected. Understand burndown with examples and solutions for Bedrock users.
Jonathan Evans
Amazon Employee
Published May 27, 2025
Token burndown rates determine how your API usage counts against your quota limits. Think of it like exchange rates - but for tokens.
Important: This is about quota management, not pricing. Quota consumption and billing are separate - this article focuses only on how tokens count toward your rate limits.
The Simple Rule for Claude 4:
- Input tokens: 1 token used = 1 quota consumed
- Output tokens: 1 token generated = 5 quota consumed
With a 100,000 tokens-per-minute (TPM) quota, here's what happens:
Scenario 1: You make a request that generates 10,000 output tokens
- Quota burned: 10,000 × 5 = 50,000 tokens
- Remaining quota this minute: 50,000 tokens
- You can still make more requests!
Scenario 2: You make a request that generates 20,000 output tokens
- Quota burned: 20,000 × 5 = 100,000 tokens
- Remaining quota this minute: 0 tokens
- You must wait for the next minute to make another request
Request: "Write a 500-word blog post" (10 tokens)
Response: 500-word output (~750 tokens)
Response: 500-word output (~750 tokens)
Quota consumed: 10 + (750 × 5) = 3,760 tokens
(Remember: This is quota usage, not billing cost)
(Remember: This is quota usage, not billing cost)
Request: 2,000-token code snippet + "Find bugs" (2,005 tokens)
Response: Brief analysis (~200 tokens)
Response: Brief analysis (~200 tokens)
Quota consumed: 2,005 + (200 × 5) = 3,005 tokens
(Remember: This is quota usage, not billing cost)
(Remember: This is quota usage, not billing cost)
Request: "Hi, how are you?" (6 tokens)
Response: "I'm doing well, thanks! How can I help?" (12 tokens)
Response: "I'm doing well, thanks! How can I help?" (12 tokens)
Quota consumed: 6 + (12 × 5) = 66 tokens
(Remember: This is quota usage, not billing cost)
(Remember: This is quota usage, not billing cost)
AWS Bedrock reserves quota based on your
max_tokens
parameter BEFORE processing your request. This pre-allocation ensures fair access but can cause unexpected throttling. Here's a real example - we're just asking for a simple haiku (a 3-line poem):Better approach:
Model | Input Token Rate | Output Token Rate |
---|---|---|
Claude Opus 4 | 1:1 | 1:5 |
Claude Sonnet 4 | 1:1 | 1:5 |
Other Bedrock Models | 1:1 | 1:1 |
Estimate your actual output needs. Don't default to 4096. Remember: quota is reserved for the entire request duration.
Lower max_tokens = more concurrent requests possible. Critical for high-throughput applications.
Track your input:output ratio to optimize quota allocation.
- High input, low output (analysis): More efficient
- Low input, high output (generation): Reserves more quota longer
- Many concurrent requests: Keep max_tokens low
If you consistently hit limits, request increases based on your actual burndown patterns.
Every output token from Claude 4 costs 5x more quota than input tokens. Plan accordingly, set appropriate
max_tokens
values, and your API calls will run smoothly without unexpected throttling.Have questions about AWS Bedrock and Anthropic models? Connect with me on LinkedIn or reach out through AWS support channels.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.