
Unlock Speed and Savings: Prompt Caching for Amazon Bedrock
Explore prompt caching in Amazon Bedrock
kalaivanan
Amazon Employee
Published May 17, 2025
Hey AWS Builders!
As a Solutions Architect, I'm constantly talking to customers who are pushing the boundaries with Generative AI on AWS. Amazon Bedrock has been a game-changer, simplifying access to a range of foundation models (FMs). But as applications scale, optimizing for latency and cost, especially with consistent prompt elements, becomes key.
Let's dive into Prompt Caching for Amazon Bedrock. This isn't a minor tweak; it's a significant enhancement that can improve the performance and cost-effectiveness of your Bedrock-powered applications.
Many GenAI applications use prompts with substantial static components. Think of:
• System Prompts: "You are a helpful assistant..."
• RAG Context: "Based on the following document excerpts: [long text]..."
• Few-Shot Examples: "Example 1: ... Example 2: ..."
The FM re-processes these static parts every time, consuming tokens (cost) and adding latency.
Prompt Caching allows Bedrock to cache the processed tokens of the initial, static part of your prompt.
First Call (Cache Write): Bedrock processes the entire prompt. If caching is enabled, it stores the processed tokens from this initial part.
Subsequent Calls (Cache Read): If the beginning of a new prompt exactly matches a cached entry, Bedrock reuses those cached tokens. It then only processes the new, dynamic part.
Crucially, this caches the prompt's processed tokens, not the model's completion. You still get fresh, unique generations.
Reduced Latency: Faster "time to first token."
Lower Costs: On a cache hit, you're charged only for processing the new, uncached part of the prompt (and output tokens).
Improved Application Efficiency: Handle more users or complex interactions.
• Chatbots with System Prompts & Conversation History
• Retrieval Augmented Generation (RAG) with static context.
• Few-Shot Learning with fixed examples.
• Templated Prompts with large static sections.
The key is identifying the static, reusable prefix.
To understand the effectiveness of prompt caching, Amazon Bedrock publishes two specific metrics to Amazon CloudWatch for each model:
• CacheReadInputTokens: This metric counts "The number of input tokens in a request that were found in the cache and therefore were not processed again by the foundation model."
• CacheWriteInputTokens: This metric counts "The number of input tokens in a request that were written to the cache for the first time."
How to Interpret and Use These Metrics:
CacheReadInputTokens - Measuring Cache Hits and Savings:
A high value for CacheReadInputTokens (e.g., using SUM aggregation) directly indicates the volume of input tokens successfully served from the cache. Each token counted here represents a token that did not need to be re-processed by the foundation model, translating directly to potential cost savings and latency reduction for that portion of the prompt.
If this value is consistently low or zero when you expect cache hits, it strongly suggests that the static prefixes of your prompts are not matching existing cache entries. This could be due to subtle variations in the text.
CacheWriteInputTokens - Measuring New Cache Entries:
This metric tells you how many tokens are being newly added to the cache. You'll see this spike when new unique static prefixes are encountered by Bedrock. If you have a fixed set of static prompts, you'd expect to see CacheWriteInputTokens primarily during the initial "warm-up" phase. Afterwards, for those same static prompts, you should see CacheReadInputTokens increase instead. A continuously high CacheWriteInputTokens might indicate that your "static" prefixes are, in fact, changing frequently, preventing effective cache utilization for reads.
By closely monitoring CacheReadInputTokens and CacheWriteInputTokens in CloudWatch, you gain precise insights into how effectively your prompt caching strategy is working, allowing you to optimize prompt design and quantify the benefits.
1. Maximize the Static Portion: A longer, consistent static prefix leads to higher CacheReadInputTokens.
2. Ensure "Exactness": Variations kill cache reads. Aim for perfect consistency in static parts.
3. Test and Measure: Use CacheReadInputTokens, CacheWriteInputTokens, and latency metrics to quantify impact.
4. Not a Silver Bullet for All Prompts: Mostly dynamic prompts won't see large CacheReadInputTokens.
Prompt Caching for Amazon Bedrock, offers a robust way to enhance performance and reduce costs.
Explore this feature, enable it, and set up your CloudWatch dashboards to track these key metrics. See firsthand how many tokens you're saving!
What are your thoughts? Share your experiences and questions below!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.