Amazon Bedrock: Real Time Token Usage Monitoring
This blog introduces a solution to monitor Amazon Bedrock token usage: Real-time tracking per user/app, alerts, and cost insights for LLM management.
Anil Chinnam
Amazon Employee
Published Oct 8, 2024
As enterprises increasingly integrate large language models (LLMs) into their applications using AWS Bedrock, the need for precise cost allocation and consumption monitoring becomes paramount. Organizations typically want to track and manage token usage for each logged-in user or specific application to maintain operational efficiency and cost-effectiveness. In this context, two critical challenges emerge in managing token consumption:
1. Real-time monitoring and alerting: Organizations need to track token usage in real-time, either per user or per application, and receive immediate notifications when consumption exceeds predefined thresholds.
2. Granular cost visibility: Detailed insights into token consumption, broken down by user ID, application ID, and model ID, are essential for accurate cost allocation and budget management.
Amazon Bedrock's pay-per-use pricing model, based on input and output tokens processed, necessitates a robust monitoring solution that addresses both these challenges across diverse applications and use cases. Without such a system, organizations struggle with effective cost management, resource allocation, and usage metering. For instance, a SaaS provider offering an AI-powered document analysis tool needs a solution that not only alerts them to usage spikes but also provides detailed cost breakdowns, enabling usage-based pricing, cost optimization, and transparent billing for their customers.
In this post, I’ll introduce a monitoring solution leveraging purpose-built AWS services and open-source tools. This solution enables real-time tracking of token usage at granular levels (per application, user ID, and foundation model), provides proactive notifications for threshold breaches, and offers detailed cost breakdowns. By implementing this solution, organizations can effectively manage costs, optimize resource allocation, and enable usage-based pricing models for their LLM-powered services.
The architecture utilizes LangChain, an open-source framework for developing LLM-powered applications, with AWS services to capture and process usage data from Amazon Bedrock. It utilizes Amazon SQS for reliable message queuing, AWS Lambda for serverless computing, and Amazon SNS for notifications. The data is stored in Amazon DynamoDB and visualized using Amazon QuickSight dashboards.
This monitoring solution demonstrates how to track and manage token consumption in Amazon Bedrock.Throughout this post, I 'll focus on this specific use case: alerting when a user's token consumption for a particular model exceeds a set threshold within an hour. The solution leverages serverless, pay-as-you-go AWS services, making it cost-effective and highly scalable.
This scenario illustrates how this solution can:
- Capture fine-grained usage data at the user, application and model level
- Process and store this data in real-time
- Evaluate consumption against predefined thresholds
- Trigger immediate notifications when thresholds are breached
While we use this example, the principles and architecture can be easily adapted to monitor various metrics and thresholds, making it versatile for different organizational needs.
The following diagram illustrates the architecture of the solution
Here is how it works:
- Users interact with an AI-Powered application that leverages Large Language Models (LLMs) through Amazon Bedrock, providing a seamless and scalable interface for AI-powered functionalities.
- After each interaction, the application code sends detailed usage metrics including user ID, model ID, application ID, and token consumption (both input and output). This data is immediately sent to an Amazon SQS queue.
- An AWS Lambda function continuously monitors the SQS queue, processing events in real-time to extract critical usage data.
- The Lambda function stores all the extracted information in a DynamoDB table.
- The Lambda function monitors usage patterns and triggers SNS events for exceeded thresholds
This solution's technical foundation is built on the LangChain framework, which provides a sophisticated and flexible interface for interacting with Amazon Bedrock models. LangChain is an open-source framework designed to simplify the development of applications powered by large language models (LLMs), including those offered by Amazon Bedrock.
LangChain's ChatBedrock class part of langchain_aws module. It provides a streamlined interface for interacting with Amazon's Bedrock language models, abstracting away the complexities of the API.
Leveraging Callbacks for Token Usage Tracking
One of the most powerful features of the ChatBedrock wrapper is its callback system. To harness the full potential of LangChain's callback API, I implemented a custom callback handler that inherits from BaseCallbackHandler. This custom handler allows us to:
- Capture detailed token usage metrics
- Process usage data in real-time
- Trigger actions based on specific events or thresholds
Here's a focused look at how we leverage the callback feature:
- CustomCallBack Handler :
This solution implements a CustomCallbackHandler class that hooks into LangChain's callback system. The initialize_chat function demonstrates how to set up the ChatBedrock wrapper with custom parameters, allowing for easy adaptation to different models and use cases. The on_llm_end method extracts token usage data and immediately publishes it to an Amazon SQS queue after each bedrock invocation.
- ChatBedrock Initialization
ChatBedrock wrapper is setup with custom parameters:
- Finally put altogether to invoke the foundation model via Bedrock
This setup allows us to capture token usage data every time the model is invoked, providing the foundation for our monitoring solution.
After capturing the token usage data in our custom callback handler, the next step is to publish this information to an Amazon SQS queue. This allows us to process the usage data asynchronously and reliably.
Here's code to publish the token usage data to SQS:
This function is called from the on_llm_end method in our custom callback handler. It formats the token usage data into a JSON payload and sends it to the specified SQS queue.
Here's an example of the JSON event published to SQS after each call to the LLM:
By publishing these events to SQS, we create a reliable stream of token usage data that can be processed and analyzed in real-time, forming the backbone of our Bedrock token usage monitoring solution.
To efficiently track token usage , this monitoring solution leverages Amazon DynamoDB, a highly scalable and flexible NoSQL database service. Whenever an event is published to Amazon SQS, an AWS Lambda function is triggered to process the event in real-time and writes the data to a DynamoDB table.
The DynamoDB table uses a composite partition key, which includes the user ID and the model ID (userId#modelId). It also uses a sort key with a timestamp (YYYY-MM-DD-HH) to keep track of the event time. This structure allows for easy storage and retrieval of events by user Id, model Id and by the hour of the day.
Here is the code snippet that stores the event into dynamo db
This function updates the DynamoDB table with the latest token consumption data. For events occurring within the same hour for a specific user and model, the solution aggregates the token information and updates the last updated timestamp.
The table's structure enables easy identification of users who exceed their token consumption thresholds on an hourly basis. This design is crucial for implementing real-time monitoring and alerting based on token usage patterns.
By storing the event data in DynamoDB, we create a persistent and queryable record of token usage, which forms the foundation for our analysis and visualization processes in the later stages of our monitoring solution.
Whenever an event is processed, the Lambda function checks if the total token consumption for a user and model within an hour has exceeded the predefined threshold. If the threshold is exceeded, the Lambda function triggers an Amazon SNS event.
Below is the code logic to check if the token consumption for a user and model Id exceeds the hourly threshold.
Here is the sample code logic that calculates the token consumption for a user and model in the last hour.
This solution leverages Amazon SNS to provide real-time notifications. Lambda function triggers the SNS event when the token consumption when token consumption thresholds are exceeded.This event-driven approach ensures timely alerts and enables swift action.
Amazon SNS (Simple Notification Service) is responsible for delivering notifications to the appropriate recipients. To complete the notification setup:
- Create an SNS topic .
- Configure subscriptions to the SNS topic. Options include:
- Email: For direct notifications to team members
- SMS: For urgent text message alerts
- AWS Lambda: To trigger automated responses
- HTTP/HTTPS endpoints: For integration with other systems
Here is the sample code to send the event to SNS Topic
With this setup, when a user's token consumption exceeds the defined threshold (e.g., 10,000 tokens per hour), Lambda function will trigger the rule, which in turn publishes to the SNS topic. Subscribers then receive immediate notifications through their chosen delivery method.
Sample Email Notification:
This event-driven approach ensures timely alerts, enabling swift action when usage patterns deviate from expected norms.
After storing the usage data in DynamoDB, you can leverage Amazon QuickSight to visualize the insights as shown in Figure 2. Amazon QuickSight is a cloud-powered business analytics service that enables organizations to build visualizations, perform ad hoc analysis, and quickly get business insights from their data, anytime, on any device.
To make data in DynamoDB accessible to Amazon QuickSight, you can use Amazon Athena with the open source Amazon DynamoDB Connector. Athena is a serverless, interactive service that allows you to query data from a variety of sources in heterogeneous formats, with no provisioning effort. Athena accesses data stored in DynamoDB via Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog.
This solution Includes following steps:
- The Athena DynamoDB connector runs in a pre-built, serverless AWS Lambda function. You don’t need to write any code.
- AWS Glue provides supplemental metadata from the DynamoDB table. In particular, an AWS Glue Crawler is run to infer and store the DynamoDB table format, schema, and associated properties in the Glue Data Catalog.
- The Athena editor is used to test the connector and perform analysis via SQL queries.
- QuickSight uses the Athena connector to visualize BI insights from DynamoDB.
Follow this blog for detail steps on setting up the Amazon DynamoDB connector and integration with Amazon QuickSight.
Once the DynamoDB data is available in QuickSight, It is ready to be visualized. Following is the QuickSight dashboard created with visualizations:
- Vertical Stacked Bar Chart: This chart shows total token usage by application and LLM, helping identify high-consumption areas and optimize resource allocation.
- Pie Chart: This chart displays the relative usage of different language models, informing decisions on model selection.
- Heatmap: The heatmap visualizes token consumption by application and user, enabling the detection of usage patterns, potential abuse, and the ability to set appropriate user-level token limits.
These visualizations can be analyzed over time, allowing organizations to track consumption trends, manage budgets, forecast future usage, and optimize their LLM deployments on Amazon Bedrock for maximum efficiency and cost-effectiveness.
This comprehensive monitoring solution for Amazon Bedrock offers organizations a powerful tool to gain real-time visibility into their token usage across various applications, users, and foundation models. The system triggers alerts for threshold breaches, allowing operations teams to take immediate action. These notifications empower teams to implement further measures, such as limiting or metering user consumption for specific applications, ensuring effective cost management.
With insightful visualizations and actionable alerts, this solution enables organizations to make data-driven decisions about their LLM deployments. By implementing this monitoring system, businesses can confidently scale their AI-powered applications while maintaining precise control over token usage and optimizing costs.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.