Transforming Amazon API Gateway Access Log Into Prometheus Metrics
This post will work you through a solution to convert Amazon API Gateway access log into Prometheus metrics.
Published Jul 5, 2023
Last Modified May 2, 2024
When working with Amazon API Gateway, there are several challenges you may encounter. Particularly in terms of observability, here are some challenges:
- Metrics and Monitoring: API Gateway offers some basic metrics and monitoring capabilities, but they might not be sufficient for in-depth observability. You may need to integrate with AWS CloudWatch or other monitoring solutions to collect and analyze metrics like latency, error rates, and request throughput, etc.
- Debugging: Debugging issues within API Gateway can be tricky due to limited visibility into the inner workings of the service. It's crucial to utilize logging, metrics, and distributed tracing to identify and troubleshoot issues effectively. Additionally, thorough testing and monitoring of API Gateway configurations are essential.
- Error Handling: API Gateway can return various types of errors and status codes. Proper error handling and monitoring mechanisms are crucial to ensure you can identify and address errors promptly. Configuring appropriate alarms or integrating with other monitoring tools can help in this regard.
- Scalability and Performance: As API traffic grows, ensuring scalability and optimal performance becomes crucial. Monitoring and observability can help you identify bottlenecks, optimize resource allocation, and monitor the overall health and performance of backend services of the API Gateway.
To overcome these challenges, it is essential to leverage additional AWS services, third-party monitoring tools, open-source frameworks to enhance observability. By combining these tools and best practices, you can gain better insights into your API Gateway, identify performance issues, debug problems, and ensure a seamless experience for your users.
Let's look at a situation. I possess a SaaS product that offers APIs through an API Gateway to clients. Each client has the capability to generate multiple API keys for various purposes. Furthermore, each API key can be linked to a different predefined usage plan.
I would like to obtain metrics that provide insight into client behavior, including:
- Which clients and API keys have the highest consumption rates? What are the rates at which they make requests?
- Which API exhibits the slowest performance? What are the latency values for the 99th, 90th, and 50th percentiles?
- What are the success and error rates for a specific API?
Unfortunately, API Gateway does not offer metrics specifically for tracking such insights. Generating those metrics would require implementing custom instrument codes within the backend services, which can be time-consuming and require significant effort.
As the API Gateway access log already includes all the essential information, I would utilize it to obtain insightful metrics. Given that I am utilizing the Prom stack (Prometheus, Alertmanager, Grafana) as my observability platform, I would like to find a solution to convert the API Gateway access log into Prometheus metrics. This will enable seamless integration with my current tools and framework.
There are numerous tools available that facilitate the transformation of logs into Prometheus metrics. Additionally, it is possible to develop a custom tool for this purpose. Personally, I have a preference for open-source solutions, and in my experience, I have found vector.dev to be a powerful and high-performance tool for observability purposes. It effectively caters to my requirements in this particular use case.
The following diagram illustrates my solution for transforming API Gateway access logs into Prometheus metrics:
The solution's basic flow for converting API Gateway access logs into Prometheus metrics is as follows:
- API Gateway is configured to generate access logs, which are then sent to CloudWatch Logs
- CloudWatch Logs Subscription Filter is employed to forward the access logs to a Lambda function
- The Lambda function performs additional enhancements on the access logs and dispatches them to an SQS queue
- Vector deployment is implemented, utilizing either ECS tasks or EKS pods. This deployment retrieves the access logs from the SQS queue and transforms them into Prometheus metrics, while also exposing these metrics
- Prometheus scrapes the metrics exposed by Vector
- Lastly, Grafana queries the Prometheus metrics and visualizes them in the form of graphs
You will need an AWS account and a basic knowledge of the AWS Console to begin with. Below are high-level steps. Let’s get started!
We first need to enable access log of the API Gateway. Let’s say we already have an existing API Gateway.
Here are log formats:
There is a minor distinction in the log format between a REST API and an HTTP API. The key difference lies in the absence of the
$context.identity.apiKey
support in the log format of an HTTP API, which is exclusively available in a REST API. As a result, when utilizing an HTTP API, it becomes impossible to retrieve metrics related to the usage of API keys.Before we create subscription filter to forward log to SQS queue, we need to create a lambda function.
Here is the function:
To ensure the proper functioning of the function, it is essential to grant the necessary permissions by utilizing both the
Execution role
and Resource-based policy statements
. By combining these two components, the function can access the required resources and perform its tasks effectively.This is the permission policies of the execution role of the function:
And this a resource-based policy statement which enables CloudWatch Logs to invoke the function
Now that we have completed the necessary steps, we are ready to create a CloudWatch Logs subscription filter.
Go to CloudWatch Logs and navigate to API Gateway access log group, then click on "Create Lambda subscription filter"
We need to choose the lambda function created in the previous step, and give the subscription filter a name
The deployment configuration of Vector varies depending on the deployment location (ECS or EKS). However, the configuration of Vector itself remains consistent.
Here is the configuration for Vector:
Let's delve deeper into the configuration details:
- First, we have the
source
configuration namedapigw_access_log_queue
. This configuration enables Vector deployment instances to poll messages from a specified SQS queue for further processing. - Next, we have two
transforms
configured:apigw_access_log_transformed
andapigw_access_log_2_metrics
. These transforms serve specific purposes: - The
apigw_access_log_transformed
transform is responsible for modifying anypath
field that contains more than one digest into{id}
. Here's an example:- Before transforming:
- After transforming:
- The
apigw_access_log_2_metrics
transform serves the purpose of converting messages from the previous transform into Prometheus metrics. This transform is responsible for exposing two metrics:http_request_count_total
andhttp_response_latency_milliseconds
. These metrics will include a set of labels such asmethod
,path
,status
,gatewayId
, andapiKeyId
.For more details, please refer to Vector log to metrics configuration documentation - Lastly, we have the
sink
configuration namedapigw_access_log_metrics
, which is responsible for exposing the metrics to port18687
on the Vector deployment instances. Each metric name will have a prefix based on the value configured atdefault_namespace
. This configuration allows the metrics to be accessible and collected by external systems or monitoring tools
This is an example ECS task definition for Vector deployment. Note that in the
entryPoint
field a bash script creates the /etc/vector/vector.yaml
file by decoding the BASE64 text included in the script. In order to use this configuration you'll need to modify the vector.yaml
file and encode it to BASE64, replacing the part of the entryPoint
field below that begins with c291cm
.Please note, you have to configure proper permission to enable ECS tasks polling message from SQS. Here is an example of the permission configuration:
Let's proceed with configuring Prometheus to scrape metrics from Vector
First, let’s verify if everything integrate well and works as expected. We can try sending a request to API Gateway and then get metrics for that request by Grafana Explore web interface.
Then run the following PromQL to get related metrics of the request:
We will get result look like this:
Alright, everything works well. Let's simulate increased traffic and generate more meaningful graphs on Grafana. By generating more data and visualizations, we’ll gain deeper insights into client behavior as well as the system's performance, and be able to make informed decisions based on the metrics collected.
Here are some graphs:
With just two basic metrics extracted from the access log, we have gained valuable insights into both the performance of the system and the behavior of our clients. These metrics can be easily integrated into the existing monitoring and alarm tools, enhancing the overall system visibility.
The solution we have implemented demonstrates a high level of availability and scalability. For instance, when it comes to lambda functions, we can optimize their performance and minimize cold starts by appropriately configuring provisioned concurrency. Additionally, the Vector deployment can be scaled either vertically or horizontally, enabling us to allocate resources efficiently as per our needs.
Furthermore, this solution is not limited to a single account; it can be expanded to include multiple accounts. This capability becomes particularly advantageous when a centralized monitoring solution is required for multiple accounts across the organization. It allows for streamlined management and oversight, simplifying the overall monitoring process.