Transforming Amazon API Gateway Access Log Into Prometheus Metrics
This post will work you through a solution to convert Amazon API Gateway access log into Prometheus metrics.
- Metrics and Monitoring: API Gateway offers some basic metrics and monitoring capabilities, but they might not be sufficient for in-depth observability. You may need to integrate with AWS CloudWatch or other monitoring solutions to collect and analyze metrics like latency, error rates, and request throughput, etc.
- Debugging: Debugging issues within API Gateway can be tricky due to limited visibility into the inner workings of the service. It's crucial to utilize logging, metrics, and distributed tracing to identify and troubleshoot issues effectively. Additionally, thorough testing and monitoring of API Gateway configurations are essential.
- Error Handling: API Gateway can return various types of errors and status codes. Proper error handling and monitoring mechanisms are crucial to ensure you can identify and address errors promptly. Configuring appropriate alarms or integrating with other monitoring tools can help in this regard.
- Scalability and Performance: As API traffic grows, ensuring scalability and optimal performance becomes crucial. Monitoring and observability can help you identify bottlenecks, optimize resource allocation, and monitor the overall health and performance of backend services of the API Gateway.
- Which clients and API keys have the highest consumption rates? What are the rates at which they make requests?
- Which API exhibits the slowest performance? What are the latency values for the 99th, 90th, and 50th percentiles?
- What are the success and error rates for a specific API?
- API Gateway is configured to generate access logs, which are then sent to CloudWatch Logs
- CloudWatch Logs Subscription Filter is employed to forward the access logs to a Lambda function
- The Lambda function performs additional enhancements on the access logs and dispatches them to an SQS queue
- Vector deployment is implemented, utilizing either ECS tasks or EKS pods. This deployment retrieves the access logs from the SQS queue and transforms them into Prometheus metrics, while also exposing these metrics
- Prometheus scrapes the metrics exposed by Vector
- Lastly, Grafana queries the Prometheus metrics and visualizes them in the form of graphs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"apiKeyId": "$context.identity.apiKey",
"gatewayId": "$context.apiId",
"integrationLatency": "$context.integration.latency",
"ip": "$context.identity.sourceIp",
"latency": "$context.responseLatency",
"method": "$context.httpMethod",
"path": "$context.path",
"principalId": "$context.authorizer.principalId",
"protocol": "$context.protocol",
"requestId": "$context.requestId",
"responseLength": "$context.responseLength",
"routeKey": "$context.routeKey",
"stage": "$context.stage",
"status": "$context.status",
"time": "$context.requestTimeEpoch",
"userAgent": "$context.identity.userAgent"
}
$context.identity.apiKey
support in the log format of an HTTP API, which is exclusively available in a REST API. As a result, when utilizing an HTTP API, it becomes impossible to retrieve metrics related to the usage of API keys.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import json
import os
import zlib
from base64 import b64decode
import boto3
SQS_QUEUE_URL = os.getenv("SQS_QUEUE_URL")
SQS_QUEUE_REGION = os.getenv("SQS_QUEUE_REGION")
CONVERT_TO_INT_FIELDS = ["integrationLatency",
"latency", "status", "time", "responseLength"]
sqs = boto3.session.Session().client(
'sqs', region_name=SQS_QUEUE_REGION, use_ssl=True)
def lambda_handler(event, context):
decoded_data = decode_cwl_event(event["awslogs"]["data"])
count = 0
messages = []
for log_event in decoded_data["logEvents"]:
message_payload = json.loads(log_event["message"])
# transform str into int
for k, v in message_payload.items():
if k in CONVERT_TO_INT_FIELDS:
try:
message_payload[k] = int(v)
except:
message_payload[k] = 0
messages.append({
"Id": str(count),
"MessageBody": json.dumps(message_payload)
})
count = count + 1
if count == 10:
sqs.send_message_batch(
QueueUrl=SQS_QUEUE_URL,
Entries=messages
)
print(f"{count} message(s) sent to SQS at {SQS_QUEUE_URL}")
count = 0
messages = []
if len(messages) > 0:
sqs.send_message_batch(
QueueUrl=SQS_QUEUE_URL,
Entries=messages
)
print(f"{count} message(s) sent to SQS at {SQS_QUEUE_URL}")
def decode_cwl_event(encoded_data: str) -> dict:
compressed_data = b64decode(encoded_data)
json_payload = zlib.decompress(compressed_data, 16+zlib.MAX_WBITS)
return json.loads(json_payload)
Execution role
and Resource-based policy statements
. By combining these two components, the function can access the required resources and perform its tasks effectively.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"Statement": [
{
"Action": [
"sqs:SendMessage",
"sqs:GetQueueUrl"
],
"Effect": "Allow",
"Resource": "arn:aws:sqs:<region>:<account-id>:apigw-access-log-demo",
"Sid": "sqs"
},
{
"Action": [
"logs:PutLogEvents",
"logs:CreateLogStream",
"logs:CreateLogGroup"
],
"Effect": "Allow",
"Resource": [
"arn:aws:logs:<region>:<account-id>:log-group:/aws/lambda/apigw-access-log-demo:*:*",
"arn:aws:logs:<region>:<account-id>:log-group:/aws/lambda/apigw-access-log-demo:*"
]
}
],
"Version": "2012-10-17"
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
sources:
apigw_access_log_queue:
type: aws_sqs
region: <region>
queue_url: <sqs-queue-url>
client_concurrency: 1
decoding:
codec: json
transforms:
apigw_access_log_transformed:
type: remap
inputs:
- apigw_access_log_queue
source: >-
.path, err = replace(.path, r'(\d{2,})', "{id}")
apigw_access_log_2_metrics:
type: log_to_metric
inputs:
- apigw_access_log_transformed
metrics:
- field: path
name: http_request_count_total
type: counter
tags:
method: "{{method}}"
path: "{{path}}"
status: "{{status}}"
gatewayId: "{{gatewayId}}"
apiKeyId: "{{apiKeyId}}"
- field: latency
name: http_response_latency_milliseconds
type: histogram
tags:
method: "{{method}}"
path: "{{path}}"
status: "{{status}}"
gatewayId: "{{gatewayId}}"
apiKeyId: "{{apiKeyId}}"
sinks:
apigw_aceess_log_metrics:
type: prometheus_exporter
inputs:
- apigw_access_log_2_metrics
address: 0.0.0.0:18687
default_namespace: aws_apigw
distributions_as_summaries: true
- First, we have the
source
configuration namedapigw_access_log_queue
. This configuration enables Vector deployment instances to poll messages from a specified SQS queue for further processing. - Next, we have two
transforms
configured:apigw_access_log_transformed
andapigw_access_log_2_metrics
. These transforms serve specific purposes: - The
apigw_access_log_transformed
transform is responsible for modifying anypath
field that contains more than one digest into{id}
. Here's an example:- Before transforming:1
2
3
4
5
6{
...
"method": "POST",
"path": "/prod/api/v1/user/8761/friend/271",
...
} - After transforming:1
2
3
4
5
6{
...
"method": "POST",
"path": "/prod/api/v1/user/{id}/friend/{id}",
...
}
- The
apigw_access_log_2_metrics
transform serves the purpose of converting messages from the previous transform into Prometheus metrics. This transform is responsible for exposing two metrics:http_request_count_total
andhttp_response_latency_milliseconds
. These metrics will include a set of labels such asmethod
,path
,status
,gatewayId
, andapiKeyId
.For more details, please refer to Vector log to metrics configuration documentation - Lastly, we have the
sink
configuration namedapigw_access_log_metrics
, which is responsible for exposing the metrics to port18687
on the Vector deployment instances. Each metric name will have a prefix based on the value configured atdefault_namespace
. This configuration allows the metrics to be accessible and collected by external systems or monitoring tools
entryPoint
field a bash script creates the /etc/vector/vector.yaml
file by decoding the BASE64 text included in the script. In order to use this configuration you'll need to modify the vector.yaml
file and encode it to BASE64, replacing the part of the entryPoint
field below that begins with c291cm
.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
{
"containerDefinitions": [
{
"name": "config-init",
"image": "ubuntu:23.10",
"cpu": 0,
"portMappings": [],
"essential": false,
"entryPoint": [
"bash",
"-c",
"set -ueo pipefail; mkdir -p /etc/vector/; echo c291cmNlczoKICBhcGlnd19hY2Nlc3NfbG9nX3F1ZXVlOgogICAgdHlwZTogYXdzX3NxcwogICAgcmVnaW9uOiA8cmVnaW9uPgogICAgcXVldWVfdXJsOiBodHRwczovL3Nxcy48cmVnaW9uPi5hbWF6b25hd3MuY29tLzxhY2NvdW50LWlkPi9hcGlndy1hY2Nlc3MtbG9nLWRlbW8KICAgIGNsaWVudF9jb25jdXJyZW5jeTogMQogICAgZGVjb2Rpbmc6CiAgICAgIGNvZGVjOiBqc29uCnRyYW5zZm9ybXM6CiAgYXBpZ3dfYWNjZXNzX2xvZ190cmFuc2Zvcm1lZDoKICAgIHR5cGU6IHJlbWFwCiAgICBpbnB1dHM6CiAgICAtIGFwaWd3X2FjY2Vzc19sb2dfcXVldWUKICAgIHNvdXJjZTogPi0KICAgICAgLnBhdGgsIGVyciA9IHJlcGxhY2UoLnBhdGgsIHInKFxkezIsfSknLCAie2lkfSIpCiAgYXBpZ3dfYWNjZXNzX2xvZ18yX21ldHJpY3M6CiAgICB0eXBlOiBsb2dfdG9fbWV0cmljCiAgICBpbnB1dHM6CiAgICAtIGFwaWd3X2FjY2Vzc19sb2dfdHJhbnNmb3JtZWQKICAgIG1ldHJpY3M6CiAgICAtIGZpZWxkOiBwYXRoCiAgICAgIG5hbWU6IGh0dHBfcmVxdWVzdF9jb3VudF90b3RhbAogICAgICB0eXBlOiBjb3VudGVyCiAgICAgIHRhZ3M6CiAgICAgICAgbWV0aG9kOiAie3ttZXRob2R9fSIKICAgICAgICBwYXRoOiAie3twYXRofX0iCiAgICAgICAgc3RhdHVzOiAie3tzdGF0dXN9fSIKICAgICAgICBnYXRld2F5SWQ6ICJ7e2dhdGV3YXlJZH19IgogICAgICAgIGFwaUtleUlkOiAie3thcGlLZXlJZH19IgogICAgLSBmaWVsZDogbGF0ZW5jeQogICAgICBuYW1lOiBodHRwX3Jlc3BvbnNlX2xhdGVuY3lfbWlsbGlzZWNvbmRzCiAgICAgIHR5cGU6IGhpc3RvZ3JhbQogICAgICB0YWdzOgogICAgICAgIG1ldGhvZDogInt7bWV0aG9kfX0iCiAgICAgICAgcGF0aDogInt7cGF0aH19IgogICAgICAgIHN0YXR1czogInt7c3RhdHVzfX0iCiAgICAgICAgZ2F0ZXdheUlkOiAie3tnYXRld2F5SWR9fSIKICAgICAgICBhcGlLZXlJZDogInt7YXBpS2V5SWR9fSIKc2lua3M6CiAgYXBpZ3dfYWNlZXNzX2xvZ19tZXRyaWNzOgogICAgdHlwZTogcHJvbWV0aGV1c19leHBvcnRlcgogICAgaW5wdXRzOgogICAgLSBhcGlnd19hY2Nlc3NfbG9nXzJfbWV0cmljcwogICAgYWRkcmVzczogMC4wLjAuMDoxODY4NwogICAgZGVmYXVsdF9uYW1lc3BhY2U6IGF3c19hcGlndwogICAgZGlzdHJpYnV0aW9uc19hc19zdW1tYXJpZXM6IHRydWUK | base64 -d > /etc/vector/vector.yaml; cat /etc/vector/vector.yaml"
],
"environment": [],
"mountPoints": [
{
"sourceVolume": "config",
"containerPath": "/etc/vector"
}
],
"volumesFrom": [],
"stopTimeout": 10,
"privileged": false,
"readonlyRootFilesystem": false,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/ecs/apigw-access-log-demo",
"awslogs-region": "<region>",
"awslogs-stream-prefix": "vector"
}
}
},
{
"name": "vector",
"image": "timberio/vector:0.30.0-distroless-libc",
"cpu": 0,
"portMappings": [
{
"containerPort": 8686,
"hostPort": 8686,
"protocol": "tcp"
},
{
"containerPort": 18687,
"hostPort": 18687,
"protocol": "tcp"
}
],
"essential": true,
"command": [
"--config-dir",
"/etc/vector/"
],
"environment": [],
"mountPoints": [
{
"sourceVolume": "config",
"containerPath": "/etc/vector"
}
],
"volumesFrom": [],
"dependsOn": [
{
"containerName": "config-init",
"condition": "SUCCESS"
}
],
"privileged": false,
"readonlyRootFilesystem": false,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/ecs/apigw-access-log-demo",
"awslogs-region": "<region>",
"awslogs-stream-prefix": "vector"
}
}
}
],
"family": "vector",
"taskRoleArn": "arn:aws:iam::<account-id>:role/vector-ecs-task-role",
"executionRoleArn": "arn:aws:iam::<account-id>:role/apigw-access-log-demo-ecs-task-role",
"networkMode": "awsvpc",
"volumes": [
{
"name": "config",
"host": {}
}
],
"compatibilities": [
"EC2",
"FARGATE"
],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "256",
"memory": "512",
"runtimePlatform": {
"cpuArchitecture": "ARM64",
"operatingSystemFamily": "LINUX"
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage"
],
"Effect": "Allow",
"Resource": "arn:aws:sqs:<region>:<account-id>:apigw-access-log-demo"
}
]
}
1
2
3
4
5
6
7
scrape_configs:
...
- job_name: apigw-access-log
scrape_interval: 15s
static_configs:
- targets: [<vector-address>:18687]
...
1
curl -X GET <API-Gateway-URL>/api/v1/user/12345
1
{job="apigw-access-log", __name__=~".+", method="GET", path=~".*/api/v1/user/{id}"}