Monitor VPC NAT gateways using CloudWatch metrics and alarms
This blog post highlights the monitoring metrics and alarms of VPC NAT gateways using CloudWatch.
Published Feb 4, 2024
Many VPC designs utilize both public and private subnets. To enable communication from a private subnet to the Internet, a NAT gateway is required.
A VPC NAT gateway is a limited resource that can be depleted. Therefore, it is important to implement monitoring to receive alerts in case the NAT gateway experiences a bottleneck.
Each NAT gateway sends metrics to CloudWatch that can be monitored using CloudWatch alarms. I recommend creating alarms for the following metrics:
ErrorPortAllocation: Represents the number of times the NAT gateway was unable to allocate a source port.
PacketsDropCount: Represents the number of packets dropped by the NAT gateway.
Unfortunately, NAT gateways do not provide a single metric for measuring the throughput utilization of bandwidth and packets. The maximum bandwidth is 100 Gbit/second and 10,000,000 packets/sec. However, we can calculate the throughput using CloudWatch metric math.
To calculate the bandwidth utilization, we can utilize the following metrics:
And the following expressions:
|[(BytesOutToDestination + BytesOutToSource + BytesInFromDestination + BytesInFromSource) * 8 / Time period in seconds]
|Bytes/min to Gbit/s
|to %; 100 Gbit/s is the hard limit
To obtain more than 100 Gbps of bandwidth bursts, divide the resources among multiple subnets and establish multiple NAT gateways. For optimal performance, deploy your EC2 instances in private subnets within the same Availability Zone as your NAT gateway.
In conclusion, monitoring VPC NAT gateways with CloudWatch metrics and alarms, as well as monitoring throughput utilization, allows us to ensure the optimal performance and availability of NAT gateways in our VPC designs.