logo
Menu

Troubleshoot slow performance after moving to AWS RDS

How to identify RDS storage issues and how to fix them.

Published May 22, 2024
Last Modified May 23, 2024
Storage consumption is a critical aspect that is sometimes overlooked when migrating on-premises SQL Server databases to AWS RDS. While matching CPU and memory specifications is common, neglecting to analyze and optimize storage requirements can lead to unexpected and potentially high costs in the cloud. In contrast to an on-premise data center, where you buy storage and can use all of its capacity without additional costs, a cloud environment will charge you based on the number of IOPS and throughput you use and may also limit you depending on the instance type. So, while a spike from 1000 to 16000 IOPS can happen without being noticed in an on-premise data center, it may cause a huge impact on an RDS instance performance (and consequently to your clients) if it’s not configured to handle such a load. Let’s see below what are the symptoms that usually lead to this behaviour and how to fix them.

Symptoms

Increase in Disk queue depth, Read/Write IOPS and Read/Write throughput

Root Causes Identification

  1. Amazon CloudWatch Metrics:
    a. Utilize Amazon CloudWatch to monitor key metrics related to your RDS instance. Look for metrics such as “Read IOPS”, “Write IOPS” and “DiskQueueDepth”.
    b. An abrupt increase in Read/Write IOPS, Read/Write Throughput, or a consistently high disk queue depth may indicate potential storage throttling.
  2. Performance Insights:
  • AWS provides a feature called Performance Insights that offers a visual representation of database performance. It includes metrics related to storage, such as IOPS (Input/Output Operations Per Second) and throughput.
  • Check for spikes or patterns in IOPS and or throughput that might coincide with periods of reduced database performance.
Using the metrics gathered above, it’s time to compare them with the instance type, storage type, and their limits.
Let’s use a real case to illustrate that.
A company has moved a 2TB Sql server instance running in a on-prem server with 32 vcpus and about 250GB/RAM. To have a similar server in the cloud, they’ve chosen an RDS instance type db.r5.8xlarge with EBS storage configured with GP2 (General Purpose SSD storage).
The underlying instance supports the following properties/numbers:
Max Bandwidth (Mbps) on (EBS): 6800
Max Throughput (MB/s) on EBS: 850.0
Max I/O Operations/second (IOPS): 30000
Baseline Bandwidth (Mbps) on (EBS): 6800
Baseline Throughput (MB/s) on EBS: 850.0
Baseline I/O Operations/second (IOPS): 30000
And GP2 storage has the following capabilities:
3,000–16,000 IOPS
250 MiB/s Throughput
After the cutover, their customers started facing higher response times and timeouts.
Looking at the RDS instance metrics, the CPU and memory were good. However, the graph below shows a high number in the “Disk Queue Depth.”
This metric refers to the number of Input/Output (I/O) operations waiting to be processed by the underlying storage system. It is a metric that indicates the depth of the queue of pending disk operations.
So we can conclude that something unexpected was happening at the storage level.
The next step is to check if the IOPS and/or throughput is throttled.
As shown in the below graph, the throughput seems to be throttled at 250mbs:
So, even if the throughput limit of the instance is 850MB/s, it can't reach that number due to the limit of the underlying EBS volume, which is 250MB/s. In other words, some processes require large disk scans or insert a large volume of data, and the storage cannot handle such load in a timely manner.

Solution

The easiest and fastest (but not cheaper) solution in this situation is to change the storage type to gp3 or io1, allowing the disk to deliver higher throughput and IOPS.
Each of these storage types has its limits. This document can help you evaluate the one that best fits each workload.
*Once you change the storage type, the status of the DB instance is storage-optimization, and you won't be allowed to modify it again in the last six hours**.** Even though changing or scaling storage doesn’t cause an outage, you may face some latency (within the single-digit millisecond range) during storage optimization.
That should be a short-term solution, as it should put a band-aid on the wound, but it will not heal it. In order to fix the root cause, you need to identify the queries and/or database configurations that cause such a load. We'll see how to do that in the next post.
Hope it helps!
 

Comments