Amazon Aurora is a fully managed, cloud-native relational database engine built by AWS. It is compatible with both MySQL and PostgreSQL and offers up to five times the throughput of standard MySQL and three times that of standard PostgreSQL, making it a popular choice for production workloads that demand high availability and scalability.
But high performance does not mean zero failure risk. Without proper AWS Aurora monitoring, a spike in CPU usage, a connection leak, a storage bottleneck, or a replica falling behind can all go undetected until they become outages. Setting up the right metrics and alerts is not optional; it is a core operational responsibility.
This guide walks you through every layer of Aurora monitoring: the key metrics you should track, the alerting thresholds that matter, the native AWS tools available, and how to extend visibility with third-party platforms. By the end, you will have a clear, actionable plan for keeping your Aurora clusters healthy.
- ✓ Aurora sends metrics to Amazon CloudWatch by default at 1-minute granularity, available for 15 days.
- ✓ The five critical metric categories to monitor are query throughput, query latency, resource utilization (CPU, memory, and I/O), connections, and replication lag.
-
✓ Set CloudWatch alarms on
CPUUtilization(>80%),FreeableMemory(<20%),DatabaseConnections(>80% of maximum),ReplicaLag(>1 second), andDiskQueueDepth. - ✓ Enable Enhanced Monitoring (1-second granularity) and Performance Insights for deeper OS-level and query-level visibility.
-
✓ For Aurora Serverless v2, additionally monitor
ACUUtilizationto detect capacity ceiling pressure. - ✓ Establish a performance baseline during normal operations before setting alert thresholds.
- ✓ Third-party tools like Datadog, SolarWinds, Grafana, and CubeAPM extend native CloudWatch capabilities with richer dashboards and cross-service correlation.
What Is AWS Aurora Monitoring?
AWS Aurora monitoring refers to the continuous collection, analysis, and alerting of performance and health data from your Amazon Aurora database clusters. It covers both cluster-level metrics (which apply to the entire DB cluster) and instance-level metrics (which apply to individual DB instances within the cluster).
Aurora integrates natively with Amazon CloudWatch, which serves as the primary metrics repository. CloudWatch collects and processes raw data from each active Aurora instance at 1-minute intervals by default, and stores these metrics for 15 days. All instance-level metrics are published to the AWS/RDS namespace in CloudWatch.
Beyond CloudWatch, AWS offers several complementary monitoring tools:
- Enhanced Monitoring: OS-level metrics at up to 1-second granularity
- Performance Insights: Query-level profiling and DB load analysis
- CloudWatch Logs Insights: Log search and analysis for slow query and error logs
- AWS CloudTrail: API call audit logs for compliance and security tracking
- Amazon DevOps Guru for RDS: ML-powered anomaly detection
Why Aurora Monitoring Matters
Aurora is often the data layer behind critical applications. A poorly monitored database can cause:
- Application slowdowns due to undetected query latency increases
- Cascading failures when connection pools are exhausted
- Data inconsistency if replica lag grows too large before a failover
- Unexpected cost spikes from runaway ACU scaling in Aurora Serverless v2
- Missed SLAs when storage or CPU bottlenecks go unaddressed
Effective Aurora monitoring lets you detect issues early, establish normal performance baselines, and respond to incidents before they reach end users. AWS recommends establishing a performance baseline by capturing average, minimum, and maximum values for all key metrics at multiple intervals (1 hour, 24 hours, 1 week, 2 weeks) during normal operations, before configuring alert thresholds.
Key Metrics to Monitor for AWS Aurora
The metrics below are organized by category. Together they give you a complete picture of your Aurora cluster health. These are sourced from Amazon CloudWatch, Aurora-specific CloudWatch metrics, and the MySQL or PostgreSQL database engine.
1. Query Throughput
Query throughput tells you how much work your database is doing. A sudden drop in throughput can indicate a crash, a blocked session, or a network issue. A sudden spike can indicate an application bug or a traffic event.
| Metric | CloudWatch Name | What It Measures | Alert Threshold |
|---|---|---|---|
| Queries per second | Queries | Total query execution rate | Alert on sudden drops (>30% below baseline) |
| Read throughput | SelectThroughput | SELECT statement rate | Compare vs. baseline |
| Write throughput | DMLThroughput | INSERT + UPDATE + DELETE rate | Compare vs. baseline |
| Commit throughput | CommitThroughput | COMMIT statement rate | Drop = potential lock contention |
Tip: A drop in Queries per second that is not accompanied by a drop in incoming application traffic is a strong signal of a database-side problem and should trigger an immediate investigation.
2. Query Latency
Latency is one of the most user-visible performance indicators. Aurora exposes read and write latency metrics that are exclusive to the Aurora engine and not available for other RDS database engines.
| Metric | CloudWatch Name | What It Measures | Alert Threshold |
|---|---|---|---|
| Read query latency | SelectLatency | Average SELECT execution time (ms) | >10ms sustained or >50ms spikes |
| Write query latency | DMLLatency | Average DML execution time (ms) | >5ms sustained |
| Slow queries | Engine: Slow_queries | Queries exceeding long_query_time | Any sustained nonzero count |
| Commit latency | CommitLatency | Average COMMIT time (ms) | >5ms is worth investigating |
Slow queries are defined by the long_query_time database parameter. You can adjust this via the RDS parameter group in the AWS Console. To view slow query details, enable the slow query log and publish it to CloudWatch Logs, then query it using CloudWatch Logs Insights.
To find the slowest queries using the MySQL sys schema (if Performance Schema is enabled):
mysql> SELECT * FROM sys.statements_with_runtimes_in_95th_percentile\G3. Resource Utilization (CPU, Memory, Disk, Network)
Aurora, like any database, depends on four fundamental hardware resources: CPU, memory, disk, and network. A bottleneck in any one of these will degrade query performance even if the database engine itself is healthy.
| Metric | CloudWatch Name | What It Measures | Alert Threshold |
|---|---|---|---|
| CPU utilization | CPUUtilization | Percentage of CPU in use | >80% for 5 minutes |
| Freeable memory | FreeableMemory | Available RAM (MB) | <20% of total instance RAM |
| Read IOPS | ReadIOPS | Disk read operations per second | Persistently high = working set not in memory |
| Write IOPS | WriteIOPS | Disk write operations per second | Compare to provisioned IOPS limit |
| Disk queue depth | DiskQueueDepth | Pending I/O operations | >1 sustained is worth investigating |
| Read latency (disk) | ReadLatency | Average disk read time (ms) | >1ms at disk level is elevated |
| Write latency (disk) | WriteLatency | Average disk write time (ms) | >1ms at disk level is elevated |
| Network receive throughput | NetworkReceiveThroughput | Incoming network traffic (bytes/s) | Compare to instance type limits |
| Network transmit throughput | NetworkTransmitThroughput | Outgoing network traffic (bytes/s) | Compare to instance type limits |
Important note on ReadIOPS: If ReadIOPS is high and stable while your application is under load, it often means your working data set is too large to fit in the InnoDB buffer pool. Upgrading to a larger instance class (more RAM) is usually more effective than optimizing queries in this case.
Important note on storage: Unlike standard MySQL on RDS, Aurora uses auto-scaling shared distributed storage. It does not expose a FreeStorageSpace metric. Storage grows automatically up to 128 TB. You can track VolumeBytesUsed at the cluster level to see how much storage has been consumed.
4. Database Connection Metrics
Connection management is critical in Aurora. Each database connection consumes memory. Aurora has a configurable maximum connection limit, and exceeding it causes applications to receive “Too many connections” errors, which can cause a cascading failure across your application layer.
Aurora’s default max_connections value is calculated as: log(DBInstanceClassMemory / 8187281408) x 1000. You can check or override this via the instance’s RDS parameter group.
| Metric | CloudWatch Name | What It Measures | Alert Threshold |
|---|---|---|---|
| Open connections | DatabaseConnections | Currently open client connections | >80% of max_connections |
| Login failures | LoginFailures | Failed connection attempts per second | >0 should be investigated |
| Active threads | Engine: Threads_running | Threads actively executing queries | Sudden spikes = concurrency issue |
| Connection errors | Engine: Connection_errors_internal | Errors caused by server-side issues | Any nonzero = investigate immediately |
To check current and maximum connections directly on Aurora MySQL:
mysql> SELECT @@max_connections;Connection_errors_internal is particularly important: it increments when errors originate from the server itself, such as an out-of-memory condition or thread creation failure. Any nonzero value deserves immediate attention.
5. Read Replica and Replication Metrics
Aurora supports up to 15 read replicas per primary instance. Replica lag is the delay between a write being committed on the primary and becoming readable on a replica. High replica lag can cause read inconsistency and, during a failover, data loss if the replica is promoted to primary.
| Metric | CloudWatch Name | What It Measures | Alert Threshold |
|---|---|---|---|
| Replica lag | AuroraReplicaLag | Lag in milliseconds behind primary | >1000ms (1 second) |
| Max replica lag | AuroraReplicaLagMaximum | Highest lag across all replicas | >1000ms |
| Min replica lag | AuroraReplicaLagMinimum | Lowest lag across all replicas | Useful for baseline tracking |
For Aurora global databases (cross-region replication), also monitor:
- AuroraGlobalDBReplicationLag: Replication lag across AWS regions in milliseconds
- AuroraGlobalDBRPOLag: Recovery Point Objective lag, indicating potential data loss window
6. Aurora Serverless v2 Specific Metrics
Aurora Serverless v2 automatically scales compute capacity in Aurora Capacity Units (ACUs). Each ACU provides approximately 2 GiB of memory along with corresponding CPU and networking. Unlike provisioned Aurora, you must additionally monitor ACU consumption to detect capacity ceiling pressure.
| Metric | CloudWatch Name | What It Measures | Alert Threshold |
|---|---|---|---|
| ACU utilization | ACUUtilization | Percentage of max ACU capacity in use | >80% (approaching max ACU limit) |
| CPU utilization | CPUUtilization | CPU usage within allocated ACUs | >80% for 5 minutes |
| Serverless capacity | ServerlessDatabaseCapacity | Current ACU capacity allocated | Monitor for unexpected scaling patterns |
If both ACUUtilization and CPUUtilization are near 100%, your cluster has hit its maximum ACU capacity and is under extreme load. Increase the maximum ACU limit in the cluster configuration or optimize your workload.
AWS Aurora Monitoring Tools
AWS provides a layered set of native monitoring tools, each suited to a different level of observability. You do not need to use all of them; choose based on your operational requirements.
Amazon CloudWatch
CloudWatch is the foundational monitoring tool for Aurora. It collects instance-level metrics automatically at 1-minute intervals at no extra cost. Key capabilities include:
- Metrics dashboards for visualizing Aurora performance over time
- Alarms that trigger SNS notifications or Auto Scaling actions when thresholds are breached
- CloudWatch Logs Insights for querying Aurora error logs, slow query logs, and audit logs
- Metric Insights for aggregating metrics across large fleets of Aurora instances
Enhanced Monitoring
Enhanced Monitoring provides OS-level metrics at granularities from 1 second to 60 seconds. It is delivered via CloudWatch Logs rather than standard CloudWatch Metrics. To enable it, you must attach an IAM role to your Aurora instance with the AmazonRDSEnhancedMonitoringRole policy.
Enhanced Monitoring adds visibility into:
- CPU steal and CPU wait times
- Per-process CPU and memory usage (useful for identifying rogue database processes)
- File system read/write activity at higher resolution than CloudWatch
Performance Insights
Performance Insights is an RDS feature that expands on standard monitoring to show database load in terms of active sessions and wait events. It visualizes the “DB Load” as the number of average active sessions (AAS), broken down by:
- SQL statement (which queries are contributing to load)
- Wait events (what resources sessions are waiting on)
- Users and hosts (which application users are generating the most load)
Performance Insights is particularly useful for identifying query-level performance problems that do not show up in aggregate CloudWatch metrics. The free tier retains 7 days of Performance Insights data; a paid tier extends this to 2 years.
Amazon Managed Grafana
For teams that prefer Grafana dashboards, AWS offers Amazon Managed Grafana with built-in support for CloudWatch as a data source. You can use pre-built RDS and Aurora dashboards to visualize all CloudWatch metrics alongside metrics from other AWS services in a unified view.
Third-Party Monitoring Tools
Several third-party platforms extend Aurora monitoring beyond what CloudWatch provides natively:
- Offers a dedicated Aurora integration that combines CloudWatch metrics with database engine metrics, correlates them with APM traces, and provides pre-built dashboards. Source: https://www.datadoghq.com/blog/monitoring-amazon-aurora-performance-metrics/: Datadog
- Provides query-level performance analysis and wait time breakdowns for Aurora MySQL and PostgreSQL: SolarWinds Database Performance Analyzer
- Supports Aurora cluster monitoring with custom dashboards, trend analysis, and alerting. Source: https://www.site24x7.com/: Site24x7
- Offers log analytics and infrastructure monitoring for Aurora, combining metrics and log data in a single platform: Sumo Logic
How to Set Up CloudWatch Alarms for Aurora
CloudWatch alarms watch a single metric over a time window you define. When the metric crosses a threshold for a specified number of consecutive periods, the alarm changes state and can send a notification via Amazon SNS or trigger an Auto Scaling action.
Important: An alarm does not fire just because it is in an ALARM state. The state must change and be maintained for the specified number of consecutive evaluation periods before any action is triggered.
Step-by-Step: Creating a CloudWatch Alarm for Aurora
- Open the Amazon RDS console at https://console.aws.amazon.com/rds/ and choose Databases.
- Select your DB instance and navigate to Logs & events.
- In the CloudWatch alarms section, choose Create alarm.
- Configure a notification: enable “Send notifications” and specify an SNS topic or create a new one with email/SMS recipients.
- Select the metric, statistic (Average is typical), and alarm condition (greater than / less than / equal to threshold).
- Set the evaluation period (e.g., 5 consecutive 1-minute periods = 5 minutes of sustained threshold breach before alarm fires).
- Name the alarm clearly (e.g., “aurora-prod-cpu-high”) and choose Create Alarm.
Recommended CloudWatch Alarm Thresholds for Aurora
| Alarm | Metric | Condition | Evaluation Period | Action |
|---|---|---|---|---|
| High CPU | CPUUtilization | >80% | 5 consecutive minutes | SNS alert, investigate queries |
| Low memory | FreeableMemory | <500MB (or <20% RAM) | 5 consecutive minutes | SNS alert, plan instance upgrade |
| High connections | DatabaseConnections | >80% of max_connections | 3 consecutive minutes | SNS alert, check connection pooling |
| High replica lag | AuroraReplicaLag | >1000ms | 3 consecutive minutes | SNS alert, investigate replication |
| High disk queue | DiskQueueDepth | >10 | 5 consecutive minutes | SNS alert, check I/O patterns |
| Login failures | LoginFailures | >0 | 1 period | SNS alert, check credentials |
| High ACU (Serverless) | ACUUtilization | >80% | 5 consecutive minutes | SNS alert, increase max ACU |
Monitoring Aurora Logs
Metrics tell you what is happening in aggregate. Logs tell you why. Aurora supports several log types that can be published to Amazon CloudWatch Logs for real-time search and analysis.
Log Types Available in Aurora
- Database startup, shutdown, and runtime errors: Error log
- Queries exceeding the long_query_time threshold (MySQL) or log_min_duration_statement (PostgreSQL): Slow query log
- All SQL statements executed (high volume; use selectively): General log
- Database activity including connections, queries, and table access (requires Advanced Auditing feature for Aurora MySQL): Audit log
Querying Aurora Logs with CloudWatch Logs Insights
Once logs are published to CloudWatch Logs, you can query them interactively. Example: Find the top 10 slowest queries in the last 24 hours:
fields @timestamp, @message
| filter @logStream like /slowquery/
| sort @timestamp desc
| limit 10AWS Aurora Monitoring Best Practices
Applying the right strategy matters as much as choosing the right tools. Follow these best practices to get the most out of your Aurora monitoring setup.
- Run your workload under typical load and record average, minimum, and maximum values for all key metrics at multiple time intervals. Use this data to set meaningful alert thresholds rather than arbitrary ones.: Establish a baseline first
- Cluster-level metrics like VolumeBytesUsed and replication lag give an overall picture, while instance-level metrics (CPU, memory, connections) are specific to each writer or reader instance.: Monitor at both the cluster and instance level
- The 1-second granularity gives you faster detection of transient CPU spikes and memory pressure events that 1-minute CloudWatch averages can miss.: Enable Enhanced Monitoring for production clusters
- When aggregate metrics are elevated, Performance Insights helps you identify which specific SQL statements are driving the load and what they are waiting on.: Use Performance Insights for query-level root cause analysis
- A DatabaseConnections alarm at 1000 connections means nothing without knowing your max_connections limit. Alert at >80% of the configured max instead.: Alert on percentage, not absolute values
- A latency spike in your application may correspond to a DiskQueueDepth spike in Aurora. Correlating metrics across layers speeds up root-cause identification significantly.: Correlate metrics across layers
- Enable slow query logs, error logs, and audit logs and send them to CloudWatch Logs so you can use Logs Insights for interactive analysis without logging into the database directly.: Publish logs to CloudWatch Logs
- Periodically verify that your CloudWatch alarms are configured correctly and that your SNS notifications reach the right people. Stale or misconfigured alarms are worse than no alarms.: Test your alerts regularly
- ✓ Set up Aurora monitoring in minutes with no manual metric wiring required
- ✓ Get smart alerts on CPU spikes, replica lag, connection storms, and more
- ✓ Correlate slow queries with infrastructure metrics for faster root-cause analysis
- ✓ Scale from a single cluster to thousands of instances with no extra effort
Conclusion
Effective AWS Aurora monitoring requires visibility across multiple layers: query throughput and latency, resource utilization, connection health, replication lag, and, for Serverless v2, ACU consumption. Amazon CloudWatch provides the foundation, and tools like Enhanced Monitoring, Performance Insights, and CloudWatch Logs Insights give you deeper visibility when you need it.
The most important first step is not setting up dashboards; it is establishing a performance baseline. Once you know what normal looks like for your specific workload, you can configure alert thresholds with confidence and avoid both missed alerts and alert fatigue.
Whether you use native AWS tools, a third-party platform like Datadog or Grafana, or a purpose-built observability solution like CubeAPM, the goal is the same: detect issues early, resolve them quickly, and keep your applications running smoothly.
Disclaimer: The information in this article is provided for general educational purposes only. Metric thresholds and best practices may vary depending on your specific workload, instance class, Aurora engine version, and business requirements. Always refer to the official AWS Aurora documentation and test configurations in a non-production environment before applying them to production systems. CubeAPM product details are based on publicly available information at the time of publication.
FAQs
1. What is the difference between Aurora metrics and RDS metrics in CloudWatch?
Amazon Aurora publishes both standard RDS metrics (available for all RDS engines) and Aurora-specific metrics to CloudWatch in the AWS/RDS namespace. Aurora-specific metrics include SelectLatency, DMLLatency, CommitLatency, CommitThroughput, AuroraReplicaLag, and several others that are not available for MySQL, PostgreSQL, or other RDS engines. Standard RDS metrics like CPUUtilization, FreeableMemory, DatabaseConnections, and ReadIOPS are available for all engines.
2. How do I enable slow query logging in Amazon Aurora MySQL?
To enable slow query logs in Aurora MySQL: Navigate to your DB cluster’s parameter group in the RDS console. Set slow_query_log to 1 and configure long_query_time to your desired threshold in seconds (e.g., 1 for queries slower than 1 second). Set log_output to FILE to write to the database log, or to TABLE to write to the mysql.slow_log table. Optionally, enable publishing to CloudWatch Logs by modifying the DB cluster and enabling the slow query log under Additional configuration. Queries that exceed long_query_time will then appear in the slow query log.
3. What are the most important CloudWatch alarms to set for Aurora?
The highest-priority alarms to configure for a production Aurora cluster are: CPUUtilization (>80%), FreeableMemory (below a threshold appropriate for your instance size), DatabaseConnections (>80% of max_connections), AuroraReplicaLag (>1000ms), and DiskQueueDepth (>10 sustained). For Aurora Serverless v2, also add ACUUtilization (>80%). These five to six alarms cover the most common failure modes.
4. What is ACUUtilization and why does it matter for Aurora Serverless v2?
ACUUtilization measures the percentage of the maximum Aurora Capacity Unit (ACU) limit that is currently being used. Each ACU provides approximately 2 GiB of memory with corresponding CPU. If ACUUtilization and CPUUtilization are both near 100%, your cluster has reached its configured maximum capacity and cannot scale further. This can cause queries to queue and latency to spike. Increasing the maximum ACU limit in your cluster configuration resolves this; alternatively, you can optimize the workload to reduce resource consumption.
5. Does Aurora automatically scale storage? Do I need to monitor free storage space?
Yes, Aurora storage scales automatically in 10 GB increments, up to a maximum of 128 TiB. Because of this, Aurora does not expose a FreeStorageSpace metric. Instead, you can monitor VolumeBytesUsed at the cluster level to track how much storage has been consumed. There is no need to set a free storage alarm as you would for standard RDS engines, but tracking VolumeBytesUsed helps you understand growth trends and project costs.





