CubeAPM
CubeAPM CubeAPM

How to Monitor AWS Aurora: Key Metrics and Alerts

How to Monitor AWS Aurora: Key Metrics and Alerts

Table of Contents

Amazon Aurora is a fully managed, cloud-native relational database engine built by AWS. It is compatible with both MySQL and PostgreSQL and offers up to five times the throughput of standard MySQL and three times that of standard PostgreSQL, making it a popular choice for production workloads that demand high availability and scalability.

But high performance does not mean zero failure risk. Without proper AWS Aurora monitoring, a spike in CPU usage, a connection leak, a storage bottleneck, or a replica falling behind can all go undetected until they become outages. Setting up the right metrics and alerts is not optional; it is a core operational responsibility.

This guide walks you through every layer of Aurora monitoring: the key metrics you should track, the alerting thresholds that matter, the native AWS tools available, and how to extend visibility with third-party platforms. By the end, you will have a clear, actionable plan for keeping your Aurora clusters healthy.

Key Takeaways
  • ✓ Aurora sends metrics to Amazon CloudWatch by default at 1-minute granularity, available for 15 days.
  • ✓ The five critical metric categories to monitor are query throughput, query latency, resource utilization (CPU, memory, and I/O), connections, and replication lag.
  • ✓ Set CloudWatch alarms on CPUUtilization (>80%), FreeableMemory (<20%), DatabaseConnections (>80% of maximum), ReplicaLag (>1 second), and DiskQueueDepth.
  • ✓ Enable Enhanced Monitoring (1-second granularity) and Performance Insights for deeper OS-level and query-level visibility.
  • ✓ For Aurora Serverless v2, additionally monitor ACUUtilization to detect capacity ceiling pressure.
  • ✓ Establish a performance baseline during normal operations before setting alert thresholds.
  • ✓ Third-party tools like Datadog, SolarWinds, Grafana, and CubeAPM extend native CloudWatch capabilities with richer dashboards and cross-service correlation.

What Is AWS Aurora Monitoring?

AWS Aurora monitoring refers to the continuous collection, analysis, and alerting of performance and health data from your Amazon Aurora database clusters. It covers both cluster-level metrics (which apply to the entire DB cluster) and instance-level metrics (which apply to individual DB instances within the cluster).

Aurora integrates natively with Amazon CloudWatch, which serves as the primary metrics repository. CloudWatch collects and processes raw data from each active Aurora instance at 1-minute intervals by default, and stores these metrics for 15 days. All instance-level metrics are published to the AWS/RDS namespace in CloudWatch.

Beyond CloudWatch, AWS offers several complementary monitoring tools:

  • Enhanced Monitoring: OS-level metrics at up to 1-second granularity
  • Performance Insights: Query-level profiling and DB load analysis
  • CloudWatch Logs Insights: Log search and analysis for slow query and error logs
  • AWS CloudTrail: API call audit logs for compliance and security tracking
  • Amazon DevOps Guru for RDS: ML-powered anomaly detection

Why Aurora Monitoring Matters

Aurora is often the data layer behind critical applications. A poorly monitored database can cause:

  • Application slowdowns due to undetected query latency increases
  • Cascading failures when connection pools are exhausted
  • Data inconsistency if replica lag grows too large before a failover
  • Unexpected cost spikes from runaway ACU scaling in Aurora Serverless v2
  • Missed SLAs when storage or CPU bottlenecks go unaddressed

Effective Aurora monitoring lets you detect issues early, establish normal performance baselines, and respond to incidents before they reach end users. AWS recommends establishing a performance baseline by capturing average, minimum, and maximum values for all key metrics at multiple intervals (1 hour, 24 hours, 1 week, 2 weeks) during normal operations, before configuring alert thresholds.

Key Metrics to Monitor for AWS Aurora

The metrics below are organized by category. Together they give you a complete picture of your Aurora cluster health. These are sourced from Amazon CloudWatch, Aurora-specific CloudWatch metrics, and the MySQL or PostgreSQL database engine.

1. Query Throughput

Query throughput tells you how much work your database is doing. A sudden drop in throughput can indicate a crash, a blocked session, or a network issue. A sudden spike can indicate an application bug or a traffic event.

MetricCloudWatch NameWhat It MeasuresAlert Threshold
Queries per secondQueriesTotal query execution rateAlert on sudden drops (>30% below baseline)
Read throughputSelectThroughputSELECT statement rateCompare vs. baseline
Write throughputDMLThroughputINSERT + UPDATE + DELETE rateCompare vs. baseline
Commit throughputCommitThroughputCOMMIT statement rateDrop = potential lock contention

Tip: A drop in Queries per second that is not accompanied by a drop in incoming application traffic is a strong signal of a database-side problem and should trigger an immediate investigation.

2. Query Latency

Latency is one of the most user-visible performance indicators. Aurora exposes read and write latency metrics that are exclusive to the Aurora engine and not available for other RDS database engines.

MetricCloudWatch NameWhat It MeasuresAlert Threshold
Read query latencySelectLatencyAverage SELECT execution time (ms)>10ms sustained or >50ms spikes
Write query latencyDMLLatencyAverage DML execution time (ms)>5ms sustained
Slow queriesEngine: Slow_queriesQueries exceeding long_query_timeAny sustained nonzero count
Commit latencyCommitLatencyAverage COMMIT time (ms)>5ms is worth investigating

Slow queries are defined by the long_query_time database parameter. You can adjust this via the RDS parameter group in the AWS Console. To view slow query details, enable the slow query log and publish it to CloudWatch Logs, then query it using CloudWatch Logs Insights.

To find the slowest queries using the MySQL sys schema (if Performance Schema is enabled):

mysql> SELECT * FROM sys.statements_with_runtimes_in_95th_percentile\G

3. Resource Utilization (CPU, Memory, Disk, Network)

Aurora, like any database, depends on four fundamental hardware resources: CPU, memory, disk, and network. A bottleneck in any one of these will degrade query performance even if the database engine itself is healthy.

MetricCloudWatch NameWhat It MeasuresAlert Threshold
CPU utilizationCPUUtilizationPercentage of CPU in use>80% for 5 minutes
Freeable memoryFreeableMemoryAvailable RAM (MB)<20% of total instance RAM
Read IOPSReadIOPSDisk read operations per secondPersistently high = working set not in memory
Write IOPSWriteIOPSDisk write operations per secondCompare to provisioned IOPS limit
Disk queue depthDiskQueueDepthPending I/O operations>1 sustained is worth investigating
Read latency (disk)ReadLatencyAverage disk read time (ms)>1ms at disk level is elevated
Write latency (disk)WriteLatencyAverage disk write time (ms)>1ms at disk level is elevated
Network receive throughputNetworkReceiveThroughputIncoming network traffic (bytes/s)Compare to instance type limits
Network transmit throughputNetworkTransmitThroughputOutgoing network traffic (bytes/s)Compare to instance type limits

Important note on ReadIOPS: If ReadIOPS is high and stable while your application is under load, it often means your working data set is too large to fit in the InnoDB buffer pool. Upgrading to a larger instance class (more RAM) is usually more effective than optimizing queries in this case.

Important note on storage: Unlike standard MySQL on RDS, Aurora uses auto-scaling shared distributed storage. It does not expose a FreeStorageSpace metric. Storage grows automatically up to 128 TB. You can track VolumeBytesUsed at the cluster level to see how much storage has been consumed.

4. Database Connection Metrics

Connection management is critical in Aurora. Each database connection consumes memory. Aurora has a configurable maximum connection limit, and exceeding it causes applications to receive “Too many connections” errors, which can cause a cascading failure across your application layer.

Aurora’s default max_connections value is calculated as: log(DBInstanceClassMemory / 8187281408) x 1000. You can check or override this via the instance’s RDS parameter group.

MetricCloudWatch NameWhat It MeasuresAlert Threshold
Open connectionsDatabaseConnectionsCurrently open client connections>80% of max_connections
Login failuresLoginFailuresFailed connection attempts per second>0 should be investigated
Active threadsEngine: Threads_runningThreads actively executing queriesSudden spikes = concurrency issue
Connection errorsEngine: Connection_errors_internalErrors caused by server-side issuesAny nonzero = investigate immediately

To check current and maximum connections directly on Aurora MySQL:

mysql> SELECT @@max_connections;

Connection_errors_internal is particularly important: it increments when errors originate from the server itself, such as an out-of-memory condition or thread creation failure. Any nonzero value deserves immediate attention.

5. Read Replica and Replication Metrics

Aurora supports up to 15 read replicas per primary instance. Replica lag is the delay between a write being committed on the primary and becoming readable on a replica. High replica lag can cause read inconsistency and, during a failover, data loss if the replica is promoted to primary.

MetricCloudWatch NameWhat It MeasuresAlert Threshold
Replica lagAuroraReplicaLagLag in milliseconds behind primary>1000ms (1 second)
Max replica lagAuroraReplicaLagMaximumHighest lag across all replicas>1000ms
Min replica lagAuroraReplicaLagMinimumLowest lag across all replicasUseful for baseline tracking

For Aurora global databases (cross-region replication), also monitor:

  • AuroraGlobalDBReplicationLag: Replication lag across AWS regions in milliseconds
  • AuroraGlobalDBRPOLag: Recovery Point Objective lag, indicating potential data loss window

6. Aurora Serverless v2 Specific Metrics

Aurora Serverless v2 automatically scales compute capacity in Aurora Capacity Units (ACUs). Each ACU provides approximately 2 GiB of memory along with corresponding CPU and networking. Unlike provisioned Aurora, you must additionally monitor ACU consumption to detect capacity ceiling pressure.

MetricCloudWatch NameWhat It MeasuresAlert Threshold
ACU utilizationACUUtilizationPercentage of max ACU capacity in use>80% (approaching max ACU limit)
CPU utilizationCPUUtilizationCPU usage within allocated ACUs>80% for 5 minutes
Serverless capacityServerlessDatabaseCapacityCurrent ACU capacity allocatedMonitor for unexpected scaling patterns

If both ACUUtilization and CPUUtilization are near 100%, your cluster has hit its maximum ACU capacity and is under extreme load. Increase the maximum ACU limit in the cluster configuration or optimize your workload.

AWS Aurora Monitoring Tools

AWS provides a layered set of native monitoring tools, each suited to a different level of observability. You do not need to use all of them; choose based on your operational requirements.

Amazon CloudWatch

CloudWatch is the foundational monitoring tool for Aurora. It collects instance-level metrics automatically at 1-minute intervals at no extra cost. Key capabilities include:

  • Metrics dashboards for visualizing Aurora performance over time
  • Alarms that trigger SNS notifications or Auto Scaling actions when thresholds are breached
  • CloudWatch Logs Insights for querying Aurora error logs, slow query logs, and audit logs
  • Metric Insights for aggregating metrics across large fleets of Aurora instances

Enhanced Monitoring

Enhanced Monitoring provides OS-level metrics at granularities from 1 second to 60 seconds. It is delivered via CloudWatch Logs rather than standard CloudWatch Metrics. To enable it, you must attach an IAM role to your Aurora instance with the AmazonRDSEnhancedMonitoringRole policy.

Enhanced Monitoring adds visibility into:

  • CPU steal and CPU wait times
  • Per-process CPU and memory usage (useful for identifying rogue database processes)
  • File system read/write activity at higher resolution than CloudWatch

Performance Insights

Performance Insights is an RDS feature that expands on standard monitoring to show database load in terms of active sessions and wait events. It visualizes the “DB Load” as the number of average active sessions (AAS), broken down by:

  • SQL statement (which queries are contributing to load)
  • Wait events (what resources sessions are waiting on)
  • Users and hosts (which application users are generating the most load)

Performance Insights is particularly useful for identifying query-level performance problems that do not show up in aggregate CloudWatch metrics. The free tier retains 7 days of Performance Insights data; a paid tier extends this to 2 years.

Amazon Managed Grafana

For teams that prefer Grafana dashboards, AWS offers Amazon Managed Grafana with built-in support for CloudWatch as a data source. You can use pre-built RDS and Aurora dashboards to visualize all CloudWatch metrics alongside metrics from other AWS services in a unified view.

Third-Party Monitoring Tools

Several third-party platforms extend Aurora monitoring beyond what CloudWatch provides natively:

  • Offers a dedicated Aurora integration that combines CloudWatch metrics with database engine metrics, correlates them with APM traces, and provides pre-built dashboards. Source: https://www.datadoghq.com/blog/monitoring-amazon-aurora-performance-metrics/: Datadog
  • Provides query-level performance analysis and wait time breakdowns for Aurora MySQL and PostgreSQL: SolarWinds Database Performance Analyzer
  • Supports Aurora cluster monitoring with custom dashboards, trend analysis, and alerting. Source: https://www.site24x7.com/: Site24x7
  • Offers log analytics and infrastructure monitoring for Aurora, combining metrics and log data in a single platform: Sumo Logic

How to Set Up CloudWatch Alarms for Aurora

CloudWatch alarms watch a single metric over a time window you define. When the metric crosses a threshold for a specified number of consecutive periods, the alarm changes state and can send a notification via Amazon SNS or trigger an Auto Scaling action.

Important: An alarm does not fire just because it is in an ALARM state. The state must change and be maintained for the specified number of consecutive evaluation periods before any action is triggered.

Step-by-Step: Creating a CloudWatch Alarm for Aurora

  1. Open the Amazon RDS console at https://console.aws.amazon.com/rds/ and choose Databases.
  2. Select your DB instance and navigate to Logs & events.
  3. In the CloudWatch alarms section, choose Create alarm.
  4. Configure a notification: enable “Send notifications” and specify an SNS topic or create a new one with email/SMS recipients.
  5. Select the metric, statistic (Average is typical), and alarm condition (greater than / less than / equal to threshold).
  6. Set the evaluation period (e.g., 5 consecutive 1-minute periods = 5 minutes of sustained threshold breach before alarm fires).
  7. Name the alarm clearly (e.g., “aurora-prod-cpu-high”) and choose Create Alarm.

Recommended CloudWatch Alarm Thresholds for Aurora

AlarmMetricConditionEvaluation PeriodAction
High CPUCPUUtilization>80%5 consecutive minutesSNS alert, investigate queries
Low memoryFreeableMemory<500MB (or <20% RAM)5 consecutive minutesSNS alert, plan instance upgrade
High connectionsDatabaseConnections>80% of max_connections3 consecutive minutesSNS alert, check connection pooling
High replica lagAuroraReplicaLag>1000ms3 consecutive minutesSNS alert, investigate replication
High disk queueDiskQueueDepth>105 consecutive minutesSNS alert, check I/O patterns
Login failuresLoginFailures>01 periodSNS alert, check credentials
High ACU (Serverless)ACUUtilization>80%5 consecutive minutesSNS alert, increase max ACU

Monitoring Aurora Logs

Metrics tell you what is happening in aggregate. Logs tell you why. Aurora supports several log types that can be published to Amazon CloudWatch Logs for real-time search and analysis.

Log Types Available in Aurora

  • Database startup, shutdown, and runtime errors: Error log
  • Queries exceeding the long_query_time threshold (MySQL) or log_min_duration_statement (PostgreSQL): Slow query log
  • All SQL statements executed (high volume; use selectively): General log
  • Database activity including connections, queries, and table access (requires Advanced Auditing feature for Aurora MySQL): Audit log

Querying Aurora Logs with CloudWatch Logs Insights

Once logs are published to CloudWatch Logs, you can query them interactively. Example: Find the top 10 slowest queries in the last 24 hours:

fields @timestamp, @message

| filter @logStream like /slowquery/

| sort @timestamp desc

| limit 10

AWS Aurora Monitoring Best Practices

Applying the right strategy matters as much as choosing the right tools. Follow these best practices to get the most out of your Aurora monitoring setup.

  • Run your workload under typical load and record average, minimum, and maximum values for all key metrics at multiple time intervals. Use this data to set meaningful alert thresholds rather than arbitrary ones.: Establish a baseline first
  • Cluster-level metrics like VolumeBytesUsed and replication lag give an overall picture, while instance-level metrics (CPU, memory, connections) are specific to each writer or reader instance.: Monitor at both the cluster and instance level
  • The 1-second granularity gives you faster detection of transient CPU spikes and memory pressure events that 1-minute CloudWatch averages can miss.: Enable Enhanced Monitoring for production clusters
  • When aggregate metrics are elevated, Performance Insights helps you identify which specific SQL statements are driving the load and what they are waiting on.: Use Performance Insights for query-level root cause analysis
  • A DatabaseConnections alarm at 1000 connections means nothing without knowing your max_connections limit. Alert at >80% of the configured max instead.: Alert on percentage, not absolute values
  • A latency spike in your application may correspond to a DiskQueueDepth spike in Aurora. Correlating metrics across layers speeds up root-cause identification significantly.: Correlate metrics across layers
  • Enable slow query logs, error logs, and audit logs and send them to CloudWatch Logs so you can use Logs Insights for interactive analysis without logging into the database directly.: Publish logs to CloudWatch Logs
  • Periodically verify that your CloudWatch alarms are configured correctly and that your SNS notifications reach the right people. Stale or misconfigured alarms are worse than no alarms.: Test your alerts regularly
AWS Aurora Observability
Monitor Smarter with CubeAPM
Tired of piecing together CloudWatch dashboards manually? CubeAPM gives you a unified observability platform with out-of-the-box AWS Aurora monitoring, pre-built dashboards, intelligent alerting, and correlated traces, metrics, and logs—all in one place.
With CubeAPM, you can:
  • ✓ Set up Aurora monitoring in minutes with no manual metric wiring required
  • ✓ Get smart alerts on CPU spikes, replica lag, connection storms, and more
  • ✓ Correlate slow queries with infrastructure metrics for faster root-cause analysis
  • ✓ Scale from a single cluster to thousands of instances with no extra effort
Start your free trial today and get full visibility into your Aurora clusters.
Start Free Trial →

Conclusion

Effective AWS Aurora monitoring requires visibility across multiple layers: query throughput and latency, resource utilization, connection health, replication lag, and, for Serverless v2, ACU consumption. Amazon CloudWatch provides the foundation, and tools like Enhanced Monitoring, Performance Insights, and CloudWatch Logs Insights give you deeper visibility when you need it.

The most important first step is not setting up dashboards; it is establishing a performance baseline. Once you know what normal looks like for your specific workload, you can configure alert thresholds with confidence and avoid both missed alerts and alert fatigue.

Whether you use native AWS tools, a third-party platform like Datadog or Grafana, or a purpose-built observability solution like CubeAPM, the goal is the same: detect issues early, resolve them quickly, and keep your applications running smoothly.

Disclaimer: The information in this article is provided for general educational purposes only. Metric thresholds and best practices may vary depending on your specific workload, instance class, Aurora engine version, and business requirements. Always refer to the official AWS Aurora documentation and test configurations in a non-production environment before applying them to production systems. CubeAPM product details are based on publicly available information at the time of publication.

FAQs

1. What is the difference between Aurora metrics and RDS metrics in CloudWatch?

Amazon Aurora publishes both standard RDS metrics (available for all RDS engines) and Aurora-specific metrics to CloudWatch in the AWS/RDS namespace. Aurora-specific metrics include SelectLatency, DMLLatency, CommitLatency, CommitThroughput, AuroraReplicaLag, and several others that are not available for MySQL, PostgreSQL, or other RDS engines. Standard RDS metrics like CPUUtilization, FreeableMemory, DatabaseConnections, and ReadIOPS are available for all engines.

2. How do I enable slow query logging in Amazon Aurora MySQL?

To enable slow query logs in Aurora MySQL: Navigate to your DB cluster’s parameter group in the RDS console. Set slow_query_log to 1 and configure long_query_time to your desired threshold in seconds (e.g., 1 for queries slower than 1 second). Set log_output to FILE to write to the database log, or to TABLE to write to the mysql.slow_log table. Optionally, enable publishing to CloudWatch Logs by modifying the DB cluster and enabling the slow query log under Additional configuration. Queries that exceed long_query_time will then appear in the slow query log.

3. What are the most important CloudWatch alarms to set for Aurora?

The highest-priority alarms to configure for a production Aurora cluster are: CPUUtilization (>80%), FreeableMemory (below a threshold appropriate for your instance size), DatabaseConnections (>80% of max_connections), AuroraReplicaLag (>1000ms), and DiskQueueDepth (>10 sustained). For Aurora Serverless v2, also add ACUUtilization (>80%). These five to six alarms cover the most common failure modes.

4. What is ACUUtilization and why does it matter for Aurora Serverless v2?

ACUUtilization measures the percentage of the maximum Aurora Capacity Unit (ACU) limit that is currently being used. Each ACU provides approximately 2 GiB of memory with corresponding CPU. If ACUUtilization and CPUUtilization are both near 100%, your cluster has reached its configured maximum capacity and cannot scale further. This can cause queries to queue and latency to spike. Increasing the maximum ACU limit in your cluster configuration resolves this; alternatively, you can optimize the workload to reduce resource consumption.

5. Does Aurora automatically scale storage? Do I need to monitor free storage space?

Yes, Aurora storage scales automatically in 10 GB increments, up to a maximum of 128 TiB. Because of this, Aurora does not expose a FreeStorageSpace metric. Instead, you can monitor VolumeBytesUsed at the cluster level to track how much storage has been consumed. There is no need to set a free storage alarm as you would for standard RDS engines, but tracking VolumeBytesUsed helps you understand growth trends and project costs.

×
×