Amazon Redshift is a fully managed cloud data warehouse that uses columnar storage and massively parallel processing (MPP) to run analytic queries across petabytes of data. It distributes work across a leader node and multiple compute nodes, each processing data in parallel slices.
That distributed architecture is what makes Redshift fast. It is also what makes performance problems hard to diagnose. A slow report could come from a bad query plan, a full WLM slot queue, table skew, or a silent disk spill that nobody noticed.
This guide covers every monitoring layer available in AWS Redshift, from CloudWatch infrastructure metrics to raw system table queries. You will know exactly what to watch, what queries to run, and when to use a dedicated tool.
💡 Key takeaways
- Redshift monitoring has four main layers: CloudWatch, Redshift Console, system tables/SYS views, and audit logging.
- CloudWatch shows cluster health, while system tables and SYS views help identify the exact query, lock, WLM queue, or spill behind a slowdown.
STV_INFLIGHT,STV_RECENTS, andSTV_WLM_QUERY_STATEare useful first checks for live query issues.- Enhanced query monitoring, announced in January 2025, works for both provisioned clusters and Redshift Serverless.
WLMQueueWaitTimeis an early signal that query concurrency or workload pressure is becoming a problem.- Run
ANALYZEafter large data changes and useVACUUMwhen tables become heavily unsorted.
What Is AWS Redshift and Why Does Monitoring Matter?
Redshift stores data in a columnar format and spreads it across compute node slices for parallel processing. The leader node receives queries, builds execution plans, and distributes steps across compute nodes before reassembling results.
Each stage of that pipeline can become a bottleneck. A query waiting for a WLM slot never touches compute. A query with stale statistics builds a bad execution plan before it even starts. Without monitoring you will only know something is wrong when a user complains.
AWS recommends monitoring three areas continuously: cluster performance, query and load performance, and configuration events. The sections below cover each one in detail.
Layer 1: Infrastructure Monitoring with Amazon CloudWatch
CloudWatch is the first tool to set up. Redshift publishes cluster-level and node-level metrics to CloudWatch automatically, with most metrics collected at one-minute intervals unless stated otherwise.
How to Access CloudWatch Metrics for Redshift
Step 1: Open CloudWatch in the AWS Console
Console: Go to AWS Management Console > CloudWatch > Metrics > AWS/Redshift. Select your ClusterIdentifier to filter.
AWS CLI: List available metrics with:
aws cloudwatch list-metrics --namespace AWS/RedshiftStep 2: Pull CPU utilization for the last 24 hours
Console: In CloudWatch Metrics, select CPUUtilization, set the period to 1 hour, and add it to a dashboard.
AWS CLI: Run:
aws cloudwatch get-metric-statistics \
--namespace AWS/Redshift \
--metric-name CPUUtilization \
--start-time 2025-01-01T00:00:00 \
--end-time 2025-01-02T00:00:00 \
--period 3600 --statistics Average \
--dimensions Name=ClusterIdentifier,Value=your-cluster-idStep 3: Create an alarm for a critical threshold
Console: In CloudWatch, click Alarms > Create Alarm, choose your metric, set the threshold, and configure an SNS topic for notifications.
AWS CLI: Example alarm for CPU > 80%:
aws cloudwatch put-metric-alarm \
--alarm-name redshift-cpu-high \
--metric-name CPUUtilization \
--namespace AWS/Redshift \
--statistic Average --period 300 \
--threshold 80 --comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:your-topic \
--dimensions Name=ClusterIdentifier,Value=your-cluster-idCritical CloudWatch Metrics Reference
| Metric | Measures | Alert Threshold |
| CPUUtilization | % CPU across nodes | > 80% for 5+ min |
| PercentageDiskSpaceUsed | Cluster storage used | > 70% |
| QueryDuration | Avg query exec time | Spike vs rolling baseline |
| WLMQueueWaitTime | Avg WLM queue wait | Persistently > 0 |
| WLMQueriesQueued | Queries waiting for slot | Any non-zero value |
| QueriesCompletedPerSecond | Query throughput | Drop vs baseline |
| ReadIOPS / WriteIOPS | Disk ops per second | Correlated spike with slow queries |
Note: CloudWatch shows cluster health only. It cannot tell you which query caused the CPU spike. For that, use the layers below.
Layer 2: Query Performance Monitoring via the Redshift Console
The Redshift Console provides real-time and historical query visibility that CloudWatch cannot. You can see running queries, completed queries, execution plans, and WLM states without writing any SQL.
How to Access Query Monitoring in the Console
Step 1: Navigate to Query Monitoring
Console: Open AWS Console > Amazon Redshift > select your cluster > Query monitoring.
You will see tabs for Queries, Loads, and WLM. Switch between them to view different activity types.
Step 2: View running and completed queries
Console: Under the Queries tab, filter by Status: Running to see active queries. Sort by Elapsed time descending to find the slowest ones.
Click any query row to open its detail view with execution time, WLM queue, and query text.
Step 3: Open the Query Profiler for a slow query
Console: From the query detail view, click Query profiler. This opens the visual execution plan showing rows estimated vs actual at each step, plus time spent at each node.
A large gap between estimated and actual rows signals stale statistics. Run ANALYZE on the affected tables.
Step 4: Check the Workload Execution Breakdown chart
Console: Available for multi-node clusters. Go to Cluster > Monitoring > Workload execution breakdown. The chart shows how time is split across queue wait, planning, reading, sorting, and returning data.
If queue wait is the dominant segment, your WLM configuration needs adjustment. See Layer 3 for how to dig deeper.
Enhanced query monitoring, launched in January 2025, adds performance history for trend analysis, workload change detection, and drill-down from aggregate trends to individual query plans. It works on both provisioned clusters and Redshift Serverless.
Administrators with the SYS:MONITOR role can view queries for all users. Regular users see only their own. Grant it with:
GRANT ROLE SYS:MONITOR TO <username>;Layer 3: Deep Monitoring with System Tables
System tables give you direct access to Redshift query execution data. External tools can surface some of this data, but system tables and SYS views remain the source of truth for deep query diagnostics.
System Tables Quick Reference
| Table | Type | Use It To |
| STV_INFLIGHT | Live | See queries running right now |
| STV_RECENTS | Live + Recent | Find waiting or recently finished queries |
| STV_WLM_QUERY_STATE | Live | Check WLM slot state per query |
| STV_EXEC_STATE | Live | Track row progress on a running query |
| SVV_TRANSACTIONS | Live | Identify lock owners and blockers |
| STL_QUERY | Historical | Get duration of finished queries |
| SVL_QUERY_SUMMARY | Historical | Step-level stats across all slices |
| SVL_QUERY_METRICS_SUMMARY | Historical | CPU, spill, rows scanned for a query |
| STL_WLM_QUERY | Historical | WLM queue time vs execution time |
Step-by-Step: Diagnose a Live Slow Query
Step 1: Check what is running
Run this to see all active queries ordered by start time:
SELECT query, pid, starttime, text FROM stv_inflight ORDER BY starttime;Note the query ID of any query running longer than expected.
Step 2: Check if the query is waiting, not running
A query in STV_RECENTS but not in STV_INFLIGHT is waiting for a WLM slot or a table lock:
SELECT pid, starttime, duration, trim(query) AS query_text
FROM stv_recents
WHERE status <> 'Done'
AND pid NOT IN (SELECT pid FROM stv_inflight)
ORDER BY duration DESC;Step 3: Check its WLM state
Use the query ID from Step 1:
SELECT * FROM stv_wlm_query_state WHERE query = <query_id>;If state is QueuedWaiting, all WLM slots in that service class are occupied. Review concurrency settings.
Step 4: Check if the query is progressing
Run this query twice, 60 seconds apart. If the SUM(rows) value does not change, the query is hung:
SELECT SUM(rows) FROM stv_exec_state WHERE query = <query_id>;A static count means a table lock is likely blocking progress. Proceed to Step 5.
Step 5: Detect and resolve lock contention
This query identifies which session holds a lock and is blocking others:
SELECT a.txn_owner AS user_name, a.pid AS session_id,
nvl(trim(c.relname),'') AS table_name,
a.granted, b.pid AS blocking_session_id,
datediff(s, a.txn_start, getdate()) AS lock_age_seconds
FROM svv_transactions a
LEFT JOIN (SELECT pid, relation, granted FROM pg_locks
GROUP BY pid, relation, granted) b
ON a.relation = b.relation AND a.granted = 'f' AND b.granted = 't'
LEFT JOIN pg_class c ON a.relation = c.oid
WHERE a.relation IS NOT NULL ORDER BY a.txn_start;If granted = false and a blocking_session_id appears, that session is holding the lock. Terminate it with:
SELECT pg_terminate_backend(<blocking_pid>);Step 6: Review post-query metrics for a completed query
Once a query finishes, pull its resource usage:
SELECT query, query_cpu_usage_percent,
query_temp_blocks_to_disk AS spill_blocks,
scan_row_count, return_row_count
FROM svl_query_metrics_summary
WHERE query = <query_id>;Any spill_blocks value above 0 means the query wrote temporary data to disk, which can slow performance.
Weekly WLM Trend Report
Run this weekly to catch regression before users notice:
SELECT TRUNC(w.exec_start_time) AS exec_day,
COUNT(*) AS queries_count,
AVG(w.total_queue_time/1000000) AS avg_queue_sec,
AVG(w.total_exec_time/1000000) AS avg_exec_sec,
AVG(m.query_cpu_usage_percent) AS avg_cpu_pct,
SUM(m.query_temp_blocks_to_disk) AS total_spill_blocks
FROM stl_wlm_query AS w
LEFT JOIN svl_query_metrics_summary AS m
USING (userid, service_class, query)
WHERE service_class > 5
AND w.exec_start_time > (sysdate - 7)
GROUP BY 1 ORDER BY 1;Watch total_spill_blocks across days. A rising trend means queries are outgrowing their WLM memory allocation.
Layer 4: Auditing and Configuration Monitoring
AWS CloudTrail
CloudTrail records API activity against your Redshift cluster. It is useful for security reviews, audit trails, and compliance evidence.
Step 1: Enable CloudTrail for Redshift
Console: Go to CloudTrail > Trails > Create Trail. Choose All Regions, enable for management events, and set an S3 bucket destination.
AWS CLI: Create a trail via CLI:
aws cloudtrail create-trail --name redshift-audit --s3-bucket-name your-audit-bucketCommon API events to alert on: CreateCluster, DeleteCluster, ModifyClusterParameterGroup, AuthorizeSnapshotAccess, CreateClusterSecurityGroup.
Database Audit Logging
Step 1: Enable audit logging
Console: In the Redshift Console, select your cluster > Properties > scroll to Database audit logging > Edit. Choose an S3 bucket and enable connection and user activity logging.
AWS CLI: Enable via CLI:
aws redshift enable-logging \
--cluster-identifier your-cluster \
--bucket-name your-audit-bucket \
--s3-key-prefix redshift-logs/Audit logs capture connections, authentication failures, and queries. Query them with Amazon Athena once in S3, or load them into your SIEM.
AWS Config for Compliance Drift
AWS Config continuously evaluates cluster configuration against rules you define. Unlike CloudTrail, which records what changed, Config tells you whether the cluster is currently compliant. Pre-built rules cover encryption at rest, public accessibility, and parameter group settings.
Understanding WLM and Query Queues
Workload Management (WLM) controls how queries share cluster resources. Each queue (service class) has a concurrency level and a memory allocation. When all slots in a queue are full, new queries wait.
How to Configure WLM
Step 1: Open WLM configuration
Console: In the Redshift Console, go to Clusters > your cluster > Properties > Workload management. Click Edit to modify queues.
AWS CLI: Describe current parameter group:
aws redshift describe-cluster-parameter-groups --parameter-group-name your-groupStep 2: Switch to automatic WLM (recommended)
Console: In WLM configuration, select Auto WLM. Redshift will dynamically allocate memory and concurrency based on query workload.
With Auto WLM enabled, you do not need to set manual slot counts. Redshift manages concurrency dynamically.
Step 3: Check queue depth in real time
Count running vs queued queries per service class:
SELECT service_class, state, COUNT(query) AS query_count
FROM stv_wlm_query_state
WHERE service_class >= 6
GROUP BY 1, 2 ORDER BY 1, 2;Service classes 1 to 5 are internal. User queues start at 6. Persistent QueuedWaiting counts indicate saturation.
Redshift supports up to eight user-defined WLM queues plus a default queue. Separate heavy analytical workloads from light dashboard queries to keep both responsive.
Performance Best Practices
1. Set CloudWatch Alarms Before Problems Happen
Do not wait for user complaints. Set alarms on these four metrics as a minimum:
- CPUUtilization > 80% sustained for 5 minutes
- PercentageDiskSpaceUsed > 70%
- WLMQueueWaitTime > 30 seconds average
- QueryDuration rising more than 2x over the rolling 7-day baseline
2. Run VACUUM and ANALYZE on a Schedule
Find tables that need attention first:
SELECT schema_name, table_name, unsorted_pct, stats_off
FROM svv_table_info
WHERE unsorted_pct > 5 OR stats_off > 10
ORDER BY unsorted_pct DESC;Run VACUUM SORT ONLY on tables with unsorted_pct > 5. Run ANALYZE after any bulk load that changes more than 5% of a table’s rows. Schedule both during off-peak hours.
3. Detect and Fix Data Skew
Data skew means some node slices process far more rows than others, which breaks parallelism. Check for it with:
SELECT TRIM(name) AS table_name,
MAX(num_values) AS max_rows,
MIN(num_values) AS min_rows,
(MAX(num_values) - MIN(num_values)) * 100.0
/ NULLIF(MAX(num_values),0) AS skew_pct
FROM svv_diskusage
WHERE name NOT LIKE 'Internal%'
GROUP BY 1 ORDER BY skew_pct DESC LIMIT 20;Tables with skew_pct above 30 should have their distribution key reviewed.
4. Use Redshift Advisor
Redshift Advisor analyzes your workload and produces recommendations for distribution keys, sort keys, and tables needing VACUUM.
Step 1: Open Advisor
Console: In the Redshift Console, select your cluster > Advisor. Recommendations are ranked by estimated performance impact.
AWS CLI: Describe recommendations via CLI:
aws redshift list-recommendationsThird-Party Monitoring Tools
Native AWS tools require significant SQL knowledge and manual setup. These tools provide dashboards, anomaly detection, and alerting with less effort.
| Tool | Strength | Best For |
| CubeAPM | Query tracing, WLM visibility, anomaly alerts | APM-grade Redshift observability |
| Datadog | AI anomaly detection, multi-cloud dashboards | Multi-cloud enterprise environments |
| Dynatrace | Full-stack observability, AI root cause analysis | Large enterprises with complex infrastructure |
| New Relic | Query-level performance tracking, alerting | Dev and ops teams wanting a unified platform |
| Sumo Logic | Log analytics, pre-built Redshift dashboards | Security-focused teams |
| ManageEngine | Unified infra and database monitoring | IT ops in mixed environments |
| Amazon CloudWatch | Native metrics, alarms, dashboards | Infrastructure baseline monitoring |
Troubleshooting Checklist: When a Query Is Slow
Work through these steps in order when a report or query is slow:
- Is it running? Query STV_INFLIGHT. If it appears, note the query ID and starttime.
- Is it waiting? Check STV_RECENTS for queries not in STV_INFLIGHT. Waiting queries are blocked on a WLM slot or a table lock.
- What is the WLM state? Query STV_WLM_QUERY_STATE. QueuedWaiting means your queue concurrency limit is hit.
- Is it progressing? Run STV_EXEC_STATE twice, 60 seconds apart. Static row count means it is hung on a lock.
- Is there a lock? Run the SVV_TRANSACTIONS lock query. Terminate the blocking session if needed.
- Is the cluster under pressure? Check CloudWatch CPUUtilization and PercentageDiskSpaceUsed.
- Is it spilling to disk? Query SVL_QUERY_METRICS_SUMMARY for query_temp_blocks_to_disk. Any value above 0 is degrading performance.
- Is the execution plan bad? Open the Query Profiler in the console. Look for rows estimated vs actual mismatches and redistribution operations.
Tired of Writing SQL Just to Debug a Slow Report?
Most engineers spend more time diagnosing Redshift issues than fixing them. CubeAPM changes that by turning raw query telemetry into clear, actionable signals — without any SQL.
What you get out of the box:
- ✔ Query-level latency breakdown with execution plan context
- ✔ WLM queue depth and slot wait time, live and historical
- ✔ Automatic alerts when disk spill or CPU thresholds are breached
- ✔ Works with both Redshift Serverless and provisioned clusters
Conclusion
Effective AWS Redshift monitoring is not a single tool. It is a stack: CloudWatch for cluster-level signals, the Redshift Console and Query Profiler for query-level visibility, system tables for deep diagnostic queries, and CloudTrail with audit logging for security and compliance.
Most performance problems fall into a small set of categories: WLM queue saturation, disk spill from insufficient memory, lock contention, and stale statistics causing bad execution plans. The step-by-step queries in this guide cover all of them.
Start with the Key Takeaways at the top of this article. Set CloudWatch alarms today for the four critical metrics. Run VACUUM and ANALYZE on a schedule. And when something does go wrong, work through the eight-step troubleshooting checklist before anything else.
Disclaimer: This article reflects AWS Redshift features and behavior as of May 2025. AWS updates its services regularly, so check the official documentation before applying any steps to a live cluster.
FAQs
1. What is the difference between CloudWatch and Redshift system tables for monitoring?
CloudWatch shows cluster-level health, such as CPU, disk usage, network throughput, and query-related metrics. It helps you see when the cluster is under pressure.
Redshift system tables and SYS views show query-level details, such as WLM state, lock contention, disk spill, execution steps, and query history. CloudWatch tells you something is wrong; system tables help you find the exact query behind it.
2. How do I find slow queries in Amazon Redshift?
For live slow queries, check STV_INFLIGHT. For waiting or recent queries, use STV_RECENTS. For completed queries, use STL_QUERY or the Redshift Console Query and Database Monitoring page.
The Query Profiler adds a visual execution plan for troubleshooting. Enhanced query monitoring, announced in January 2025, adds query history, trend analysis, and deeper drill-downs.
3. What causes WLM queue wait time to spike?
WLM queue wait time rises when all available slots in a WLM service class are busy. New queries must wait until a slot becomes free.
Common causes include peak workload, long-running queries, too few slots, disk spill extending query time, or heavy and light queries sharing the same queue.
4. How do I check if a Redshift query is stuck or hung?
First, confirm the query appears in STV_INFLIGHT. Then run:
SELECT SUM(rows)
FROM stv_exec_state
WHERE query = <query_id>;
Run it twice, about 60 seconds apart. If the row count does not change, the query may not be progressing. Check SVV_TRANSACTIONS for blocking locks before terminating any session.
5. What is disk spill and how do I prevent it?
Disk spill happens when a query needs more memory than its WLM allocation and writes temporary data to disk. This can slow the query because disk I/O is slower than in-memory processing.
Detect it with query_temp_blocks_to_disk in SVL_QUERY_METRICS_SUMMARY. To reduce spill, review WLM memory, tune joins and sort keys, avoid unnecessary SELECT * queries, and keep table statistics updated.





