How to Monitor AWS Redshift Query Performance?

Amazon Redshift is a fully managed cloud data warehouse that uses columnar storage and massively parallel processing (MPP) to run analytic queries across petabytes of data. It distributes work across a leader node and multiple compute nodes, each processing data in parallel slices.

That distributed architecture is what makes Redshift fast. It is also what makes performance problems hard to diagnose. A slow report could come from a bad query plan, a full WLM slot queue, table skew, or a silent disk spill that nobody noticed.

This guide covers every monitoring layer available in AWS Redshift, from CloudWatch infrastructure metrics to raw system table queries. You will know exactly what to watch, what queries to run, and when to use a dedicated tool.

💡 Key takeaways

Redshift monitoring has four main layers: CloudWatch, Redshift Console, system tables/SYS views, and audit logging.
CloudWatch shows cluster health, while system tables and SYS views help identify the exact query, lock, WLM queue, or spill behind a slowdown.
STV_INFLIGHT, STV_RECENTS, and STV_WLM_QUERY_STATE are useful first checks for live query issues.
Enhanced query monitoring, announced in January 2025, works for both provisioned clusters and Redshift Serverless.
WLMQueueWaitTime is an early signal that query concurrency or workload pressure is becoming a problem.
Run ANALYZE after large data changes and use VACUUM when tables become heavily unsorted.

What Is AWS Redshift and Why Does Monitoring Matter?

Redshift stores data in a columnar format and spreads it across compute node slices for parallel processing. The leader node receives queries, builds execution plans, and distributes steps across compute nodes before reassembling results.

Each stage of that pipeline can become a bottleneck. A query waiting for a WLM slot never touches compute. A query with stale statistics builds a bad execution plan before it even starts. Without monitoring you will only know something is wrong when a user complains.

AWS recommends monitoring three areas continuously: cluster performance, query and load performance, and configuration events. The sections below cover each one in detail.

Layer 1: Infrastructure Monitoring with Amazon CloudWatch

CloudWatch is the first tool to set up. Redshift publishes cluster-level and node-level metrics to CloudWatch automatically, with most metrics collected at one-minute intervals unless stated otherwise.

How to Access CloudWatch Metrics for Redshift

Step 1: Open CloudWatch in the AWS Console

Console: Go to AWS Management Console > CloudWatch > Metrics > AWS/Redshift. Select your ClusterIdentifier to filter.

AWS CLI: List available metrics with:

aws cloudwatch list-metrics --namespace AWS/Redshift

aws cloudwatch list-metrics --namespace AWS/Redshift

Step 2: Pull CPU utilization for the last 24 hours

Console: In CloudWatch Metrics, select CPUUtilization, set the period to 1 hour, and add it to a dashboard.

AWS CLI: Run:

aws cloudwatch get-metric-statistics \

  --namespace AWS/Redshift \

  --metric-name CPUUtilization \

  --start-time 2025-01-01T00:00:00 \

  --end-time 2025-01-02T00:00:00 \

  --period 3600 --statistics Average \

  --dimensions Name=ClusterIdentifier,Value=your-cluster-id

aws cloudwatch get-metric-statistics \

  --namespace AWS/Redshift \

  --metric-name CPUUtilization \

  --start-time 2025-01-01T00:00:00 \

  --end-time 2025-01-02T00:00:00 \

  --period 3600 --statistics Average \

  --dimensions Name=ClusterIdentifier,Value=your-cluster-id

Step 3: Create an alarm for a critical threshold

Console: In CloudWatch, click Alarms > Create Alarm, choose your metric, set the threshold, and configure an SNS topic for notifications.

AWS CLI: Example alarm for CPU > 80%:

aws cloudwatch put-metric-alarm \

  --alarm-name redshift-cpu-high \

  --metric-name CPUUtilization \

  --namespace AWS/Redshift \

  --statistic Average --period 300 \

  --threshold 80 --comparison-operator GreaterThanThreshold \

  --evaluation-periods 2 \

  --alarm-actions arn:aws:sns:us-east-1:123456789:your-topic \

  --dimensions Name=ClusterIdentifier,Value=your-cluster-id

aws cloudwatch put-metric-alarm \

  --alarm-name redshift-cpu-high \

  --metric-name CPUUtilization \

  --namespace AWS/Redshift \

  --statistic Average --period 300 \

  --threshold 80 --comparison-operator GreaterThanThreshold \

  --evaluation-periods 2 \

  --alarm-actions arn:aws:sns:us-east-1:123456789:your-topic \

  --dimensions Name=ClusterIdentifier,Value=your-cluster-id

Critical CloudWatch Metrics Reference

Metric	Measures	Alert Threshold
CPUUtilization	% CPU across nodes	> 80% for 5+ min
PercentageDiskSpaceUsed	Cluster storage used	> 70%
QueryDuration	Avg query exec time	Spike vs rolling baseline
WLMQueueWaitTime	Avg WLM queue wait	Persistently > 0
WLMQueriesQueued	Queries waiting for slot	Any non-zero value
QueriesCompletedPerSecond	Query throughput	Drop vs baseline
ReadIOPS / WriteIOPS	Disk ops per second	Correlated spike with slow queries

Note: CloudWatch shows cluster health only. It cannot tell you which query caused the CPU spike. For that, use the layers below.

Layer 2: Query Performance Monitoring via the Redshift Console

The Redshift Console provides real-time and historical query visibility that CloudWatch cannot. You can see running queries, completed queries, execution plans, and WLM states without writing any SQL.

How to Access Query Monitoring in the Console

Step 1: Navigate to Query Monitoring

Console: Open AWS Console > Amazon Redshift > select your cluster > Query monitoring.

You will see tabs for Queries, Loads, and WLM. Switch between them to view different activity types.

Step 2: View running and completed queries

Console: Under the Queries tab, filter by Status: Running to see active queries. Sort by Elapsed time descending to find the slowest ones.

Click any query row to open its detail view with execution time, WLM queue, and query text.

Step 3: Open the Query Profiler for a slow query

Console: From the query detail view, click Query profiler. This opens the visual execution plan showing rows estimated vs actual at each step, plus time spent at each node.

A large gap between estimated and actual rows signals stale statistics. Run ANALYZE on the affected tables.

Step 4: Check the Workload Execution Breakdown chart

Console: Available for multi-node clusters. Go to Cluster > Monitoring > Workload execution breakdown. The chart shows how time is split across queue wait, planning, reading, sorting, and returning data.

If queue wait is the dominant segment, your WLM configuration needs adjustment. See Layer 3 for how to dig deeper.

Enhanced query monitoring, launched in January 2025, adds performance history for trend analysis, workload change detection, and drill-down from aggregate trends to individual query plans. It works on both provisioned clusters and Redshift Serverless.

Administrators with the SYS:MONITOR role can view queries for all users. Regular users see only their own. Grant it with:

GRANT ROLE SYS:MONITOR TO <username>;

GRANT ROLE SYS:MONITOR TO <username>;

Layer 3: Deep Monitoring with System Tables

System tables give you direct access to Redshift query execution data. External tools can surface some of this data, but system tables and SYS views remain the source of truth for deep query diagnostics.

System Tables Quick Reference

Table	Type	Use It To
STV_INFLIGHT	Live	See queries running right now
STV_RECENTS	Live + Recent	Find waiting or recently finished queries
STV_WLM_QUERY_STATE	Live	Check WLM slot state per query
STV_EXEC_STATE	Live	Track row progress on a running query
SVV_TRANSACTIONS	Live	Identify lock owners and blockers
STL_QUERY	Historical	Get duration of finished queries
SVL_QUERY_SUMMARY	Historical	Step-level stats across all slices
SVL_QUERY_METRICS_SUMMARY	Historical	CPU, spill, rows scanned for a query
STL_WLM_QUERY	Historical	WLM queue time vs execution time

Step-by-Step: Diagnose a Live Slow Query

Step 1: Check what is running

Run this to see all active queries ordered by start time:

SELECT query, pid, starttime, text FROM stv_inflight ORDER BY starttime;

SELECT query, pid, starttime, text FROM stv_inflight ORDER BY starttime;

Note the query ID of any query running longer than expected.

Step 2: Check if the query is waiting, not running

A query in STV_RECENTS but not in STV_INFLIGHT is waiting for a WLM slot or a table lock:

SELECT pid, starttime, duration, trim(query) AS query_text

FROM stv_recents

WHERE status <> 'Done'

AND pid NOT IN (SELECT pid FROM stv_inflight)

ORDER BY duration DESC;

SELECT pid, starttime, duration, trim(query) AS query_text

FROM stv_recents

WHERE status <> 'Done'

AND pid NOT IN (SELECT pid FROM stv_inflight)

ORDER BY duration DESC;

Step 3: Check its WLM state

Use the query ID from Step 1:

SELECT * FROM stv_wlm_query_state WHERE query = <query_id>;

SELECT * FROM stv_wlm_query_state WHERE query = <query_id>;

If state is QueuedWaiting, all WLM slots in that service class are occupied. Review concurrency settings.

Step 4: Check if the query is progressing

Run this query twice, 60 seconds apart. If the SUM(rows) value does not change, the query is hung:

SELECT SUM(rows) FROM stv_exec_state WHERE query = <query_id>;

SELECT SUM(rows) FROM stv_exec_state WHERE query = <query_id>;

A static count means a table lock is likely blocking progress. Proceed to Step 5.

Step 5: Detect and resolve lock contention

This query identifies which session holds a lock and is blocking others:

SELECT a.txn_owner AS user_name, a.pid AS session_id,

       nvl(trim(c.relname),'') AS table_name,

       a.granted, b.pid AS blocking_session_id,

       datediff(s, a.txn_start, getdate()) AS lock_age_seconds

FROM svv_transactions a

LEFT JOIN (SELECT pid, relation, granted FROM pg_locks

           GROUP BY pid, relation, granted) b

       ON a.relation = b.relation AND a.granted = 'f' AND b.granted = 't'

LEFT JOIN pg_class c ON a.relation = c.oid

WHERE a.relation IS NOT NULL ORDER BY a.txn_start;

SELECT a.txn_owner AS user_name, a.pid AS session_id,

       nvl(trim(c.relname),'') AS table_name,

       a.granted, b.pid AS blocking_session_id,

       datediff(s, a.txn_start, getdate()) AS lock_age_seconds

FROM svv_transactions a

LEFT JOIN (SELECT pid, relation, granted FROM pg_locks

           GROUP BY pid, relation, granted) b

       ON a.relation = b.relation AND a.granted = 'f' AND b.granted = 't'

LEFT JOIN pg_class c ON a.relation = c.oid

WHERE a.relation IS NOT NULL ORDER BY a.txn_start;

If granted = false and a blocking_session_id appears, that session is holding the lock. Terminate it with:

SELECT pg_terminate_backend(<blocking_pid>);

SELECT pg_terminate_backend(<blocking_pid>);

Step 6: Review post-query metrics for a completed query

Once a query finishes, pull its resource usage:

SELECT query, query_cpu_usage_percent,

       query_temp_blocks_to_disk AS spill_blocks,

       scan_row_count, return_row_count

FROM svl_query_metrics_summary

WHERE query = <query_id>;

SELECT query, query_cpu_usage_percent,

       query_temp_blocks_to_disk AS spill_blocks,

       scan_row_count, return_row_count

FROM svl_query_metrics_summary

WHERE query = <query_id>;

Any spill_blocks value above 0 means the query wrote temporary data to disk, which can slow performance.

Weekly WLM Trend Report

Run this weekly to catch regression before users notice:

SELECT TRUNC(w.exec_start_time)       AS exec_day,

       COUNT(*)                        AS queries_count,

       AVG(w.total_queue_time/1000000) AS avg_queue_sec,

       AVG(w.total_exec_time/1000000)  AS avg_exec_sec,

       AVG(m.query_cpu_usage_percent)  AS avg_cpu_pct,

       SUM(m.query_temp_blocks_to_disk) AS total_spill_blocks

FROM stl_wlm_query AS w

LEFT JOIN svl_query_metrics_summary AS m

  USING (userid, service_class, query)

WHERE service_class > 5

  AND w.exec_start_time > (sysdate - 7)

GROUP BY 1 ORDER BY 1;

SELECT TRUNC(w.exec_start_time)       AS exec_day,

       COUNT(*)                        AS queries_count,

       AVG(w.total_queue_time/1000000) AS avg_queue_sec,

       AVG(w.total_exec_time/1000000)  AS avg_exec_sec,

       AVG(m.query_cpu_usage_percent)  AS avg_cpu_pct,

       SUM(m.query_temp_blocks_to_disk) AS total_spill_blocks

FROM stl_wlm_query AS w

LEFT JOIN svl_query_metrics_summary AS m

  USING (userid, service_class, query)

WHERE service_class > 5

  AND w.exec_start_time > (sysdate - 7)

GROUP BY 1 ORDER BY 1;

Watch total_spill_blocks across days. A rising trend means queries are outgrowing their WLM memory allocation.

Layer 4: Auditing and Configuration Monitoring

AWS CloudTrail

CloudTrail records API activity against your Redshift cluster. It is useful for security reviews, audit trails, and compliance evidence.

Step 1: Enable CloudTrail for Redshift

Console: Go to CloudTrail > Trails > Create Trail. Choose All Regions, enable for management events, and set an S3 bucket destination.

AWS CLI: Create a trail via CLI:

aws cloudtrail create-trail --name redshift-audit --s3-bucket-name your-audit-bucket

aws cloudtrail create-trail --name redshift-audit --s3-bucket-name your-audit-bucket

Common API events to alert on: CreateCluster, DeleteCluster, ModifyClusterParameterGroup, AuthorizeSnapshotAccess, CreateClusterSecurityGroup.

Database Audit Logging

Step 1: Enable audit logging

Console: In the Redshift Console, select your cluster > Properties > scroll to Database audit logging > Edit. Choose an S3 bucket and enable connection and user activity logging.

AWS CLI: Enable via CLI:

aws redshift enable-logging \

  --cluster-identifier your-cluster \

  --bucket-name your-audit-bucket \

  --s3-key-prefix redshift-logs/

aws redshift enable-logging \

  --cluster-identifier your-cluster \

  --bucket-name your-audit-bucket \

  --s3-key-prefix redshift-logs/

Audit logs capture connections, authentication failures, and queries. Query them with Amazon Athena once in S3, or load them into your SIEM.

AWS Config for Compliance Drift

AWS Config continuously evaluates cluster configuration against rules you define. Unlike CloudTrail, which records what changed, Config tells you whether the cluster is currently compliant. Pre-built rules cover encryption at rest, public accessibility, and parameter group settings.

Understanding WLM and Query Queues

Workload Management (WLM) controls how queries share cluster resources. Each queue (service class) has a concurrency level and a memory allocation. When all slots in a queue are full, new queries wait.

How to Configure WLM

Step 1: Open WLM configuration

Console: In the Redshift Console, go to Clusters > your cluster > Properties > Workload management. Click Edit to modify queues.

AWS CLI: Describe current parameter group:

aws redshift describe-cluster-parameter-groups --parameter-group-name your-group

aws redshift describe-cluster-parameter-groups --parameter-group-name your-group

Step 2: Switch to automatic WLM (recommended)

Console: In WLM configuration, select Auto WLM. Redshift will dynamically allocate memory and concurrency based on query workload.

With Auto WLM enabled, you do not need to set manual slot counts. Redshift manages concurrency dynamically.

Step 3: Check queue depth in real time

Count running vs queued queries per service class:

SELECT service_class, state, COUNT(query) AS query_count

FROM stv_wlm_query_state

WHERE service_class >= 6

GROUP BY 1, 2 ORDER BY 1, 2;

SELECT service_class, state, COUNT(query) AS query_count

FROM stv_wlm_query_state

WHERE service_class >= 6

GROUP BY 1, 2 ORDER BY 1, 2;

Service classes 1 to 5 are internal. User queues start at 6. Persistent QueuedWaiting counts indicate saturation.

Redshift supports up to eight user-defined WLM queues plus a default queue. Separate heavy analytical workloads from light dashboard queries to keep both responsive.

Performance Best Practices

1. Set CloudWatch Alarms Before Problems Happen

Do not wait for user complaints. Set alarms on these four metrics as a minimum:

CPUUtilization > 80% sustained for 5 minutes
PercentageDiskSpaceUsed > 70%
WLMQueueWaitTime > 30 seconds average
QueryDuration rising more than 2x over the rolling 7-day baseline

2. Run VACUUM and ANALYZE on a Schedule

Find tables that need attention first:

SELECT schema_name, table_name, unsorted_pct, stats_off

FROM svv_table_info

WHERE unsorted_pct > 5 OR stats_off > 10

ORDER BY unsorted_pct DESC;

SELECT schema_name, table_name, unsorted_pct, stats_off

FROM svv_table_info

WHERE unsorted_pct > 5 OR stats_off > 10

ORDER BY unsorted_pct DESC;

Run VACUUM SORT ONLY on tables with unsorted_pct > 5. Run ANALYZE after any bulk load that changes more than 5% of a table’s rows. Schedule both during off-peak hours.

3. Detect and Fix Data Skew

Data skew means some node slices process far more rows than others, which breaks parallelism. Check for it with:

SELECT TRIM(name) AS table_name,

       MAX(num_values) AS max_rows,

       MIN(num_values) AS min_rows,

       (MAX(num_values) - MIN(num_values)) * 100.0

         / NULLIF(MAX(num_values),0) AS skew_pct

FROM svv_diskusage

WHERE name NOT LIKE 'Internal%'

GROUP BY 1 ORDER BY skew_pct DESC LIMIT 20;

SELECT TRIM(name) AS table_name,

       MAX(num_values) AS max_rows,

       MIN(num_values) AS min_rows,

       (MAX(num_values) - MIN(num_values)) * 100.0

         / NULLIF(MAX(num_values),0) AS skew_pct

FROM svv_diskusage

WHERE name NOT LIKE 'Internal%'

GROUP BY 1 ORDER BY skew_pct DESC LIMIT 20;

Tables with skew_pct above 30 should have their distribution key reviewed.

4. Use Redshift Advisor

Redshift Advisor analyzes your workload and produces recommendations for distribution keys, sort keys, and tables needing VACUUM.

Step 1: Open Advisor

Console: In the Redshift Console, select your cluster > Advisor. Recommendations are ranked by estimated performance impact.

AWS CLI: Describe recommendations via CLI:

aws redshift list-recommendations

aws redshift list-recommendations

Third-Party Monitoring Tools

Native AWS tools require significant SQL knowledge and manual setup. These tools provide dashboards, anomaly detection, and alerting with less effort.

Tool	Strength	Best For
CubeAPM	Query tracing, WLM visibility, anomaly alerts	APM-grade Redshift observability
Datadog	AI anomaly detection, multi-cloud dashboards	Multi-cloud enterprise environments
Dynatrace	Full-stack observability, AI root cause analysis	Large enterprises with complex infrastructure
New Relic	Query-level performance tracking, alerting	Dev and ops teams wanting a unified platform
Sumo Logic	Log analytics, pre-built Redshift dashboards	Security-focused teams
ManageEngine	Unified infra and database monitoring	IT ops in mixed environments
Amazon CloudWatch	Native metrics, alarms, dashboards	Infrastructure baseline monitoring

Troubleshooting Checklist: When a Query Is Slow

Work through these steps in order when a report or query is slow:

Is it running? Query STV_INFLIGHT. If it appears, note the query ID and starttime.
Is it waiting? Check STV_RECENTS for queries not in STV_INFLIGHT. Waiting queries are blocked on a WLM slot or a table lock.
What is the WLM state? Query STV_WLM_QUERY_STATE. QueuedWaiting means your queue concurrency limit is hit.
Is it progressing? Run STV_EXEC_STATE twice, 60 seconds apart. Static row count means it is hung on a lock.
Is there a lock? Run the SVV_TRANSACTIONS lock query. Terminate the blocking session if needed.
Is the cluster under pressure? Check CloudWatch CPUUtilization and PercentageDiskSpaceUsed.
Is it spilling to disk? Query SVL_QUERY_METRICS_SUMMARY for query_temp_blocks_to_disk. Any value above 0 is degrading performance.
Is the execution plan bad? Open the Query Profiler in the console. Look for rows estimated vs actual mismatches and redistribution operations.

Tired of Writing SQL Just to Debug a Slow Report?

Most engineers spend more time diagnosing Redshift issues than fixing them. CubeAPM changes that by turning raw query telemetry into clear, actionable signals — without any SQL.

What you get out of the box:

✔ Query-level latency breakdown with execution plan context
✔ WLM queue depth and slot wait time, live and historical
✔ Automatic alerts when disk spill or CPU thresholds are breached
✔ Works with both Redshift Serverless and provisioned clusters

→ Try CubeAPM Today

Conclusion

Effective AWS Redshift monitoring is not a single tool. It is a stack: CloudWatch for cluster-level signals, the Redshift Console and Query Profiler for query-level visibility, system tables for deep diagnostic queries, and CloudTrail with audit logging for security and compliance.

Most performance problems fall into a small set of categories: WLM queue saturation, disk spill from insufficient memory, lock contention, and stale statistics causing bad execution plans. The step-by-step queries in this guide cover all of them.

Start with the Key Takeaways at the top of this article. Set CloudWatch alarms today for the four critical metrics. Run VACUUM and ANALYZE on a schedule. And when something does go wrong, work through the eight-step troubleshooting checklist before anything else.

Disclaimer: This article reflects AWS Redshift features and behavior as of May 2025. AWS updates its services regularly, so check the official documentation before applying any steps to a live cluster.

FAQs

1. What is the difference between CloudWatch and Redshift system tables for monitoring?

CloudWatch shows cluster-level health, such as CPU, disk usage, network throughput, and query-related metrics. It helps you see when the cluster is under pressure.
Redshift system tables and SYS views show query-level details, such as WLM state, lock contention, disk spill, execution steps, and query history. CloudWatch tells you something is wrong; system tables help you find the exact query behind it.

2. How do I find slow queries in Amazon Redshift?

For live slow queries, check STV_INFLIGHT. For waiting or recent queries, use STV_RECENTS. For completed queries, use STL_QUERY or the Redshift Console Query and Database Monitoring page.
The Query Profiler adds a visual execution plan for troubleshooting. Enhanced query monitoring, announced in January 2025, adds query history, trend analysis, and deeper drill-downs.

3. What causes WLM queue wait time to spike?

WLM queue wait time rises when all available slots in a WLM service class are busy. New queries must wait until a slot becomes free.
Common causes include peak workload, long-running queries, too few slots, disk spill extending query time, or heavy and light queries sharing the same queue.

4. How do I check if a Redshift query is stuck or hung?

First, confirm the query appears in STV_INFLIGHT. Then run:
SELECT SUM(rows)
FROM stv_exec_state
WHERE query = <query_id>;
Run it twice, about 60 seconds apart. If the row count does not change, the query may not be progressing. Check SVV_TRANSACTIONS for blocking locks before terminating any session.

5. What is disk spill and how do I prevent it?

Disk spill happens when a query needs more memory than its WLM allocation and writes temporary data to disk. This can slow the query because disk I/O is slower than in-memory processing.
Detect it with query_temp_blocks_to_disk in SVL_QUERY_METRICS_SUMMARY. To reduce spill, review WLM memory, tune joins and sort keys, avoid unnecessary SELECT * queries, and keep table statistics updated.