CubeAPM
CubeAPM CubeAPM

What Are the Key AWS SQS Metrics to Monitor?

What Are the Key AWS SQS Metrics to Monitor?

Table of Contents

Amazon SQS automatically publishes metrics to CloudWatch under the AWS/SQS namespace, at no extra charge, every minute for any queue that is active. All SQS metrics use a single dimension: QueueName. The challenge is not access – it is knowing which metrics predict problems before they cascade into application failures, and what the right thresholds are for each.

The 8 AWS SQS metrics below cover the full health picture of an SQS queue: backlog depth, message age, processing throughput, inflight limits, producer health, consumer efficiency, message size, and dead-letter queue status. Together, they tell you whether your queue is processing normally, backing up, or silently losing messages.

Key Takeaways

  • All 8 metrics are free and live in the AWS/SQS namespace – no Enhanced Monitoring required
  • SQS metrics have up to 15 minutes of latency – they are not real-time. For time-sensitive alerting, instrument your producers and consumers directly in addition to monitoring SQS
  • ApproximateNumberOfMessagesVisible and ApproximateAgeOfOldestMessage are the two most important metrics – one measures backlog depth, the other measures processing lag
  • Messages moved to a Dead Letter Queue do NOT increment NumberOfMessagesSent – you must monitor the DLQ separately with its own alarm
  • When a queue is inactive for more than 6 hours, SQS stops publishing metrics to CloudWatch. A gap in your metric graph is not necessarily an alert – it may mean the queue was simply idle
  • Both standard and FIFO queues support up to 120,000 inflight messages – reaching this limit causes new ReceiveMessage calls to fail with an OverLimit error on standard queues

Quick Reference: The 8 Metrics That Matter

MetricWhat it tells youAlert threshold
ApproximateNumberOfMessagesVisibleMessages waiting to be processed (queue backlog)Depends on workload – alert when it grows continuously
ApproximateAgeOfOldestMessageHow long the oldest unprocessed message has been waitingDepends on SLA – alert above your acceptable processing delay
ApproximateNumberOfMessagesNotVisibleMessages currently being processed by consumers> 108,000 for standard queues (90% of 120,000 limit)
NumberOfMessagesSentMessages added to the queue by producersDrops to 0 unexpectedly, or spikes far above baseline
NumberOfMessagesDeletedMessages successfully processed and removedDrops significantly below NumberOfMessagesSent
NumberOfEmptyReceivesPoll attempts that found no messagesConsistently high means over-polling and unnecessary API cost
SentMessageSizeSize of messages being sent to the queueAverage approaching your configured maximum message size
ApproximateNumberOfMessagesVisible on DLQMessages that failed processing and could not be retriedAny value above 0

1. ApproximateNumberOfMessagesVisible

What it is: The number of messages available for retrieval from the queue – the backlog. This is the primary indicator of whether your consumers are keeping pace with your producers.

What good looks like: Roughly stable or trending toward zero. A queue that processes messages as fast as they arrive will hover near zero. Some backlog is expected during traffic spikes.

What bad looks like: A continuously growing value. If ApproximateNumberOfMessagesVisible climbs steadily over time, your consumers are falling behind your producers. Left unchecked this will eventually exhaust memory, delay downstream processing, and cause messages to hit the retention period and be dropped.

Alert threshold to set: This depends entirely on your workload and SLA. The practical approach is to establish a baseline over one week and alert when the value exceeds 2x that baseline for more than 10 minutes. For queues feeding time-sensitive workflows, set a hard threshold based on how many messages represent unacceptable lag.

Note on approximation: AWS calls this metric “approximate” because SQS is a distributed system. The value is accurate enough for operational decisions but should not be used as an exact count.

aws cloudwatch put-metric-alarm \

  --alarm-name "SQS-HighQueueDepth-your-queue" \

  --metric-name ApproximateNumberOfMessagesVisible \

  --namespace AWS/SQS \

  --statistic Average \

  --period 300 \

  --evaluation-periods 3 \

  --threshold 1000 \

  --comparison-operator GreaterThanThreshold \

  --dimensions Name=QueueName,Value=your-queue-name \

  --alarm-actions arn:aws:sns:us-east-1:123456789:your-alert-topic

2. ApproximateAgeOfOldestMessage

What it is: The age in seconds of the oldest message currently in the queue – that is, how long the message has been waiting without being processed.

What good looks like: Near zero for queues with active consumers. Some age is expected during batch processing windows or when consumers are intentionally delayed.

What bad looks like: A value that keeps climbing. A growing ApproximateAgeOfOldestMessage means something is stuck – either consumers are too slow, a message is unprocessable and blocking the queue, or consumers have stopped entirely.

Alert threshold to set: Set based on your SLA. If your application commits to processing messages within 60 seconds, alarm at 45 seconds. If same-day processing is acceptable, alarm at several hours.

Use Maximum, not Average: A single stuck message raises the maximum significantly while the average may look acceptable across the 5-minute window. Always monitor ApproximateAgeOfOldestMessage with the Maximum statistic.

The DLQ relationship: If you do not have a Dead Letter Queue configured, a poison message that no consumer can process will cause ApproximateAgeOfOldestMessage to climb continuously until the message hits its retention period and is dropped. Configure a DLQ on every production queue.

3. ApproximateNumberOfMessagesNotVisible

What it is: The number of messages currently in flight – received by a consumer but not yet deleted from the queue. These messages are within their visibility timeout period and invisible to other consumers.

What good looks like: Proportional to your consumer count and message processing time. For a consumer pool processing 100 messages at a time with a 30-second timeout, you expect roughly 100 inflight messages.

What bad looks like: Approaching the inflight limit. Both standard and FIFO queues support a maximum of 120,000 inflight messages. When this limit is reached on a standard queue, new ReceiveMessage calls return an OverLimit error and consumers cannot pick up new messages.

Alert threshold to set:

  • Warning: ApproximateNumberOfMessagesNotVisible > 100,000 (83% of the 120,000 limit)
  • Critical: > 108,000 (90% of the limit)

The poison message trap: If a consumer receives a message and crashes before deleting it, the message becomes visible again after the visibility timeout expires and another consumer picks it up. If no consumer can process it, it cycles through receive attempts until it hits the max receive count and moves to the DLQ. During this cycle, ApproximateNumberOfMessagesNotVisible stays elevated even though no real processing is happening.

4. NumberOfMessagesSent

What it is: The number of messages added to the queue per period – your producer throughput.

What good looks like: Consistent with your expected traffic pattern. For a queue fed by user activity, you expect higher values during business hours and lower at night. For a queue fed by batch jobs, spikes at scheduled intervals are normal.

What bad looks like: A drop to zero when traffic is expected (producers have stopped or failed), or a sustained spike far above baseline (upstream retry storms or misconfigured producers flooding the queue).

Alert threshold to set:

  • Alert on zero: NumberOfMessagesSent = 0 during expected traffic windows means producers are down
  • Alert on anomaly: Use CloudWatch Anomaly Detection on this metric to catch both unexpected drops and unexpected spikes without manual threshold tuning

Important caveat: Messages that fail processing and are moved to a Dead Letter Queue do NOT increment NumberOfMessagesSent on the DLQ. Monitoring NumberOfMessagesSent on your DLQ only shows messages that were manually sent to it – not failed messages redirected from the source queue.

5. NumberOfMessagesDeleted

What it is: The number of messages deleted from the queue per period – your consumer throughput. Successfully processed messages are deleted; messages that expire or are moved to a DLQ are not counted here.

What good looks like: Close to NumberOfMessagesSent. A healthy queue deletes roughly as many messages as it receives.

What bad looks like: NumberOfMessagesDeleted consistently lower than NumberOfMessagesSent. The gap is the rate at which your backlog grows. If producers send 100 messages per minute and consumers delete 80, your backlog grows by 20 messages every minute.

The monitoring limitation: CloudWatch records NumberOfMessagesDeleted at the message creation time, not the deletion time. This makes it unreliable for real-time throughput tracking. Combine it with ApproximateNumberOfMessagesVisible trends for a fuller picture.

6. NumberOfEmptyReceives

What it is: The number of ReceiveMessage API calls that returned no messages because the queue was empty at the time of the poll.

What good looks like: Some empty receives are normal, especially with short polling. With long polling (WaitTimeSeconds=20), empty receives should be rare.

What bad looks like: A high and sustained empty receive rate means your consumers are polling more frequently than messages arrive. This wastes API calls and increases SQS costs, since each ReceiveMessage call is billed regardless of whether it returns messages.

What to do when it is high: Switch to long polling if you have not already. Set WaitTimeSeconds=20 on your ReceiveMessage calls. Long polling holds the connection for up to 20 seconds, waiting for a message to arrive before returning an empty response, dramatically reducing the empty receive count and cost.

7. SentMessageSize

What it is: The size in bytes of messages being sent to the queue. SQS has a hard maximum message size of 256 KB (262,144 bytes). Messages exceeding this limit are rejected.

What good looks like: Well below the maximum. Messages consistently near the limit are fragile – any payload growth breaks the integration.

What bad looks like: Average message size approaching 200 KB or more. A single deployment that adds extra fields to the payload can push messages over 256 KB and cause silent send failures.

Alert threshold to set:

  • Warning: Average SentMessageSize > 200,000 bytes (approximately 78% of the 256 KB limit)

For large payloads: Use the SQS Extended Client Library (available for Java and Python) to store the message body in S3 and send only a reference in SQS. This supports payloads up to 2 GB.

8. ApproximateNumberOfMessagesVisible on the DLQ

What it is: The backlog count on your Dead Letter Queue – the queue that receives messages that failed processing after exhausting their maximum receive count.

What good looks like: Zero. Any message in the DLQ represents a failure that your application could not handle automatically.

What bad looks like: Any value above zero. Even a single DLQ message means something went wrong that your consumers could not recover from – a bug, a malformed message, a downstream dependency that was unavailable, or a transient error that exceeded the retry limit.

Alert threshold to set:

  • Alert: ApproximateNumberOfMessagesVisible on DLQ > 0
aws cloudwatch put-metric-alarm \

  --alarm-name "SQS-DLQ-MessagesDetected-your-queue-dlq" \

  --metric-name ApproximateNumberOfMessagesVisible \

  --namespace AWS/SQS \

  --statistic Sum \

  --period 300 \

  --evaluation-periods 1 \

  --threshold 0 \

  --comparison-operator GreaterThanThreshold \

  --dimensions Name=QueueName,Value=your-queue-name-dlq \

  --alarm-actions arn:aws:sns:us-east-1:123456789:your-alert-topic

If you do not have a DLQ configured: You should. Any production queue that processes messages asynchronously needs a DLQ. Without one, failed messages cycle through retries until they hit the message retention period (up to 14 days) and are permanently dropped with no record of what failed.

FIFO Queue-Specific Metrics

FIFO queues support all the metrics above and add two additional dimensions relevant to ordering and deduplication:

ApproximateNumberOfGroupsWithInflightMessages – The number of message groups that currently have messages in flight. In a FIFO queue, messages within a single group are processed one at a time in order. If one group has a stuck or slow-processing message, no other messages in that group can be processed until it resolves. A high value here means many groups are active simultaneously, which is healthy. A persistently low value combined with high ApproximateNumberOfMessagesVisible suggests a message group bottleneck.

FIFO throughput limits to watch: Standard FIFO queues support 300 API calls per second (3,000 messages per second with batching). High throughput FIFO mode supports up to 70,000 TPS in US East (N. Virginia). If NumberOfMessagesSent approaches the throughput ceiling, requests will be throttled. Enable high throughput mode via the SQS console if your workload requires it.

The 15-Minute Latency Problem

SQS CloudWatch metrics have up to 15 minutes of latency. This is a documented limitation: SQS is a distributed system and metric aggregation is not instantaneous. What this means in practice:

  • A queue backlog that starts growing at 14:00 may not appear in CloudWatch alarms until 14:15
  • A consumer crash at 14:00 may not trigger your NumberOfMessagesDeleted = 0 alarm until 14:15

For queues where a 15-minute detection gap is unacceptable, instrument your consumers and producers directly. Track processing time and success rates as custom CloudWatch metrics from your application code – these have lower latency because you publish them yourself at the moment events occur.

When CloudWatch SQS Metrics Are Not Enough

CloudWatch SQS metrics tell you the state of the queue: how deep the backlog is, how old the oldest message is, how many are in flight. What they do not tell you is what happened to a specific message – which consumer processed it, how long it took, what the downstream call was, or why it failed and ended up in the DLQ.

When a DLQ alarm fires, CloudWatch shows you that messages are there. It does not show you the application trace that explains why they failed – which service threw the exception, which database call timed out, or whether it was a transient error or a code bug.

CubeAPM instruments the application services that produce and consume SQS messages via OpenTelemetry, and captures the full request trace for each message – from when it was sent, through the consumer’s processing logic, to every downstream service it called. When a DLQ alarm fires, the trace in CubeAPM shows you exactly what happened to the messages before they failed: which exception was thrown, which database query ran too long, and whether the same failure pattern is happening across other messages in the queue. CloudWatch shows you the symptom. CubeAPM shows you the cause. Self-hosted in your own AWS account, no data leaves your environment.

Summary

MetricNamespaceAlert thresholdPriority
ApproximateNumberOfMessagesVisibleAWS/SQS2x baseline over 10 minHigh
ApproximateAgeOfOldestMessageAWS/SQSBased on SLA (Maximum statistic)High
ApproximateNumberOfMessagesNotVisibleAWS/SQS> 100,000High
ApproximateNumberOfMessagesVisible on DLQAWS/SQS> 0High
NumberOfMessagesSentAWS/SQSDrops to 0 during expected trafficMedium
NumberOfMessagesDeletedAWS/SQSConsistently below NumberOfMessagesSentMedium
NumberOfEmptyReceivesAWS/SQSHigh sustained rate (switch to long polling)Low
SentMessageSizeAWS/SQSAverage > 200,000 bytesLow

Start with ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage, and a DLQ alarm – these three cover the most critical failure modes. Add the in-flight limit alarm and producer/consumer throughput metrics once the baseline is in place. Enable a DLQ on every production queue before you need it.

Disclaimer : Configurations, thresholds, and CLI examples are for guidance only – verify against current AWS SQS CloudWatch metrics documentation before applying to production. SQS quotes and metric behavior change over time. CubeAPM references reflect genuine use cases; Evaluate all tools against your own requirements.

Also read:

How to Monitor EKS Pods and Nodes with Grafana

How to Monitor GKE Clusters with Prometheus and Grafana

How Do I Monitor AWS RDS with Prometheus?

×
×