CubeAPM
CubeAPM CubeAPM

What Are the Most Important RabbitMQ Metrics to Track?

What Are the Most Important RabbitMQ Metrics to Track?

Table of Contents

RabbitMQ exposes hundreds of metrics through its Prometheus endpoint and management API. Most of them are not relevant to daily operations. The metrics that predict failures are not always the most obvious ones – queue depth grows quietly before memory alarms fire, consumer utilisation saturates before backlogs become visible, and connection counts spike from application bugs before throughput degrades.

The RabbitMQ team recommends Prometheus combined with Grafana as the primary monitoring approach for production clusters. The rabbitmq_prometheus plugin exposes metrics on port 15692 and is designed to be scraped every 15 to 30 seconds. The management plugin (rabbitmq_management, port 15672) is convenient for development but has higher overhead and is not recommended as the primary monitoring source in production.

As of RabbitMQ 4.3 (released April 2026), all Prometheus metric names below apply to 3.8 and later, including all 4.x releases.

Key Takeaways

  • Enable rabbitmq_prometheus plugin for production monitoring – it has low overhead and is the recommended approach. The management plugin is for development only
  • Queue depth (rabbitmq_queue_messages_ready) is the single most important metric – a steadily growing value is almost always the first sign of a problem
  • Consumer utilisation at or near 1.0 (100%) means consumers are the bottleneck. Below 0.5 (50%) means consumers are idle more than they are processing – investigate why
  • Memory and disk alarms are cluster-wide – when one node triggers an alarm, all nodes block publishing connections
  • File descriptor exhaustion causes connection failures silently – monitor fd usage alongside message metrics
  • The recommended scrape interval is 30 seconds for production. Scraping more frequently increases CPU overhead on the broker
  • rabbitmq_queue_consumer_utilisation returns NaN for queues with no consumers – handle this in your alerting rules

Quick Reference: The Key Metrics

Prometheus metricWhat it tells youAlert threshold
rabbitmq_queue_messages_readyMessages waiting in queue (depth)Continuously growing above baseline
rabbitmq_queue_messages_unackedMessages delivered to consumers but not yet acknowledgedGrowing without corresponding delivery rate increase
rabbitmq_queue_consumer_utilisationFraction of time consumers are active (0.0 to 1.0)< 0.5 sustained (idle) or = 1.0 sustained (saturated)
rabbitmq_process_resident_memory_bytesMemory used by the RabbitMQ node> 60% of rabbitmq_resident_memory_limit_bytes
rabbitmq_disk_space_available_bytesFree disk space on the node data directoryApproaching 0 (disk alarm threshold)
rabbitmq_process_open_fdsFile descriptors in use> 80% of rabbitmq_process_max_fds
rabbitmq_connectionsTotal client connectionsSustained rapid growth or unexpected drop to 0
rabbitmq_queue_messagesTotal queue depth (ready + unacked)Continuously growing
rabbitmq_node_mem_alarmMemory alarm state (0 or 1)= 1 (alarm active)
rabbitmq_node_disk_free_alarmDisk alarm state (0 or 1)= 1 (alarm active)

1. Queue Depth: rabbitmq_queue_messages_ready

What it is: The number of messages sitting in a queue waiting to be delivered to a consumer. This is the primary indicator of consumer lag – if this value is growing, consumers are not processing messages as fast as publishers are sending them.

What good looks like: Near zero for queues with active consumers, or a stable value that reflects normal batching behavior. A queue that drains as fast as it fills is healthy.

What bad looks like: Continuous growth over time. Even slow growth – 100 messages per minute across an eight-hour shift – means 48,000 messages are waiting by the end of the day. At some point, this will exhaust node memory or disk space.

Alert threshold: Alert when the queue depth has grown continuously for more than 10 minutes, or when it exceeds an absolute threshold based on your expected processing SLA. There is no universal number – set the threshold relative to your normal operating baseline.

# Alert: queue depth growing for 10 minutes straight

rabbitmq_queue_messages_ready > 0 

  and (rabbitmq_queue_messages_ready - rabbitmq_queue_messages_ready offset 10m) > 0

The practical trap: Aggregate queue depth across all queues can look fine while one specific queue is filling up. Always monitor per-queue depth with the queue label, not just the cluster total.

2. Unacknowledged Messages: rabbitmq_queue_messages_unacked

What it is: Messages that have been delivered to a consumer but not yet acknowledged. RabbitMQ holds these in memory until the consumer sends an ack. If a consumer crashes without acknowledging, these messages are requeued.

What good looks like: Proportional to consumer count and prefetch configuration. If each consumer has a prefetch of 10 and you have 5 consumers, you expect up to 50 unacknowledged messages as a normal steady state.

What bad looks like: Unacked count growing without a corresponding increase in delivery rate. This usually means consumers are receiving messages but taking unusually long to process them – possibly blocked on a downstream service call, a database query, or an exception being swallowed silently.

Alert threshold: Alert when rabbitmq_queue_messages_unacked is growing over time rather than staying stable. A sudden spike that then recovers is often a transient consumer delay. A value that grows continuously is a stuck or slow consumer.

Practical note: When all consumers disconnect, unacknowledged messages are automatically requeued by RabbitMQ – they appear back in rabbitmq_queue_messages_ready. This is expected behavior. If you see a sudden jump in ready messages accompanied by a drop in unacked, consumers disconnected rather than messages being generated.

3. Consumer Utilisation: rabbitmq_queue_consumer_utilisation

What it is: A value between 0.0 and 1.0 representing the fraction of time the queue’s consumers are active and processing messages. A value of 1.0 means consumers are busy 100% of the time – they are the bottleneck. A value of 0.5 means consumers are idle half the time.

Why it matters: Consumer utilisation is the metric that tells you where your throughput ceiling is. If utilisation is at 1.0 and your queue is growing, adding more consumers (or increasing prefetch if consumers are fast and the bottleneck is network round-trips) is the fix. If utilisation is low and your queue is growing, something is blocking consumers – a downstream dependency, a processing exception, or incorrect prefetch configuration.

What good looks like: Between 0.5 and 0.9 in steady state. High enough that consumers are doing useful work, low enough that there is headroom for traffic spikes.

Alert threshold:

  • Warning: rabbitmq_queue_consumer_utilisation = 1.0 sustained for 5 minutes (consumers are saturated)
  • Warning: rabbitmq_queue_consumer_utilisation < 0.1 for a queue that should have active consumers (consumers may be stuck)

The NaN gotcha: rabbitmq_queue_consumer_utilisation returns NaN for queues with no consumers. Build your alerting rules to handle NaN values – otherwise an alert on < 0.5 will fire for every queue that has no consumers at all, including idle or inactive queues.

# Only alert on queues that have consumers attached

rabbitmq_queue_consumer_utilisation < 0.5

  and rabbitmq_queue_consumers > 0

4. Memory: rabbitmq_process_resident_memory_bytes and Alarms

What it is: The amount of memory currently used by the RabbitMQ node’s Erlang process. This is compared against rabbitmq_resident_memory_limit_bytes – the memory high-watermark that triggers the memory alarm.

The memory alarm behavior: When a node’s memory usage exceeds the high-watermark (default is 40% of total system RAM), RabbitMQ triggers a memory alarm and blocks all publishing connections cluster-wide. Consumers continue to receive messages unaffected. All nodes in the cluster block publishers when any single node triggers an alarm.

Alert thresholds:

Monitor the available memory headroom rather than the raw usage:

# Available memory before alarm triggers

rabbitmq_resident_memory_limit_bytes - rabbitmq_process_resident_memory_bytes

Alert when this value approaches zero. A threshold of less than 20% of the limit remaining gives you time to investigate before the alarm fires.

Also monitor the alarm state directly:

# Alert immediately when memory alarm is active

rabbitmq_node_mem_alarm == 1

rabbitmq_node_mem_alarm is 1 when the alarm is active (publishers blocked), 0 when clear.

Why memory alarms happen: Queues with lazy-mode disabled store messages in memory before writing to disk. A sudden burst of messages to a queue without adequate consumers will consume node memory fast. Classic queues that fill up are a common cause. Quorum queues and streams manage memory more predictably.

5. Disk Space: rabbitmq_disk_space_available_bytes and Disk Alarm

What it is: Free disk space on the partition used for the RabbitMQ node data directory. RabbitMQ uses disk to persist durable messages, quorum queue logs, and stream data.

The disk alarm behavior: When free disk space falls below the disk alarm threshold (default 50 MB, or configurable as a percentage of total disk), RabbitMQ triggers a disk alarm and blocks all publishing connections cluster-wide. Like the memory alarm, this is a cluster-wide effect when any single node triggers it.

Alert threshold: Alert well before the disk alarm threshold. A practical rule is to alert when free space falls below 20% of total disk capacity.

# Alert when disk alarm is active

rabbitmq_node_disk_free_alarm == 1

Practical note: Quorum queues and streams write to disk continuously as part of their design. If you use these queue types heavily, disk space management is critical, and disk monitoring becomes more urgent than for workloads using only classic queues.

6. File Descriptors: rabbitmq_process_open_fds

What it is: The number of file descriptors currently in use by the RabbitMQ process. Every TCP connection consumes at least one file descriptor. Every open file (for message storage) also consumes one. Erlang sockets and internal processes consume additional descriptors.

Why this matters: File descriptor exhaustion causes new connection attempts to fail with an error. Existing connections are not affected. This can cause silent partial failures – some clients cannot connect while the broker continues to serve existing connections normally.

Alert threshold:

# Alert when FD usage exceeds 80% of maximum

rabbitmq_process_open_fds / rabbitmq_process_max_fds > 0.8

The system limit: The maximum file descriptor count is set at the OS level. On Linux, the default limit per process is often 1,024 – far too low for production RabbitMQ with many connections. The RabbitMQ networking guide recommends setting this to at least 65,536 for production clusters. Monitor rabbitmq_process_max_fds as well as rabbitmq_process_open_fds to verify the OS limit has been set correctly.

7. Connections: rabbitmq_connections

What it is: The total number of client connections to the broker. Connection count is a health signal in two directions: a sudden, unexpected drop means clients are disconnecting, and sustained rapid growth means a connection leak.

What good looks like: Stable and proportional to your application’s connection pooling configuration.

What bad looks like: Connections climbing steadily without corresponding traffic growth (a connection leak – clients are opening connections without closing them). Or connections dropping to zero unexpectedly (broker or network failure from the client’s perspective).

Alert threshold:

  • Alert on unexpected drops: use anomaly detection or alert when connections drop more than 50% in 5 minutes
  • Alert on unexpected growth: alert when connections grow more than 2x the baseline over 10 minutes

Connection-related metric to watch: rabbitmq_channels grows in proportion to connections. A very high channel-to-connection ratio (many channels per connection) is not harmful but can indicate application code that opens channels without closing them.

8. Message Rates: Global Counters

RabbitMQ 3.8 introduced global counters that aggregate message rates correctly across reconnections and channel churn. Use these instead of per-channel or per-connection metrics for cluster-wide rate monitoring.

Key global rate metrics (all counters – use rate() in Prometheus):

MetricWhat it counts
rabbitmq_global_messages_received_totalMessages published into exchanges
rabbitmq_global_messages_delivered_consume_auto_ack_totalMessages delivered with auto-ack
rabbitmq_global_messages_delivered_consume_manual_ack_totalMessages delivered requiring manual ack
rabbitmq_global_messages_acknowledged_totalMessages acknowledged by consumers
rabbitmq_global_messages_dead_lettered_maxlen_totalMessages dead-lettered due to queue length limit
rabbitmq_global_messages_dead_lettered_expired_totalMessages dead-lettered due to TTL expiry

Publish vs ack rate balance:

# If this is positive and growing, your queue is filling up

rate(rabbitmq_global_messages_received_total[1m]) 

  - rate(rabbitmq_global_messages_acknowledged_total[1m])

A sustained positive difference means messages are accumulating – either consumers are too slow, or there are no consumers. A value near zero means the system is in balance.

Dead-lettered message rate: Any sustained rate of dead-lettered messages (either by TTL expiry or queue length overflow) indicates messages are being discarded. This is data loss. Alert on any non-zero rate.

9. Quorum Queue Specific: Raft Metrics

If you use quorum queues (recommended over classic mirrored queues since RabbitMQ 3.8), monitor the Raft consensus metrics:

rabbitmq_raft_entry_commit_latency_seconds – Time for a log entry to be committed across the quorum. Sustained high values indicate inter-node network issues or a lagging follower.

rabbitmq_raft_log_last_applied_index per node – If one node’s applied index is significantly behind others, it is lagging. This causes the quorum queue leader to hold messages waiting for the follower to catch up.

Enable the RabbitMQ-Raft Grafana dashboard alongside the main overview dashboard when using quorum queues.

Setting Up the Prometheus Plugin

# Enable the Prometheus plugin

rabbitmq-plugins enable rabbitmq_prometheus


# Verify it is serving metrics (default port 15692)

curl http://localhost:15692/metrics | head -20

The plugin serves two endpoints:

  • /metrics – aggregated metrics (recommended for most deployments)
  • /metrics/per-object – per-queue, per-connection metrics (useful for detailed debugging but generates larger payloads with many queues)

Prometheus scrape configuration:

scrape_configs:

  - job_name: rabbitmq

    scrape_interval: 30s

    scrape_timeout: 25s

    static_configs:

      - targets:

          - rabbitmq-node-1:15692

          - rabbitmq-node-2:15692

          - rabbitmq-node-3:15692

The official RabbitMQ team provides pre-built Grafana dashboards (IDs 10991 for RabbitMQ-Overview, with companion dashboards for Raft and Erlang runtime metrics) that cover all the metrics above with sensible default thresholds.

How Do I Find Which Application Is Causing a Queue Backlog?

rabbitmq_queue_messages_ready tells you a queue is filling up. It does not tell you which publisher is sending messages faster than consumers can process them, whether the bottleneck is the consumer processing logic, a downstream service the consumer is calling, or a configuration issue like incorrect prefetch.

When a queue depth alarm fires and the Grafana dashboard shows a growing backlog, the next question – which part of the application is responsible – requires tracing into the application layer.

CubeAPM instruments your producer and consumer application services via OpenTelemetry and captures each RabbitMQ publish and consume operation as a span in the full request trace. When a queue depth alarm fires, the trace in CubeAPM shows which API endpoint or background job is generating the publish rate, which consumer service is processing the messages, how long each message takes to process end-to-end, and whether the bottleneck is within the consumer’s own logic or in a downstream service it calls. The queue metric identifies the symptom. The trace identifies where in the application the imbalance originated. Self-hosted inside your own infrastructure, no data leaves your environment.

Summary

MetricAlert whenPriority
rabbitmq_queue_messages_readyGrowing continuously above baselineCritical
rabbitmq_node_mem_alarm= 1Critical
rabbitmq_node_disk_free_alarm= 1Critical
rabbitmq_process_open_fds / rabbitmq_process_max_fds> 0.8High
rabbitmq_queue_messages_unackedGrowing without delivery rate increaseHigh
rabbitmq_queue_consumer_utilisation= 1.0 sustained (saturated) or < 0.1 with consumers presentHigh
rabbitmq_connectionsUnexpected 50%+ drop or rapid growthMedium
Dead-lettered message rateAny non-zero sustained rateMedium

Start with the memory alarm, disk alarm, queue depth, and file descriptor alerts – these four cover the most common production failure modes. Add consumer utilisation and unacked message monitoring once the baseline is in place. Use the official RabbitMQ Grafana dashboards (ID 10991) as a starting point and adjust thresholds based on your actual workload over two to four weeks of observation.

Disclaimer: Metric names and alert thresholds are for guidance only – verify against the current RabbitMQ monitoring documentation and Prometheus plugin documentation before applying to production. Metric names and behavior may vary between RabbitMQ major versions. CubeAPM references reflect genuine use cases; evaluate all tools against your own requirements.

Also read:

How to Monitor AWS Fargate Containers with OpenTelemetry

How to Monitor AWS DynamoDB Read/Write Capacity and Throttles

What Are the Key AWS SQS Metrics to Monitor?

×
×