RabbitMQ exposes hundreds of metrics through its Prometheus endpoint and management API. Most of them are not relevant to daily operations. The metrics that predict failures are not always the most obvious ones – queue depth grows quietly before memory alarms fire, consumer utilisation saturates before backlogs become visible, and connection counts spike from application bugs before throughput degrades.
The RabbitMQ team recommends Prometheus combined with Grafana as the primary monitoring approach for production clusters. The rabbitmq_prometheus plugin exposes metrics on port 15692 and is designed to be scraped every 15 to 30 seconds. The management plugin (rabbitmq_management, port 15672) is convenient for development but has higher overhead and is not recommended as the primary monitoring source in production.
As of RabbitMQ 4.3 (released April 2026), all Prometheus metric names below apply to 3.8 and later, including all 4.x releases.
Key Takeaways
- Enable rabbitmq_prometheus plugin for production monitoring – it has low overhead and is the recommended approach. The management plugin is for development only
- Queue depth (rabbitmq_queue_messages_ready) is the single most important metric – a steadily growing value is almost always the first sign of a problem
- Consumer utilisation at or near 1.0 (100%) means consumers are the bottleneck. Below 0.5 (50%) means consumers are idle more than they are processing – investigate why
- Memory and disk alarms are cluster-wide – when one node triggers an alarm, all nodes block publishing connections
- File descriptor exhaustion causes connection failures silently – monitor fd usage alongside message metrics
- The recommended scrape interval is 30 seconds for production. Scraping more frequently increases CPU overhead on the broker
- rabbitmq_queue_consumer_utilisation returns NaN for queues with no consumers – handle this in your alerting rules
Quick Reference: The Key Metrics
| Prometheus metric | What it tells you | Alert threshold |
| rabbitmq_queue_messages_ready | Messages waiting in queue (depth) | Continuously growing above baseline |
| rabbitmq_queue_messages_unacked | Messages delivered to consumers but not yet acknowledged | Growing without corresponding delivery rate increase |
| rabbitmq_queue_consumer_utilisation | Fraction of time consumers are active (0.0 to 1.0) | < 0.5 sustained (idle) or = 1.0 sustained (saturated) |
| rabbitmq_process_resident_memory_bytes | Memory used by the RabbitMQ node | > 60% of rabbitmq_resident_memory_limit_bytes |
| rabbitmq_disk_space_available_bytes | Free disk space on the node data directory | Approaching 0 (disk alarm threshold) |
| rabbitmq_process_open_fds | File descriptors in use | > 80% of rabbitmq_process_max_fds |
| rabbitmq_connections | Total client connections | Sustained rapid growth or unexpected drop to 0 |
| rabbitmq_queue_messages | Total queue depth (ready + unacked) | Continuously growing |
| rabbitmq_node_mem_alarm | Memory alarm state (0 or 1) | = 1 (alarm active) |
| rabbitmq_node_disk_free_alarm | Disk alarm state (0 or 1) | = 1 (alarm active) |
1. Queue Depth: rabbitmq_queue_messages_ready
What it is: The number of messages sitting in a queue waiting to be delivered to a consumer. This is the primary indicator of consumer lag – if this value is growing, consumers are not processing messages as fast as publishers are sending them.
What good looks like: Near zero for queues with active consumers, or a stable value that reflects normal batching behavior. A queue that drains as fast as it fills is healthy.
What bad looks like: Continuous growth over time. Even slow growth – 100 messages per minute across an eight-hour shift – means 48,000 messages are waiting by the end of the day. At some point, this will exhaust node memory or disk space.
Alert threshold: Alert when the queue depth has grown continuously for more than 10 minutes, or when it exceeds an absolute threshold based on your expected processing SLA. There is no universal number – set the threshold relative to your normal operating baseline.
# Alert: queue depth growing for 10 minutes straight
rabbitmq_queue_messages_ready > 0
and (rabbitmq_queue_messages_ready - rabbitmq_queue_messages_ready offset 10m) > 0The practical trap: Aggregate queue depth across all queues can look fine while one specific queue is filling up. Always monitor per-queue depth with the queue label, not just the cluster total.
2. Unacknowledged Messages: rabbitmq_queue_messages_unacked
What it is: Messages that have been delivered to a consumer but not yet acknowledged. RabbitMQ holds these in memory until the consumer sends an ack. If a consumer crashes without acknowledging, these messages are requeued.
What good looks like: Proportional to consumer count and prefetch configuration. If each consumer has a prefetch of 10 and you have 5 consumers, you expect up to 50 unacknowledged messages as a normal steady state.
What bad looks like: Unacked count growing without a corresponding increase in delivery rate. This usually means consumers are receiving messages but taking unusually long to process them – possibly blocked on a downstream service call, a database query, or an exception being swallowed silently.
Alert threshold: Alert when rabbitmq_queue_messages_unacked is growing over time rather than staying stable. A sudden spike that then recovers is often a transient consumer delay. A value that grows continuously is a stuck or slow consumer.
Practical note: When all consumers disconnect, unacknowledged messages are automatically requeued by RabbitMQ – they appear back in rabbitmq_queue_messages_ready. This is expected behavior. If you see a sudden jump in ready messages accompanied by a drop in unacked, consumers disconnected rather than messages being generated.
3. Consumer Utilisation: rabbitmq_queue_consumer_utilisation
What it is: A value between 0.0 and 1.0 representing the fraction of time the queue’s consumers are active and processing messages. A value of 1.0 means consumers are busy 100% of the time – they are the bottleneck. A value of 0.5 means consumers are idle half the time.
Why it matters: Consumer utilisation is the metric that tells you where your throughput ceiling is. If utilisation is at 1.0 and your queue is growing, adding more consumers (or increasing prefetch if consumers are fast and the bottleneck is network round-trips) is the fix. If utilisation is low and your queue is growing, something is blocking consumers – a downstream dependency, a processing exception, or incorrect prefetch configuration.
What good looks like: Between 0.5 and 0.9 in steady state. High enough that consumers are doing useful work, low enough that there is headroom for traffic spikes.
Alert threshold:
- Warning: rabbitmq_queue_consumer_utilisation = 1.0 sustained for 5 minutes (consumers are saturated)
- Warning: rabbitmq_queue_consumer_utilisation < 0.1 for a queue that should have active consumers (consumers may be stuck)
The NaN gotcha: rabbitmq_queue_consumer_utilisation returns NaN for queues with no consumers. Build your alerting rules to handle NaN values – otherwise an alert on < 0.5 will fire for every queue that has no consumers at all, including idle or inactive queues.
# Only alert on queues that have consumers attached
rabbitmq_queue_consumer_utilisation < 0.5
and rabbitmq_queue_consumers > 04. Memory: rabbitmq_process_resident_memory_bytes and Alarms
What it is: The amount of memory currently used by the RabbitMQ node’s Erlang process. This is compared against rabbitmq_resident_memory_limit_bytes – the memory high-watermark that triggers the memory alarm.
The memory alarm behavior: When a node’s memory usage exceeds the high-watermark (default is 40% of total system RAM), RabbitMQ triggers a memory alarm and blocks all publishing connections cluster-wide. Consumers continue to receive messages unaffected. All nodes in the cluster block publishers when any single node triggers an alarm.
Alert thresholds:
Monitor the available memory headroom rather than the raw usage:
# Available memory before alarm triggers
rabbitmq_resident_memory_limit_bytes - rabbitmq_process_resident_memory_bytesAlert when this value approaches zero. A threshold of less than 20% of the limit remaining gives you time to investigate before the alarm fires.
Also monitor the alarm state directly:
# Alert immediately when memory alarm is active
rabbitmq_node_mem_alarm == 1rabbitmq_node_mem_alarm is 1 when the alarm is active (publishers blocked), 0 when clear.
Why memory alarms happen: Queues with lazy-mode disabled store messages in memory before writing to disk. A sudden burst of messages to a queue without adequate consumers will consume node memory fast. Classic queues that fill up are a common cause. Quorum queues and streams manage memory more predictably.
5. Disk Space: rabbitmq_disk_space_available_bytes and Disk Alarm
What it is: Free disk space on the partition used for the RabbitMQ node data directory. RabbitMQ uses disk to persist durable messages, quorum queue logs, and stream data.
The disk alarm behavior: When free disk space falls below the disk alarm threshold (default 50 MB, or configurable as a percentage of total disk), RabbitMQ triggers a disk alarm and blocks all publishing connections cluster-wide. Like the memory alarm, this is a cluster-wide effect when any single node triggers it.
Alert threshold: Alert well before the disk alarm threshold. A practical rule is to alert when free space falls below 20% of total disk capacity.
# Alert when disk alarm is active
rabbitmq_node_disk_free_alarm == 1Practical note: Quorum queues and streams write to disk continuously as part of their design. If you use these queue types heavily, disk space management is critical, and disk monitoring becomes more urgent than for workloads using only classic queues.
6. File Descriptors: rabbitmq_process_open_fds
What it is: The number of file descriptors currently in use by the RabbitMQ process. Every TCP connection consumes at least one file descriptor. Every open file (for message storage) also consumes one. Erlang sockets and internal processes consume additional descriptors.
Why this matters: File descriptor exhaustion causes new connection attempts to fail with an error. Existing connections are not affected. This can cause silent partial failures – some clients cannot connect while the broker continues to serve existing connections normally.
Alert threshold:
# Alert when FD usage exceeds 80% of maximum
rabbitmq_process_open_fds / rabbitmq_process_max_fds > 0.8The system limit: The maximum file descriptor count is set at the OS level. On Linux, the default limit per process is often 1,024 – far too low for production RabbitMQ with many connections. The RabbitMQ networking guide recommends setting this to at least 65,536 for production clusters. Monitor rabbitmq_process_max_fds as well as rabbitmq_process_open_fds to verify the OS limit has been set correctly.
7. Connections: rabbitmq_connections
What it is: The total number of client connections to the broker. Connection count is a health signal in two directions: a sudden, unexpected drop means clients are disconnecting, and sustained rapid growth means a connection leak.
What good looks like: Stable and proportional to your application’s connection pooling configuration.
What bad looks like: Connections climbing steadily without corresponding traffic growth (a connection leak – clients are opening connections without closing them). Or connections dropping to zero unexpectedly (broker or network failure from the client’s perspective).
Alert threshold:
- Alert on unexpected drops: use anomaly detection or alert when connections drop more than 50% in 5 minutes
- Alert on unexpected growth: alert when connections grow more than 2x the baseline over 10 minutes
Connection-related metric to watch: rabbitmq_channels grows in proportion to connections. A very high channel-to-connection ratio (many channels per connection) is not harmful but can indicate application code that opens channels without closing them.
8. Message Rates: Global Counters
RabbitMQ 3.8 introduced global counters that aggregate message rates correctly across reconnections and channel churn. Use these instead of per-channel or per-connection metrics for cluster-wide rate monitoring.
Key global rate metrics (all counters – use rate() in Prometheus):
| Metric | What it counts |
| rabbitmq_global_messages_received_total | Messages published into exchanges |
| rabbitmq_global_messages_delivered_consume_auto_ack_total | Messages delivered with auto-ack |
| rabbitmq_global_messages_delivered_consume_manual_ack_total | Messages delivered requiring manual ack |
| rabbitmq_global_messages_acknowledged_total | Messages acknowledged by consumers |
| rabbitmq_global_messages_dead_lettered_maxlen_total | Messages dead-lettered due to queue length limit |
| rabbitmq_global_messages_dead_lettered_expired_total | Messages dead-lettered due to TTL expiry |
Publish vs ack rate balance:
# If this is positive and growing, your queue is filling up
rate(rabbitmq_global_messages_received_total[1m])
- rate(rabbitmq_global_messages_acknowledged_total[1m])A sustained positive difference means messages are accumulating – either consumers are too slow, or there are no consumers. A value near zero means the system is in balance.
Dead-lettered message rate: Any sustained rate of dead-lettered messages (either by TTL expiry or queue length overflow) indicates messages are being discarded. This is data loss. Alert on any non-zero rate.
9. Quorum Queue Specific: Raft Metrics
If you use quorum queues (recommended over classic mirrored queues since RabbitMQ 3.8), monitor the Raft consensus metrics:
rabbitmq_raft_entry_commit_latency_seconds – Time for a log entry to be committed across the quorum. Sustained high values indicate inter-node network issues or a lagging follower.
rabbitmq_raft_log_last_applied_index per node – If one node’s applied index is significantly behind others, it is lagging. This causes the quorum queue leader to hold messages waiting for the follower to catch up.
Enable the RabbitMQ-Raft Grafana dashboard alongside the main overview dashboard when using quorum queues.
Setting Up the Prometheus Plugin
# Enable the Prometheus plugin
rabbitmq-plugins enable rabbitmq_prometheus
# Verify it is serving metrics (default port 15692)
curl http://localhost:15692/metrics | head -20The plugin serves two endpoints:
- /metrics – aggregated metrics (recommended for most deployments)
- /metrics/per-object – per-queue, per-connection metrics (useful for detailed debugging but generates larger payloads with many queues)
Prometheus scrape configuration:
scrape_configs:
- job_name: rabbitmq
scrape_interval: 30s
scrape_timeout: 25s
static_configs:
- targets:
- rabbitmq-node-1:15692
- rabbitmq-node-2:15692
- rabbitmq-node-3:15692The official RabbitMQ team provides pre-built Grafana dashboards (IDs 10991 for RabbitMQ-Overview, with companion dashboards for Raft and Erlang runtime metrics) that cover all the metrics above with sensible default thresholds.
How Do I Find Which Application Is Causing a Queue Backlog?
rabbitmq_queue_messages_ready tells you a queue is filling up. It does not tell you which publisher is sending messages faster than consumers can process them, whether the bottleneck is the consumer processing logic, a downstream service the consumer is calling, or a configuration issue like incorrect prefetch.
When a queue depth alarm fires and the Grafana dashboard shows a growing backlog, the next question – which part of the application is responsible – requires tracing into the application layer.
CubeAPM instruments your producer and consumer application services via OpenTelemetry and captures each RabbitMQ publish and consume operation as a span in the full request trace. When a queue depth alarm fires, the trace in CubeAPM shows which API endpoint or background job is generating the publish rate, which consumer service is processing the messages, how long each message takes to process end-to-end, and whether the bottleneck is within the consumer’s own logic or in a downstream service it calls. The queue metric identifies the symptom. The trace identifies where in the application the imbalance originated. Self-hosted inside your own infrastructure, no data leaves your environment.
Summary
| Metric | Alert when | Priority |
| rabbitmq_queue_messages_ready | Growing continuously above baseline | Critical |
| rabbitmq_node_mem_alarm | = 1 | Critical |
| rabbitmq_node_disk_free_alarm | = 1 | Critical |
| rabbitmq_process_open_fds / rabbitmq_process_max_fds | > 0.8 | High |
| rabbitmq_queue_messages_unacked | Growing without delivery rate increase | High |
| rabbitmq_queue_consumer_utilisation | = 1.0 sustained (saturated) or < 0.1 with consumers present | High |
| rabbitmq_connections | Unexpected 50%+ drop or rapid growth | Medium |
| Dead-lettered message rate | Any non-zero sustained rate | Medium |
Start with the memory alarm, disk alarm, queue depth, and file descriptor alerts – these four cover the most common production failure modes. Add consumer utilisation and unacked message monitoring once the baseline is in place. Use the official RabbitMQ Grafana dashboards (ID 10991) as a starting point and adjust thresholds based on your actual workload over two to four weeks of observation.
Disclaimer: Metric names and alert thresholds are for guidance only – verify against the current RabbitMQ monitoring documentation and Prometheus plugin documentation before applying to production. Metric names and behavior may vary between RabbitMQ major versions. CubeAPM references reflect genuine use cases; evaluate all tools against your own requirements.
Also read:
How to Monitor AWS Fargate Containers with OpenTelemetry
How to Monitor AWS DynamoDB Read/Write Capacity and Throttles





