Alerting on RabbitMQ queue depth and consumer count with Prometheus requires understanding one critical distinction first: the default /metrics endpoint returns aggregated metrics with no queue name label, which means rabbitmq_queue_messages_ready tells you the total across all queues but cannot fire per-queue alerts.
Per-queue and per-consumer alerting requires the /metrics/detailed endpoint, which uses a rabbitmq_detailed_ metric prefix and must be scraped as a second job alongside /metrics. This guide covers both levels end-to-end.
Key Takeaways
- The default /metrics endpoint returns aggregated cluster-wide metrics. There is no queue label — per-queue alerting is not possible from this endpoint alone
- Per-queue alerts require scraping /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count as a second Prometheus job. Metrics from this endpoint use the rabbitmq_detailed_ prefix
- On clusters, detailed metrics for a queue are only reported from the node that hosts the leader replica of that queue. Scrape each node individually, never via a load-balanced service endpoint
- rabbitmq_detailed_queue_messages_ready gives per-queue ready message count with queue and vhost labels
- rabbitmq_detailed_queue_consumer_count gives per-queue consumer count with queue and vhost labels
- Always alert separately on zero consumers AND on absolute queue depth. A queue with no consumers grows silently — no depth alert fires until messages have already accumulated
Step 1: Understand the Two Metric Tiers
Before writing any alert rules, understand which endpoint gives you what.
| Endpoint | Metric prefix | Has queue label | Use for |
| /metrics | rabbitmq_ | No (aggregated) | Cluster health, node memory, disk, connection counts, overall message rates |
| /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count | rabbitmq_detailed_ | Yes | Per-queue depth alerts, per-queue consumer count alerts |
| /metrics/per-object | rabbitmq_ | Yes | Full per-object metrics — avoid on large clusters, very high overhead |
For queue depth and consumer count alerting, you need the /metrics/detailed endpoint. It was designed specifically for this use case: on a test system with 10,000 queues, /metrics/per-object took over two minutes to respond, while /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count took two seconds.
Step 2: Configure Prometheus to Scrape Both Endpoints
Add two scrape jobs to your prometheus.yml. The first scrapes the standard aggregated endpoint for cluster-level monitoring. The second scrapes the detailed endpoint for per-queue alerting:
scrape_configs:
# Job 1: Aggregated cluster metrics (standard)
- job_name: rabbitmq
scrape_interval: 15s
static_configs:
- targets:
- rabbitmq-node1:15692
- rabbitmq-node2:15692
- rabbitmq-node3:15692
labels:
cluster: production
# Job 2: Per-queue metrics for depth and consumer count alerting
- job_name: rabbitmq-detailed
scrape_interval: 30s
metrics_path: /metrics/detailed
params:
family:
- queue_coarse_metrics
- queue_consumer_count
static_configs:
- targets:
- rabbitmq-node1:15692
- rabbitmq-node2:15692
- rabbitmq-node3:15692
labels:
cluster: productionA 30-second scrape interval for the detailed endpoint is sufficient for alerting and reduces load. Do not point this job at a load-balanced service endpoint – detailed metrics for a queue are only returned by the node hosting that queue’s leader replica, so load-balancing will cause most scrapes to miss queues on a multi-node cluster.
For Kubernetes with Prometheus Operator, add a second endpoint to your ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rabbitmq
namespace: monitoring
spec:
selector:
matchLabels:
app: rabbitmq
namespaceSelector:
matchNames:
- rabbitmq
endpoints:
# Standard aggregated endpoint
- port: prometheus
interval: 15s
# Per-queue detailed endpoint
- port: prometheus
interval: 30s
path: /metrics/detailed
params:
family:
- queue_coarse_metrics
- queue_consumer_countAfter applying, navigate to http://prometheus:9090/targets and confirm both jobs show targets as UP.
Step 3: Verify the Metrics Are Populating
Before writing alert rules, confirm the right metrics are present in Prometheus. Run these queries in the Prometheus UI:
# Aggregated ready messages (no queue label – cluster total only)
rabbitmq_queue_messages_ready
# Per-queue ready messages (has queue and vhost labels)
rabbitmq_detailed_queue_messages_ready
# Per-queue consumer count (has queue and vhost labels)
rabbitmq_detailed_queue_consumer_count
You can also verify directly against the endpoint:
curl -s “http://rabbitmq-node1:15692/metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count” | grep rabbitmq_detailed_queue_messages_ready
You should see output like:
rabbitmq_detailed_queue_messages_ready{vhost=”/”,queue=”orders”} 75
rabbitmq_detailed_queue_messages_ready{vhost=”/”,queue=”notifications”} 2
If rabbitmq_detailed_queue_messages_ready returns nothing in Prometheus, check that the detailed job targets show UP at http://prometheus:9090/targets and that you are not scraping via a load-balanced service endpoint.
Label reference for detailed metrics:
| Label | Value |
| queue | The queue name |
| vhost | The virtual host (e.g. /) |
| instance | The RabbitMQ node that reported this metric |
Step 4: Write the Alert Rules
Create rabbitmq-alerts.yml. The four rules below cover the signals that matter most for queue depth and consumer count monitoring.
groups:
- name: rabbitmq-queue-depth
rules:
# RULE 1: Queue depth exceeds threshold
# Set the threshold based on your write rate and SLA, not an
# arbitrary number. Formula: SLA seconds × write rate (msgs/sec).
# Example: 500 msgs/sec × 30s tolerance = 15,000 threshold.
- alert: RabbitMQQueueDepthHigh
expr: rabbitmq_detailed_queue_messages_ready > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ queue depth is high"
description: >
Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}
has {{ $value | humanize }} ready messages.
Node: {{ $labels.instance }}.
# RULE 2: Queue depth has been increasing for a sustained period
# Uses deriv() which is correct for gauges (unlike rate(), which
# is for counters). Fires only after sustained growth, not during
# temporary bursts or catch-up after consumer restarts.
- alert: RabbitMQQueueDepthGrowing
expr: deriv(rabbitmq_detailed_queue_messages_ready[10m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "RabbitMQ queue depth is continuously growing"
description: >
Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}
has been growing for 15 consecutive minutes.
Current depth: {{ $value | humanize }} messages/second rate of change.
Node: {{ $labels.instance }}.
# RULE 3: Queue has no consumers
# A consumer drop is almost never benign for more than 2 minutes.
# This is a silent failure — the broker produces no error.
- alert: RabbitMQQueueNoConsumers
expr: rabbitmq_detailed_queue_consumer_count == 0
for: 2m
labels:
severity: critical
annotations:
summary: "RabbitMQ queue has no consumers"
description: >
Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}
has zero consumers. Messages are accumulating with no one
processing them. Node: {{ $labels.instance }}.
# RULE 4: Consumer count dropped below minimum expected
# Adjust the threshold (< 2) to match your deployment.
- alert: RabbitMQConsumerCountLow
expr: rabbitmq_detailed_queue_consumer_count < 2
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ consumer count is below minimum"
description: >
Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}
has only {{ $value }} consumer(s). Expected at least 2.
Node: {{ $labels.instance }}.
Reference your rule file from prometheus.yml:
rule_files:
- "rabbitmq-alerts.yml"Alert Threshold Guidance
There is no universal threshold for queue depth. The right value depends on your write rate and SLA.
Formula: threshold = acceptable delay in seconds × write rate in messages per second
| Scenario | Write rate | Acceptable delay | Threshold |
| High-volume order processing | 1,000 msgs/sec | 10 seconds | 10,000 |
| Background job queue | 50 msgs/sec | 60 seconds | 3,000 |
| Low-volume notification queue | 5 msgs/sec | 60 seconds | 300 |
For the growing depth alert (deriv > 0 for 15m), the duration is the key tuning parameter. 15 minutes is a safe baseline that filters out catch-up bursts after consumer restarts. If your consumers frequently restart and catch up within minutes, keep it at 15 to 20 minutes to avoid noise.
For the zero consumer alert, a 2-minute duration is appropriate. A consumer drop is almost never benign for more than 2 minutes in production.
Step 5: Configure Alertmanager Routing
Group by queue and vhost to prevent alert storms when the same condition affects multiple nodes in a cluster simultaneously:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'queue', 'vhost']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: default
routes:
- match:
alertname: RabbitMQQueueNoConsumers
receiver: rabbitmq-critical
repeat_interval: 10m
- match:
alertname: RabbitMQQueueDepthHigh
receiver: rabbitmq-warning
receivers:
- name: default
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: rabbitmq-critical
slack_configs:
- channel: '#rabbitmq-incidents'
title: ':fire: RabbitMQ Critical: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
- name: rabbitmq-warning
slack_configs:
- channel: '#rabbitmq-monitoring'
title: ':warning: RabbitMQ Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'Common Setup Problems
| Problem | Likely cause | Fix |
| rabbitmq_detailed_ metrics missing in Prometheus | Detailed scrape job not configured, or targeting a load-balanced endpoint | Add the second scrape job from Step 2 and point it at each node individually, not at a Kubernetes service |
| Per-queue metrics missing for some queues intermittently | Scraping via a load-balanced endpoint rather than per-pod | In Kubernetes, use the ServiceMonitor per-pod approach so each node is scraped directly |
| Zero consumer alert fires on intentionally inactive queues | Queue is dormant by design during off-hours | Add a label exclusion filter: rabbitmq_detailed_queue_consumer_count{queue!~”deferred-.*”} == 0 |
| Growing depth alert fires every time consumers restart | for duration shorter than your restart and catch-up cycle | Increase for: 15m to for: 20m to absorb longer catch-up periods |
| Alert fires on the same queue from multiple nodes | Leader failover causing the reporting node to change mid-window | Group by queue and vhost in Alertmanager routing, not by instance |
How Do You Know Why the Queue Is Growing, Not Just That It Is?
When RabbitMQQueueDepthHigh fires, the alert tells you which queue and vhost are affected. It does not tell you whether the depth is growing because consumers crashed, because each message is taking too long to process, or because a downstream service the consumer calls has slowed down.
These three root causes look identical on a queue depth chart. A queue accumulating because of a crashed consumer looks exactly the same as one accumulating because each message now takes three seconds to process instead of 300ms.
CubeAPM instruments your RabbitMQ consumer application via OpenTelemetry and captures each message processing cycle as a span in the full distributed trace. When a queue depth alert fires, CubeAPM shows you which consumer instance is slowest, how long each message takes end-to-end through the system, which downstream service call is responsible for the slowdown, and whether the issue is isolated to one consumer pod or affecting all of them. The Prometheus alert tells you something is wrong. CubeAPM tells you what and where. It runs self-hosted inside your own infrastructure at $0.15/GB ingestion pricing, so no data leaves your environment.
Summary
Alerting on RabbitMQ queue depth and consumer count requires two Prometheus scrape jobs: the standard /metrics endpoint for cluster-level health, and /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count for per-queue alerting. Metrics from the detailed endpoint use the rabbitmq_detailed_ prefix and carry queue and vhost labels.
Alert on absolute depth (threshold-based), on growing depth using deriv() (not rate() – queue depth is a gauge, not a counter), and on zero consumers as a separate critical signal.
| Step | What to configure | Key detail |
| Add detailed scrape job | Second job in prometheus.yml targeting /metrics/detailed | Use family=queue_coarse_metrics&family=queue_consumer_count. Scrape each node individually |
| Verify metrics | Query rabbitmq_detailed_queue_messages_ready in Prometheus UI | Must have queue and vhost labels. If absent, check scrape job targets are UP |
| Alert on absolute depth | rabbitmq_detailed_queue_messages_ready > [threshold] for: 5m | Threshold = SLA seconds × write rate. Set per-queue, not universally |
| Alert on growing depth | deriv(rabbitmq_detailed_queue_messages_ready[10m]) > 0 for: 15m | Use deriv() for gauges, not rate(). Catches slow build-ups below a fixed threshold |
| Alert on zero consumers | rabbitmq_detailed_queue_consumer_count == 0 for: 2m | Critical severity. Always alert on this separately |
| Alert on low consumer count | rabbitmq_detailed_queue_consumer_count < N for: 5m | Set N to your expected minimum per queue |
| Configure Alertmanager | Group by queue and vhost | Prevents alert storms when multiple nodes report the same queue condition |
Disclaimer: Metric names, endpoint paths, and alert expressions are verified against RabbitMQ 4.3 documentation (rabbitmq.com/docs/prometheus), the rabbitmq-server GitHub repository, and direct responses from the RabbitMQ core team in the official Google Group as of May 2026.
Also read:
How to Build a RabbitMQ Grafana Dashboard From Scratch





