CubeAPM
CubeAPM CubeAPM

RabbitMQ Alerts: How to Alert on RabbitMQ Queue Depth and Consumer Count 

RabbitMQ Alerts: How to Alert on RabbitMQ Queue Depth and Consumer Count 

Table of Contents

Alerting on RabbitMQ queue depth and consumer count with Prometheus requires understanding one critical distinction first: the default /metrics endpoint returns aggregated metrics with no queue name label, which means rabbitmq_queue_messages_ready tells you the total across all queues but cannot fire per-queue alerts. 

Per-queue and per-consumer alerting requires the /metrics/detailed endpoint, which uses a rabbitmq_detailed_ metric prefix and must be scraped as a second job alongside /metrics. This guide covers both levels end-to-end.

Key Takeaways

  • The default /metrics endpoint returns aggregated cluster-wide metrics. There is no queue label — per-queue alerting is not possible from this endpoint alone
  • Per-queue alerts require scraping /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count as a second Prometheus job. Metrics from this endpoint use the rabbitmq_detailed_ prefix
  • On clusters, detailed metrics for a queue are only reported from the node that hosts the leader replica of that queue. Scrape each node individually, never via a load-balanced service endpoint
  • rabbitmq_detailed_queue_messages_ready gives per-queue ready message count with queue and vhost labels
  • rabbitmq_detailed_queue_consumer_count gives per-queue consumer count with queue and vhost labels
  • Always alert separately on zero consumers AND on absolute queue depth. A queue with no consumers grows silently — no depth alert fires until messages have already accumulated

Step 1: Understand the Two Metric Tiers

Before writing any alert rules, understand which endpoint gives you what.

EndpointMetric prefixHas queue labelUse for
/metricsrabbitmq_No (aggregated)Cluster health, node memory, disk, connection counts, overall message rates
/metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_countrabbitmq_detailed_YesPer-queue depth alerts, per-queue consumer count alerts
/metrics/per-objectrabbitmq_YesFull per-object metrics — avoid on large clusters, very high overhead

For queue depth and consumer count alerting, you need the /metrics/detailed endpoint. It was designed specifically for this use case: on a test system with 10,000 queues, /metrics/per-object took over two minutes to respond, while /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count took two seconds.

Step 2: Configure Prometheus to Scrape Both Endpoints

Add two scrape jobs to your prometheus.yml. The first scrapes the standard aggregated endpoint for cluster-level monitoring. The second scrapes the detailed endpoint for per-queue alerting:

scrape_configs:

  # Job 1: Aggregated cluster metrics (standard)

  - job_name: rabbitmq

    scrape_interval: 15s

    static_configs:

      - targets:

          - rabbitmq-node1:15692

          - rabbitmq-node2:15692

          - rabbitmq-node3:15692

        labels:

          cluster: production
          

  # Job 2: Per-queue metrics for depth and consumer count alerting

  - job_name: rabbitmq-detailed

    scrape_interval: 30s

    metrics_path: /metrics/detailed

    params:

      family:

        - queue_coarse_metrics

        - queue_consumer_count

    static_configs:

      - targets:

          - rabbitmq-node1:15692

          - rabbitmq-node2:15692

          - rabbitmq-node3:15692

        labels:

          cluster: production

A 30-second scrape interval for the detailed endpoint is sufficient for alerting and reduces load. Do not point this job at a load-balanced service endpoint – detailed metrics for a queue are only returned by the node hosting that queue’s leader replica, so load-balancing will cause most scrapes to miss queues on a multi-node cluster.

For Kubernetes with Prometheus Operator, add a second endpoint to your ServiceMonitor:

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: rabbitmq

  namespace: monitoring

spec:

  selector:

    matchLabels:

      app: rabbitmq

  namespaceSelector:

    matchNames:

      - rabbitmq

  endpoints:

    # Standard aggregated endpoint

    - port: prometheus

      interval: 15s

    # Per-queue detailed endpoint

    - port: prometheus

      interval: 30s

      path: /metrics/detailed

      params:

        family:

          - queue_coarse_metrics

          - queue_consumer_count

After applying, navigate to http://prometheus:9090/targets and confirm both jobs show targets as UP.

Step 3: Verify the Metrics Are Populating

Before writing alert rules, confirm the right metrics are present in Prometheus. Run these queries in the Prometheus UI:

# Aggregated ready messages (no queue label – cluster total only)

rabbitmq_queue_messages_ready

# Per-queue ready messages (has queue and vhost labels)

rabbitmq_detailed_queue_messages_ready

# Per-queue consumer count (has queue and vhost labels)

rabbitmq_detailed_queue_consumer_count

You can also verify directly against the endpoint:

curl -s “http://rabbitmq-node1:15692/metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count” | grep rabbitmq_detailed_queue_messages_ready

You should see output like:

rabbitmq_detailed_queue_messages_ready{vhost=”/”,queue=”orders”} 75

rabbitmq_detailed_queue_messages_ready{vhost=”/”,queue=”notifications”} 2

If rabbitmq_detailed_queue_messages_ready returns nothing in Prometheus, check that the detailed job targets show UP at http://prometheus:9090/targets and that you are not scraping via a load-balanced service endpoint.

Label reference for detailed metrics:

LabelValue
queueThe queue name
vhostThe virtual host (e.g. /)
instanceThe RabbitMQ node that reported this metric

Step 4: Write the Alert Rules

Create rabbitmq-alerts.yml. The four rules below cover the signals that matter most for queue depth and consumer count monitoring.

groups:

  - name: rabbitmq-queue-depth

    rules:

      # RULE 1: Queue depth exceeds threshold

      # Set the threshold based on your write rate and SLA, not an

      # arbitrary number. Formula: SLA seconds × write rate (msgs/sec).

      # Example: 500 msgs/sec × 30s tolerance = 15,000 threshold.

      - alert: RabbitMQQueueDepthHigh

        expr: rabbitmq_detailed_queue_messages_ready > 10000

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "RabbitMQ queue depth is high"

          description: >

            Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}

            has {{ $value | humanize }} ready messages.

            Node: {{ $labels.instance }}.
            

      # RULE 2: Queue depth has been increasing for a sustained period

      # Uses deriv() which is correct for gauges (unlike rate(), which

      # is for counters). Fires only after sustained growth, not during

      # temporary bursts or catch-up after consumer restarts.

      - alert: RabbitMQQueueDepthGrowing

        expr: deriv(rabbitmq_detailed_queue_messages_ready[10m]) > 0

        for: 15m

        labels:

          severity: warning

        annotations:

          summary: "RabbitMQ queue depth is continuously growing"

          description: >

            Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}

            has been growing for 15 consecutive minutes.

            Current depth: {{ $value | humanize }} messages/second rate of change.

            Node: {{ $labels.instance }}.
            

      # RULE 3: Queue has no consumers

      # A consumer drop is almost never benign for more than 2 minutes.

      # This is a silent failure — the broker produces no error.

      - alert: RabbitMQQueueNoConsumers

        expr: rabbitmq_detailed_queue_consumer_count == 0

        for: 2m

        labels:

          severity: critical

        annotations:

          summary: "RabbitMQ queue has no consumers"

          description: >

            Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}

            has zero consumers. Messages are accumulating with no one

            processing them. Node: {{ $labels.instance }}.
            

      # RULE 4: Consumer count dropped below minimum expected

      # Adjust the threshold (< 2) to match your deployment.

      - alert: RabbitMQConsumerCountLow

        expr: rabbitmq_detailed_queue_consumer_count < 2

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "RabbitMQ consumer count is below minimum"

          description: >

            Queue {{ $labels.queue }} on vhost {{ $labels.vhost }}

            has only {{ $value }} consumer(s). Expected at least 2.

            Node: {{ $labels.instance }}.

Reference your rule file from prometheus.yml:

rule_files:

  - "rabbitmq-alerts.yml"

Alert Threshold Guidance

There is no universal threshold for queue depth. The right value depends on your write rate and SLA.

Formula: threshold = acceptable delay in seconds × write rate in messages per second

ScenarioWrite rateAcceptable delayThreshold
High-volume order processing1,000 msgs/sec10 seconds10,000
Background job queue50 msgs/sec60 seconds3,000
Low-volume notification queue5 msgs/sec60 seconds300

For the growing depth alert (deriv > 0 for 15m), the duration is the key tuning parameter. 15 minutes is a safe baseline that filters out catch-up bursts after consumer restarts. If your consumers frequently restart and catch up within minutes, keep it at 15 to 20 minutes to avoid noise.

For the zero consumer alert, a 2-minute duration is appropriate. A consumer drop is almost never benign for more than 2 minutes in production.

Step 5: Configure Alertmanager Routing

Group by queue and vhost to prevent alert storms when the same condition affects multiple nodes in a cluster simultaneously:

global:

  resolve_timeout: 5m

  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:

  group_by: ['alertname', 'queue', 'vhost']

  group_wait: 30s

  group_interval: 5m

  repeat_interval: 1h

  receiver: default

  routes:

    - match:

        alertname: RabbitMQQueueNoConsumers

      receiver: rabbitmq-critical

      repeat_interval: 10m

    - match:

        alertname: RabbitMQQueueDepthHigh

      receiver: rabbitmq-warning

receivers:

  - name: default

    slack_configs:

      - channel: '#alerts'

        title: '{{ .GroupLabels.alertname }}'

        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: rabbitmq-critical

    slack_configs:

      - channel: '#rabbitmq-incidents'

        title: ':fire: RabbitMQ Critical: {{ .GroupLabels.alertname }}'

        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

        send_resolved: true

    pagerduty_configs:

      - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'

  - name: rabbitmq-warning

    slack_configs:

      - channel: '#rabbitmq-monitoring'

        title: ':warning: RabbitMQ Warning: {{ .GroupLabels.alertname }}'

        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Common Setup Problems

ProblemLikely causeFix
rabbitmq_detailed_ metrics missing in PrometheusDetailed scrape job not configured, or targeting a load-balanced endpointAdd the second scrape job from Step 2 and point it at each node individually, not at a Kubernetes service
Per-queue metrics missing for some queues intermittentlyScraping via a load-balanced endpoint rather than per-podIn Kubernetes, use the ServiceMonitor per-pod approach so each node is scraped directly
Zero consumer alert fires on intentionally inactive queuesQueue is dormant by design during off-hoursAdd a label exclusion filter: rabbitmq_detailed_queue_consumer_count{queue!~”deferred-.*”} == 0
Growing depth alert fires every time consumers restartfor duration shorter than your restart and catch-up cycleIncrease for: 15m to for: 20m to absorb longer catch-up periods
Alert fires on the same queue from multiple nodesLeader failover causing the reporting node to change mid-windowGroup by queue and vhost in Alertmanager routing, not by instance

How Do You Know Why the Queue Is Growing, Not Just That It Is?

When RabbitMQQueueDepthHigh fires, the alert tells you which queue and vhost are affected. It does not tell you whether the depth is growing because consumers crashed, because each message is taking too long to process, or because a downstream service the consumer calls has slowed down.

These three root causes look identical on a queue depth chart. A queue accumulating because of a crashed consumer looks exactly the same as one accumulating because each message now takes three seconds to process instead of 300ms.

CubeAPM instruments your RabbitMQ consumer application via OpenTelemetry and captures each message processing cycle as a span in the full distributed trace. When a queue depth alert fires, CubeAPM shows you which consumer instance is slowest, how long each message takes end-to-end through the system, which downstream service call is responsible for the slowdown, and whether the issue is isolated to one consumer pod or affecting all of them. The Prometheus alert tells you something is wrong. CubeAPM tells you what and where. It runs self-hosted inside your own infrastructure at $0.15/GB ingestion pricing, so no data leaves your environment.

Summary

Alerting on RabbitMQ queue depth and consumer count requires two Prometheus scrape jobs: the standard /metrics endpoint for cluster-level health, and /metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count for per-queue alerting. Metrics from the detailed endpoint use the rabbitmq_detailed_ prefix and carry queue and vhost labels.

Alert on absolute depth (threshold-based), on growing depth using deriv() (not rate() – queue depth is a gauge, not a counter), and on zero consumers as a separate critical signal.

StepWhat to configureKey detail
Add detailed scrape jobSecond job in prometheus.yml targeting /metrics/detailedUse family=queue_coarse_metrics&family=queue_consumer_count. Scrape each node individually
Verify metricsQuery rabbitmq_detailed_queue_messages_ready in Prometheus UIMust have queue and vhost labels. If absent, check scrape job targets are UP
Alert on absolute depthrabbitmq_detailed_queue_messages_ready > [threshold] for: 5mThreshold = SLA seconds × write rate. Set per-queue, not universally
Alert on growing depthderiv(rabbitmq_detailed_queue_messages_ready[10m]) > 0 for: 15mUse deriv() for gauges, not rate(). Catches slow build-ups below a fixed threshold
Alert on zero consumersrabbitmq_detailed_queue_consumer_count == 0 for: 2mCritical severity. Always alert on this separately
Alert on low consumer countrabbitmq_detailed_queue_consumer_count < N for: 5mSet N to your expected minimum per queue
Configure AlertmanagerGroup by queue and vhostPrevents alert storms when multiple nodes report the same queue condition

Disclaimer: Metric names, endpoint paths, and alert expressions are verified against RabbitMQ 4.3 documentation (rabbitmq.com/docs/prometheus), the rabbitmq-server GitHub repository, and direct responses from the RabbitMQ core team in the official Google Group as of May 2026.

Also read:

How to Build a RabbitMQ Grafana Dashboard From Scratch 

How to Set Up a Kafka Consumer Lag Alert with Prometheus

How to Monitor ActiveMQ Queues and Consumers

×
×