CubeAPM
CubeAPM CubeAPM

Monitoring RabbitMQ, Kafka, and ActiveMQ: The Engineer’s Handbook (2026)

Monitoring RabbitMQ, Kafka, and ActiveMQ: The Engineer’s Handbook (2026)

Table of Contents

Monitoring RabbitMQ, Kafka, and ActiveMQ is harder than basic uptime checks because each broker fails in a different way. Most messaging broker incidents build up before they become visible. By the time an alert fires, the issue may already be affecting producers, consumers, or downstream services.

The harder problem is that RabbitMQ, Kafka, and ActiveMQ have different architectures and different failure modes. A monitoring setup built for one does not transfer to the others. Queue depth is a critical signal in RabbitMQ. In Kafka, it tells you almost nothing, because consumers pull at their own pace and the broker has no opinion on whether they are keeping up.

This handbook breaks down the metrics, alerts, dashboards, and setup steps engineers need to monitor RabbitMQ, Kafka, and ActiveMQ in 2026. 

Why Message Broker Monitoring Matters in 2026

Three changes in the last two years make existing monitoring setups worth revisiting.

Apache Kafka 4.0, released March 2025, removed ZooKeeper entirely. KRaft is now the only supported coordination mode. Any runbook referencing ZooKeeper connection counts or session errors is obsolete.

RabbitMQ also changed the metadata-store path. Khepri became fully supported in RabbitMQ 4.0 and became the default backend in RabbitMQ 4.2, while upgraded clusters may still run on Mnesia unless Khepri is enabled. That means teams need to know which metadata store their cluster is actually using before trusting old dashboards.

ActiveMQ is still widely used, but its PeerSpot user-engagement mindshare fell from 26.4% to 19.8% by May 2026. Many teams now run ActiveMQ alongside Artemis, Kafka, or other brokers during migrations, which creates separate JMX namespaces, different metrics, and more monitoring work.

Quick Reference: Architecture and Monitoring Interface by Broker

Each broker has a distinct failure profile. That profile determines which metrics are worth alerting on.

DimensionRabbitMQ 4.1Apache Kafka 4.1Apache ActiveMQ 5.x
ArchitectureExchange-and-queue brokerPartitioned distributed logJMS message broker
Throughput profileHigh for queues; higher with StreamsVery high with partitions/batchingModerate; JVM and disk dependent
Latency profileLow, but persistence adds overheadBatching can add delayLow to moderate
Main failure risksMemory/disk alarms, blocked publishersConsumer lag, under-replicationHeap pressure, blocked producers
Cluster coordinationMnesia default; Khepri optionalKRaft only; no ZooKeeperShared file, JDBC, or broker network
Monitoring interfaceHTTP API, UI, Prometheus pluginJMX and metrics reportersJMX, Jolokia, JMX exporter

Monitoring RabbitMQ: The Complete 2026 Guide

RabbitMQ is a push-based broker. Producers send to exchanges, exchanges route into queues, and the broker pushes messages to consumers.

The failure mode that ends most RabbitMQ incidents is resource exhaustion. Once memory or disk thresholds are breached, RabbitMQ activates flow control and stops accepting new publishes across all connections. Producers are blocked before you know about it.

Effective RabbitMQ monitoring means watching resource levels and queue states well before they reach alarm territory.

Critical RabbitMQ Metrics

Queue Depth and Message States

Queue depth is the most actionable signal in RabbitMQ. Messages are in one of three states at any moment:

  • Ready: waiting to be delivered to a consumer
  • Unacknowledged: delivered but not yet acknowledged, meaning the consumer is still processing
  • Total: the sum of both

The split between ready and unacknowledged messages is often more useful than the total queue depth. If unacknowledged messages keep rising while throughput stays flat, consumers may be stuck, slow, or failing to acknowledge messages. If ready messages keep building while unacknowledged stays at zero, the queue may have no active consumers, or consumers may not be receiving messages.

Memory and Disk Alarms

RabbitMQ blocks all publishing connections when memory hits 40% of total system RAM by default. The disk alarm fires when free disk falls below 50 MB. Both are configurable. Monitor via the HTTP Management API at /api/nodes or the Prometheus endpoint:

  • rabbitmq_process_resident_memory_bytes: raw memory consumption
  • rabbitmq_disk_space_available_bytes: free disk remaining
  • rabbitmq_alarms_memory_used_watermark: 1 when the memory alarm is active. Page immediately.
  • rabbitmq_alarms_free_disk_space_watermark: 1 when the disk alarm is active

Connection and Channel Churn

Every RabbitMQ connection consumes an Erlang process and an OS file descriptor. Applications that open connections per request rather than pooling them will eventually exhaust both.

  • rabbitmq_connections: total open connections
  • rabbitmq_channels: total open channels, should scale proportionally with connections
  • rabbitmq_connection_created_total: rate of new connections. A sudden spike almost always indicates a reconnection storm from a brief network partition or broker restart

Consumer Utilisation

Consumer utilisation measures what fraction of its time a consumer is actually active. A value of 1.0 means always busy. A value of 0.3 means idle 70% of the time, which usually points to a low prefetch count or a slow processing function.

Available per-queue through the Management API. Track it on any queue where depth is growing despite consumers being present.

Message Rates

  • rabbitmq_queue_messages_published_total: messages published per queue
  • rabbitmq_queue_messages_delivered_total: messages delivered to consumers
  • rabbitmq_queue_messages_redelivered_total: redeliveries. A rising rate points to a consumer-side processing bug, not a capacity problem

Setting Up Prometheus and Grafana for RabbitMQ

RabbitMQ ships a native Prometheus plugin since version 3.8. One command enables it:

rabbitmq-plugins enable rabbitmq_prometheus

Metrics are available at http://<node>:15692/metrics. Add a Prometheus scrape job and import Grafana Dashboard ID 10991 as a starting point. 

The HTTP Management API

The Management API at port 15672 exposes full broker state in JSON. Most useful endpoints: /api/overview for a cluster summary, /api/queues for per-queue state and consumer details, /api/nodes for per-node memory and file descriptor usage.

Use the API for investigation. Use Prometheus scraping for alerting.

Alerting Thresholds

MetricWhen to warnWhen to page
Memory alarmNo warning neededAlarm is active
Disk alarmNo warning neededAlarm is active
Ready messagesAbove the queue’s normal rangeSustained growth or 2x baseline
Unacknowledged messagesGrowing for 10 minutesNo meaningful drop for 30 minutes
Critical queue consumersBelow the redundancy target0 active consumers
File descriptorsAbove 70% of OS limitAbove 90% of OS limit

Security Monitoring for RabbitMQ

RabbitMQ’s native audit logs capture basic user actions but lack filtering and correlation. Supplement them:

  • Enable rabbitmq_auth_backend_ldap and alert on failed LDAP bind attempts
  • Track connection creation rate by source IP. A spike from a single IP on port 5672 is usually credential stuffing or a misconfigured reconnect loop
  • RabbitMQ supports OAuth 2.0 natively. Monitor token expiry and forced disconnections; both look like ordinary consumer churn until you check the reason codes
  • Alert on any new virtual host created in production. It is a rare event outside deployment pipelines
  • Monitor TLS handshake failures. A spike usually means a client certificate misconfiguration or a port scanner

Monitoring Apache Kafka: Consumer Lag, Partition Health, and KRaft

Kafka has a different operating model from RabbitMQ. Producers write records to topic partitions, while consumers read those partitions at their own pace and track progress through offsets.

That makes consumer lag one of the first Kafka signals to watch. If a consumer group stops keeping up, the broker will continue accepting writes, but downstream services may start working with delayed data long before the issue is visible to users.

Critical Kafka Metrics

Consumer Lag

Consumer lag is the gap between the latest offset written to a partition and the offset a consumer group has committed.

Lag = Log End Offset - Current Committed Offset

Steady growth means consumer throughput is consistently below producer throughput. A spike that recovers is usually transient: a slow batch, a GC pause, a brief dependency blip.

Track per consumer group and per partition using:

  • JMX: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,topic=*,partition=*,name=records-lag
  • Burrow (https://github.com/linkedin/Burrow): evaluates lag trends over time rather than absolute values, significantly reducing false positives from normal traffic variability
  • kafka_exporter (https://github.com/danielqsj/kafka_exporter): exposes per-group per-partition lag as Prometheus metrics, which the JMX Exporter alone does not provide

Under-Replicated Partitions

Under-replicated partitions are the most urgent broker health signal in Kafka. A partition is under-replicated when one or more in-sync replicas has fallen behind the leader. The effective replication factor drops. A broker failure at that point risks data loss.

The main signal to monitor is:

  • JMX: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
  • Target: 0 at all times
  • Common causes: disk I/O pressure on a broker, a recently restarted broker still catching up, or a network partition between brokers

Active Controller Count

Exactly one broker should hold the controller role at any time. The controller manages partition leadership and handles broker failures.

Use this metric to confirm controller health:

  • JMX: kafka.controller:type=KafkaController,name=ActiveControllerCount
  • Value of 0: no controller elected. The cluster cannot reassign partition leadership
  • Value greater than 1: split-brain condition requiring immediate intervention

Broker Throughput

The core throughput metrics are:

  • kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec: incoming byte rate
  • kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec: outgoing byte rate
  • kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec: message ingestion rate

A drop in BytesInPerSec without a matching drop in producer activity signals a partition leadership problem. A spike in BytesOutPerSec without a corresponding business event usually means a consumer is replaying from an older offset.

Request Latency

Start with these request metrics:

  • kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce: end-to-end produce latency
  • kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer: consumer fetch latency
  • For acks=all with three-way replication on healthy hardware, a realistic p99 baseline is 100 to 300 ms. Above 500 ms warrants investigation

KRaft-Specific Metrics (Kafka 4.0 and Later)

Useful KRaft-specific metrics include:

With ZooKeeper removed in Kafka 4.0, these replace the old ZooKeeper health checks:

  • kafka.controller:type=MetadataManager,name=LastAppliedRecordOffset: how far each controller has applied the metadata log. A growing gap indicates a stressed controller quorum
  • kafka.controller:type=ControllerMetrics,name=EventQueueTimeMs: how long events wait before the controller processes them. Elevated values mean the controller is overloaded

Setting Up JMX to Prometheus for Kafka

Kafka exposes broker metrics through JMX, so the usual setup is to run the Prometheus JMX Exporter as a Java agent with each broker JVM. The exporter opens an HTTP endpoint that Prometheus can scrape. Use a Kafka-specific rules file instead of exposing every MBean, otherwise the metric set can become noisy and hard to manage.

Consumer lag needs separate treatment. The JMX Exporter can expose broker and JVM metrics, but group-level lag is usually easier to monitor with kafka_exporter or Burrow. These tools read consumer group offsets and expose lag in a form that works better for Prometheus alerts and Grafana dashboards.

For Amazon MSK, CloudWatch already includes broker and consumer-lag metrics, but the level of detail depends on the monitoring tier. DEFAULT gives basic cluster metrics, while PER_BROKER, PER_TOPIC_PER_BROKER, and PER_TOPIC_PER_PARTITION add more granular broker, topic, and partition-level visibility.

Alerting Thresholds

MetricWarningCritical
UnderReplicatedPartitions>0 for more than 5 minutes>0 outside maintenance; page immediately
ActiveControllerCountNot equal to 10 or >1; page immediately
Consumer lag per groupGrowing for 10 minutesExceeds SLA processing window
OfflinePartitionsCount>0>0; page immediately
JVM heap used>70%>85%
Disk used per broker>60%>80%

Security Monitoring for Kafka

Kafka supports SSL/TLS for encryption and SASL for authentication. Key signals:

  • Failed authentication attempts via RequestMetrics for Authenticate requests. A spike usually means rotated or expired client credentials
  • For SASL/OAUTHBEARER: token refresh failures. A consumer that cannot refresh stops consuming. The resulting lag looks identical to a slow consumer unless you are also tracking auth errors
  • Connections per authenticated principal. An unfamiliar principal or a known one from an unexpected IP warrants a look
  • AWS MSK Audit Logging and Confluent Platform both provide principal-level event tracking. Enable it and route logs to your SIEM

Monitoring Apache ActiveMQ: JVM Health, Blocked Producers, and Store Usage

Monitoring RabbitMQ Kafka ActiveMQ
Monitoring RabbitMQ, Kafka, and ActiveMQ: The Engineer's Handbook (2026) 3

ActiveMQ is a JMS-centric broker running on the JVM. Its failure modes are JVM heap exhaustion, KahaDB index corruption, and producers blocked by destination memory limits.

ActiveMQ Classic exposes everything through JMX. There is no native Prometheus endpoint without a third-party exporter.

Critical ActiveMQ Metrics

BlockedProducerWaitTime is the most important ActiveMQ metric to watch. When a queue or topic hits its memory limit, ActiveMQ blocks the producer thread until space opens. A non-zero value means producers are stalled right now.

Monitor this at the destination level:

  • JMX MBean: org.apache.activemq:type=Broker,brokerName=*,destinationType=Queue,destinationName=*
  • Attribute: BlockedProducerWaitTime
  • Target: 0. Any non-zero reading is an immediate investigation

ActiveMQ runs on the JVM, so heap pressure can quickly become broker pressure. Messages may sit in memory before being paged to disk, and when garbage collection cannot keep up, the broker can pause, slow down, or block producers.

Start with the standard JVM memory metric:

  • JMX: java.lang:type=Memory, HeapMemoryUsage (used vs max)
  • Warning above 70%; critical above 85%
  • Enable GC logging and correlate pause duration with throughput drops. Long pauses are usually the root cause of intermittent blocking

When a queue has zero consumers, messages accumulate indefinitely. ActiveMQ does not reroute automatically. A queue with no consumers is a silent failure.

Track this on every critical queue:

  • JMX attribute: ConsumerCount on the destination MBean
  • Alert when ConsumerCount on any critical queue drops to zero

Queue depth matters more when it is viewed together with enqueue and dequeue rates. A queue can be large but healthy if it is draining normally; it becomes risky when messages arrive faster than they leave.

Use these destination-level counters together:

  • QueueSize: current depth. Growing depth with healthy consumers signals a volume spike
  • EnqueueCount: cumulative messages written since last restart
  • DequeueCount: cumulative messages acknowledged

Track the rate of EnqueueCount minus the rate of DequeueCount. Sustained positive divergence means the queue is growing faster than it drains.

KahaDB writes messages sequentially and maintains a separate index. Under heavy writes the index becomes a contention point. After an unclean shutdown, index recovery can take minutes.

Monitor both broker store usage and filesystem usage:

  • StorePercentUsage: percentage of the configured store limit in use. Alert above 70%
  • JMX: org.apache.activemq:type=Broker,brokerName=*,service=PersistenceAdapter,instanceName=*
  • File system usage of the KahaDB data directory. When disk fills, the broker stalls entirely

These two metrics show how close ActiveMQ is to its configured memory and temporary storage limits. They usually rise before producers are fully blocked, so they are useful early-warning signals.

Track:

  • MemoryPercentUsage: how much of the broker memory limit is in use
  • TempPercentUsage: how much temporary storage is being used for non-persistent messages

Treat either metric above 90% as urgent. At that point, the broker is close to producer flow control, even if BlockedProducerWaitTime has not started climbing yet.

Setting Up Prometheus Monitoring for ActiveMQ

ActiveMQ Classic does not expose Prometheus metrics by default, so most teams use the Prometheus JMX Exporter.

A basic setup looks like this:

  1. Run the JMX Exporter as a Java agent with the ActiveMQ JVM.
  2. Use an ActiveMQ-specific rules file so you only collect the broker, destination, store, and JVM metrics you need.
  3. Configure Prometheus to scrape the exporter endpoint.
  4. Build alerts for memory usage, temp usage, store usage, queue depth, consumer count, and blocked producers.

The ActiveMQ Web Console at http://<host>:8161/admin is helpful during troubleshooting. It should not replace automated alerts, dashboards, or long-term metric storage.

ActiveMQ Classic vs Artemis: Monitoring Differences

Monitoring is one of the areas teams often underestimate during an ActiveMQ Classic to Artemis migration. Both brokers support JMX, but their metric names and management paths are different.

The main differences are:

  • Classic uses the org.apache.activemq MBean namespace.
  • Artemis uses the org.apache.activemq.artemis namespace.
  • Classic advisory topics such as ActiveMQ.Advisory.* do not have a direct one-to-one replacement in Artemis.
  • Dashboards, alert rules, and runbooks built for Classic usually need to be rebuilt for Artemis.

This does not mean Artemis is harder to monitor. It just means Classic monitoring should not be copied over unchanged.

Unified Observability: Monitoring All Three Brokers Together

Most production environments run more than one broker. Analytics on Kafka. Task queues on RabbitMQ. Legacy integrations on ActiveMQ.

The observability stack needs to cover all of them without creating separate silos that no one crosses during an incident.

6.1 Prometheus and Grafana

The standard open-source setup for 2026:

  • RabbitMQ: enable the rabbitmq_prometheus plugin and scrape http://<node>:15692/metrics
  • Kafka: deploy the JMX Exporter as a Java agent on each broker, and run kafka_exporter separately for consumer group lag
  • ActiveMQ: deploy the JMX Exporter with an ActiveMQ-specific MBean rules file

Grafana Dashboard ID 10991 covers RabbitMQ. The Confluent community maintains Kafka dashboards. For ActiveMQ environments, meshIQ’s Infrared360 provides a unified console covering both Classic and Artemis.

OpenTelemetry for Distributed Tracing

Prometheus shows the broker-side symptoms. OpenTelemetry helps connect those symptoms to the producer, consumer, or downstream service involved.

Instrument producers and consumers with OpenTelemetry, then propagate trace context through message headers. This allows a trace to follow the path from message publish to broker interaction to consumer processing, instead of treating each service as a separate incident. OpenTelemetry’s messaging conventions are designed for this kind of producer-consumer workflow.

On the consumer side, one of the most useful signals is the time between message publication and processing start. It shows how long the message waited before a consumer picked it up, which is often the missing link between broker metrics and user-facing delay.

CubeAPM for Broker Monitoring

monitoring RabbitMQ Kafka and ActiveMQ
Monitoring RabbitMQ, Kafka, and ActiveMQ: The Engineer's Handbook (2026) 4

CubeAPM integrates with the Prometheus pipeline for all three brokers and adds:

  • Correlated alerting: surface a consumer lag spike alongside the upstream service error rate that caused it
  • Baseline-relative anomaly detection: alert when queue depth deviates from historical patterns rather than fixed thresholds, reducing false positives during expected traffic spikes
  • Unified broker view: RabbitMQ, Kafka, and ActiveMQ side-by-side in one dashboard
  • Trace-to-metric correlation: link OpenTelemetry spans to broker metric events without switching tools

Monitoring Tools Comparison for 2026

ToolRabbitMQKafkaActiveMQBest suited for
Prometheus + GrafanaPrometheus pluginJMX Exporter + kafka_exporterJMX ExporterOpen-source monitoring stacks
DatadogAgent integrationAgent/JMX integrationAgent/JMX integrationTeams already using Datadog
Confluent Control CenterNoYesNoConfluent Platform users
meshIQ Infrared360YesYesYesEnterprise messaging estates
CubeAPMVia PrometheusVia PrometheusVia JMX ExporterAPM plus broker monitoring
BurrowNoConsumer lag onlyNoKafka lag trend analysis
AWS CloudWatchAmazon MQ RabbitMQAmazon MSKAmazon MQ ActiveMQAWS-managed brokers

Alerting Best Practices

Alerting on raw metrics without context is how teams learn to ignore alerts. A RabbitMQ queue at 50,000 messages may be normal for one workload and critical for another. Alert on baseline changes, growth rate, and business impact, not only fixed thresholds. In Prometheus, functions like predict_linear() and deriv() can help catch queues, lag, or disk usage trending toward failure.

Before adding an alert, define the first action the on-call engineer should take. “Check the dashboard” is not enough.

  • RabbitMQ: memory alarm, disk alarm, zero consumers, unacknowledged messages not dropping
  • Kafka: under-replicated partitions, controller count not equal to 1, consumer lag growing
  • ActiveMQ: blocked producers, heap above 85%, zero consumers, store usage above safe limits

Each alert should point to the likely cause, the dashboard to open, and the command or service owner needed for the next step.

Not every broker alert needs the same urgency. A payment queue falling behind can affect revenue immediately and should page the on-call engineer. A Kafka broker trending high on CPU may be a capacity issue unless it is already affecting throughput, latency, or consumer lag.

Keep customer-impacting alerts separate from infrastructure trend alerts. When both go into the same channel, real incidents get buried in noise.

A growing DLQ usually points to a consumer-side processing bug and deserves its own alert.

  • RabbitMQ: monitor the dead-letter queue behind the configured dead-letter exchange.
  • Kafka: use a dedicated DLQ topic and alert when failed messages are not being reviewed or drained.
  • ActiveMQ: monitor ActiveMQ.DLQ depth through JMX.

In production, even a small DLQ count is worth checking.

In regulated environments, monitoring should prove that delivery guarantees, access controls, and audit trails are working. Native broker metrics help, but they are not enough on their own.

Track delivery assurance signals:

  • Kafka: producer error rate, idempotency issues, and transaction coordinator errors
  • RabbitMQ: publish rate versus confirm rate
  • ActiveMQ: JMS transaction rollback trends

Centralize broker logs in a SIEM or log platform with the right retention policy. Use read-only monitoring credentials, and avoid giving observability tools permission to produce or consume messages unless there is a strong operational reason.

Conclusion

RabbitMQ, Kafka, and ActiveMQ each fail in their own way and surface that failure through different signals. Getting monitoring right means understanding those differences rather than applying the same checklist to all three.

The tooling is mature. The Prometheus pipeline covers all three through the exporter ecosystem. Platforms like CubeAPM add application-level correlation that turns broker metrics into something actionable during an incident rather than another dashboard to scroll through.

Start with the checklist in Section 10. Get the critical alerts firing first, then build toward unified observability across your full messaging stack.

Disclaimer: Monitoring recommendations and alert thresholds can vary by broker version, workload, deployment model, and traffic pattern. Treat the examples in this article as a starting point, and tune them against your own RabbitMQ, Kafka, or ActiveMQ environment before using them in production.

FAQs

1. What is the most important metric when monitoring RabbitMQ?

Memory and disk alarms are the most dangerous because they block all producers the moment they fire. After alarms, unacknowledged message count is the most useful day-to-day signal. A climbing unacknowledged count with no rise in throughput tells you consumers are hung, often well before queue depth triggers a threshold alert.

2. How do I monitor Kafka consumer lag?

Kafka provides no built-in consumer lag alerting. kafka_exporter exposes per-group per-partition lag as Prometheus metrics you can alert on in Grafana. Burrow evaluates lag trends over time and reduces false positives from normal traffic variability. For AWS MSK, consumer lag is available natively in CloudWatch under enhanced monitoring.

3. How is monitoring ActiveMQ different from monitoring Artemis?

The interface is the same (JMX) but the MBean namespaces differ entirely. ActiveMQ uses org.apache.activemq; Artemis uses org.apache.activemq.artemis. Advisory messages, a common foundation for ActiveMQ monitoring tooling, have no direct equivalent in Artemis. Dashboards and alert rules built for Classic are not portable to Artemis.

4. What changed in Kafka monitoring after the move to KRaft?

ZooKeeper-specific monitoring is obsolete: connection counts, session errors, ZK latency. Replace those with KRaft metrics: metadata log offset lag between controller nodes, controller event queue time, and active controller count. The controller count alert still applies; KRaft manages it internally now.

5. Do I need a paid tool to monitor all three brokers?

No. The Prometheus and Grafana stack handles all three with open-source components. RabbitMQ ships its own plugin. Kafka uses the JMX Exporter and kafka_exporter. ActiveMQ uses the JMX Exporter. Commercial tools like Datadog, Confluent Control Center, meshIQ Infrared360, and CubeAPM reduce setup overhead and add cross-broker correlation, but they are not a prerequisite.

×
×