CubeAPM
CubeAPM CubeAPM

How to Set Up a Kafka Consumer Lag Alert with Prometheus

How to Set Up a Kafka Consumer Lag Alert with Prometheus

Table of Contents

Setting up a Kafka consumer lag alert with Prometheus requires three components working together: kafka_exporter (by danielqsj) to expose consumer group lag as Prometheus metrics, Prometheus to scrape and evaluate alerting rules, and Alertmanager to route and deliver notifications. This guide covers each step end-to-end.

Key Takeaways

  • kafka_exporter is the standard tool for exposing Kafka consumer group lag to Prometheus. It connects directly to Kafka brokers via the Admin API – no JMX configuration required
  • kafka_exporter exposes metrics on port 9308 at /metrics by default
  • The most important lag metric is kafka_consumergroup_lag – per partition, per consumer group, per topic
  • Alert on lag growth rate, not just the absolute lag value. A fixed threshold, like lag > 10000 produces false positives during normal catch-up periods after a consumer restart. A growing-lag alert is more signal, less noise
  • Always alert separately on kafka_consumergroup_lag == 0 unless kafka_topic_partition_current_offset == 0 – a consumer group with zero members will show zero lag, masking the fact that no one is processing the topic
  • Consumer group metrics are not available from kafka_exporter if no consumer with a consumer group is currently active – the metric simply will not be present

Step 1: Deploy kafka_exporter

kafka_exporter connects to Kafka brokers using the Admin API and exposes consumer group offsets, log-end offsets, and derived lag values in Prometheus format. It does not require JMX access to the brokers.

Docker (quickstart):

docker run -d \

  --name kafka-exporter \

  -p 9308:9308 \

  danielqsj/kafka-exporter:latest \

  --kafka.server=broker-1:9092 \

  --kafka.server=broker-2:9092 \

  --kafka.server=broker-3:9092

Verify metrics are being served:

curl http://localhost:9308/metrics | grep kafka_consumergroup_lag

Docker Compose (with a Kafka cluster):

services:

  kafka-exporter:

    image: danielqsj/kafka-exporter:latest

    command:

      - "--kafka.server=kafka-1:9092"

      - "--kafka.server=kafka-2:9092"

      - "--kafka.server=kafka-3:9092"

      - "--topic.filter=.*"

      - "--group.filter=.*"

    ports:

      - "9308:9308"

    restart: unless-stopped

Kubernetes Deployment:

apiVersion: apps/v1

kind: Deployment

metadata:

  name: kafka-exporter

  namespace: monitoring

spec:

  replicas: 1

  selector:

    matchLabels:

      app: kafka-exporter

  template:

    metadata:

      labels:

        app: kafka-exporter

      annotations:

        prometheus.io/scrape: "true"

        prometheus.io/port: "9308"

    spec:

      containers:

        - name: kafka-exporter

          image: danielqsj/kafka-exporter:latest

          args:

            - --kafka.server=kafka-0.kafka-headless.kafka.svc.cluster.local:9092

            - --kafka.server=kafka-1.kafka-headless.kafka.svc.cluster.local:9092

            - --kafka.server=kafka-2.kafka-headless.kafka.svc.cluster.local:9092

            - --topic.filter=^[^_].*

            - --group.filter=.*

          ports:

            - containerPort: 9308

              name: metrics

          resources:

            requests:

              memory: "64Mi"

              cpu: "100m"

            limits:

              memory: "128Mi"

              cpu: "200m"

---

apiVersion: v1

kind: Service

metadata:

  name: kafka-exporter

  namespace: monitoring

spec:

  ports:

    - port: 9308

      targetPort: 9308

      name: metrics

  selector:

    app: kafka-exporter

The --topic.filter=^[^_].* flag excludes internal Kafka topics (which start with _ such as __consumer_offsets and __transaction_state) from the metrics output. This reduces noise and avoids alerting on internal bookkeeping topics.

Note on consumer group metrics availability: kafka_exporter only exposes consumer group lag metrics when there is an active consumer group. If a consumer group has no active members and no committed offsets, the kafka_consumergroup_lag metric for that group will simply not appear in the output. Design your alerts to handle missing data appropriately.

Step 2: Configure Prometheus to Scrape kafka_exporter

Add a scrape job to your prometheus.yml:

global:

  scrape_interval: 15s

  evaluation_interval: 15s

rule_files:

  - "kafka-alerts.yml"

alerting:

  alertmanagers:

    - static_configs:

        - targets:

            - alertmanager:9093

scrape_configs:

  - job_name: kafka-exporter

    scrape_interval: 30s

    scrape_timeout: 25s

    static_configs:

      - targets:

          - kafka-exporter:9308

A scrape interval of 30 seconds is recommended for kafka_exporter. Scraping more frequently increases load on the broker’s Admin API without providing more precision for meaningful lag alerting.

For Kubernetes with Prometheus Operator, use a ServiceMonitor instead:

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: kafka-exporter

  namespace: monitoring

spec:

  selector:

    matchLabels:

      app: kafka-exporter

  endpoints:

    - port: metrics

      interval: 30s

      scrapeTimeout: 25s

Step 3: Understand the Key Metrics

Before writing alert rules, confirm these metrics are populating correctly by running PromQL queries in your Prometheus UI:

# All consumer group lag values currently being tracked

kafka_consumergroup_lag

# Log-end offset per partition

kafka_topic_partition_current_offset

# Current committed offset per consumer group

kafka_consumergroup_current_offset

# Total lag per consumer group across all partitions of a topic

sum by (consumergroup, topic) (kafka_consumergroup_lag)

Label reference for kafka_consumergroup_lag:

  • consumergroup – the consumer group ID
  • topic – the topic name
  • partition – the partition number (string)

Step 4: Write the Alert Rules

Create kafka-alerts.yml with production-grade rules. The rules below progress from most critical to supporting signals.

groups:

  - name: kafka-consumer-lag

    rules:

      # -------------------------------------------------------

      # CRITICAL: Consumer lag is continuously growing

      # This is more reliable than a fixed threshold because it

      # fires only when lag is genuinely increasing, not during

      # normal catch-up after a consumer restart.

      # -------------------------------------------------------

      - alert: KafkaConsumerLagGrowing

        expr: |

          rate(kafka_consumergroup_lag[10m]) > 0

        for: 15m

        labels:

          severity: warning

        annotations:

          summary: "Kafka consumer lag is growing"

          description: >

            Consumer group {{ $labels.consumergroup }} is falling behind

            on topic {{ $labels.topic }} partition {{ $labels.partition }}.

            Lag is growing at {{ $value | humanize }} records/second.

            Current lag: {{ with query (printf "kafka_consumergroup_lag{consumergroup='%s',topic='%s',partition='%s'}" $labels.consumergroup $labels.topic $labels.partition) }}{{ . | first | value | humanize }}{{ end }} records.

      # -------------------------------------------------------

      # CRITICAL: Lag exceeds absolute threshold

      # Set this to a value meaningful for your SLA.

      # The 'for' duration prevents false positives during

      # normal traffic spikes that briefly exceed the threshold.

      # -------------------------------------------------------

      - alert: KafkaConsumerLagHigh

        expr: |

          kafka_consumergroup_lag > 100000

        for: 5m

        labels:

          severity: critical

        annotations:

          summary: "Kafka consumer lag is critically high"

          description: >

            Consumer group {{ $labels.consumergroup }} has lag of

            {{ $value | humanize }} records on topic {{ $labels.topic }}

            partition {{ $labels.partition }}.

      # -------------------------------------------------------

      # WARNING: No active consumers on a topic partition

      # kafka_consumergroup_lag == 0 can mean either caught up

      # OR no consumers at all. Check offset movement to distinguish.

      # A consumer group that hasn't committed recently on a

      # non-empty topic is likely stalled or absent.

      # -------------------------------------------------------

      - alert: KafkaConsumerGroupStalled

        expr: |

          (

            kafka_consumergroup_lag > 0

          ) and (

            increase(kafka_consumergroup_current_offset[10m]) == 0

          )

        for: 10m

        labels:

          severity: warning

        annotations:

          summary: "Kafka consumer group is not making progress"

          description: >

            Consumer group {{ $labels.consumergroup }} has not committed

            any new offsets in the last 10 minutes on topic

            {{ $labels.topic }} partition {{ $labels.partition }},

            but lag is {{ $value | humanize }} records.

      # -------------------------------------------------------

      # CRITICAL: Topic has messages but no consumer group is

      # tracking it at all. This fires when there are messages

      # in a partition but no consumer group has ever committed

      # an offset for it.

      # -------------------------------------------------------

      - alert: KafkaTopicHasNoConsumers

        expr: |

          kafka_topic_partition_current_offset > 0

          unless on(topic, partition)

          kafka_consumergroup_current_offset

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "Kafka topic partition has no consumer groups"

          description: >

            Topic {{ $labels.topic }} partition {{ $labels.partition }}

            has messages (offset {{ $value | humanize }}) but no

            consumer group is tracking it.

Step 5: Configure Alertmanager Routing

Create alertmanager.yml to route Kafka lag alerts to the right channel:

global:

  resolve_timeout: 5m

  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:

  group_by: ['alertname', 'topic', 'consumergroup']

  group_wait: 30s

  group_interval: 5m

  repeat_interval: 1h

  receiver: default

  routes:

    - match:

        alertname: KafkaConsumerLagHigh

      receiver: kafka-critical

      repeat_interval: 15m

    - match:

        alertname: KafkaConsumerLagGrowing

      receiver: kafka-warning

receivers:

  - name: default

    slack_configs:

      - channel: '#alerts'

        title: '{{ .GroupLabels.alertname }}'

        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: kafka-critical

    slack_configs:

      - channel: '#kafka-incidents'

        title: ':fire: Kafka Critical: {{ .GroupLabels.alertname }}'

        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

        send_resolved: true

    pagerduty_configs:

      - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'

  - name: kafka-warning

    slack_configs:

      - channel: '#kafka-monitoring'

        title: ':warning: Kafka Warning: {{ .GroupLabels.alertname }}'

        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Grouping by topic and consumergroup prevents alert storms when many partitions of the same topic are lagging simultaneously – they are grouped into one notification rather than firing individually per partition.

Alert Threshold Guidance

There is no universal lag threshold. The right value depends entirely on your topic’s write rate and your SLA.

  • For growing lag alerts: Use rate(kafka_consumergroup_lag[10m]) > 0 for 15m as your baseline. This fires only when lag has been continuously increasing for 15 consecutive minutes – not during normal catch-up periods. Tune the for duration based on how quickly your team can respond.
  • For absolute lag alerts: Calculate the threshold from your SLA, not from an arbitrary number:

Threshold = Acceptable delay in seconds × Write rate (records/second)

Example: If you can tolerate 60 seconds of delay and your topic receives 500 records per second, your threshold is 30,000 records. A topic receiving 5 records per second with the same 60-second tolerance gives a threshold of 300 records.

Per-partition vs total lag: kafka_consumergroup_lag is per partition. A lag of 10,000 on a topic with 20 partitions represents 10,000 records on one partition, not 200,000 total. Use sum by (consumergroup, topic) to get the total topic lag.

Grafana Dashboard

After alerts are firing, add a Grafana dashboard to visualize lag trends. Grafana dashboard ID 7589 (“Kafka Exporter Overview”) is the official community dashboard for kafka_exporter. Import it via Dashboards > Import > Enter ID 7589.

Key panels to verify after setup:

  • Consumer group lag per partition (time series)
  • Lag rate of change (derivative panel)
  • Consumer group offset progress (ensures the group is committing)
  • Topic partition current offset (log-end, shows producer activity)

Common Setup Problems

  • Metrics not appearing for a consumer group: kafka_exporter only shows consumer group metrics when the group has active members or committed offsets that Kafka can report. A group that has never connected will not appear. Run a consumer to generate activity, then check again.
  • for duration causing delayed alerts: The for clause in Prometheus alerting rules requires the condition to be continuously true for the specified duration before the alert fires. A lag spike that recovers within the for window will never fire. This is usually desirable – it prevents false positives during consumer restarts. If you need faster alerting, reduce the for duration but expect more noise.
  • Alert storms from per-partition firing: Without group_by in Alertmanager, a topic with 20 partitions all lagging will generate 20 separate alert notifications. Always group by consumergroup and topic at a minimum.
  • kafka_exporter cannot connect to brokers: If running in Kubernetes, ensure the exporter pod can reach the Kafka broker service on port 9092. If TLS is enabled on the brokers, pass the appropriate --tls.* flags to kafka_exporter.

How Do I Know Which Consumer Is Slow, Not Just That Lag Exists?

kafka_consumergroup_lag fires your alert and tells you which consumer group and topic are affected. It does not tell you which instance of the consumer application is processing slowest, which message type is taking longest to handle, or whether the bottleneck is within the consumer’s own code or a downstream service it is calling.

When a lag alert fires and the Grafana dashboard shows the lag graph climbing, the next question is always: is this a throughput problem (not enough consumer instances), a processing problem (each message takes too long), or a downstream dependency problem (the consumer is waiting on something external)?

kafka consumer lag
How to Set Up a Kafka Consumer Lag Alert with Prometheus 2

CubeAPM instruments your Kafka consumer application via OpenTelemetry and captures each message processing cycle as a span in the full distributed trace. When a lag alert fires, CubeAPM shows you which consumer instance is slowest, how long each message takes to process end-to-end, which downstream service calls are consuming the most time per message, and whether lag is concentrated on specific message types or is evenly distributed. The Prometheus alert tells you something is wrong. CubeAPM tells you what and where. Self-hosted inside your own infrastructure, no data leaves your environment.

Summary

Catches stopped consumers with a backlogWhat to configureKey decision
Deploy kafka_exporterDocker, Compose, or KubernetesPort 9308, connect to all brokers
Configure scrapingAdd job to prometheus.yml or ServiceMonitor30s scrape interval
Write growing-lag alertrate(kafka_consumergroup_lag[10m]) > 0 for 15mMost reliable lag signal
Write absolute-lag alertThreshold = SLA in seconds × write rateTopic-specific, not universal
Write stalled consumer alertlag > 0 AND offset not changingCatches stopped consumers with backlog
Configure AlertmanagerGroup by consumergroup and topicPrevents per-partition alert storms

Disclaimer: Configurations, metric names, and alert thresholds are for guidance only – verify against the current kafka_exporter documentation and Prometheus alerting documentation before applying to production. kafka_exporter metric names and flags may change between releases. CubeAPM references reflect genuine use cases; evaluate all tools against your own requirements.

Also read:

Consumer Lag vs Offset in Kafka: What Is the Difference?

How to Monitor ActiveMQ Queues and Consumers

What Are the Most Important RabbitMQ Metrics to Track?

×
×