CubeAPM
CubeAPM CubeAPM

How Do I Monitor Kafka Consumer Group Lag?

How Do I Monitor Kafka Consumer Group Lag?

Table of Contents

If you run Apache Kafka in production, kafka consumer group lag is one of the most important metrics you need to watch. When lag grows, it means your consumers are falling behind producers. Messages pile up, real-time guarantees break, and downstream systems suffer. Left unattended, a growing lag can cascade into data pipeline failures and missed SLAs.

This guide explains what kafka consumer group lag is, why it happens, how to measure it using multiple methods, and what you can do to bring it under control.

Key Takeaways
  • Kafka consumer group lag is the gap between the log-end offset (latest message produced) and the committed offset (last message consumed) across all partitions in a consumer group.
  • A stable lag of a few hundred messages is normal. A continuously growing lag signals a problem.
  • You can check lag instantly with kafka-consumer-groups.sh, or continuously via JMX, Prometheus with kafka_lag_exporter, or Burrow.
  • Common causes include slow message processing, partition skew, insufficient consumer instances, and misconfigured fetch parameters.
  • Fixes include adding consumers (up to the partition count), optimizing processing logic, increasing partition count, and tuning configuration.

What Is Kafka Consumer Group Lag?

In Apache Kafka, every partition in a topic maintains a sequence of messages. Each message has a monotonically increasing integer called an offset. Two offsets are critical for understanding lag:

  • Log-End Offset (LEO): The offset of the latest message written to the partition by the producer.
  • Consumer Committed Offset: The offset of the last message that the consumer group has processed and committed.

Lag for a single partition is simply:

Lag = Log-End Offset - Consumer Committed Offset

For a consumer group reading a topic with multiple partitions, total lag is the sum of per-partition lags across all partitions assigned to that group.

Example: A topic has 3 partitions. Partition P1 has lag 0, P2 has lag 300, P3 has lag 200. Total consumer group lag = 500. This means 500 messages are waiting to be consumed.

Some lag is always present because producers and consumers run asynchronously. A stable lag of a few hundred messages in a high-throughput system is normal. What you should worry about is lag that grows continuously over time, because that means your consumers are structurally unable to keep up.

Common Causes of Kafka Consumer Group Lag

Understanding why lag happens is the first step toward fixing it. The most frequent causes are:

This is the most common cause. If a consumer takes 50ms to process each message and 1,000 messages per second arrive on a single partition, one consumer needs 50 seconds to process what arrived in one second. The math does not work. Typical slow-processing culprits include synchronous database writes per message instead of batching, HTTP calls to external APIs inside the processing loop, heavy transformations such as parsing large JSON objects or running regex, and locking or contention in multi-threaded consumers.

If your producer uses message keys for partitioning and certain keys appear far more frequently than others, some partitions receive significantly more traffic than others. Because each partition is consumed by exactly one consumer in a group, the consumer handling the hot partition will lag even when others are idle.

A consumer group can have at most one active consumer per partition. If you have 12 partitions and only 3 consumers, each consumer processes 4 partitions. Adding more consumers (up to the partition count) distributes the work. Beyond that limit, additional consumers sit idle.

A sudden surge in producer throughput, such as a batch import or an event-driven spike, can temporarily push lag higher. If consumers can process fast enough under normal conditions, lag typically recovers. If it does not recover, you have an underlying capacity problem.

Kafka consumer performance is sensitive to configuration. Key parameters that affect throughput include:

  • fetch.min.bytes: Minimum data the broker should return in a fetch request. Too high causes unnecessary waiting.
  • fetch.max.wait.ms: Maximum time the broker waits before responding to a fetch request.
  • max.poll.records: Maximum number of records returned in a single poll() call. Setting this too low limits throughput.

When consumers join or leave a group, Kafka triggers a rebalance. During a rebalance, all consumers in the group pause consumption temporarily. Frequent rebalances (caused by slow poll loops, long processing times exceeding max.poll.interval.ms, or frequent restarts) accumulate lag quickly.

How to Monitor Kafka Consumer Group Lag

There are several reliable methods to monitor kafka consumer group lag, from simple CLI tools to production-grade observability stacks.

Method 1: kafka-consumer-groups.sh (Built-In CLI)

The fastest way to check lag is with the kafka-consumer-groups.sh script that ships with every Kafka installation. Run the following command:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group your-consumer-group

The output looks like this:

GROUP            TOPIC     PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-IDyour-group       orders    0          45230           45230           0    consumer-1your-group       orders    1          44900           45200           300  consumer-2your-group       orders    2          43100           43700           600  consumer-3

Key columns to watch:

  • CURRENT-OFFSET: The last committed offset for that partition.
  • LOG-END-OFFSET: The latest offset produced to that partition.
  • LAG: The difference. This is your per-partition lag.

To list all consumer groups first, run:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

This method is useful for quick manual checks but is not suitable for continuous monitoring or alerting.

Method 2: JMX Metrics

Kafka exposes consumer group metrics over JMX (Java Management Extensions). The primary metric for lag is:

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*records-lag-maxrecords-lag-avg

records-lag-max reports the maximum lag across all partitions consumed by that client. This is a client-side metric exposed by the consumer application itself, not the broker. To use it, enable JMX on your consumer with:

export JMX_PORT=9999./bin/kafka-server-start.sh config/consumer.properties

Important caveat: JMX metrics are scoped to a single consumer instance. They do not give you a consumer-group-wide view across all partitions in one place. For group-level visibility, use a dedicated lag monitoring tool.

Method 3: Prometheus with kafka_lag_exporter or kminion

For production monitoring with alerting, the most common approach is to expose lag as a Prometheus metric and visualize it in Grafana.

kafka_lag_exporter (open-sourced by Lightbend) connects to your Kafka cluster, fetches consumer group offsets, and exposes them as Prometheus metrics. Install it and it will expose metrics such as:

kafka_consumergroup_group_lag{group="your-group",topic="orders",partition="0"} 300

kminion is a popular alternative that also exposes Kafka consumer group lag as Prometheus metrics with low overhead. Both tools work by polling the __consumer_offsets internal topic and broker metadata.

A useful PromQL query to calculate total consumer group lag:

sum(kafka_consumergroup_group_lag{group="your-group"}) by (group, topic)

To detect growing lag (positive rate means consumers are falling behind):

rate(kafka_consumergroup_group_lag[5m])

Method 4: Burrow by LinkedIn

Burrow is a dedicated Kafka consumer lag monitoring companion open-sourced by LinkedIn. Unlike simple offset-difference checks, Burrow evaluates lag over a sliding window of offsets to classify consumer group status as OK, WARNING, or ERROR. This means it distinguishes between a consumer that is stable (lag is constant because the topic has no new messages) and a consumer that is genuinely falling behind.

Burrow exposes an HTTP API. A basic query to check a consumer group:

GET /v3/kafka/{cluster}/consumer/{group}/lag

Burrow does not store metrics itself but can integrate with notification systems and can be paired with Prometheus via a Burrow exporter for dashboards.

Method 5: OpenTelemetry Collector with Kafka Metrics Receiver

If your observability stack is built on OpenTelemetry, the Kafka metrics receiver in the OpenTelemetry Collector contrib distribution can scrape consumer group lag directly. A minimal configuration:

receiverskafkametrics:    brokers:      - kafka-1:9092      - kafka-2:9092    protocol_version: 2.8.0    collection_interval: 15s    scrapers:      - topics      - consumers

Key metrics exposed:

  • kafka.consumer_group.lag: Per-partition lag for each consumer group
  • kafka.consumer_group.offset: Current committed offset
  • kafka.consumer_group.members: Number of members in the consumer group

Total consumer group lag across all partitions:

sum(kafka.consumer_group.lag{group="my-consumer-group"})

Method 6: Confluent Control Center (Confluent Platform)

If you use Confluent Platform or Confluent Cloud, consumer group lag is visible directly in the Confluent Control Center dashboard. It shows per-consumer-group lag, partition-level detail, and lag trends over time without any additional tooling. Confluent also exposes lag metrics via its Metrics API for programmatic access.

Method 7: Python Script for Custom Lag Monitoring

For lightweight or custom use cases, you can query consumer group offsets programmatically using the kafka-python or confluent-kafka Python libraries. The pattern is:

  • Connect to the Kafka cluster using AdminClient.
  • Call list_consumer_group_offsets() to get committed offsets for each partition in the group.
  • Call get_watermark_offsets() or use TopicPartition metadata to get the log-end offset for each partition.
  • Calculate lag as LEO minus committed offset for each partition.

This approach is useful for feeding lag data into custom dashboards, Airflow DAGs, or alerting pipelines when a dedicated exporter is not available.

Method 8: CubeAPM (OpenTelemetry-Native, Unified Visibility)

CubeAPM is a full-stack, OpenTelemetry-native observability platform that monitors Kafka consumer group lag alongside traces, logs, and infrastructure metrics in a single workspace. Because it is built on OTel from the ground up, there is no proprietary agent to manage: if you already run an OpenTelemetry Collector, you are most of the way there.

What makes CubeAPM different from a standalone Prometheus setup is context correlation. When a lag spike appears in the dashboard, you can pivot directly into the associated broker logs or consumer service traces to see why lag grew, without switching tools. It also provides trend-aware lag alerting so you are paged on growing lag, not just a raw threshold.

How to set it up

Step 1: Deploy CubeAPM. Install CubeAPM on bare metal, Docker, or Kubernetes (Helm). Once the UI is reachable, note your OTLP ingestion endpoint (default port 4317).

Step 2: Configure the OpenTelemetry Collector. Deploy the OTel Collector contrib distribution alongside your Kafka brokers. Add a Prometheus receiver to scrape your JMX exporter endpoints and export to CubeAPM over OTLP:

receiversprometheus:    config:      scrape_configs:        - job_name: 'kafka-brokers'          static_configs:            - targets:                - 'kafka-0.kafka:7071'                - 'kafka-1.kafka:7071'                - 'kafka-2.kafka:7071'exportersotlp:    endpoint: http://<cubeapm-host>:4317service:  pipelines:    metrics:      receivers: [prometheus]      exporters: [otlp]

Step 3: Add the Kafka metrics receiver for consumer group lag. The OTel Collector contrib distribution also includes a native Kafka metrics receiver that directly scrapes consumer group offsets and log-end offsets from your brokers:

receiverskafkametrics:    brokers:      - kafka-1:9092      - kafka-2:9092    protocol_version: 2.8.0    collection_interval: 15s    scrapers:      - consumers   # scrapes consumer group lag      - topics      # scrapes topic/partition offsetsexportersotlp:    endpoint: http://<cubeapm-host>:4317service:  pipelines:    metrics:      receivers: [kafkametrics]      exporters: [otlp]

This surfaces the kafka.consumer_group.lag metric per partition and per consumer group directly in CubeAPM dashboards.

Step 4: Set a consumer lag alert. In CubeAPM’s alerting UI (or as a YAML rule), configure a lag threshold alert:

- alert: KafkaConsumerLagHigh  expr: kafka_consumergroup_lag > 1500  for: 5m  labels:    severity: critical  annotations:    summary: "High Kafka Consumer Lag Detected"    description: "Consumer lag has exceeded 1500 messages for 5 min. Check consumer throughput."

CubeAPM fires this alert and delivers context-rich notifications to Slack or email within about 30 seconds of threshold breach, including metric snapshots so you know the scope before you open the dashboard.

Step 5: Correlate lag with traces and logs. Instrument your producer and consumer services with OTel SDKs and export traces to CubeAPM via OTLP. Forward broker logs through Fluent Bit or the OTel logs receiver. From any lag spike in the dashboard, you can jump to the corresponding consumer service trace or broker log entry to find the root cause, whether it is a slow downstream database call, a GC pause, or a rebalance event.

Key consumer lag metrics CubeAPM surfaces:

  • kafka_consumergroup_lag: Per-partition, per-consumer-group lag (the primary lag signal).
  • records_lag_max: Maximum lag across all partitions for a single consumer client.
  • kafka.consumer_group.members: Number of active consumers in the group, useful for detecting unexpected rebalances.
  • Total group lag: Sum across all partitions, visible in the cluster summary panel.

Key Kafka Consumer Group Lag Metrics at a Glance

When setting up monitoring, focus on these metrics regardless of the tool you use:

MetricWhat It Tells YouWhere to Find It
Consumer group lag (per partition)Number of unconsumed messages in a partitionCLI, JMX, Prometheus, Burrow
Total group lagSum of lag across all partitions in the groupPrometheus (sum query), Control Center
Lag rate (rate of change)Whether lag is growing or shrinkingPrometheus rate() function
Log-End OffsetLatest message produced to a partitionCLI, Kafka Admin API
Consumer committed offsetLast message the group has processedCLI, __consumer_offsets topic
records-lag-max (JMX)Max lag across partitions for a single consumer instanceJMX on consumer application
Consumer group membersHow many consumers are active in the groupCLI, OpenTelemetry, Control Center

How to Reduce Kafka Consumer Group Lag

Once you have identified that lag is growing, the right fix depends on the root cause.

This is the most straightforward fix if your topic has more partitions than consumers. Each consumer in a group can handle multiple partitions, but each partition is handled by only one consumer at a time. Adding consumers distributes the processing load. Remember: you cannot have more active consumers than partitions.

If you are already at the consumer limit (one per partition) and still lagging, you need more partitions to enable more parallel consumers. Increasing partition count allows you to scale horizontally. Note that changing the number of partitions on an existing topic can alter the ordering guarantees for keyed messages

If the bottleneck is slow processing rather than throughput, rework the processing logic:

  • Replace per-message database writes with batch inserts.
  • Remove synchronous HTTP calls from the critical path, or parallelize them.
  • Use async processing where message ordering allows.
  • Reduce unnecessary deserialization or transformation overhead.

Adjust Kafka consumer configuration to improve fetch efficiency:

  • Increase max.poll.records to fetch more messages per poll cycle.
  • Tune fetch.min.bytes and fetch.max.wait.ms to reduce round trips.
  • Increase max.poll.interval.ms if processing takes longer than the default 5 minutes, to avoid unnecessary rebalances.

If hot partitions are the cause, change your partitioning strategy to distribute messages more evenly. Options include using a random partitioner, using composite keys, or implementing a custom partitioner that distributes high-frequency keys across multiple partitions.

If frequent consumer group rebalances are the problem, switch to the cooperative sticky assignor (available since Kafka 2.4), which allows consumers to continue processing unaffected partitions during a rebalance. Also ensure your processing loop completes within max.poll.interval.ms.

Which Monitoring Method Should You Use?

MethodBest ForLimitation
kafka-consumer-groups.shQuick one-off checksNot suitable for continuous monitoring or alerting
JMX metricsSingle-instance consumer monitoringNo group-wide view; requires JMX access to consumer app
Prometheus + kafka_lag_exporter / kminionProduction monitoring with alertingRequires additional exporter deployment and Grafana for dashboards
BurrowIntelligent lag classificationMore complex setup; does not store metrics natively
OpenTelemetry Collector (standalone)OTel-native metric collectionMetrics only; no built-in dashboards or alerting without a backend
CubeAPMUnified lag monitoring with traces, logs, and alerts in one placeSelf-hosted; requires OTel Collector setup (but works with your existing one)
Confluent Control CenterConfluent Platform or Cloud usersConfluent-specific; not available for vanilla Apache Kafka
Python scriptCustom pipelines and automationRequires development and maintenance effort

Conclusion

Kafka consumer group lag caught early is a configuration fix. Caught late, it is an incident. The methods in this guide range from a quick CLI check to full production observability, so pick what fits your stack and get visibility in place.

If you want lag correlated with traces and logs without stitching together multiple tools, CubeAPM is worth a look. It is OpenTelemetry-native, so if you already run an OTel Collector, setup takes minutes.

FAQs

1. What is an acceptable amount of Kafka consumer group lag?

There is no universal number. A stable lag of a few hundred or even a few thousand messages is acceptable in many high-throughput systems, as long as it is not growing. What matters is the trend: lag that grows continuously over time is the problem to solve.

2. Can I have more consumers than partitions?

You can add more consumers than partitions to a consumer group, but the extra consumers will be idle. Kafka assigns one partition to at most one consumer in a group at any time. To benefit from more consumers, you need at least as many partitions as consumers.

3. Does Kafka consumer group lag affect message ordering?

Kafka guarantees message ordering within a single partition. Lag does not break ordering within a partition. However, if you process messages from multiple partitions concurrently, cross-partition ordering is not guaranteed by Kafka itself.

4. What is the __consumer_offsets topic?

The __consumer_offsets topic is an internal Kafka topic where consumer groups store their committed offsets. When a consumer commits an offset, Kafka writes a record to this topic. It is the authoritative source of truth for consumer group progress and is what all monitoring tools ultimately read from.

How is Kafka consumer group lag different from Kafka latency?

Consumer group lag is measured in number of messages (offset difference). Latency is measured in time, specifically how long it takes a message to travel from producer to consumer. A large lag can contribute to increased latency, but they are distinct measures. A slow consumer might have high lag but low per-message latency, or vice versa.

×
×