CubeAPM
CubeAPM CubeAPM

Pub/Sub Monitoring: Backlog, Dead-Letter Queues, and Subscriber Lag

Pub/Sub Monitoring: Backlog, Dead-Letter Queues, and Subscriber Lag

Table of Contents

Google Cloud Pub/Sub sits at the center of most event driven architectures on GCP, routing messages between publishers and subscribers at scale. When Pub/Sub experiences issues, the effects cascade through your entire system: messages pile up in subscriptions, consumers fall behind on processing, data pipelines stall, and end users experience delays or errors. According to the CNCF’s 2024 Annual Survey, 87% of organizations now use event driven architectures in production, making reliable message queue monitoring a baseline requirement for distributed systems.

This guide covers the three core signals that reveal Pub/Sub health – backlog size, dead letter queue accumulation, and subscriber lag along with the specific metrics to track, alert thresholds that catch problems early, and how to connect Pub/Sub telemetry to your broader observability stack.

What Is Pub/Sub Monitoring and Why It Matters

Pub/Sub monitoring is the practice of tracking message queue health across topics, subscriptions, and subscribers in real time. It answers three questions at every moment: Are messages being delivered? Are consumers keeping up? Are failed messages being handled correctly?

Without monitoring, silent failures accumulate. A subscription backlog grows for hours before anyone notices. A misconfigured dead letter queue silently drops critical order confirmations. A slow subscriber falls minutes behind, processing stale data that breaks downstream business logic.

Pub/Sub monitoring surfaces these failures immediately by tracking three core signals: backlog size (how many undelivered messages exist), dead letter queue depth (how many messages failed processing), and subscriber lag (how far behind consumers are running). Each signal maps to a specific failure mode. Backlog growth indicates consumer throughput problems. Dead letter accumulation signals processing bugs or configuration errors. Subscriber lag reveals scaling issues or resource contention.

For teams running event driven systems, Pub/Sub monitoring is the difference between detecting an issue in seconds versus discovering it when customers report broken features.

How Pub/Sub Works: Publishers, Topics, Subscriptions, and Subscribers

Understanding the Pub/Sub architecture clarifies what to monitor and why each component matters.

Publishers send messages to a topic. Topics are named channels that accept and route messages. A single topic can have multiple subscriptions attached. Each subscription represents an independent message stream that consumers pull from or that Pub/Sub pushes to an endpoint.

Subscribers consume messages from subscriptions. A subscriber can be a Cloud Function, a Kubernetes pod running a worker service, or any application that processes events. Subscribers acknowledge messages after successful processing. Pub/Sub retries unacknowledged messages until the acknowledgment deadline expires or the message reaches its maximum delivery attempts.

Dead letter topics capture messages that exceed retry limits. When a message cannot be acknowledged after multiple attempts, Pub/Sub moves it to a configured dead letter topic instead of retrying indefinitely. This prevents poison messages from blocking the subscription.

Each layer creates a monitoring requirement. Topics reveal publisher health. Subscriptions surface consumer lag and backlog growth. Dead letter topics expose processing failures. Monitoring all three layers gives complete visibility into message flow.

Key Pub/Sub Metrics to Track

Pub/Sub exposes metrics through Cloud Monitoring that map directly to the three failure modes teams encounter most often.

Subscription backlog size

The num_undelivered_messages metric counts messages waiting in a subscription that have not been acknowledged by any subscriber. Backlog growth indicates consumers are falling behind publisher throughput. A backlog that grows steadily means consumer capacity is insufficient. A backlog that spikes then recovers suggests temporary resource contention or a deployment issue.

Alert on sustained backlog growth over 5 minutes. A threshold depends on your baseline — if your subscription normally holds 100 messages and jumps to 10,000, that is a clear signal. For high throughput systems, track the rate of change rather than absolute count.

Oldest unacknowledged message age

The oldest_unacked_message_age metric measures how long the oldest message in the subscription has been waiting for acknowledgment. This is the most direct measure of subscriber lag. A message age of 30 seconds is normal for batch oriented consumers. A message age of 5 minutes signals processing delays. A message age measured in hours indicates a stuck consumer or broken subscription configuration.

Alert when oldest message age exceeds your processing SLA. If messages must be processed within 2 minutes, alert at 3 minutes to give time for investigation before violating the SLA.

Dead letter message count

The dead_letter_message_count metric tracks how many messages have been forwarded to the dead letter topic from a subscription. Any non-zero count requires investigation. Messages reach the dead letter queue because they repeatedly failed processing — either due to application bugs, schema mismatches, or infrastructure failures.

Alert immediately when dead letter count increases. Each message represents a processing failure that may require manual intervention to resolve.

Publish request count and latency

The topic/send_request_count metric shows publisher throughput. Sudden drops indicate publisher failures or upstream issues. The topic/send_message_operation_count metric reveals successful versus failed publish attempts. High failure rates point to quota limits, network issues, or topic configuration problems.

Track publish latency with topic/send_request_latencies. Increased latency suggests quota throttling or regional capacity constraints.

Subscription pull request count and acknowledgment rate

The subscription/pull_request_count metric shows how often subscribers are polling the subscription. A drop in pull requests means subscribers are down or stuck. The subscription/ack_message_count metric reveals how many messages subscribers successfully acknowledged. Compare this to publish throughput to verify consumers are keeping pace.

Low acknowledgment rates combined with growing backlogs confirm consumer throughput problems.

Understanding Backlog and When It Becomes a Problem

Backlog size alone does not indicate a problem. A subscription with 10,000 undelivered messages is healthy if consumers process 100,000 messages per minute and the backlog shrinks over time. The same backlog is critical if consumers process 1,000 messages per minute and the backlog grows.

Backlog becomes a problem when it grows faster than consumers can drain it. This happens in three scenarios: consumer capacity is lower than publisher throughput, consumers are processing messages slowly due to resource contention or inefficient code, or consumers are stuck entirely due to deployment issues or infrastructure failures.

Monitor backlog growth rate rather than absolute size. Calculate the rate by comparing backlog size at two points in time. A positive rate (backlog increasing) sustained over 5 minutes indicates a capacity or processing issue. A negative rate (backlog decreasing) confirms consumers are catching up.

Backlog also becomes a problem when it causes messages to expire. Pub/Sub enforces a maximum message retention of 7 days by default. If backlog size grows so large that messages sit unprocessed for days, older messages expire and are lost before consumers reach them. Track message age alongside backlog size to detect this scenario before data loss occurs.

Dead Letter Queues: Configuration and Monitoring

A dead letter queue (DLQ) is a Pub/Sub topic configured to receive messages that cannot be processed after repeated delivery attempts. Configuring a DLQ prevents poison messages from blocking the subscription indefinitely.

How dead letter queues work

When you configure a subscription with a dead letter topic, Pub/Sub tracks delivery attempts for each message. If a message is not acknowledged after the maximum delivery attempts threshold (minimum 5 attempts), Pub/Sub forwards the message to the dead letter topic and removes it from the original subscription.

The dead letter topic is a standard Pub/Sub topic. You can attach a subscription to it to process failed messages separately, send them to a logging service, or archive them for later analysis.

Configuring dead letter topics

Set the deadLetterPolicy on a subscription to enable dead letter forwarding. Specify the dead letter topic and the maximum delivery attempts threshold.

A minimum of 5 delivery attempts is required. Setting a low threshold (5-10 attempts) moves failed messages to the DLQ quickly, preventing them from blocking other messages. Setting a high threshold (20+ attempts) gives transient failures more time to resolve but risks queue buildup if failures are persistent.

Monitoring dead letter topics

Track the dead_letter_message_count metric on the source subscription to see how many messages are being forwarded. Monitor the backlog size on the dead letter topic itself to confirm failed messages are being processed or archived.

A growing dead letter backlog indicates failed messages are accumulating faster than your DLQ processing logic can handle them. This often means the same bug or configuration issue is affecting multiple messages.

Alert on any increase in dead letter message count. Each message represents a processing failure that needs investigation. High dead letter rates signal application bugs, schema mismatches, or infrastructure issues that require immediate attention.

Subscriber Lag: Causes and How to Detect It

Subscriber lag is the delay between when a message is published and when it is acknowledged by a consumer. Lag directly impacts system latency. A message published at 10:00:00 and acknowledged at 10:00:05 has 5 seconds of lag. For real time systems, lag measured in minutes breaks SLAs and user experience.

Causes of subscriber lag

Lag occurs when consumers cannot process messages as fast as publishers produce them. The root causes fall into four categories.

Insufficient consumer capacity: Too few subscriber instances or worker threads to handle message volume. Auto scaling that lags behind traffic spikes. Fixed capacity that was sized for average load but cannot absorb peak traffic.

Slow message processing: Inefficient code that takes too long per message. Synchronous database queries or external API calls that block worker threads. Large batch sizes that delay acknowledgment until an entire batch completes.

Resource contention: CPU or memory saturation on subscriber instances. Network bandwidth limits. Downstream service bottlenecks (databases, caches, third party APIs) that slow processing.

Configuration issues: Acknowledgment deadlines set too short, causing messages to be redelivered before processing completes. Max outstanding messages or bytes limits that throttle consumer pull rate below capacity.

Detecting subscriber lag

The oldest_unacked_message_age metric is the direct measure of lag. When this metric increases, consumers are falling behind. Track the metric per subscription to isolate which consumers are lagging.

Compare oldest_unacked_message_age to your processing SLA. If messages must be processed within 30 seconds and the oldest message is 2 minutes old, lag is violating the SLA.

Correlate lag with backlog size and pull request count. Lag combined with growing backlog and steady pull requests suggests slow processing. Lag combined with low pull request count suggests consumer downtime or stuck workers.

Pub/Sub Monitoring with CubeAPM

CubeAPM provides full stack observability for Pub/Sub workloads by connecting Pub/Sub metrics to distributed traces, logs, and infrastructure telemetry in a single platform. It runs on premises or inside your VPC, keeping all telemetry data within your infrastructure while surfacing the same visibility SaaS tools provide.

CubeAPM ingests Pub/Sub metrics via OpenTelemetry or Prometheus exporters, correlates them with application traces to show message flow across services, and alerts on backlog growth, subscriber lag, or dead letter accumulation with context tied to the exact service or deployment causing the issue.

For Pub/Sub monitoring specifically, CubeAPM tracks subscription backlog size, oldest unacknowledged message age, dead letter message count, publish request latency, and subscriber pull and acknowledgment rates. These metrics are searchable with high cardinality filters — by subscription, topic, service, or deployment.

CubeAPM correlates Pub/Sub metrics with APM traces so you can see which service published a message, how long it sat in the subscription, and which consumer processed it. This correlation is automatic when using OpenTelemetry instrumentation for both Pub/Sub and application code.

Alerts in CubeAPM use the same metrics as Cloud Monitoring but tie them to trace context and infrastructure events. An alert for high subscriber lag includes the service graph showing which downstream dependency is slow, the logs from the subscriber pod, and the Kubernetes events that show whether the pod was recently restarted or throttled.

Pricing is $0.15/GB for all ingested telemetry with unlimited retention and no per-seat fees. Pub/Sub metrics, traces, and logs count toward the same ingestion total. For a subscription processing 50,000 messages per minute with traces enabled, monthly costs are predictable and scale linearly with data volume.

Best Practices for Pub/Sub Monitoring

Track all three core metrics — backlog size, oldest unacked message age, and dead letter message count — together to get complete visibility into subscription health. No single metric reveals the full picture. Backlog size without message age misses lag. Message age without dead letter count misses persistent failures.

Set alert thresholds based on your processing SLA, not arbitrary numbers. If your system must process messages within 1 minute, alert when oldest message age exceeds 90 seconds. If backlog normally holds 500 messages, alert when it crosses 5,000 for more than 5 minutes.

Monitor publisher and subscriber health independently. Track publish request count and latency to detect publisher failures. Track pull request count and acknowledgment rate to detect subscriber failures. Correlate both sides to understand whether issues originate upstream or downstream.

Configure dead letter topics for every subscription. Without a DLQ, poison messages retry indefinitely and block the subscription. With a DLQ, failed messages are isolated automatically and backlog continues draining.

Use high cardinality filtering to isolate issues. Monitor metrics by subscription, topic, service, and deployment. A spike in backlog size across all subscriptions indicates a publisher issue. A spike in one subscription indicates a consumer issue for that workload.

Correlate Pub/Sub metrics with application traces and logs. Metrics tell you a problem exists. Traces show you the message flow and where processing slows down. Logs surface the errors causing failures. Infrastructure monitoring tools that connect all three layers reduce mean time to resolution by giving complete context in a single view.

Automate alerts but include context in notifications. A Slack alert that says “Subscription backlog high” requires manual investigation to understand which service is affected and what changed. An alert that says “Orders subscription backlog at 15,000 messages, oldest message 8 minutes old, consumer pods restarted 10 minutes ago” gives the context needed to act immediately.

Test your monitoring during load tests and chaos engineering exercises. Simulate publisher spikes, kill subscriber pods, inject processing delays, and verify alerts fire before SLAs are violated. Monitoring that works during steady state but misses failures during incidents is not reliable.

Tools for Pub/Sub Monitoring

Several platforms provide Pub/Sub monitoring capabilities with different trade-offs in deployment model, cost, and feature depth.

Cloud Monitoring

Cloud Monitoring is Google’s native observability platform. It collects Pub/Sub metrics automatically when you create topics and subscriptions. Metrics are available in the console or via the Cloud Monitoring API. Dashboards can be created using JSON or Terraform.

Alerting policies in Cloud Monitoring support threshold based alerts on any Pub/Sub metric. Notifications can be sent to email, Slack, PagerDuty, or webhooks.

Cloud Monitoring is included with GCP usage. There is no separate cost for collecting Pub/Sub metrics. Alert policy evaluations and API queries are billed separately.

Limitations: Cloud Monitoring does not correlate Pub/Sub metrics with application traces or logs by default. Building this correlation requires exporting metrics to a tool that supports unified telemetry.

Datadog

Datadog integrates with GCP to collect Pub/Sub metrics and correlate them with logs, traces, and infrastructure data. The GCP integration pulls metrics via the Cloud Monitoring API. Pub/Sub dashboards are available out of the box.

Datadog’s APM can trace messages across services if you instrument publishers and subscribers with Datadog agents or OpenTelemetry. This gives end to end visibility from publish to acknowledgment.

Pricing is based on the number of monitored hosts plus ingested logs and APM traces. For Pub/Sub workloads, this includes the instances running publisher and subscriber code. The Datadog pricing calculator helps model costs for specific workloads.

Limitations: Datadog is SaaS only. All telemetry data is sent to Datadog’s cloud. This creates egress costs for large volumes and is not suitable for teams with strict data residency requirements.

Grafana with Prometheus

Grafana combined with Prometheus can monitor Pub/Sub using the GCP exporter. The exporter queries Cloud Monitoring APIs and exposes Pub/Sub metrics in Prometheus format. Grafana dashboards visualize the metrics.

This setup is self hosted. You run Prometheus and Grafana in your own infrastructure. Pub/Sub metrics stay within your environment. Grafana Cloud offers a hosted option if you prefer managed services.

Pricing for self hosted setups depends on infrastructure costs. Grafana Cloud pricing is based on metrics, logs, and trace volume ingested.

Limitations: Prometheus and Grafana require manual configuration for alerts, dashboards, and exporters. Correlating Pub/Sub metrics with traces requires additional setup using Tempo or another tracing backend. Synthetic monitoring tools can extend Grafana’s coverage to include proactive health checks for Pub/Sub endpoints.

New Relic

New Relic’s GCP integration collects Pub/Sub metrics and displays them in pre-built dashboards. APM tracing can follow messages across services if applications are instrumented with New Relic agents.

New Relic is SaaS only. Pricing is based on data ingested. The New Relic pricing calculator provides cost estimates based on ingestion volume.

Limitations: New Relic does not support on premises deployment. All telemetry is sent to New Relic’s cloud. Pricing can become unpredictable at scale as ingestion grows with traffic spikes.

Frequently Asked Questions

How often should Pub/Sub metrics be collected?

Collect Pub/Sub metrics at 60 second intervals minimum. Higher frequency (10-30 seconds) improves detection speed for fast growing backlogs but increases monitoring costs and query load. Cloud Monitoring updates metrics every 60 seconds by default.

What is a normal backlog size for a Pub/Sub subscription?

Normal backlog size depends on consumer throughput and publisher rate. A subscription processing 1,000 messages per second may hold 5,000 messages at any moment if consumers run in micro batches. Monitor backlog growth rate rather than absolute size to detect problems.

How do I configure dead letter topics in Pub/Sub?

Set the `deadLetterPolicy` on a subscription with the dead letter topic name and maximum delivery attempts. Minimum 5 attempts required. Use `gcloud pubsub subscriptions update SUBSCRIPTION_NAME –dead-letter-topic=TOPIC_NAME –max-delivery-attempts=5`.

What causes high subscriber lag in Pub/Sub?

Subscriber lag is caused by insufficient consumer capacity, slow message processing, resource contention, or configuration issues like short acknowledgment deadlines. Use `oldest_unacked_message_age` metric to measure lag and correlate with consumer CPU, memory, and processing time.

Can I monitor Pub/Sub with OpenTelemetry?

Yes. Use the GCP exporter to pull Pub/Sub metrics into Prometheus, then export to OpenTelemetry Collector. Instrument publishers and subscribers with OpenTelemetry SDKs to generate traces that correlate with metrics.

How do I alert on Pub/Sub backlog growth?

Create an alert policy in Cloud Monitoring on `num_undelivered_messages` with a threshold based on your baseline. Alert when backlog exceeds the threshold for more than 5 minutes to avoid noise from temporary spikes.

What is the minimum backoff duration in Pub/Sub?

The minimum backoff duration for Pub/Sub retry policies is 10 seconds. Exponential backoff increases delay between retries up to a maximum backoff duration of 600 seconds. Configure retry policy on the subscription to control backoff behavior.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.

×
×