NATS is fast, but without the right monitoring, a misconfigured connection pool or a failing JetStream consumer can quietly degrade message delivery for hours. According to the CNCF Annual Survey 2024, 42% of organizations now use message queues or event streaming platforms in production, making visibility into these systems a production reliability requirement, not an operational nice to have.
NATS monitoring gives you real time visibility into server health, connection count, message throughput, JetStream stream state, and consumer lag. When something breaks in a distributed messaging layer, the failure is rarely obvious from application logs alone. NATS monitoring surfaces the signal before messages start piling up or dropping entirely.
This guide covers what NATS monitoring is, how it works, what metrics actually matter in production, and how to set up monitoring using native endpoints, Prometheus exporters, and full stack observability platforms.
What Is NATS Monitoring
NATS monitoring is the practice of continuously tracking the health, performance, and behavior of NATS servers and clusters in real time. It answers questions like: Are messages being delivered? How many connections are active? Is JetStream stream storage approaching limits? Are there slow consumers causing backpressure?
NATS is designed to be lightweight and fast. But that design philosophy means NATS itself does not ship with a built in UI, persistent dashboards, or automated alerting. Monitoring fills that gap by exposing server telemetry through HTTP endpoints and allowing external systems to collect, visualize, and alert on that data.
NATS monitoring typically covers three layers:
Server health metrics — CPU, memory, connection count, uptime, configuration state Message flow metrics — inbound/outbound message rates, byte rates, slow consumer events JetStream specific metrics — stream count, consumer lag, storage usage, pending messages, replication state
Without monitoring, you only find out something is wrong when users report message delays or engineers notice that a consumer has stopped processing entirely. With monitoring, you catch those issues early with alerts tied to specific thresholds like consumer lag exceeding 10,000 messages or memory usage crossing 80%.
How NATS Monitoring Works
NATS monitoring works by exposing telemetry data through a built in HTTP server that runs on a dedicated monitoring port. This server provides JSON formatted metrics at specific endpoints that external monitoring tools can poll.
When you start a NATS server with monitoring enabled, it listens on a port you specify (commonly 8222) and serves metrics at predefined URL paths. Tools like Prometheus, Grafana, Datadog, or infrastructure monitoring platforms scrape these endpoints at regular intervals to collect the data and store it for querying, visualization, and alerting.
The core mechanism has three parts:
NATS server exposes HTTP endpoints — /varz for general server stats, /connz for connection details, /routez for cluster routing info, /subsz for subscription stats, /jsz for JetStream metrics Monitoring agent polls those endpoints — Prometheus exporter, Telegraf, native Datadog integration, or custom scripts query the endpoints every 10 to 60 seconds Metrics are stored and visualized — collected data flows into time series databases like Prometheus, Grafana Cloud, or observability platforms where you build dashboards and set up alerts
For JetStream specific monitoring, /jsz exposes stream level and consumer level details like total messages, pending messages, consumer delivery count, and acknowledgment state. Without this endpoint, you have no visibility into whether a JetStream stream is filling up or a consumer has fallen behind.
Key NATS Metrics to Monitor
Not every metric NATS exposes requires active monitoring. Some are useful for debugging specific incidents but do not need dashboards or alerts. The metrics below are the ones that matter in production and should be tracked continuously.
Server Health Metrics
CPU and memory usage — track how much system resource the NATS server consumes. NATS is designed to be lightweight, but under load or with misconfigured streams, memory can climb unexpectedly. Set alerts when memory usage crosses 75% to 80% of available capacity.
Uptime — how long the server has been running since the last restart. A frequent restart pattern can indicate crashes or configuration issues that need investigation.
Connection count — total active client connections. A sudden drop can mean a network partition or a downstream service failing to reconnect. A sudden spike can indicate a connection leak or retry storm.
Slow consumer events — NATS disconnects slow consumers to protect the server from backpressure. Track how often this happens. If slow consumer disconnects are frequent, it means clients cannot keep up with message rates and need optimization.
Message Flow Metrics
Inbound message rate — messages per second arriving at the server. This tells you how much load the server is handling and whether message traffic is increasing over time.
Outbound message rate — messages per second being delivered to subscribers. Compare this to inbound rate. If inbound exceeds outbound consistently, messages are piling up somewhere, likely in JetStream streams or due to slow consumers.
Inbound byte rate and outbound byte rate — similar to message rates but measured in bytes per second. Useful for understanding bandwidth consumption and detecting unusually large messages that could impact performance.
Pending bytes — how many bytes are queued and waiting to be delivered. High pending bytes indicate subscribers are not consuming messages fast enough.
JetStream Metrics
Stream count — total number of JetStream streams on the server. Growth here should be intentional. Unplanned growth can indicate configuration drift or runaway stream creation.
Stream storage usage — bytes consumed by each stream. Monitor this closely for streams with retention policies. If storage approaches the configured limit, older messages will be deleted to make room for new ones, which may not be the intended behavior.
Consumer lag — the difference between the last message in the stream and the last message acknowledged by a consumer. High lag means the consumer is falling behind. Set alerts when lag exceeds a threshold that matches your SLA, like 10,000 messages or 5 minutes of data.
Pending messages per consumer — messages waiting to be delivered to a specific consumer. This is the most direct indicator of whether a consumer is keeping up with the stream.
Redelivery count — how many times a message has been redelivered to a consumer. High redelivery counts suggest consumers are failing to process messages successfully, either due to bugs or resource constraints.
Built In NATS Monitoring Endpoints
NATS provides several HTTP endpoints out of the box. These return JSON payloads with server and stream telemetry. You query these endpoints directly using curl or integrate them with monitoring tools.
/varz — general server variables including uptime, memory usage, CPU, connection count, message rates, slow consumer events. This is the first endpoint to check when diagnosing server health.
/connz — detailed list of all active connections including client IP, connection time, pending bytes, inbound/outbound message counts per connection. Useful for identifying which clients are slow or generating the most traffic.
/routez — routing information for clustered NATS servers. Shows which routes are active and their connection state. Critical for diagnosing cluster connectivity issues.
/subsz — subscription details including subject names and subscriber counts. Helps identify over subscription or unused subscriptions that could be cleaned up.
/jsz — JetStream specific metrics covering streams, consumers, storage usage, pending messages, and replication state. This endpoint is essential for teams using JetStream in production.
To enable the monitoring server, start NATS with the monitoring flag and port:
nats-server -m 8222
Or set it in the NATS configuration file:
monitor_port: 8222
Once enabled, query any endpoint:
curl http://localhost:8222/varz
The response is JSON formatted and can be parsed by monitoring agents or custom scripts.
NATS Monitoring with Prometheus
Prometheus is one of the most common ways to monitor NATS at scale. The nats-surveyor tool from the NATS project acts as a Prometheus exporter, polling NATS monitoring endpoints and converting the data into Prometheus metrics format.
How nats-surveyor works
nats-surveyor connects to a NATS server as a client, subscribes to system events, and polls monitoring endpoints like /varz and /jsz. It then exposes a /metrics endpoint in Prometheus format, which Prometheus scrapes at a defined interval (commonly every 15 to 30 seconds).
This approach gives you time series data for every metric NATS exposes, which you can query using PromQL, visualize in Grafana, and use to trigger alerts in Alertmanager.
Setting up nats-surveyor
Install nats-surveyor:
go install github.com/nats-io/nats-surveyor@latest
Run it with your NATS server URL:
nats-surveyor -s nats://localhost:4222 -http_port 7777
nats-surveyor now exposes metrics at http://localhost:7777/metrics.
Add this target to your Prometheus configuration:
scrape_configs:
- job_name: 'nats'
static_configs:
- targets: ['localhost:7777']
Restart Prometheus. Metrics like nats_core_mem_bytes, nats_core_in_msgs_total, nats_jetstream_stream_messages, and nats_jetstream_consumer_pending will now appear in Prometheus.
Build Grafana dashboards using these metrics. Set up alerts in Alertmanager for thresholds like consumer lag exceeding 10,000 messages or memory usage crossing 80%.
NATS JetStream Monitoring
JetStream adds persistence and replay to NATS, which means monitoring requirements expand beyond simple message rates. You need visibility into stream state, consumer progress, and storage limits.
Stream Level Metrics
Total messages in stream — how many messages are stored. Compare this to your retention policy. If this number grows unbounded, you may have a retention misconfiguration.
Stream storage bytes — actual disk or memory usage by the stream. If this approaches your configured limit, NATS will start deleting old messages to make room for new ones.
First sequence and last sequence — the sequence number of the oldest and newest message in the stream. Useful for understanding message retention and whether old messages are being purged as expected.
Consumer Level Metrics
Delivered messages — total messages delivered to the consumer since it started. This should increase steadily if the consumer is active.
Acknowledged messages — messages the consumer has successfully acknowledged. Compare this to delivered messages. If the gap widens, the consumer is receiving messages but failing to process them.
Pending messages — messages waiting to be delivered to the consumer. This is consumer lag in absolute terms. High pending counts mean the consumer cannot keep up.
Redelivered messages — how many times messages have been redelivered after the consumer failed to acknowledge them. High redelivery counts indicate processing failures or timeouts that need investigation.
Consumer state — whether the consumer is active, idle, or stalled. A stalled consumer is a production incident.
Query JetStream metrics using the /jsz endpoint or pull them from Prometheus if you are using nats-surveyor.
Best Practices for NATS Monitoring
Set up monitoring before you go to production. Retrofitting monitoring after an incident is both harder and more expensive than building it in from the start.
Monitor at multiple layers — track server health, message flow, and JetStream state separately. A healthy server does not mean healthy message delivery.
Set alerts based on SLAs — if your service can tolerate 5 minutes of consumer lag, set your alert threshold at 4 minutes so you catch issues before they breach your SLA.
Use dashboards that reflect real workflows — build dashboards for each critical stream and consumer, not just aggregate server stats. When an incident happens, you need to see the exact stream or consumer that is failing, not just a server level average.
Monitor slow consumer events — these are early warnings that clients cannot keep up. If you see frequent slow consumer disconnects, investigate client performance or increase client resources.
Track redelivery counts per consumer — high redelivery means messages are failing to process. This usually points to application bugs or resource constraints, not NATS configuration issues.
Monitor memory and CPU trends over time — NATS is lightweight, but memory usage can climb with large JetStream streams or high message volumes. Detecting trends early prevents outages caused by resource exhaustion.
Use retention policies intentionally — if your stream is set to retain 1 million messages but your consumer is processing 10,000 messages per day, your stream will eventually hit the retention limit and start discarding messages. Monitor stream size and pending messages to catch this before it happens.
Tools and Platforms for NATS Monitoring
Several tools integrate with NATS to provide monitoring, alerting, and visualization. Each fits different team sizes, budgets, and deployment models.
NATS Surveyor + Prometheus + Grafana
Open source, self hosted, full control over data and configuration. Best for teams already running Prometheus and Grafana who want to own the entire monitoring stack. Requires setup and maintenance but offers the most flexibility.
CubeAPM
Full stack observability platform that includes infrastructure monitoring and APM alongside message queue monitoring. CubeAPM supports NATS monitoring through Prometheus integration and can correlate NATS metrics with application traces and logs. It runs on your infrastructure, so NATS telemetry data stays local with no egress fees. Pricing at $0.15/GB makes it predictable for teams ingesting large volumes of metrics and logs. Best for teams that want unified observability without SaaS lock in or unpredictable billing.
Datadog
Managed SaaS platform with native NATS integration. Collects metrics from NATS monitoring endpoints, provides pre-built dashboards, and integrates with 700+ other services. Best for teams that want fully managed monitoring and can absorb the cost. Pricing is host based plus ingestion fees, which compounds as infrastructure scales.
New Relic
Another managed observability platform with NATS monitoring support. New Relic’s pricing model is based on data ingest, which can be unpredictable for high volume NATS deployments. Best for teams already using New Relic who want to consolidate monitoring in one platform.
Grafana Cloud
Hosted Grafana with Prometheus backend. You run nats-surveyor locally, and metrics are sent to Grafana Cloud for storage and visualization. Best for teams that want the Prometheus and Grafana stack without managing the infrastructure themselves. Pricing is based on active series and ingestion volume.
Netdata
Real time monitoring with automatic metric collection and dashboard generation. Netdata can monitor NATS through its Prometheus integration. Best for teams that want instant setup with minimal configuration. Limited alerting and long term retention compared to Prometheus based stacks.
Migrating NATS Monitoring to a New Platform
If you are switching from one monitoring tool to another, the migration process is straightforward because NATS exposes standard HTTP endpoints that any tool can consume.
Step 1: Set up the new monitoring stack — install Prometheus, Grafana, or your chosen platform in parallel with your current setup. Do not disable the old system yet.
Step 2: Configure nats-surveyor or native integrations — point nats-surveyor at your NATS servers and expose metrics to the new monitoring tool. Most platforms provide a NATS integration guide.
Step 3: Build equivalent dashboards — recreate your critical dashboards in the new system. Focus on the metrics you actually use during incidents, not every metric NATS exposes.
Step 4: Set up alerts — migrate alert rules from the old system to the new one. Test alerts by triggering known conditions like stopping a consumer or filling a stream past a threshold.
Step 5: Run both systems in parallel for at least one week — compare data between the old and new systems. Confirm the new system is collecting metrics correctly and firing alerts as expected.
Step 6: Disable the old system — once you confirm the new system works, shut down the old monitoring stack.
The entire migration typically takes one to two weeks depending on how many custom dashboards and alert rules you have. The key is running both systems in parallel so you can validate the new setup without risking a monitoring gap.
NATS monitoring is not optional for production deployments. Without it, you are blind to message delivery failures, consumer lag, and resource exhaustion until users start reporting problems. The monitoring endpoints NATS provides make setup straightforward, and the tools available cover every deployment model from open source self hosted stacks to fully managed SaaS platforms.
Set up monitoring before you need it. Build dashboards that reflect your actual workflows. Set alerts that catch issues before they breach your SLA. And choose a platform that fits your team size, budget, and data control requirements.
Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.
Frequently Asked Questions
How do I enable NATS monitoring?
Start the NATS server with the monitoring flag and port: nats-server -m 8222. Or add monitor_port: 8222 to your NATS configuration file. The monitoring server will expose metrics at endpoints like /varz, /connz, and /jsz.
What is the difference between /varz and /jsz?
/varz provides general server metrics like uptime, memory, CPU, connection count, and message rates. /jsz provides JetStream specific metrics including stream count, consumer lag, storage usage, and pending messages. You need both for complete NATS monitoring.
How do I monitor NATS with Prometheus?
Use nats-surveyor to poll NATS monitoring endpoints and expose metrics in Prometheus format. Configure Prometheus to scrape the nats-surveyor /metrics endpoint. Build Grafana dashboards and set up Alertmanager rules using the collected metrics.
What metrics should I alert on for NATS?
Alert on consumer lag exceeding your SLA threshold, slow consumer events, memory usage crossing 75% to 80%, and high redelivery counts per consumer. These are the metrics that indicate production issues before they cause user facing failures.
Can I monitor NATS without Prometheus?
Yes. You can query NATS monitoring endpoints directly using HTTP clients, integrate with Datadog or New Relic using their native NATS integrations, or use synthetic monitoring tools to check endpoint availability and response times.
How do I monitor JetStream consumer lag?
Query the /jsz endpoint and look for pending messages per consumer. This tells you how many messages are waiting to be delivered. Set alerts when pending messages exceed a threshold that matches your processing SLA, like 10,000 messages or 5 minutes of data.
What does a slow consumer event mean?
A slow consumer event means a client could not keep up with the message rate, so NATS disconnected it to prevent backpressure from affecting the server. Monitor this metric closely. Frequent slow consumer disconnects mean clients need performance optimization or more resources.





