Qdrant is a high performance vector database built for semantic search, recommendation engines, and AI applications. As teams scale from prototype to production, monitoring becomes critical. Without visibility into query latency, memory consumption, and shard health, a slow indexing operation or a memory spike can silently degrade search quality or cause outages before anyone notices.
Qdrant exposes metrics in Prometheus format and provides telemetry endpoints for cluster health, collection statistics, and operational visibility. This guide covers what Qdrant monitoring is, how to instrument it with Prometheus and Grafana, which metrics matter most, and how to set up alerts that catch issues before they cascade.
What Is Qdrant Monitoring
Qdrant monitoring is the practice of continuously tracking vector database performance, resource utilization, and operational health to ensure fast query responses, stable indexing, and reliable availability for AI driven applications.
Monitoring a vector database differs from monitoring a traditional relational database. Query latency depends on vector dimensionality, index type (HNSW vs flat), and similarity metric (cosine, dot product, Euclidean). Memory usage grows with collection size and on-disk vs in-memory configuration. Distributed Qdrant clusters introduce shard replication, consensus state, and node health metrics that do not exist in single node deployments.
Qdrant provides two primary monitoring endpoints: /metrics for Prometheus compatible metrics from each node, and /telemetry for detailed cluster state, collection statistics, and shard distribution. These endpoints surface everything from REST API response times and gRPC latency to HNSW index build progress and memory allocator statistics.
Effective Qdrant monitoring answers four operational questions: Is query latency acceptable for end users? Are resources (CPU, memory, disk) trending toward saturation? Are all shards healthy and replicated? Did the last indexing operation complete successfully without errors?
How Qdrant Monitoring Works
Qdrant exposes metrics through HTTP endpoints that tools like Prometheus can scrape at regular intervals. The /metrics endpoint returns node level telemetry in OpenMetrics format, including request counts, response durations, memory usage, and collection statistics. The /sys_metrics endpoint, available only in Qdrant Cloud, adds infrastructure metrics like CPU utilization, disk I/O, and load balancer telemetry.
Prometheus scrapes these endpoints every 15 to 60 seconds depending on your configuration and stores time series data for querying and visualization. Grafana connects to Prometheus as a data source and renders dashboards showing query rate, latency percentiles, error counts, and resource trends over time.
For distributed Qdrant clusters, each node exposes its own /metrics endpoint. It is critical to scrape every node individually rather than through a load balancer. Scraping through a load balancer produces inconsistent metrics because each request may land on a different node, making time series data unreliable for alerting or trending.
Qdrant also provides a /telemetry endpoint that returns JSON formatted cluster state, including the number of vectors per collection, shard distribution across nodes, and indexing progress. The /cluster/telemetry endpoint aggregates telemetry from all peers in a cluster, giving a unified view of shard transfer progress and replica health without querying each node separately.
Alerts are typically configured in Prometheus Alertmanager or Grafana. Common alert rules fire when query latency exceeds a threshold, memory usage crosses 80%, a node becomes unhealthy, or the number of dead replicas increases. Alerts route to Slack, PagerDuty, email, or incident management systems with full metric context to accelerate troubleshooting.
Key Qdrant Metrics to Track
Qdrant exposes dozens of metrics across application performance, collection health, API response behavior, and infrastructure resource usage. Monitoring all of them creates noise. Focus on the metrics that directly impact user experience or indicate impending failure.
Collection Metrics
Collection metrics track the number of vectors stored, indexed, and excluded from search results across all collections in your cluster.
collections_total reports the total number of collections. Values above 500 collections suggest an anti-pattern for Qdrant. A large number of collections creates significant resource overhead and can degrade resilience or cause outages. A single collection with payload index partitioning is usually optimal compared to many small tenant collections.
collection_points tracks the number of points (vectors with metadata) per collection. This metric helps you understand collection size and predict memory or disk requirements as data grows.
collection_vectors reports the number of vectors per collection and vector name. In Qdrant, a single point can contain multiple named vectors. This metric lets you track growth separately for each vector type.
collection_indexed_only_excluded_points counts points excluded from indexed only search mode per collection and vector name. If this number is high, users may see incomplete search results when querying with indexed only enabled.
Replica and Shard Health Metrics
Distributed Qdrant clusters rely on shard replication for availability. These metrics surface replication lag and unhealthy replicas that can lead to data loss or query failures.
collection_active_replicas_min reports the minimum number of active replicas across all collections and shards. If this drops below your configured replication factor, you have lost redundancy and are at risk of data loss during node failure.
collection_active_replicas_max reports the maximum number of active replicas. This helps you verify that replication is balanced and no collection has accumulated extra replicas due to configuration drift.
collection_dead_replicas counts non-active replicas across all collections and shards. Non-zero values indicate replication problems. Dead replicas should be investigated immediately and either recovered or removed to restore cluster health.
API Response Metrics
API response metrics track query performance and error rates across REST and gRPC interfaces. These are the most direct indicators of user experience.
rest_responses_total counts the number of REST API responses. Break this down by endpoint and status code to identify which operations are most frequent and whether error rates are increasing.
rest_responses_fail_total counts failed REST API responses. A rising failure rate often indicates upstream application errors, malformed requests, or resource exhaustion in Qdrant.
rest_responses_duration_seconds is a histogram of REST API response durations. Use this to calculate latency percentiles (p50, p95, p99) and alert when the p99 crosses acceptable thresholds for your use case.
grpc_responses_total and grpc_responses_duration_seconds provide equivalent visibility for gRPC calls. If your application uses gRPC for search or upsert operations, monitor these metrics with the same priority as REST metrics.
Memory and Resource Metrics
Qdrant uses a custom memory allocator and exposes allocator statistics that help you understand memory pressure before the operating system starts killing processes.
memory_active_bytes reports the total number of bytes in active pages allocated by the application. This is the working set size and the most accurate indicator of current memory consumption.
memory_allocated_bytes tracks the total number of bytes allocated by the application since startup. Compare this to memory_active_bytes to detect memory fragmentation or leaks.
memory_resident_bytes shows the maximum number of bytes in physically resident data pages. If this grows faster than memory_active_bytes, Qdrant may be paging memory to disk, which severely impacts query performance.
process_threads counts the number of system threads in use by the Qdrant process. A sudden spike in thread count can indicate a locking problem or resource contention.
process_open_fds reports the number of open file descriptors. If this approaches process_max_fds, Qdrant will fail to open new files for indexing or logging. Alert when usage crosses 80% of the system limit.
Cluster Consensus Metrics
If you are running a distributed Qdrant cluster, consensus metrics report the health of the Raft protocol that coordinates shard placement and replication.
cluster_enabled is a binary gauge indicating whether distributed mode is enabled. This should always be 1 in production clusters.
cluster_term reports the current Raft consensus term. A rapidly increasing term suggests frequent leader elections, which can be caused by network instability or node health issues.
cluster_pending_operations_total counts the number of pending consensus operations. A backlog here indicates that the cluster is struggling to replicate changes, often due to slow nodes or network latency.
cluster_voter is a binary gauge showing whether the node is a consensus voter (1) or learner (0). Only voters participate in leader elections. Learners replicate data but do not vote. Ensure your cluster has enough voters to maintain quorum.
Monitoring Qdrant with Prometheus and Grafana
Prometheus and Grafana form the standard open source stack for monitoring Qdrant. Prometheus scrapes metrics from Qdrant nodes at regular intervals and stores them as time series data. Grafana queries Prometheus and visualizes metrics in dashboards with graphs, heatmaps, and alerts.
Setting Up Prometheus to Scrape Qdrant Metrics
Prometheus requires a scrape configuration that tells it where to find Qdrant metrics and how often to collect them. Add a scrape target to your prometheus.yml configuration file:
scrape_configs:
- job_name: 'qdrant'
scrape_interval: 15s
static_configs:
- targets:
- 'qdrant-node-1:6333'
- 'qdrant-node-2:6333'
- 'qdrant-node-3:6333'
Replace the target hostnames with the actual addresses of your Qdrant nodes. Each node exposes metrics on port 6333 by default at the /metrics endpoint. Do not scrape through a load balancer. Scrape each node individually to ensure metric consistency.
If you are running Qdrant in Kubernetes, use Prometheus service discovery to automatically detect Qdrant pods and scrape them without hardcoding IP addresses. Add a Kubernetes service discovery configuration:
scrape_configs:
- job_name: 'qdrant'
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: qdrant
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: $1:6333
This configuration discovers all pods with the label app=qdrant and scrapes their /metrics endpoint on port 6333.
Once Prometheus is configured, restart it and verify that Qdrant metrics are being scraped by querying Prometheus for up{job="qdrant"}. If the result is 1, the scrape is working. If the result is 0 or the query returns no data, check your firewall rules, Qdrant configuration, and Prometheus logs for errors.
Building Grafana Dashboards for Qdrant
Grafana connects to Prometheus as a data source and visualizes Qdrant metrics in customizable dashboards. To get started, add Prometheus as a data source in Grafana by navigating to Configuration, Data Sources, and selecting Prometheus. Enter your Prometheus server URL and click Save & Test.
Qdrant provides an official Grafana dashboard that you can import directly. In Grafana, go to Dashboards, Import, and paste the dashboard ID or upload the JSON file. The official dashboard includes panels for query latency, memory usage, collection size, replica health, and API response rates.
If you want to build custom dashboards, start with these core panels:
Query latency (p95 and p99): Use the rest_responses_duration_seconds histogram to calculate 95th and 99th percentile latency. Add a Grafana panel with the query histogram_quantile(0.95, rate(rest_responses_duration_seconds_bucket[5m])) for p95 and histogram_quantile(0.99, rate(rest_responses_duration_seconds_bucket[5m])) for p99.
Memory usage trend: Plot memory_active_bytes over time to track working set size. Add a threshold line at 80% of total node memory to visualize headroom before saturation.
API error rate: Calculate the ratio of failed requests to total requests with the query rate(rest_responses_fail_total[5m]) / rate(rest_responses_total[5m]). Multiply by 100 to show error rate as a percentage.
Dead replica count: Create a stat panel showing sum(collection_dead_replicas). This should be zero in a healthy cluster. Any non-zero value requires immediate investigation.
Cluster pending operations: Plot cluster_pending_operations_total to detect replication lag. A rising trend indicates that the cluster is falling behind on consensus operations, often due to slow nodes or network issues.
Setting Up Alerts in Prometheus and Grafana
Alerts notify you when Qdrant metrics cross thresholds that indicate performance degradation or impending failure. Prometheus Alertmanager handles alert routing and deduplication. Define alert rules in a Prometheus rules file:
groups:
- name: qdrant
rules:
- alert: HighQueryLatency
expr: histogram_quantile(0.99, rate(rest_responses_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Qdrant p99 query latency above 1 second"
description: "p99 latency is {{ $value }} seconds on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (memory_active_bytes / memory_resident_bytes) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Qdrant memory usage above 80%"
description: "Node {{ $labels.instance }} is using {{ $value | humanizePercentage }} of available memory"
- alert: DeadReplicas
expr: sum(collection_dead_replicas) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Qdrant cluster has dead replicas"
description: "{{ $value }} replicas are non-active across all collections"
Reload Prometheus after adding alert rules. Alerts fire when the condition is met for the specified duration (for: 5m means the condition must be true for 5 consecutive minutes before the alert triggers). Configure Alertmanager to route alerts to Slack, PagerDuty, email, or your incident management system.
Grafana also supports alerting directly from dashboard panels. For simpler setups, you can define alerts in Grafana instead of Prometheus. Edit any panel, go to the Alert tab, and define the condition. Grafana evaluates the alert every evaluation interval and sends notifications through its built-in alerting system.
Best Practices for Qdrant Monitoring
Monitoring Qdrant effectively requires more than scraping metrics and building dashboards. Apply these practices to catch issues early and maintain reliable vector search performance at scale.
Monitor Each Node Individually in Distributed Clusters
Never scrape Qdrant metrics through a load balancer. Load balancers distribute requests across nodes, so each scrape may land on a different node. This produces inconsistent time series data that breaks trending, alerting, and debugging. Configure Prometheus to scrape each Qdrant node directly by listing all node endpoints in the scrape configuration.
Set Alerts Based on Percentiles, Not Averages
Average latency hides outliers. A query that takes 10 seconds affects user experience even if the average latency is 100 milliseconds. Use p95 or p99 latency for alerting instead of mean latency. The histogram_quantile function in Prometheus calculates percentiles from Qdrant’s rest_responses_duration_seconds histogram.
Track Collection Count and Alert Above 500
Qdrant performance degrades with large numbers of collections. Each collection consumes memory for metadata and indexing structures even when empty. If your cluster has more than 500 collections, consider consolidating them into a single collection with payload partitioning or splitting collections across multiple clusters. Set an alert when collections_total crosses 500 to catch this anti-pattern early.
Monitor Dead Replicas and Pending Consensus Operations
collection_dead_replicas should always be zero. Any non-zero value means replication has failed and you have lost redundancy. Investigate immediately and either recover the dead replica or remove it to restore cluster health.
cluster_pending_operations_total should stay low and stable. A rising trend indicates replication lag. Check network latency between nodes, disk I/O saturation, or slow consensus voters that are holding up the cluster.
Use Per-Collection Metrics for Multi-Tenant Workloads
If you are running a multi-tenant Qdrant cluster, enable per-collection API metrics by adding ?per_collection=true to the Prometheus scrape URL. This breaks down rest_responses_total and grpc_responses_total by collection name, letting you identify which tenant is driving high query rates or errors.
Note that enabling per-collection metrics replaces the global rest_responses_total metric entirely. The unlabeled metric will not be returned when per-collection mode is enabled.
Retain Metrics for Capacity Planning
Store Prometheus metrics for at least 30 days to support capacity planning and trend analysis. Short retention windows make it impossible to detect slow growth in memory usage or query latency. Use Prometheus remote write to push metrics to long term storage like Thanos, Cortex, or Mimir if you need retention beyond 30 days.
Test Alerts in Staging Before Production
Alert fatigue kills on-call effectiveness. Test alert thresholds in a staging environment under realistic load before deploying them to production. If an alert fires too often with no actionable issue, raise the threshold or increase the for duration to reduce noise.
Monitoring Qdrant with CubeAPM
CubeAPM is a self hosted observability platform that runs inside your cloud or on-premises infrastructure. It natively supports Prometheus metrics, OpenTelemetry traces, and log ingestion, making it a unified alternative to managing separate Prometheus and Grafana instances for Qdrant monitoring.
CubeAPM connects to Qdrant via Prometheus scrape configuration the same way you would configure Prometheus directly. It ingests metrics from Qdrant’s /metrics endpoint, stores them with unlimited retention, and provides pre-built dashboards and alerting for vector database workloads.
What makes CubeAPM different from the Prometheus + Grafana stack is that it combines metrics, traces, and logs in a single platform with correlated search. If a Qdrant query latency spike triggers an alert, you can drill from the metric spike directly into application traces that show which API call caused the slow query, and then into logs that capture the exact error message or resource contention event.
CubeAPM supports custom dashboards and alert rules using the same PromQL query language you would use in Prometheus. You can migrate existing Prometheus alert rules to CubeAPM without rewriting them. Alerts route to Slack, PagerDuty, email, and webhooks with full metric and trace context.
Because CubeAPM runs on your infrastructure, there are no data egress fees, no per-seat licensing, and no external SaaS dependency. Telemetry from Qdrant stays inside your VPC, meeting data residency and compliance requirements. For teams running Qdrant in regulated industries or environments with strict data control policies, this deployment model removes the compliance risk of sending telemetry to external SaaS platforms.
CubeAPM pricing is $0.15 per GB of data ingested, covering metrics, traces, and logs with unlimited retention. There are no per-user fees or per-host charges. For a Qdrant cluster generating 500 MB of metrics per day, monthly cost would be approximately $2.25 for metrics alone. Most teams also ingest application traces and logs, bringing total monthly cost to a predictable figure based on data volume rather than infrastructure scale.
CubeAPM provides native Kubernetes support with cluster, pod, and node level visibility. If you are running Qdrant in Kubernetes, CubeAPM automatically discovers Qdrant pods and correlates Qdrant metrics with Kubernetes resource metrics (CPU, memory, disk I/O) without requiring separate integrations or exporters.
Common Qdrant Monitoring Challenges and How to Fix Them
Monitoring Qdrant in production surfaces patterns and problems that are not obvious in development. These are the issues teams hit most often and how to resolve them.
High Memory Usage Leading to OOMKill
Qdrant stores vector indexes and metadata in memory for fast query performance. If memory usage grows beyond available RAM, the operating system kills the Qdrant process with an out-of-memory (OOMKill) error. This causes downtime and data loss if the cluster is not highly available.
Monitor memory_active_bytes and memory_resident_bytes. Set an alert when memory_active_bytes crosses 80% of total node memory. If memory usage is growing faster than expected, check collection_points and collection_vectors to identify which collections are expanding. Consider enabling on-disk storage for large collections, scaling vertically to nodes with more RAM, or scaling horizontally by adding nodes and re-sharding collections.
Query Latency Spikes During Indexing
HNSW index builds are CPU and memory intensive. When Qdrant is indexing a large batch of new vectors, query latency often spikes because indexing and search operations compete for resources. This is especially noticeable in single node deployments or clusters without enough CPU headroom.
Monitor rest_responses_duration_seconds and collection_running_optimizations. If latency spikes correlate with non-zero collection_running_optimizations, indexing is the cause. Optimize by scheduling bulk upserts during low traffic windows, increasing CPU allocation, or adjusting Qdrant’s optimizer configuration to reduce indexing aggressiveness.
Dead Replicas After Node Restart
When a Qdrant node restarts, replicas on that node temporarily become unavailable. If the node takes too long to rejoin the cluster or encounters errors during startup, those replicas remain dead even after the node is back online.
Monitor collection_dead_replicas and set a critical alert when the count is above zero. Investigate logs on the affected node for errors during startup. Common causes include corrupted shard data, network partitions preventing the node from contacting the cluster, or misconfigured Raft consensus settings. Use the Qdrant cluster API to manually recover or remove dead replicas once the root cause is fixed.
Cluster Consensus Lag
Raft consensus requires a majority of voters to commit operations. If one voter is slow (due to disk I/O saturation, network latency, or CPU throttling), the entire cluster slows down because consensus operations wait for the slow voter to respond.
Monitor cluster_pending_operations_total. A rising trend indicates consensus lag. Check disk latency, network round trip time between nodes, and CPU usage on each voter. If one node is consistently slow, consider replacing it or demoting it from voter to learner status so it no longer participates in consensus decisions.
Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.
Frequently Asked Questions
What metrics should I monitor first in Qdrant?
Start with query latency (p95 and p99 from `rest_responses_duration_seconds`), memory usage (`memory_active_bytes`), API error rate (`rest_responses_fail_total`), and replica health (`collection_dead_replicas`). These four metrics cover user experience, resource pressure, and cluster stability.
Can I monitor Qdrant without Prometheus?
Yes. Qdrant exposes metrics in OpenMetrics format, so any tool that scrapes Prometheus compatible endpoints works. Alternatives include Grafana Cloud Agent, Datadog Agent, New Relic infrastructure agent, or any OpenTelemetry collector configured to scrape Prometheus targets.
How do I monitor Qdrant Cloud clusters?
Qdrant Cloud provides metrics and logs in the Qdrant Cloud Console under the Metrics, Logs, and Request sections of the Cluster Details page. Qdrant Cloud also exposes a `/sys_metrics` endpoint with infrastructure telemetry including CPU, memory, disk utilization, and load balancer metrics. You can scrape this endpoint with Prometheus or forward it to your observability platform.
Why should I scrape each Qdrant node individually?
Scraping through a load balancer produces inconsistent metrics because each scrape may land on a different node. Time series data from different nodes cannot be reliably trended or alerted on. Always configure Prometheus to scrape each Qdrant node directly using static targets or Kubernetes service discovery.
What does a high collection_dead_replicas count mean?
It means one or more shard replicas are not active. This reduces redundancy and increases the risk of data loss if another node fails. Investigate logs on the affected nodes, check network connectivity, and use the Qdrant cluster API to recover or remove dead replicas.
How do I calculate Qdrant query latency percentiles?
Use the `histogram_quantile` function in Prometheus with the `rest_responses_duration_seconds` histogram. For p95 latency, query `histogram_quantile(0.95, rate(rest_responses_duration_seconds_bucket[5m]))`. For p99, use 0.99 instead of 0.95.
Can I monitor multiple Qdrant clusters in one Grafana dashboard?
Yes. Add each cluster as a separate Prometheus data source in Grafana or use Prometheus federation to aggregate metrics from multiple Prometheus instances. Label each cluster with a unique identifier in the Prometheus scrape configuration so you can filter and group metrics by cluster in Grafana queries.





