Aerospike is a distributed NoSQL database built for real time data at scale. When a single Aerospike cluster handles millions of transactions per second across dozens of nodes, a slow disk write or silent memory pressure on one node can cascade into user facing latency spikes that are difficult to trace without proper instrumentation. Production teams running Aerospike at scale need continuous visibility into cluster health, namespace capacity, node performance, and replication lag.
This guide covers what Aerospike monitoring is, how the Aerospike monitoring stack works, what metrics to track, and how to choose the right monitoring approach for your infrastructure.
What Is Aerospike Monitoring
Aerospike monitoring is the practice of continuously tracking operational metrics from Aerospike database clusters to maintain performance, detect failures early, and prevent capacity bottlenecks. Unlike traditional relational databases, Aerospike’s distributed architecture means that a problem on one node can affect cluster wide availability, replication health, and query latency.
Aerospike monitoring typically covers four areas: cluster health and availability, namespace memory and storage usage, node performance including CPU, memory, and disk IO, and replication lag across data centers. Without monitoring, teams discover problems only after they affect users. With monitoring, alerts fire when disk usage crosses 90%, when eviction rates spike, or when client connection failures indicate a network partition.
Aerospike itself provides rich operational metrics through multiple interfaces. The main telemetry export mechanism is the Aerospike Prometheus Exporter, which scrapes metrics from the Aerospike server’s info protocol and exposes them in Prometheus format. This exporter is the core of the official Aerospike Monitoring Stack, which combines Prometheus for metric storage and Grafana for visualization.
How Aerospike Monitoring Works
The Aerospike monitoring workflow begins with the Aerospike server itself. Every Aerospike node exposes operational metrics via its info protocol, a simple telnet style interface that returns real time statistics on demand. These metrics include everything from cluster state and node health to namespace capacity, replication status, and client connection counts.
The Aerospike Prometheus Exporter connects to each Aerospike node’s info port, scrapes the metrics, transforms them into Prometheus exposition format, and makes them available at an HTTP endpoint that Prometheus can pull from. Prometheus stores these time series metrics and handles alerting rules. Grafana connects to Prometheus as a data source and renders dashboards that show cluster health, namespace trends, node performance, and alert history.
This stack runs alongside your Aerospike cluster. In typical production deployments, the Prometheus exporter runs as a sidecar container or dedicated service with network access to all Aerospike nodes. Prometheus itself runs as a separate service, scraping the exporter at regular intervals, usually every 10 to 30 seconds. Grafana provides the frontend for SRE teams to visualize real time cluster behavior and investigate incidents.
For teams using OpenTelemetry, the OpenTelemetry Collector can also import monitoring data from Aerospike. The Collector connects to the Aerospike Prometheus Exporter, ingests the metrics, and forwards them to any observability backend that supports OpenTelemetry such as Datadog, New Relic, or infrastructure monitoring tools that support OTel natively. This approach is useful for teams standardizing on OpenTelemetry across multiple infrastructure components.
The Aerospike Monitoring Stack also includes pre built Grafana dashboards tailored to Aerospike’s architecture. These dashboards cover multi cluster views, namespace memory usage, replication health, client operations, and node level resource consumption. Some dashboards depend on external Grafana plugins like grafana-polystat-panel and jdbranham-diagram-panel to render specific visualizations. Teams can deploy the full stack using Docker Compose for local testing or use tools like AeroLab for automated cluster and monitoring deployment.
Key Metrics to Monitor in Aerospike
Aerospike exposes hundreds of metrics. Knowing which ones matter most for alerting and capacity planning prevents metric overload and ensures teams focus on signals that actually predict failures.
Cluster Health Metrics
Cluster integrity is the first thing to check. The cluster size metric shows how many nodes are currently part of the cluster. If this drops unexpectedly, it means a node has left the cluster due to a crash, network partition, or configuration error. The cluster state metric indicates whether the cluster is healthy or degraded. A degraded state often signals that some nodes cannot reach each other.
The nodes online metric tracks how many nodes are reachable. A sudden drop here indicates network failures or node crashes that require immediate attention. The cluster clock skew metric shows time drift between nodes. Aerospike relies on synchronized clocks for conflict resolution in multi datacenter replication. Clock skew above a few milliseconds can cause data inconsistency.
Namespace Memory and Storage
Namespaces in Aerospike are logical containers for data, similar to databases in relational systems. Each namespace has defined memory and storage limits. The memory used metric shows how much RAM the namespace is consuming. When this approaches the configured memory limit, Aerospike begins evicting records based on the eviction policy, which can surprise teams if not monitored.
The disk used metric tracks storage consumption. Aerospike can store data in memory only, on disk with memory index, or entirely on disk. For disk backed namespaces, crossing storage thresholds triggers stop writes mode, where the namespace refuses new writes to prevent data loss. The eviction count metric shows how many records Aerospike has evicted to free memory. A sudden spike in evictions often correlates with a memory leak or unexpected traffic surge.
The objects metric counts total records in the namespace. Tracking this over time helps with capacity planning and identifying unusual growth patterns. The stop writes flag is a critical alert condition. When true, the namespace has hit a hard limit and is refusing writes. This typically happens when disk usage crosses 90% or memory usage exceeds the configured threshold.
Node Performance Metrics
Node level metrics reveal bottlenecks at the infrastructure layer. The node CPU usage metric shows processor utilization. High sustained CPU usage above 80% can slow query processing and cause latency spikes. The node memory usage metric tracks RAM consumption on the host. Running out of memory forces the OS to swap, which kills Aerospile performance.
The disk read latency and disk write latency metrics measure storage IO performance. Aerospike is designed for sub millisecond disk access. If disk latency exceeds 1 millisecond consistently, it indicates slow storage, misconfigured SSDs, or hardware failure. The batch index queue metric shows how many batch requests are queued. A growing queue means the node cannot keep up with batch read requests.
The client connections metric counts active client connections to the node. A sudden drop signals network issues or application crashes. The proxy in use metric shows how many proxy requests are active. Proxy requests occur when a client sends a request to the wrong node and it must be forwarded to the correct node. High proxy counts indicate poor client side routing.
Replication and Cross Datacenter Replication
Aerospike supports two replication models: intra cluster replication for high availability and cross datacenter replication for disaster recovery. The replication factor metric defines how many copies of each record exist. The default is 2, meaning every record is stored on two nodes. If the effective replication factor drops below the configured value, it means some partitions are under replicated, usually due to node failures.
The migrate progress metric shows the percentage of data migration completed. Migration happens during cluster changes such as adding or removing nodes. A stalled migration at 50% indicates a problem with data transfer. The XDR lag metric measures how far behind the remote datacenter is in replication. High XDR lag means the remote cluster is not keeping up with writes, often due to network bandwidth limits or remote cluster overload.
The XDR success rate metric shows the percentage of successfully replicated records. A success rate below 100% means some records are failing to replicate, which can cause data inconsistency across datacenters. The remote lag time metric measures the time difference between when a record was written locally and when it was replicated remotely. Tracking this helps teams understand replication delay in multi region deployments.
Client Operation Metrics
Client facing metrics show how Aerospike responds to application requests. The read operations per second metric shows read throughput. The write operations per second metric shows write throughput. Tracking both helps teams understand traffic patterns and detect anomalies.
The read latency and write latency metrics measure how long operations take. Aerospike aims for sub millisecond latency. If read latency spikes above 5 milliseconds or write latency above 10 milliseconds, it signals a performance problem at the storage, network, or query layer. The timeout errors metric counts operations that exceeded the client timeout. High timeout rates often indicate slow queries, overloaded nodes, or network congestion.
The unavailable errors metric shows operations that failed because the requested partition was unavailable. This happens during node failures or migrations. The client error rate metric tracks all client visible errors. A sudden spike here requires immediate investigation as it directly affects application reliability.
Aerospike Monitoring Tools and Platforms
Teams monitoring Aerospike can choose between the official Aerospike Monitoring Stack, integration with existing observability platforms, or third party tools that simplify deployment and correlation.
Aerospike Monitoring Stack with Prometheus and Grafana
The official Aerospike Monitoring Stack is the most common starting point. It consists of the Aerospike Prometheus Exporter, Prometheus for metric storage, and Grafana for dashboards. The exporter is maintained by Aerospike and receives regular updates to support new metrics and configurations. The stack is fully open source and can run anywhere Docker or Kubernetes is available.
Teams using this stack get access to pre built Grafana dashboards that cover multi cluster views, namespace health, replication status, and node performance. The dashboards are hosted in the Aerospike monitoring GitHub repository. Some dashboards depend on Grafana plugins like grafana-polystat-panel for multi cluster visualization and jdbranham-diagram-panel for topology views. These plugins need to be installed separately using grafana-cli.
The main limitation of the Aerospike Monitoring Stack is that it only handles metrics. Logs and distributed traces require separate tooling. Teams running microservices alongside Aerospike often need a unified observability platform that correlates Aerospike metrics with application traces and infrastructure logs.
Monitoring Aerospike with New Relic
New Relic offers a quickstart integration for Aerospike that collects key metrics like uptime, info stats, memory usage, and client connections. The integration uses the New Relic infrastructure agent to scrape metrics from Aerospike nodes and forward them to New Relic’s cloud platform.
This integration simplifies deployment for teams already using New Relic for application performance monitoring. However, New Relic’s pricing model based on data ingest and user seats can become expensive at scale. A 50 node Aerospike cluster generating 15 TB of telemetry per month can cost over $5,000 per month on New Relic’s Standard plan before adding APM or logs. Teams should verify current rates at New Relic’s pricing page.
Monitoring Aerospike with CubeAPM
CubeAPM provides infrastructure monitoring for Aerospike clusters with native support for Prometheus and OpenTelemetry exporters. It runs inside your own cloud or on premises environment, meaning Aerospike metrics never leave your infrastructure. This solves both data residency requirements and egress cost problems that affect cloud based SaaS platforms.
CubeAPM ingests metrics from the Aerospike Prometheus Exporter or via the OpenTelemetry Collector. Once ingested, all Aerospike metrics are automatically indexed and searchable without additional configuration. Teams can create custom dashboards for namespace health, replication lag, and node performance using the same interface they use for application traces and logs. Alerts on Aerospike metrics can be routed to Slack, PagerDuty, or email with full trace context when issues correlate with application errors.
Because CubeAPM uses a flat $0.15 per GB pricing model with no per host or per user fees, scaling Aerospike monitoring from 10 nodes to 100 nodes does not multiply costs. A 50 node cluster generating 8 TB of metrics per month costs $1,200 per month with unlimited retention and full search capability. This is often 60% to 70% lower than SaaS platforms that charge per host or per metric series.
CubeAPM also correlates Aerospike metrics with application traces. If a slow database query in your application code correlates with a spike in Aerospike disk latency on a specific node, CubeAPM surfaces that connection automatically. This cross layer visibility helps teams troubleshoot faster without switching between multiple monitoring tools.
Monitoring Aerospike with Datadog
Datadog supports Aerospike monitoring through its Aerospike integration, which collects metrics via the Aerospike Prometheus Exporter. The integration provides pre built dashboards for cluster health, namespace usage, and replication status. Teams already using Datadog for infrastructure and APM can add Aerospike metrics to the same unified view.
Datadog’s strength is breadth of integrations and ease of use. However, its pricing model based on hosts and ingested data can scale unpredictably. A 50 host Aerospike cluster monitored via Datadog Infrastructure Monitoring at $18 per host per month costs $900 per month before adding custom metrics, APM, or logs. Verify current rates at Datadog’s pricing page.
Monitoring Aerospike with Elastic APM and the ELK Stack
Teams running the Elastic Stack for logs and application monitoring can add Aerospike metrics using the Aerospike Prometheus Exporter and Elastic’s Prometheus integration. Elastic APM ingests Prometheus metrics and stores them in Elasticsearch alongside logs and traces. This provides a unified query interface using Kibana.
The main trade off with Elastic is operational complexity. Running a production grade Elasticsearch cluster for metrics storage requires dedicated infrastructure, ongoing tuning, and expertise in managing shards, replicas, and query performance. For teams already invested in the ELK stack, this is a natural fit. For teams new to Elastic, the learning curve is steep.
Best Practices for Aerospike Monitoring
Effective Aerospike monitoring requires more than deploying a metrics exporter and dashboards. These practices help teams catch problems early and reduce mean time to resolution when incidents occur.
Set Alerts on Critical Thresholds
Alerts should fire before problems affect users. For namespace memory usage, set an alert at 85% of the configured limit. This gives time to add capacity or tune eviction policies before stop writes mode kicks in. For disk usage, alert at 80% and escalate at 90%. Aerospike stops writes at 90% by default.
Alert on replication factor drops. If the effective replication factor falls below the configured value, it means some partitions are under replicated. This is a data availability risk. Alert on XDR lag exceeding 60 seconds in multi region deployments. Sustained lag means the remote datacenter is falling behind and may miss writes during a failover.
Alert on disk latency exceeding 1 millisecond. Aerospike is designed for sub millisecond storage. Latencies above this threshold indicate hardware problems or misconfigured storage. Alert on client timeout rates exceeding 1%. This is a direct signal of user facing performance degradation.
Monitor Across Cluster, Namespace, and Node Levels
Aerospike metrics exist at three levels and monitoring all three prevents blind spots. Cluster level metrics like nodes online and cluster state show overall health. Namespace level metrics like memory used and eviction count show logical data behavior. Node level metrics like CPU usage and disk latency show physical resource constraints.
A namespace wide eviction spike might indicate traffic growth. A node level disk latency spike might indicate a failing SSD on one server. Correlating across levels helps teams distinguish between application load issues and infrastructure failures.
Track Trends Over Time for Capacity Planning
Aerospike monitoring is not just for alerting. Historical trends help teams plan capacity before hitting limits. Track namespace memory growth weekly. If memory usage increases 10% per month, extrapolate when the cluster will hit capacity and plan node additions ahead of time.
Track disk usage trends across namespaces. If a namespace grows from 2 TB to 4 TB in three months, plan for additional storage or enable compression. Track client operation rates over time. If read operations per second have doubled in six months, evaluate whether current cluster size can handle the next doubling.
Integrate Aerospike Metrics with Application Traces
Aerospike is infrastructure, but it supports applications. Monitoring Aerospike in isolation misses the connection between database performance and user experience. Correlating Aerospike metrics with application traces reveals causality. A slow API response might correlate with a spike in Aerospike read latency on a specific namespace. An error spike in your application might correlate with a node failure in the Aerospike cluster.
Tools like CubeAPM that unify metrics, traces, and logs make this correlation automatic. Teams can view an application trace showing a slow database query and immediately see the corresponding Aerospike node disk latency spike without switching tools. This reduces troubleshooting time from hours to minutes.
Use Pre-Built Dashboards and Customize for Your Workload
The Aerospike Monitoring Stack includes pre built Grafana dashboards for common use cases. Start with these and customize based on your specific workload. If your application does heavy batch reads, add a dashboard panel tracking batch index queue depth. If you use cross datacenter replication heavily, create dedicated panels for XDR lag and success rate.
Document your custom dashboards and share them across teams. When a production incident occurs, everyone should know which dashboard to check first.
Test Monitoring During Failure Scenarios
Monitoring tools that work in steady state often fail during real incidents. Test your Aerospike monitoring setup during controlled failure scenarios. Simulate a node failure by stopping one Aerospike process. Verify that alerts fire, dashboards update correctly, and the cluster state reflects the missing node.
Simulate disk full conditions by writing data until a namespace hits stop writes mode. Confirm that your alerts fire before writes are blocked. Simulate network partitions between nodes and verify that cluster integrity metrics detect the partition. These tests build confidence that monitoring will work when it matters most.
Common Aerospike Monitoring Challenges
Even with the right tools, teams encounter recurring challenges when monitoring Aerospike at scale.
Metrics Explosion at Scale
A 100 node Aerospike cluster with 10 namespaces generates thousands of unique metric series. Each namespace exposes 50+ metrics per node. Multiply that across nodes and namespaces and the total metric count becomes difficult to manage. Traditional monitoring platforms charge per metric series or per host, making large Aerospike clusters expensive to monitor.
The solution is choosing a monitoring platform with flat pricing or unlimited metric cardinality. CubeAPM indexes all Aerospike metrics without additional cost, regardless of cluster size. Prometheus stores high cardinality metrics efficiently but requires careful retention tuning to avoid storage bloat.
Correlating Aerospike Metrics with Application Behavior
Aerospike metrics show what the database is doing. Application traces show what the application is doing. Connecting the two requires tooling that ingests both signal types and correlates them automatically. Without correlation, teams troubleshoot in two separate tools and miss the connection between a slow query and a disk latency spike.
Using an observability platform that handles metrics and traces together solves this. Synthetic monitoring tools can also help by simulating application requests and alerting when Aerospike backed endpoints slow down, even if Aerospike metrics look healthy.
Alert Fatigue from Noisy Metrics
Aerospike exposes metrics that fluctuate naturally during normal operation. Eviction counts spike during peak traffic. Disk latency varies with IO load. Setting static thresholds on these metrics generates false positives that train teams to ignore alerts.
The solution is using anomaly detection or dynamic baselines. Instead of alerting when disk latency exceeds 1 millisecond, alert when disk latency exceeds the 95th percentile of the past 7 days by more than 50%. This adapts to normal traffic patterns and reduces noise.
Monitoring Aerospike in Kubernetes
Running Aerospike in Kubernetes adds complexity. Nodes are pods that can reschedule across hosts. Pod IPs change during restarts. The Aerospike Prometheus Exporter needs to discover and scrape these dynamic endpoints. The Aerospike Kubernetes Operator simplifies deployment, but monitoring requires Kubernetes aware service discovery in Prometheus.
Using Prometheus Operator with ServiceMonitor resources automates discovery. CubeAPM’s Kubernetes monitoring integrates Aerospike pod metrics with cluster level Kubernetes signals like pod restarts, resource limits, and node pressure, providing full context during incidents.
Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.
Frequently Asked Questions
What is the Aerospike Prometheus Exporter?
The Aerospike Prometheus Exporter is a service that scrapes operational metrics from Aerospike nodes via the info protocol and exposes them in Prometheus format. It is the core component of the Aerospike Monitoring Stack and enables monitoring through Prometheus and Grafana.
How do I monitor Aerospike cluster health?
Monitor cluster health by tracking metrics like cluster size, nodes online, cluster state, and clock skew. Set alerts when nodes drop from the cluster or when cluster state changes to degraded. Use Grafana dashboards to visualize cluster topology and node availability in real time.
What are the most important Aerospike metrics to alert on?
Alert on namespace memory usage above 85%, disk usage above 80%, replication factor drops, XDR lag exceeding 60 seconds, disk latency above 1 millisecond, and client timeout rates above 1%. These metrics predict failures before they affect users.
Can I monitor Aerospike with OpenTelemetry?
Yes. The OpenTelemetry Collector can import metrics from the Aerospike Prometheus Exporter and forward them to any observability backend that supports OpenTelemetry. This approach works well for teams standardizing on OTel across multiple infrastructure components.
How does CubeAPM monitor Aerospike?
CubeAPM ingests Aerospike metrics via the Prometheus Exporter or OpenTelemetry Collector. All metrics are automatically indexed and searchable. CubeAPM correlates Aerospike metrics with application traces and logs, helping teams troubleshoot faster by connecting database performance to user experience.
What is XDR in Aerospike and how do I monitor it?
XDR stands for Cross Datacenter Replication. It replicates data from one Aerospike cluster to another for disaster recovery or multi region deployments. Monitor XDR by tracking lag time, success rate, and remote lag. High XDR lag or low success rates indicate replication problems that can cause data inconsistency.
How do I set up Aerospike monitoring in Kubernetes?
Deploy the Aerospike Prometheus Exporter as a sidecar or dedicated service with access to all Aerospike pods. Use Prometheus Operator with ServiceMonitor resources for automated pod discovery. Monitor Aerospike pod metrics alongside Kubernetes signals like pod restarts, resource limits, and node pressure for full context during incidents.





