CubeAPM
CubeAPM CubeAPM

Monitor Valkey on AWS ElastiCache: Metrics, Tools & Best Practices

Monitor Valkey on AWS ElastiCache: Metrics, Tools & Best Practices

Table of Contents

AWS ElastiCache for Valkey runs as a managed service, which means most infrastructure concerns are handled by AWS. But that does not mean monitoring is optional. A slow command, a memory spike above eviction thresholds, or a connection surge can still degrade performance and impact users, even on a managed service. The difference is that with ElastiCache, you do not have direct access to the underlying host OS or the Valkey process, so monitoring relies entirely on CloudWatch metrics, event logs, and application layer instrumentation.

According to the CNCF 2024 Annual Survey, 87% of organizations use logs for observability, while 57% use distributed traces, but only 32% report having complete visibility into their production cache layer. That gap is especially costly for Valkey clusters that sit in the critical path of every request.

This guide covers what metrics matter, how to collect them from ElastiCache, how to instrument your application for cache visibility, and which tools make monitoring Valkey easier without building everything from scratch.

What Is Valkey and Why AWS ElastiCache Supports It

Valkey is an open source, high performance key value data store. It originated as a fork of Redis 7.2.4 after Redis Labs changed the Redis licensing terms in March 2024. The Valkey project is hosted by the Linux Foundation and is designed to remain fully open source under the BSD 3-Clause license.

AWS announced support for Valkey in ElastiCache in November 2024 as an alternative to Redis OSS. ElastiCache for Valkey offers the same managed deployment, scaling, and operational model as ElastiCache for Redis OSS, with full compatibility for Redis clients and commands. Teams can migrate from Redis OSS to Valkey on ElastiCache with minimal code changes, and AWS has committed to supporting both engines long term.

From a monitoring perspective, Valkey clusters on ElastiCache expose the same metrics namespace and behavior as Redis OSS clusters. CloudWatch metrics, command latency tracking, memory usage, and connection counts all function identically. The monitoring strategies and tooling discussed in this guide apply to both Valkey and Redis OSS deployments on ElastiCache.

How ElastiCache for Valkey Works: Architecture and Data Flow

ElastiCache for Valkey runs as a fully managed service inside AWS. You provision a cluster by specifying node type, number of shards, replication factor, and parameter group settings. AWS handles OS patching, Valkey version upgrades, automated backups, and failover.

Each cluster consists of one or more shards. Each shard contains a primary node and zero or more replica nodes. Writes go to the primary, replicas serve reads and provide high availability. If the primary fails, ElastiCache promotes a replica to primary automatically.

Data flow works as follows. Application instances connect to the cluster endpoint or individual node endpoints via the Valkey protocol on port 6379. Commands are sent to the primary for writes or to replicas for reads. The Valkey process executes each command, updates in memory data structures, and returns results. Replication happens asynchronously between primary and replicas within each shard.

CloudWatch metrics are emitted every minute. AWS collects these from the Valkey INFO command and from instance level resource metrics. Metrics include memory usage, CPU utilization, network bytes in and out, active connections, evictions, replication lag, and command latency for reads and writes. These metrics are published to the AWS/ElastiCache namespace.

You cannot SSH into ElastiCache nodes. You cannot run custom agents on them. All monitoring happens via CloudWatch, application layer instrumentation, or third party tools that integrate with CloudWatch or ingest telemetry from your application.

Key Metrics to Monitor for Valkey on ElastiCache

ElastiCache publishes over 50 CloudWatch metrics for Valkey clusters. Not all are equally important. The metrics below represent the core signals that indicate cluster health, capacity limits, and user facing performance.

Memory and Eviction Metrics

DatabaseMemoryUsagePercentage tracks the percentage of total data capacity in use. On data tiered instances this includes SSD usage. On standard instances it reflects used_memory / maxmemory from the Valkey INFO command. If this metric approaches 100%, the cluster starts evicting keys based on the eviction policy set in your parameter group.

DatabaseMemoryUsageCountedForEvictPercentage excludes overhead and replication buffers from the calculation. This is a more accurate measure of how close you are to triggering evictions. When this metric crosses 80%, you should either scale up your node type or add more shards to distribute data.

Evictions counts the number of keys evicted due to memory pressure. Evictions are not always bad, if your workload expects keys to expire or uses an LRU policy, some evictions are normal. But a sudden spike in evictions during normal traffic indicates memory capacity is exhausted. This often correlates with increased cache miss rates and higher latency as your application falls back to slower data sources.

OOMKills are not a standard CloudWatch metric but can appear in ElastiCache events when the Valkey process is killed due to out of memory conditions. These are rare on ElastiCache because Valkey respects maxmemory limits, but they can happen if replica buffers or fragmentation push total memory usage beyond the instance limit.

CPU and Performance Metrics

EngineCPUUtilization measures CPU usage by the Valkey process itself. Valkey is single threaded for command execution, so high EngineCPU means the Valkey thread is saturated. On nodes with 4 or more vCPUs, this metric is more useful than the instance level CPUUtilization metric, which includes OS and monitoring overhead. If EngineCPUUtilization consistently exceeds 80%, the cluster cannot handle current command throughput and will start queuing requests.

CPUUtilization measures total CPU across all cores on the instance. For smaller node types with 2 vCPUs or fewer, background monitoring processes can consume a meaningful portion of CPU. On these instances, monitor both EngineCPU and total CPUUtilization to catch cases where the instance is overloaded even if the Valkey process is not.

SuccessfulReadRequestLatency and SuccessfulWriteRequestLatency were added in late 2024. These metrics track server side latency in microseconds for read and write commands. They exclude network round trip time and client side processing. Average, p50, p99, and max latencies are available. A spike in p99 write latency often indicates disk I/O pressure on data tiered nodes or replication lag impacting the primary.

Command level latency metrics are also available for specific commands like GET, SET, HGET, ZADD. These are calculated from the INFO commandstats output and represent CPU time consumed per command type. Use these to identify which commands are slowest when overall latency spikes.

Replication and Availability Metrics

ReplicationLag measures how far behind replicas are from the primary in seconds. Low lag, under 1 second, is normal. Lag above 5 seconds means replicas are not keeping up, often due to heavy write load or network partition. High replication lag increases the risk of data loss if the primary fails before replicas catch up.

CurrConnections tracks active client connections to the node. ElastiCache sets a per node connection limit based on node type. If connections approach this limit, new connection attempts fail. Connection spikes usually indicate application connection pool misconfiguration or a connection leak. Monitor this alongside NetworkBandwidthOutAllowanceExceeded to catch network saturation.

Throughput and Network Metrics

NetworkBytesIn and NetworkBytesOut measure data volume transferred to and from the node. High bytes out can trigger NetworkBandwidthOutAllowanceExceeded, which counts packets dropped due to network throttling. When this metric is non-zero, clients experience timeouts or degraded performance even if the Valkey process is healthy.

CacheHits and CacheMisses track how often GET commands find a key. High miss rates increase load on backend databases and degrade application performance. A sudden drop in hit rate often means keys are being evicted due to memory pressure or that traffic patterns have changed.

Error and Failure Metrics

CommandFailures counts commands that failed during the measurement period. Failures can happen due to syntax errors, out of memory conditions, or client disconnections. A spike in this metric during normal operation signals a problem, often related to memory pressure or misconfigured commands.

SearchIndexMemory, SearchTotalDocs, and other search specific metrics apply only if you are using Valkey’s search and query capabilities. These track memory consumed by search indexes and document counts.

Setting Up CloudWatch Monitoring for ElastiCache Valkey Clusters

AWS publishes ElastiCache metrics to CloudWatch automatically. No agent installation or configuration is required. Metrics are published every 60 seconds and retained for 15 months.

To view metrics in the CloudWatch console, navigate to CloudWatch > Metrics > All metrics, then select the AWS/ElastiCache namespace. Filter by CacheClusterId or ReplicationGroupId to see metrics for your specific cluster. Each node in a cluster publishes metrics separately, so for a 3 node cluster you will see metrics tagged with each node ID.

ElastiCache publishes both node level metrics, like EngineCPUUtilization and DatabaseMemoryUsagePercentage, and cluster level metrics, like ReplicationLag and CacheHits. Node level metrics aggregate across all primaries and replicas in a replication group.

For production workloads, set up CloudWatch alarms on the following thresholds:

DatabaseMemoryUsageCountedForEvictPercentage > 80%: Indicates memory capacity is running out. Scale up or add shards.

EngineCPUUtilization > 80%: Valkey thread is saturated. Scale to a larger node type.

ReplicationLag > 5 seconds: Replicas are falling behind. Investigate write load or network issues.

Evictions > baseline: Define a baseline eviction rate for your workload. Alert when evictions spike above that baseline.

NetworkBandwidthOutAllowanceExceeded > 0: Network throttling is happening. Scale to a node type with higher network capacity.

CurrConnections > 80% of max connections: Connection pool is nearing limit. Investigate connection leaks or scale up.

CloudWatch alarms can trigger SNS notifications, Lambda functions, or auto scaling actions. For Valkey clusters, auto scaling based on memory or CPU metrics can automatically add shards or change node types when thresholds are breached.

Monitoring Valkey Application Layer Performance with Distributed Tracing

CloudWatch metrics tell you what is happening inside ElastiCache, but they do not show how cache latency affects your application or which services are calling Valkey. For that, you need distributed tracing.

Distributed tracing instruments your application code to record every cache operation as a span. Each span includes operation type (GET, SET, HGET), key accessed, latency, and whether the operation succeeded or failed. Spans are linked to parent request traces, so you can see cache calls in the context of the full user request.

OpenTelemetry provides language specific SDKs that automatically instrument Redis and Valkey client libraries. For example, the OpenTelemetry Python SDK instruments the redis-py library to emit spans for every command. The Go SDK instruments go-redis. The Java SDK instruments Jedis and Lettuce.

When a trace shows high latency on a database query, you can see if a cache miss preceded it. When a trace shows slow API response time, you can see if Valkey latency contributed. This visibility is critical for diagnosing performance issues in microservices architectures where Valkey sits between multiple services.

Trace data can be sent to observability backends that support OpenTelemetry, including infrastructure monitoring platforms that provide unified visibility across application, cache, and infrastructure layers.

Best Practices for Monitoring Valkey on ElastiCache in Production

Monitor DatabaseMemoryUsageCountedForEvictPercentage instead of raw memory usage. This metric accounts for replication overhead and gives a more accurate picture of eviction risk. Set alerts at 70% for warning and 85% for critical.

Track EngineCPUUtilization separately from total CPUUtilization on nodes with 4 or more vCPUs. This isolates Valkey process saturation from OS overhead. If EngineCPU stays high while total CPU is moderate, the single threaded Valkey process is the bottleneck.

Use SuccessfulReadRequestLatency and SuccessfulWriteRequestLatency p99 percentiles to catch tail latency issues. Median latency may look fine while the slowest 1% of requests are timing out. Monitor p99 and p999 to surface these outliers.

Set up ReplicationLag alerts at 5 seconds. Lag above this threshold increases data loss risk during failover and indicates the cluster cannot keep up with write load. Investigate whether the primary node type is too small or if network issues are delaying replication.

Monitor CacheHits and CacheMisses to track cache effectiveness. A sudden drop in hit rate can indicate keys are being evicted due to memory pressure or that application logic has changed. Calculate hit rate as CacheHits / (CacheHits + CacheMisses) and alert when it drops below expected levels.

Enable Enhanced Monitoring for sub-minute visibility into CPU, memory, swap, and network metrics. Standard CloudWatch metrics publish every 60 seconds. Enhanced Monitoring can publish every 5 or 10 seconds, which helps diagnose transient spikes that do not show up in 1 minute aggregates.

Use CloudWatch Logs Insights to query ElastiCache event logs. ElastiCache publishes events for node failures, failovers, backup completions, and parameter changes. These events correlate with metric spikes and provide context during incident response.

Instrument your application with OpenTelemetry or a similar tracing library to capture cache operation latency, hit rate, and error rate at the application layer. This telemetry fills the gap between CloudWatch infrastructure metrics and user facing performance.

Correlate cache metrics with downstream database metrics. If Valkey evictions spike, database query load will rise as cache misses increase. Monitoring both layers together helps you understand the full impact of cache performance on application behavior.

Set up dashboards that show memory usage, CPU, replication lag, and latency side by side. When an incident happens, seeing all signals together reduces time to root cause. Synthetic monitoring can also help by running scripted cache operations from external locations to verify availability and latency independent of application traffic.

Tools for Monitoring Valkey on AWS ElastiCache

CloudWatch is the default monitoring tool for ElastiCache, but several third party platforms extend CloudWatch with better query performance, unified dashboards, and application layer correlation.

CubeAPM provides full stack observability including ElastiCache monitoring via OpenTelemetry and CloudWatch integration. CubeAPM runs on your infrastructure, so telemetry data including cache metrics, application traces, and logs stays inside your cloud. This makes it suitable for teams with data residency or compliance requirements. CubeAPM pricing is $0.15/GB of ingested telemetry with unlimited retention and no per-seat fees, making it predictable as data volume scales. It correlates Valkey cache spans with distributed traces automatically, so you can see cache latency in the context of the full request. For teams already using AWS monitoring tools, CubeAPM unifies ElastiCache metrics with EC2, RDS, Lambda, and ECS signals in one platform.

Datadog integrates with ElastiCache via the AWS integration and can also instrument Valkey clients at the application layer. Datadog provides pre-built dashboards for ElastiCache and supports anomaly detection on cache metrics. Pricing is host based, starting at $15/host/month for infrastructure monitoring and $31/host/month for APM. For a 50 node cluster, this can reach $2,300/month before logs or custom metrics.

Grafana with Prometheus can scrape CloudWatch metrics via the CloudWatch Exporter or ingest metrics from application instrumentation. Grafana is open source and self-hosted, which gives full control over data but requires significant operational effort to maintain.

New Relic offers ElastiCache monitoring through its AWS integration and supports distributed tracing for cache operations. New Relic pricing is consumption based, charging $0.40/GB for data ingest beyond the free tier. For high volume telemetry, costs can scale faster than CubeAPM’s flat $0.15/GB model.

Dynatrace provides automated topology mapping and AI assisted anomaly detection for ElastiCache clusters. It is best suited for large enterprises with complex multi-cloud environments. Pricing is based on host units and can exceed $20/host/month.

Monitoring Valkey on ElastiCache with CubeAPM

CubeAPM monitors ElastiCache Valkey clusters by combining CloudWatch metric ingestion with application layer tracing via OpenTelemetry. This gives visibility into both infrastructure health and how cache performance affects user requests.

CubeAPM connects to your AWS account via read-only IAM role to pull CloudWatch metrics for ElastiCache clusters. Metrics are ingested every 60 seconds and stored with unlimited retention. You can query historical metrics to analyze trends, compare performance across time windows, or correlate cache spikes with deployment events.

Application layer visibility comes from OpenTelemetry instrumentation in your services. CubeAPM receives traces and spans from your application, automatically parsing Valkey operation spans to surface latency, error rate, and hit/miss ratio per service and endpoint. This telemetry is correlated with CloudWatch metrics so you can see infrastructure signals and application signals together.

CubeAPM provides pre-built dashboards for ElastiCache showing memory usage, CPU, replication lag, evictions, and command latency. You can create custom dashboards to combine cache metrics with application traces, logs, and database query performance. Alerts can be configured on any metric or trace attribute, with routing to Slack, PagerDuty, or email.

Because CubeAPM deploys inside your VPC, all telemetry data including sensitive cache keys or values captured in traces stays within your infrastructure. This simplifies compliance with GDPR, HIPAA, or data residency requirements. CubeAPM’s pricing is $0.15/GB of telemetry ingested, with no additional charges for users, hosts, or retention.

Disclaimer: CubeAPM pricing and features reflect the latest information available at the time of publication and may change. Always verify current details at [CubeAPM pricing](https://cubeapm.com/pricing/) before deployment decisions.

Migration Considerations When Switching from Redis OSS to Valkey on ElastiCache

AWS ElastiCache supports in-place upgrades from Redis OSS to Valkey for clusters running Redis 7.2 or lower. The upgrade process is similar to a Valkey version upgrade and can be triggered via the AWS Console, CLI, or API.

Before upgrading, verify that your application clients are compatible with Valkey. Most Redis client libraries work with Valkey without modification because Valkey maintains wire protocol compatibility. However, some clients check the server version string and may log warnings or fail if they do not recognize Valkey. Test your application against a non-production Valkey cluster first.

During the upgrade, ElastiCache applies changes to one node at a time. For clusters with replicas, the upgrade starts with replicas, then fails over to a replica before upgrading the old primary. This minimizes downtime but may cause brief connection resets. Plan the upgrade during a maintenance window or low traffic period.

After the upgrade, monitor the same CloudWatch metrics discussed earlier. Valkey and Redis OSS expose identical metrics, so existing dashboards and alarms continue to work. If you were using Redis specific features like RedisJSON or RedisTimeSeries modules, verify that equivalent functionality exists in Valkey or migrate to alternative solutions.

For new clusters, you can provision Valkey directly by selecting the Valkey engine when creating a cluster in ElastiCache. Pricing and instance types are the same as Redis OSS.

Conclusion

Monitoring Valkey on AWS ElastiCache requires visibility into CloudWatch infrastructure metrics, application layer cache operations, and the correlation between the two. Memory usage, CPU saturation, replication lag, and latency percentiles are the core signals that indicate cluster health and capacity limits. CloudWatch provides these metrics natively, but integrating application tracing and centralized dashboards makes troubleshooting faster and reduces the operational burden of managing multiple monitoring tools.

For teams running Valkey in production, the monitoring strategy should include CloudWatch alarms on critical thresholds, distributed tracing instrumentation at the application layer, and a unified observability platform that correlates cache, application, and infrastructure signals. Tools like CubeAPM, Datadog, and Grafana each offer different tradeoffs on cost, deployment model, and feature depth.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.

Frequently Asked Questions

What metrics are most important for Valkey monitoring on ElastiCache?

DatabaseMemoryUsageCountedForEvictPercentage, EngineCPUUtilization, ReplicationLag, and SuccessfulReadRequestLatency p99 are the core metrics. These indicate memory pressure, CPU saturation, replication health, and user facing latency.

How does ElastiCache expose Valkey metrics to CloudWatch?

ElastiCache collects metrics from the Valkey INFO command and instance level resource usage every 60 seconds and publishes them to the AWS/ElastiCache namespace. No agent installation is required.

Can I monitor cache hit rate for Valkey on ElastiCache?

Yes. CloudWatch publishes CacheHits and CacheMisses metrics. Calculate hit rate as CacheHits divided by total requests. Monitor this to detect eviction pressure or changes in application access patterns.

What is the difference between EngineCPUUtilization and CPUUtilization?

EngineCPUUtilization measures CPU used by the Valkey process itself. CPUUtilization measures total CPU across all cores including OS and monitoring overhead. For nodes with 4+ vCPUs, EngineCPU is the better signal for Valkey saturation.

How do I monitor Valkey latency from the application side?

Instrument your application with OpenTelemetry or a similar tracing library that captures spans for cache operations. This records latency, error rate, and operation type for every cache call.

What happens if DatabaseMemoryUsagePercentage reaches 100 percent?

Valkey starts evicting keys based on the eviction policy set in your parameter group. Evictions increase cache miss rate and degrade application performance. Scale up or add shards before reaching 100 percent.

Does ElastiCache support real user monitoring for Valkey?

ElastiCache does not provide real user monitoring natively. You need to instrument your frontend application to capture cache related latency as part of user session traces.

×
×