Database Monitoring Guide: PostgreSQL, MySQL, MongoDB, Redis, and Elasticsearch (2026)

Author: Vineet Chirania
Category: Monitoring
Published Date: May 27, 2026

Database monitoring is the practice of continuously tracking the health, performance, and availability of your database systems and acting on that data before users feel the impact. Most teams instrument their application layer carefully, then treat the database as an afterthought. According to ITIC’s 2024 Hourly Cost of Downtime report, over 90% of large and mid-size enterprises report that a single hour of downtime costs upward of $300,000, and 4 in 10 put that number at $1 million or more.

The problem is specificity. Tracking CPU and disk tells you something is wrong. It does not tell you which query caused it, whether it is a replication lag widening over hours, a Redis eviction policy silently dropping session data, or an Elasticsearch heap creeping toward its limit. Without the right database-native signals, you are always reacting after users have already noticed.

This guide covers database monitoring for five engines: PostgreSQL, MySQL, MongoDB, Redis, and Elasticsearch, with the specific metrics, alert thresholds, and observability practices that matter in production.

What Database Monitoring Actually Means in 2026

database-monitoring-melt-model — Database Monitoring Guide: PostgreSQL, MySQL, MongoDB, Redis, and Elasticsearch (2026) 3

Database monitoring has evolved well past tracking uptime and disk space. Most organizations start with infrastructure metrics: CPU, memory, and connection counts. That gives enough signal to notice something is wrong, but rarely enough to understand why, or to catch the problem before users feel it.

The Four Signal Types That Matter

Effective database monitoring in 2026 means collecting all four of the following simultaneously. Missing any one of them creates blind spots that show up worse during incidents.

Metrics: numeric counters and gauges such as query throughput, connection counts, and cache hit ratios
Events: discrete occurrences like schema changes, failovers, or autovacuum completions
Logs: slow query logs, error logs, and audit logs that explain what happened and when
Traces: the full path a request takes from application code through the database and back

This is the MELT model. It has become the baseline for serious database observability work.

Why Metrics Alone Are Not Enough

A CPU spike on a PostgreSQL host tells you the server is stressed. It does not tell you which query caused it, whether a missing index is doing a full table scan, or whether the spike coincides with a batch job that ran at the same time yesterday without issue. Query-level visibility is the gap most teams do not close until an incident forces them to.

The Role of OpenTelemetry

OpenTelemetry has become the instrumentation standard for databases in 2026. The Grafana Labs 2025 Observability Survey found that more than two-thirds of organizations use Prometheus in production, and OpenTelemetry is tracking close behind with 41% already in production use and another 38% actively investigating or building proofs of concept.

It unifies your telemetry pipeline so that the database spans land in the same backend as your application traces. During an incident, you can follow a slow HTTP request directly to the offending query without switching tools or joining data across systems.

What Good Database Monitoring Should Answer in Under Five Minutes

Which query is responsible for this latency spike?
Is the slowdown coming from the database itself or from connection pool exhaustion?
Did a recent deployment or schema change correlate with the degradation?
Is this a single-instance problem or a cluster-wide pattern?

If your current setup cannot answer those questions quickly, the sections below will show you exactly what to add for each engine.

PostgreSQL Monitoring: The Metrics That Actually Matter

PostgreSQL is the most widely deployed open-source relational database in production environments. It exposes an exceptionally rich statistics system through its internal views, and understanding which views to query is the first step to meaningful monitoring.

Connection Management

PostgreSQL uses one process per connection. Above roughly 200 active connections, performance degrades measurably due to context-switching and memory pressure. The right metric to watch is pg_stat_activity, filtered by state.

Track these specific counts:

Active connections (state = ‘active’): queries currently executing
Idle connections (state = ‘idle’): connections open but doing nothing, consuming memory
Idle in transaction (state = ‘idle in transaction’): the dangerous one; open transactions holding locks

Idle-in-transaction connections are a leading indicator of deadlock storms and lock waits. An alert at more than 10% of your max_connections sitting idle in transaction is a reasonable starting threshold. If you use PgBouncer or a similar connection pooler, track both the pooler’s client connections and the underlying PostgreSQL connections separately. The two can diverge dramatically under load.

Query Performance: pg_stat_statements Is Non-Negotiable

Enable the pg_stat_statements extension. It is the single most valuable diagnostic tool PostgreSQL ships. It normalizes all queries to their parameter-free form and tracks cumulative execution time, call count, rows returned, and shared buffer hits and misses per query digest.

The metrics that matter most from this view:

total_exec_time / calls: average execution time per query, sorted descending to find your worst offenders
shared_blks_hit / (shared_blks_hit + shared_blks_read): your effective buffer cache hit rate per query; below 0.95 for a frequently-called query is a strong signal of missing indexes or undersized shared_buffers
rows / calls: average rows returned; unusually high values often indicate missing WHERE clause indexes or poorly written joins

Alert on any query digest whose mean execution time exceeds your p99 application SLO. If your API responses need to be completed in 200ms, any query regularly exceeding 50ms warrants investigation.

Autovacuum and Bloat

PostgreSQL’s MVCC model creates dead row versions on every UPDATE and DELETE. Autovacuum reclaims them. When autovacuum cannot keep up during write-heavy periods or when tables lack tuned thresholds, table bloat accumulates, query plans degrade, and eventually transaction ID wraparound becomes a real threat.

Monitor these from pg_stat_user_tables:

n_dead_tup: dead tuple count per table
last_autovacuum / last_autoanalyze: when vacuum last ran; gaps longer than a few hours on busy tables are a warning
n_dead_tup / n_live_tup: bloat ratio; consistently above 10-20% deserves attention

The hard limit to understand: PostgreSQL will shut down in read-only mode if transaction IDs are within roughly 10 million of wraparound. Monitor age(datfrozenxid) from pg_database. Alert at 1 billion, take urgent action at 1.5 billion.

Replication Lag

For any high-availability PostgreSQL setup, replica lag is a critical metric. Query pg_stat_replication on the primary for write_lag, flush_lag, and replay_lag. Alert thresholds depend heavily on your RPO, but for most production systems, replay_lag exceeding 30 seconds warrants a page.

Alert Thresholds for PostgreSQL

Metric	Warning	Critical
Connection utilization (% of max_connections)	70%	90%
Idle in transaction connections	>5	>20
Replication replay_lag	>10s	>60s
Cache hit ratio (pg_stat_bgwriter)	<95%	<90%
Transaction ID age	>1B	>1.5B
Long-running queries	>30s	>5min

MySQL Monitoring: What InnoDB Tells You That Slow Logs Don’t

MySQL is the backbone of countless web applications. Its monitoring story runs through two separate information sources: Performance Schema, the preferred source for runtime query statistics and wait-event analysis, and the InnoDB engine metrics. Both matter and they answer different questions.

InnoDB Buffer Pool

The InnoDB buffer pool is MySQL’s primary memory cache. Every row read from disk passes through it. The buffer pool hit ratio tells you how often queries find their data in memory versus reading from disk.

Calculate it as: Innodb_buffer_pool_reads / (Innodb_buffer_pool_reads + Innodb_buffer_pool_read_requests)

A healthy production system holds this ratio below 1% (meaning 99%+ of reads come from cache). If this ratio climbs, the most common causes are: buffer pool too small for your working dataset, a new query pattern doing large sequential scans, or schema growth that pushed hot data out of cache.

Track Innodb_buffer_pool_pages_dirty alongside the hit ratio. High dirty page counts with low checkpoint activity can indicate I/O bottlenecks that will eventually surface as write stalls.

Thread and Connection Metrics

From SHOW GLOBAL STATUS, the metrics that matter:

Threads_running: actively running queries at this moment; this is your real concurrency indicator, not Threads_connected
Threads_connected: total connections; alert when this approaches max_connections
Connection_errors_max_connections: increment counter; any non-zero value means clients are being rejected

The distinction between Threads_running and Threads_connected is critical. A system with 500 connections and only 10 running threads is under minimal load. A system with 50 connections and 48 running threads is on the edge of a crisis.

Slow Query Log vs. Performance Schema

The slow query log captures queries exceeding long_query_time but loses its value at high write volumes due to logging overhead. Performance Schema is the better choice for production: it captures query digests, wait events, and execution stage breakdowns with minimal overhead.

Enable events_statements_summary_by_digest and sort by SUM_TIMER_WAIT DESC to find your slowest query digests by total wall time. This differs from finding your slowest individual queries. A query that takes 2ms but runs 50,000 times per minute contributes far more to total latency than one that takes 500ms and runs twice a day.

Replication Monitoring

MySQL replication monitoring requires watching both the primary and replica. On the replica, Seconds_Behind_Source (in MySQL 8.0+, replacing the older Seconds_Behind_Master) gives lag in seconds but can be misleading. It measures the timestamp difference between the last applied event and now, not network or I/O lag.

For precise replication monitoring, enable GTID-based replication and track transaction lag by comparing gtid_executed between primary and replica. Alert on any replica where lag exceeds your recovery point objective.

InnoDB Deadlocks

Track Innodb_deadlocks from SHOW STATUS. Deadlocks are normal in small quantities. InnoDB resolves them automatically by rolling back the victim transaction. A rising deadlock rate indicates application-level concurrency issues, usually from transactions acquiring locks in inconsistent order.

Parse SHOW ENGINE INNODB STATUS or query information_schema.innodb_trx during incidents to understand what transactions are blocking what.

Alert Thresholds for MySQL

Metric	Warning	Critical
InnoDB buffer pool hit ratio	<98%	<95%
Threads_running (% of max connections)	60%	80%
Seconds_Behind_Source	>10s	>60s
Deadlocks per minute	>5	>20
Connection errors (max_connections)	>0	Any sustained rate

MongoDB Monitoring: The Atlas and Self-Hosted Divide

MongoDB monitoring splits into two distinct contexts: Atlas-managed clusters, where many metrics are collected automatically, and self-hosted deployments, where instrumentation is entirely your responsibility. The metrics that matter are the same in both cases.

Operation Counters and Latency

MongoDB exposes its runtime statistics through db.serverStatus(). The most actionable metrics live in the opcounters document and the opLatencies document.

Track operations per second by type: insert, query, update, delete, getmore, command. Sudden shifts in the ratio, such as a spike in update operations with no corresponding insert growth, can indicate runaway update loops or application bugs before they surface as user-facing slowness.

opLatencies gives you cumulative operation latency in microseconds. Divide the latency accumulator by the operation count delta to get mean latency per operation type over your collection interval.

For any serious MongoDB deployment, enable the Profiler selectively on slow operations. Set db.setProfilingLevel(1, { slowms: 100 }) to capture queries exceeding 100ms without enabling the full query log overhead. Captured operations land in system.profile and contain full query shape, execution plan, and index usage.

WiredTiger Cache

MongoDB’s WiredTiger storage engine has its own cache, separate from OS page cache. Monitor wiredTiger.cache.bytes currently in the cache against the configured cache size (default: 50% of RAM minus 1GB, minimum 256MB).

The critical metric is cache eviction. When dirty data eviction rate climbs, WiredTiger starts blocking application writes while it flushes dirty pages to disk. This surfaces as sudden write latency spikes with no corresponding CPU or network explanation.

Watch for:

wiredTiger.cache.pages evicted by application threads: any non-zero sustained rate indicates cache pressure
wiredTiger.cache.tracked dirty bytes in the cache / configured cache size: alert above 20%

Replication and Oplog

For replica sets, monitor rs.printReplicationInfo() output programmatically. The critical figure is oplog window: how many hours of operations the oplog retains. If a secondary falls behind and the oplog window closes before it catches up, that secondary needs full resync.

Alert when oplog.timeDiff (oplog window in seconds) is less than 24 hours. For high write environments, configure oplog size explicitly using the replication.oplogSizeMB setting in your MongoDB config file.

Track secondary replication lag through rs.status(), specifically the optimeDate difference between primary and each secondary. Alert when any secondary lags more than 30 seconds.

Index Effectiveness

MongoDB query performance depends almost entirely on index coverage. Run db.collection.aggregate([{$indexStats:{}}]) periodically to audit which indexes are being used. Indexes with zero accesses after weeks of production traffic are candidates for removal, as they consume write overhead without benefit.

On a collection receiving 10,000 inserts per second, a single unused compound index adds measurable write latency. This is a straightforward optimization that many teams skip until a query regression forces them to investigate.

Alert Thresholds for MongoDB

Metric	Warning	Critical
Query latency (p95)	>100ms	>500ms
Replication lag (secondary)	>15s	>60s
WiredTiger dirty cache	>15%	>25%
Oplog window	<48 hours	<12 hours
Connections (% of max)	70%	90%

Redis Monitoring: Memory Is Everything

Redis works differently from every other database in this guide. It operates entirely in memory. The consequence is simple: when Redis runs out of memory, your entire caching layer is compromised. Every Redis monitoring strategy starts with memory and builds outward from there.

Memory Usage and Eviction Policy

The fundamental Redis memory metrics come from INFO memory:

used_memory: current memory allocated by Redis for data
used_memory_rss: memory as reported by the OS (always larger; the difference is fragmentation)
mem_fragmentation_ratio: used_memory_rss / used_memory; healthy range is 1.0 to 1.5; above 1.5 indicates significant fragmentation; below 1.0 means Redis is using swap

Know your eviction policy. If maxmemory-policy is set to allkeys-lru or similar, Redis silently evicts keys when memory is full. This is acceptable for pure caching but catastrophic if Redis also holds application session state or distributed locks that must not expire unexpectedly.

Monitor evicted_keys from INFO stats. Any evictions in a system where Redis holds non-cache data should trigger an immediate investigation. For pure cache use cases, a moderate eviction rate is expected but should be stable.

Keyspace Hit Rate

From INFO stats, track keyspace_hits and keyspace_misses. Your cache hit rate is:

keyspace_hits / (keyspace_hits + keyspace_misses)

A well-tuned Redis cache should hold above 90% hit rate for most workloads. A degrading hit rate suggests TTL values set too aggressively, cache invalidation happening too frequently, or a memory pressure situation causing excessive evictions.

Slow Queries and Blocking Commands

Redis is single-threaded for command execution (excluding I/O threads in Redis 6+). A single slow command blocks all subsequent commands. Enable the slowlog with a reasonable threshold:

CONFIG SET slowlog-log-slower-than 10000 # 10ms in microseconds

CONFIG SET slowlog-max-len 128

Query SLOWLOG GET 10 regularly. Commands that consistently appear in the slowlog are candidates for optimization or replacement with more targeted data structures.

Watch for blocking commands: KEYS * in production is never acceptable. One KEYS * call on a large keyspace can block Redis for seconds. Always use SCAN with cursor iteration instead.

Replication and Persistence

For Redis replicas, monitor master_last_io_seconds_ago from INFO replication. This shows how recently the replica heard from the primary. Alert above 10 seconds.

If using RDB persistence, monitor rdb_last_bgsave_status. Any value other than “ok” means your last backup failed. If using AOF, track aof_rewrite_in_progress during rewrites, as they add memory pressure.

Alert Thresholds for Redis

Metric	Warning	Critical
Memory usage (% of maxmemory)	70%	85%
Mem fragmentation ratio	>1.5	>2.0
Evicted keys (non-cache use)	>0	Any sustained rate
Cache hit rate	<85%	<70%
Connected clients (% of maxclients)	70%	85%
Replication offset (bytes behind)	>100KB	>1MB

Elasticsearch Monitoring: Cluster State Is Your First Signal

Elasticsearch operates as a distributed cluster of nodes, and its monitoring has a cluster-first orientation. A single unhealthy node can degrade the entire cluster. Your monitoring must start with cluster state, then drill into node-level and index-level metrics.

Cluster Health

The Elasticsearch Cluster Health API (GET /_cluster/health) returns one of three states:

Green: all primary and replica shards are assigned and operational
Yellow: all primary shards are assigned, but one or more replica shards are unassigned
Red: one or more primary shards are unassigned; some data may be unavailable

Red cluster status means data loss risk is active. Yellow means resilience is compromised; the cluster can serve requests but cannot tolerate a node failure for the affected indexes. Your monitoring should treat any non-green cluster state as an alert-worthy event.

Poll GET /_cluster/health at your shortest feasible interval, 15 to 30 seconds in production. The status field is your headline metric. Track it alongside relocating_shards, initializing_shards, and unassigned_shards.

JVM Heap Pressure

Elasticsearch is built on the JVM. Heap pressure is the most common root cause of Elasticsearch performance degradation and crashes. Pull JVM stats from GET /_nodes/stats/jvm:

jvm.mem.heap_used_percent: alert at 75%, take urgent action at 85%
jvm.gc.collectors.old.collection_count and collection_time_in_millis: track the rate of increase; full GC events exceeding 10 seconds will cause cluster timeouts

Elasticsearch recommends setting heap to 50% of available RAM, with a hard ceiling of 31GB. Above 31GB, the JVM cannot use compressed object pointers and memory efficiency drops sharply. If your Elasticsearch nodes have 128GB of RAM and heap is set to 64GB, you are risking more frequent full GCs and leaving performance on the table.

Indexing and Search Throughput

From GET /_nodes/stats/indices:

indices.indexing.index_total and index_time_in_millis: derive indexing throughput (docs/sec) and mean indexing latency
indices.search.query_total and query_time_in_millis: derive query throughput and mean query latency
indices.search.fetch_total and fetch_time_in_millis: the fetch phase happens after query; slow fetches often indicate returning too many fields or large source documents

For search latency, p99 matters more than mean. Elasticsearch search results are assembled from multiple shards in parallel, so tail latencies drive user experience. Alert when p99 search latency exceeds your SLO, typically 500ms to 1s for most interactive search applications.

Thread Pool Queues and Rejections

Each Elasticsearch node has thread pools for different operations: search, write, analyze, and others. When a thread pool is saturated, incoming requests are queued. When the queue fills, requests are rejected.

Track GET /_nodes/stats/thread_pool for:

thread_pool.search.rejected: any rejections mean queries are being dropped; alert immediately
thread_pool.write.rejected: same for indexing; alert immediately
thread_pool.search.queue: a growing queue predicts rejections; alert when queue depth exceeds 50

Thread pool rejections are one of the clearest signals of a cluster under load it cannot handle. They appear in application logs as 429 errors and are often the first thing users report as “search not working.”

Disk and Shard Allocation

Elasticsearch has built-in disk watermarks that automatically restrict shard allocation to protect disk space. Know the defaults:

cluster.routing.allocation.disk.watermark.low (default 85%): Elasticsearch stops allocating new shards to this node
cluster.routing.allocation.disk.watermark.high (default 90%): Elasticsearch begins relocating shards away from this node
cluster.routing.allocation.disk.watermark.flood_stage (default 95%): Elasticsearch puts all indexes on this node into read-only mode

Alert well before these thresholds. If your monitoring alerts at 85%, you have no reaction time before Elasticsearch starts managing its own shard allocation. Alert at 70% disk usage on Elasticsearch nodes.

Alert Thresholds for Elasticsearch

Metric	Warning	Critical
Cluster status	Yellow	Red
JVM heap used	>75%	>85%
Disk usage	>70%	>82%
Thread pool rejections (search/write)	>0	Any sustained rate
Unassigned shards	>0	>5
Search p99 latency	>500ms	>2s

Cross-Database Monitoring Practices

The five databases above each have unique internals, but the operational practices around monitoring them share common ground.

Correlate Database Metrics with Application Traces

The most valuable insight in database monitoring comes from correlation: understanding whether a latency spike in your database explains a latency spike in your application. This requires distributed tracing that spans both layers.

When an OpenTelemetry-instrumented application makes a database call, the resulting trace span carries the query text, duration, and database attributes alongside the parent HTTP span. During an incident, you can start from an elevated error rate in your application, drill into traces, find the slow database span, and pivot directly to the corresponding database metrics without context-switching tools.

This is the architecture that separates reactive from proactive database operations. Teams that have built this correlation catch issues during gradual degradation, not during sudden outages.

Establish Baseline Behavior Before Setting Alert Thresholds

Alert thresholds are meaningless without baselines. A MySQL buffer pool hit rate of 97% might be normal for one application and a serious regression for another. Before setting thresholds, run your monitoring for two to four weeks and understand:

Peak vs. off-peak resource utilization patterns
Expected query latency ranges by query type
Normal connection counts during batch jobs vs. interactive traffic
Replication lag behavior during heavy write periods

Thresholds set below peak normal levels generate alert fatigue. Thresholds set above what the database can sustain generate missed incidents. The tables in this guide are starting points; calibrate them against your own baselines.

Log Retention for Post-Incident Analysis

During a live incident, your monitoring dashboard shows current state. Post-incident, you need historical telemetry to understand root cause. High-cardinality telemetry (query-level traces, slow query logs) generates significant data volume, and retention decisions involve cost tradeoffs.

For most production environments, a practical retention strategy is:

Metrics (aggregated): 13 months (covers year-over-year comparison)
Slow query logs and traces: 30 to 90 days
Infrastructure metrics (CPU, memory, disk): 13 months

The worst time to discover you have two-week telemetry retention is during a post-mortem for an incident that started three weeks ago.

Putting It Together: Monitoring a Mixed Database Stack

Most production environments do not run a single database. A common stack looks something like this: PostgreSQL handles transactional data, Redis sits in front as a cache, MongoDB stores user-generated content, and Elasticsearch powers product or log search. Running separate monitoring tools for each engine is technically possible, and operationally exhausting.

What Fragmented Monitoring Looks Like in Practice

Consider an engineering team supporting an e-commerce platform. During a Black Friday traffic spike, checkout latency climbs from 180ms to 900ms. The on-call engineer opens four dashboards: the PostgreSQL monitoring tool, the Redis metrics UI, the Elasticsearch Kibana console, and the APM tool for application traces. None of them talk to each other.

The PostgreSQL tool shows connection pool saturation. The Redis tool shows normal memory usage. The APM tool shows slow HTTP spans, but cannot drill to the database query level. It takes 40 minutes to establish that a missing index on the PostgreSQL orders table, introduced by a schema migration two days earlier, is responsible, a correlation that should have taken five minutes with unified telemetry.

This is the cost of fragmentation: not just operational overhead, but mean time to resolution that multiplies under incident pressure.

What Unified Database Monitoring Enables

Unified monitoring changes the incident workflow in three specific ways:

Cross-layer correlation: A slow HTTP trace links directly to the offending database span, the query digest, and the infrastructure metric (CPU, disk I/O) on the same node, in one view
Single alert routing: One platform sends pages with full context, not five separate alerts that engineers must manually correlate
Unified retention: Historical telemetry from all engines is queryable together, which matters during post-mortems that span multiple systems

Choosing a Platform for Multi-Database Visibility

The architecture that works is simple: one OpenTelemetry-native platform ingesting telemetry from all engines into a shared store, with database spans and application traces living in the same backend. CubeAPM is built on this model, covering MySQL, PostgreSQL, MongoDB, Redis, and Elasticsearch alongside application traces and infrastructure metrics, with on-prem or VPC deployment for teams with data residency requirements.

Delhivery, a large-scale logistics company, reported a 75% reduction in monitoring costs after consolidating onto CubeAPM, with faster root-cause detection across their database and application layers. The architectural decisions here determine whether an incident takes five minutes or fifty to resolve.

A relevant data point on the cost of getting this wrong: Splunk and Oxford Economics found that unplanned downtime costs Global 2000 companies $400 billion annually, with each company averaging $200 million per year in losses.

How CubeAPM Helps With Database Monitoring

Most database monitoring problems are not a shortage of data. They are a shortage of context. Metrics in one tool, query traces in another, logs somewhere else. When an incident hits, the investigation becomes a manual exercise of correlating timestamps across three dashboards instead of following a single thread from symptom to root cause.

Database Monitoring by CubeAPM — Database Monitoring Guide: PostgreSQL, MySQL, MongoDB, Redis, and Elasticsearch (2026) 4

One Platform Across All Five Engines

CubeAPM is an OpenTelemetry-native observability platform that covers PostgreSQL, MySQL, MongoDB, Redis, and Elasticsearch in a single interface, alongside application traces, infrastructure metrics, and logs. For each engine, it surfaces the signals that matter most:

PostgreSQL: slow query detection, connection state breakdown, autovacuum tracking, and replication lag
MySQL: InnoDB buffer pool metrics, thread concurrency, and query digest analysis via Performance Schema
MongoDB: WiredTiger cache pressure, oplog window, replication lag, and index usage auditing
Redis: memory utilization, eviction rate, keyspace hit ratio, and slowlog visibility
Elasticsearch: cluster health state, JVM heap pressure, thread pool rejections, and shard allocation status

Correlation Between Database and Application Layers

Because CubeAPM ingests database telemetry and application traces through the same OpenTelemetry pipeline, a slow query shows up in the same trace as the HTTP request that triggered it. Engineers start from the user-facing error, follow the trace to the database span, see the query digest and execution time, and pivot to the infrastructure metrics on the same host, without leaving the platform.

Predictable Cost at Scale

Datadog and New Relic price database monitoring per host or monitored instance, which scales unpredictably as fleets grow. CubeAPM uses per-GB ingestion pricing at $0.15/GB with unlimited retention, so teams running five database engines across dozens of nodes can forecast their observability spend without surprises. Data stays inside your own VPC, which matters for teams with compliance or data residency requirements.

Conclusion

Each database in this guide fails differently and exposes different signals. PostgreSQL surfaces problems through its statistics views. MySQL reveals pressure through InnoDB internals. MongoDB warns you through replication lag and cache eviction. Redis tells the story through memory and hit rate. Elasticsearch speaks through cluster state and JVM heap. The job of database monitoring is to know which signal to watch for each engine, set thresholds against your own baselines, and correlate what the database is doing with what the application is experiencing.

The teams that catch issues before users notice are not running more sophisticated infrastructure. They are watching the right metrics, in one place, with enough historical context to tell signal from noise. If your current setup leaves gaps across any of the five engines covered here, start with the alert threshold tables in each section and build from there.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. The alert thresholds and metric recommendations in this guide are starting points based on common production patterns. Every database workload is different. Validate all thresholds against your own baseline before applying them in production.

FAQs

What is database monitoring?

Database monitoring is the continuous collection, analysis, and alerting on performance and health metrics from a database system. It covers query execution time, connection utilization, memory usage, replication lag, error rates, and storage capacity, tracked over time so teams detect anomalies before they cause outages or user impact.

What are the most important metrics to monitor in PostgreSQL?

The highest-signal PostgreSQL metrics are query execution time from pg_stat_statements, connection state distribution from pg_stat_activity, replication lag from pg_stat_replication, autovacuum activity from pg_stat_user_tables, and transaction ID age from pg_database. Enabling the pg_stat_statements extension is non-negotiable for any production PostgreSQL instance.

How do you monitor MySQL replication lag accurately?

In MySQL 8.0+, Seconds_Behind_Source from SHOW REPLICA STATUS is the primary indicator but can be misleading since it measures event timestamp differences rather than actual network or I/O lag. For precision, enable GTID-based replication and compare gtid_executed between primary and replica, which gives transaction-count lag independent of timestamp skew.

Why does Elasticsearch cluster health turn yellow and what should you do?

Yellow status means one or more replica shards are unassigned, usually because there are not enough nodes to place all replicas. Run GET /_cluster/allocation/explain to understand why shards are unassigned, verify node availability, and check disk watermark thresholds if disk pressure is triggering allocation restrictions.

How much memory should Redis use before you increase capacity?

Alert when Redis used_memory reaches 70% of maxmemory. At 85%, evictions may begin affecting data outside your intended cache scope. Use used_memory_rss, the actual OS-reported footprint, as your true capacity constraint when planning headroom.

What is the difference between MongoDB’s WiredTiger cache and OS page cache?

WiredTiger maintains its own cache for decompressed, deserialized documents and index entries. The OS page cache provides a secondary layer for raw disk pages not in the WiredTiger cache. Monitor wiredTiger.cache.bytes currently in the cache for engine-level pressure and overall free memory for system-level pressure. High WiredTiger eviction rates with available OS memory usually indicate the WiredTiger cache is configured too small.

Should you use a single monitoring tool for all five databases or separate tools per database?

A single unified platform significantly reduces operational complexity and, more importantly, enables cross-layer correlation between database telemetry and application traces. Separate tools make that correlation difficult or impossible. Teams running mixed stacks get the most value from OpenTelemetry-native platforms that ingest telemetry from all engines into a shared store with unified dashboards and alert routing.

How often should database monitoring alerts be reviewed and tuned?

Review and tune alert thresholds at least quarterly, and always after major version upgrades, schema changes, or significant traffic changes. After any incident, check whether your monitoring caught the issue early or missed it, and adjust accordingly. Monitoring is a living system, not a set-and-forget configuration.

9 Best Spark Streaming Monitoring Tools in 2026: Real-Time Observability Compared on Cost, Deployment, and Signal Depth

Indu Priya July 22, 2026

Azure DevOps Pipeline Monitoring: Build and Release Failures

Indu Priya July 20, 2026

Azure Managed Grafana: Setup and Comparison with Self-Hosted

Indu Priya July 20, 2026

10 Best Azure Cost Monitoring Tools in 2026: Deep Comparison for Cloud Cost Governance

Indu Priya July 20, 2026

Azure Monitor vs OpenObserve: In-Depth Comparison 2026

Indu Priya July 20, 2026

OpenCost vs Kubecost: In-Depth Comparison 2026

Abhinav Garg July 20, 2026

Database Monitoring Guide: PostgreSQL, MySQL, MongoDB, Redis, and Elasticsearch (2026)

Table of Contents

What Database Monitoring Actually Means in 2026

The Four Signal Types That Matter

Why Metrics Alone Are Not Enough

The Role of OpenTelemetry

What Good Database Monitoring Should Answer in Under Five Minutes

PostgreSQL Monitoring: The Metrics That Actually Matter

Connection Management

Query Performance: pg_stat_statements Is Non-Negotiable

Autovacuum and Bloat

Replication Lag

Alert Thresholds for PostgreSQL

MySQL Monitoring: What InnoDB Tells You That Slow Logs Don’t

InnoDB Buffer Pool

Thread and Connection Metrics

Slow Query Log vs. Performance Schema

Replication Monitoring

InnoDB Deadlocks

Alert Thresholds for MySQL

MongoDB Monitoring: The Atlas and Self-Hosted Divide

Operation Counters and Latency

WiredTiger Cache

Replication and Oplog

Index Effectiveness

Alert Thresholds for MongoDB

Redis Monitoring: Memory Is Everything

Memory Usage and Eviction Policy

Keyspace Hit Rate

Slow Queries and Blocking Commands

Replication and Persistence

Alert Thresholds for Redis

Elasticsearch Monitoring: Cluster State Is Your First Signal

Cluster Health

JVM Heap Pressure

Indexing and Search Throughput

Thread Pool Queues and Rejections

Disk and Shard Allocation

Alert Thresholds for Elasticsearch

Cross-Database Monitoring Practices

Correlate Database Metrics with Application Traces

Establish Baseline Behavior Before Setting Alert Thresholds

Log Retention for Post-Incident Analysis

Putting It Together: Monitoring a Mixed Database Stack

What Fragmented Monitoring Looks Like in Practice

What Unified Database Monitoring Enables

Choosing a Platform for Multi-Database Visibility

How CubeAPM Helps With Database Monitoring

One Platform Across All Five Engines

Correlation Between Database and Application Layers

Predictable Cost at Scale

Conclusion

FAQs

What is database monitoring?

What are the most important metrics to monitor in PostgreSQL?

How do you monitor MySQL replication lag accurately?

Why does Elasticsearch cluster health turn yellow and what should you do?

How much memory should Redis use before you increase capacity?

What is the difference between MongoDB’s WiredTiger cache and OS page cache?

Should you use a single monitoring tool for all five databases or separate tools per database?

How often should database monitoring alerts be reviewed and tuned?

Related Posts

Features

Resources

Links