CubeAPM
CubeAPM CubeAPM

How to Monitor TimescaleDB for Slow Queries and Chunk Health

How to Monitor TimescaleDB for Slow Queries and Chunk Health

Table of Contents

Standard PostgreSQL monitoring catches none of this. Compression policies stall without errors. Chunk counts balloon without alerts. Background jobs exhaust their worker pool and start queueing up, again silently.

This guide covers the six areas that matter most when you monitor TimescaleDB in production: slow query detection, chunk health, compression job status, continuous aggregate freshness, background worker health, and setting up alerts so these problems surface before they become incidents.

What Makes TimescaleDB Monitoring Different from PostgreSQL

TimescaleDB adds three layers of complexity that standard PostgreSQL monitoring does not cover.

  • Hypertables and chunks: Every hypertable is physically partitioned into chunks, one per time interval. Too many chunks cause query planning overhead. A hypertable with 4,000 chunks can add 400ms or more of planning time to a query that executes in 2ms, because the planner must evaluate constraints for every chunk on every query. Standard PostgreSQL monitoring reports the slow query, but not its cause.
  • Compression jobs: TimescaleDB compresses older chunks automatically via background jobs. When a compression job fails silently, uncompressed chunks accumulate. Three months of uncompressed data can represent a 10x to 20x disk overage relative to what compression would have produced, without triggering a single log error visible to standard monitoring.
  • Continuous aggregates: Continuous aggregates are materialized views that refresh automatically as new data arrives. When the refresh job stalls, the aggregate serves stale data with no error. Dashboards built on the aggregate show outdated values, and no alert fires unless you are explicitly monitoring the refresh lag.

Step 1: Enable pg_stat_statements for Slow Query Detection

pg_stat_statements is the foundation of slow query monitoring in TimescaleDB. It is a PostgreSQL contrib extension that aggregates execution statistics for every query run against the database. TimescaleDB inherits it fully.

Add pg_stat_statements to shared_preload_libraries in postgresql.conf:

shared_preload_libraries = 'timescaledb,pg_stat_statements'

pg_stat_statements.track = all

pg_stat_statements.max = 10000

track_io_timing = on

Restart PostgreSQL, then enable the extension:

sql

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Step 2: Find Slow Queries

The following query returns the top 10 slowest queries by mean execution time. High mean time with many calls is a tuning priority.

SELECT

  queryid,

  calls,

  ROUND(total_exec_time::numeric / 1000, 2AS total_exec_sec,

  ROUND(mean_exec_time::numeric / 1000, 4)   AS mean_exec_sec,

  ROUND(stddev_exec_time::numeric / 1000, 4) AS stddev_exec_sec,

  rows,

  shared_blks_hit,

  shared_blks_read,

  LEFT(query, 120)                            AS query_preview

FROM pg_stat_statements

WHERE calls > 10

ORDER BY mean_exec_time DESC

LIMIT 10;

Column reference (PostgreSQL 13+):
ColumnDescription
total_exec_timeTotal time spent executing, in milliseconds
mean_exec_timeAverage execution time per call, in milliseconds
stddev_exec_timeStandard deviation of execution time (high = inconsistent)
shared_blks_hitBlocks served from cache
shared_blks_readBlocks read from disk (high = cache misses)
callsNumber of times the query was executed

To find queries with high variance (occasionally very slow but usually fast), sort by stddev_exec_time:

SELECT

  calls,

  ROUND(mean_exec_time::numeric / 1000, 4)   AS mean_exec_sec,

  ROUND(stddev_exec_time::numeric / 1000, 4) AS stddev_exec_sec,

  LEFT(query, 120) AS query_preview

FROM pg_stat_statements

WHERE calls > 10

  AND stddev_exec_time > mean_exec_time

ORDER BY stddev_exec_time DESC

LIMIT 10;

To reset statistics after making changes and starting a clean baseline:

sql

SELECT pg_stat_statements_reset();

Step 3: Check Active Slow Queries in Real Time

pg_stat_statements tracks historical aggregates. To catch queries that are slow right now, use pg_stat_activity:

SELECT

  pid,

  now() - query_start  AS duration,

  state,

  wait_event_type,

  wait_event,

  LEFT(query, 200)      AS query_preview

FROM pg_stat_activity

WHERE state = 'active'

  AND query_start < now() - INTERVAL '5 seconds'

ORDER BY duration DESC;

This surfaces queries that have been running for more than 5 seconds. Adjust the interval threshold to match your SLO. Common causes in TimescaleDB: chunk decompression mid-query, missing index on a large chunk, or a continuous aggregate refresh blocking a read.

Step 4: Monitor Chunk Health

Chunk count and size are the most important structural health signals for a TimescaleDB hypertable. Run this query to get an overview per hypertable:

SELECT

  hypertable_schema,

  hypertable_name,

  COUNT(*)                                    AS chunk_count,

  pg_size_pretty(SUM(total_bytes))            AS total_size,

  pg_size_pretty(AVG(total_bytes)::bigint)    AS avg_chunk_size,

  COUNT(*) FILTER (WHERE is_compressed)       AS compressed_chunks,

  COUNT(*) FILTER (WHERE NOT is_compressed)   AS uncompressed_chunks

FROM timescaledb_information.chunks

GROUP BY hypertable_schema, hypertable_name

ORDER BY SUM(total_bytes) DESC;

Thresholds to alert on:

SignalWarningCritical
Total chunk count> 500> 1,000
Uncompressed chunks older than compression policy> 10> 50
Average chunk size< 1 MB or > 5 GBDepends on interval

High chunk count is a common cause of slow queries even when indexes are correct. If your total chunk count is above 1,000, consider increasing the chunk time interval using set_chunk_time_interval().

To inspect individual chunks and spot oversized or undersized ones:

SELECT

  chunk_schema,

  chunk_name,

  range_start,

  range_end,

  is_compressed,

  pg_size_pretty(total_bytes)   AS total_size,

  pg_size_pretty(heap_bytes)    AS heap_size,

  pg_size_pretty(index_bytes)   AS index_size

FROM timescaledb_information.chunks

WHERE hypertable_name = 'your_hypertable'

ORDER BY range_start DESC

LIMIT 20;

Step 5: Monitor Compression Job Health

A failed compression job leaves chunks uncompressed. The failure is silent unless you query the job stats directly.

SELECT

  job_id,

  application_name,

  scheduled,

  last_run_status,

  last_run_started_at,

  next_start,

  total_failures,

  total_successes

FROM timescaledb_information.jobs

WHERE application_name LIKE 'Compression%'

ORDER BY total_failures DESC;

Check compression ratios for a specific hypertable:

SELECT

  chunk_name,

  pg_size_pretty(before_compression_total_bytes) AS before,

  pg_size_pretty(after_compression_total_bytes)  AS after,

  ROUND(

    before_compression_total_bytes::numeric /

    NULLIF(after_compression_total_bytes, 0), 1

  ) AS compression_ratio

FROM chunk_compression_stats('your_hypertable')

ORDER BY chunk_name DESC

LIMIT 20;

What to alert on:
SignalAction
last_run_status = ‘Failed’Alert immediately
total_failures increasingInvestigate compression settings
compression_ratio < 5xReview segmentby and orderby column config
next_start overdue by > 2x schedule intervalJob may be stalled or worker pool exhausted

When a compression job fails, temporarily pause it while you investigate:

SELECT alter_job(<job_id>, scheduled => false);

Re-enable once resolved:

SELECT alter_job(<job_id>, scheduled => true);

Step 6: Monitor Continuous Aggregate Freshness

A stale continuous aggregate silently serves outdated data. Query timescaledb_information.continuous_aggregate_stats directly, which contains both materialization progress and job status in one place:

SELECT

  view_name,

  completed_threshold,

  invalidation_threshold,

  last_run_started_at,

  last_successful_finish,

  last_run_status,

  next_scheduled_run,

  total_runs,

  total_failures

FROM timescaledb_information.continuous_aggregate_stats

ORDER BY view_name;

Calculate refresh lag explicitly:

SELECT

  view_name,

  completed_threshold,

  now() - completed_threshold::timestamptz AS refresh_lag

FROM timescaledb_information.continuous_aggregate_stats

ORDER BY refresh_lag DESC;

Alert thresholds:

Refresh intervalWarning lagCritical lag
1 hour> 2 hours> 6 hours
1 day> 2 days> 4 days
Custom> 2x schedule interval> 4x schedule interval

Step 7: Monitor Background Worker Health

TimescaleDB runs all background jobs (compression, retention, continuous aggregate refresh) using a fixed pool of background workers. If the pool is exhausted, jobs queue up, and nothing runs. Check the pool size:

SHOW timescaledb.max_background_workers;

Check all job statuses at once:

SELECT

  job_id,

  application_name,

  schedule_interval,

  last_run_status,

  last_run_started_at,

  next_start,

  total_runs,

  total_failures,

  total_successes

FROM timescaledb_information.jobs

ORDER BY total_failures DESC, last_run_started_at DESC;

If total_failures is incrementing across multiple jobs simultaneously, the worker pool is likely exhausted. Increase timescaledb.max_background_workers in postgresql.conf and reload:

timescaledb.max_background_workers = 16

Step 8: Set Up Alerts with CubeAPM

CubeAPM monitors for TimescaleDB
How to Monitor TimescaleDB for Slow Queries and Chunk Health 2

Running these queries manually is a start, but production monitoring requires automated alerts that fire before problems compound. CubeAPM connects to your TimescaleDB instance via its infrastructure monitoring agent, collects PostgreSQL and TimescaleDB metrics, and lets you define alert thresholds across all the signals above.

What CubeAPM monitors for TimescaleDB:

  • Query execution time from pg_stat_statements (mean, max, p95 execution time per query)
  • Active query duration from pg_stat_activity
  • Per-hypertable chunk count and total size
  • Compression job success/failure rate and last run status
  • Continuous aggregate refresh lag
  • Background worker job failure counts
  • PostgreSQL-level metrics: connections, cache hit ratio, replication lag, WAL write rate

Key alerts to configure for TimescaleDB:

AlertConditionSeverity
Slow queryMean execution time > 5s for any query with calls > 50/minWarning
Long-running queryActive query duration > 30sCritical
Chunk count spikeChunk count for any hypertable > 1,000Warning
Compression job failureAny compression job last_run_status = ‘Failed’Critical
Stale continuous aggregateRefresh lag > 2x schedule intervalWarning
Background worker failureAny job total_failures incrementingWarning
Low compression ratioCompression ratio drops below 5xWarning

Read the docs at https://docs.cubeapm.com/ to set up PostgreSQL and infrastructure monitoring.

Summary

TimescaleDB’s failure modes are observable well before they cause incidents, but only if you are watching the right signals. Standard PostgreSQL monitoring misses compression job stalls, chunk count bloat, and continuous aggregate staleness entirely.

Monitoring areaPrimary sourceKey signal
Slow queries (historical)pg_stat_statementsmean_exec_time, stddev_exec_time
Slow queries (real-time)pg_stat_activityQuery duration > threshold
Chunk healthtimescaledb_information.chunksChunk count, uncompressed chunk count
Compression jobstimescaledb_information.jobslast_run_status, total_failures
Continuous aggregatestimescaledb_information.continuous_aggregate_statscompleted_threshold, refresh_lag
Background workerstimescaledb_information.jobsSimultaneous failures across job types

Disclaimer: All SQL in this article uses column names valid for PostgreSQL 13 and later. The total_exec_time and mean_exec_time columns in pg_stat_statements replaced total_time and mean_time in PostgreSQL 13. Current TimescaleDB (2.26 as of March 2026) requires PostgreSQL 15 or 16+. PostgreSQL 15 support is scheduled to be dropped in the next TimescaleDB release (June 2026), so PostgreSQL 16 or higher is strongly recommended for new deployments.

Also read:

What Are the Best New Relic Alternatives with OpenTelemetry Support?

What are the Best New Relic Alternatives for Java Applications?

What Are the Best New Relic Alternatives for Microservices?

×
×