CubeAPM
CubeAPM CubeAPM

How to Monitor Cron Jobs for Silent Failures 

How to Monitor Cron Jobs for Silent Failures 

Table of Contents

A cron job that does not run is a silent failure. Traditional cron has no built-in alerting, no success reporting, and no visibility. When a job crashes, times out, exits with a non-zero code, or never runs because the server was rebooted, the only signal you get is the absence of an outcome: a missing backup, unprocessed payments, stale data, often discovered hours or days later, long after damage is done.

The problem is structural. Cron’s design is intentional: it is a scheduler, not a monitoring system. It dispatches commands. It does not know or care whether those commands succeeded. Detecting silent failures requires adding an external monitoring layer. Let’s understand how to monitor cron jobs for these failures.

Key Takeaways

  • A cron job can be “running” in the sense that the process was started while doing nothing useful. Exit code 0 means the process exited cleanly, not that the job did what it was supposed to do
  • The most reliable detection mechanism for silent failures is heartbeat monitoring (also called the dead man’s switch technique): the job sends a ping to an external service on successful completion. If the ping does not arrive within the expected window, an alert fires
  • The ping must be sent only after successful completion, not at the start of the job. Pinging at both start and finish is more precise for detecting stuck or hung jobs, but pinging only at the end is the minimum viable implementation
  • Log files do not replace heartbeat monitoring. Logs tell you what happened when the job ran. Heartbeat monitoring tells you when the job did not run at all
  • For Kubernetes CronJobs, exit codes, pod restart counts, and missed schedules are natively trackable. The OTel Collector’s Kubernetes receiver captures these without code changes
  • Grace periods prevent false alerts from scheduling jitter. A job expected at 02:00 that runs at 02:04 due to server load should not fire an alert. Set grace periods based on measured execution variance, not guesswork

Why Cron Fails Silently

Cron fails silently in ways that are invisible to standard monitoring:

  • Exit code 0 does not mean success: A script that encounters a partial failure, processes zero records, or writes to a disk that is full can still exit 0. Unless you explicitly check outcomes and exit non-zero on failure, cron has no way to know anything went wrong.
  • Jobs can be skipped entirely: If the server reboots during a scheduled window, the job does not run. If the cron daemon is not running (happens more often than expected after OS updates), no jobs run. If a previous job run is still executing when the next one would start, cron skips the new run by default.
  • Long-running jobs can hang: A job that normally completes in 5 minutes can wait indefinitely for a database lock, a network response, or an external API. The process is alive, consuming resources, not completing, and not alerting anyone.
  • Disk full kills jobs silently: When the target partition is full, tools like mysqldump, rsync, and tar exit with non-zero codes.

If your cron command does not check exit codes explicitly, the failure is invisible:

# This looks fine but fails silently if backup.sh exits non-zero

0 2 * * * /usr/local/bin/backup.sh


# This is explicit: exit code is checked

0 2 * * * /usr/local/bin/backup.sh || echo "BACKUP FAILED" | mail -s "Backup failure" [email protected]

MAILTO is not enough. Cron can email output via MAILTO, but this relies on a working mail server, the right address being configured, and someone reading the emails. It also sends output on every run, including successful ones, creating noise that leads to the emails being ignored.

The Heartbeat (Dead Man’s Switch) Technique

Heartbeat monitoring inverts the traditional monitoring model. Instead of a monitor checking whether a service is up, the service itself reports when it has completed successfully. If the report does not arrive within the expected window, the monitoring service concludes that something went wrong and sends an alert.

The workflow has three components:

  1. A unique ping URL per job: Each monitored job gets a unique HTTPS endpoint. The job sends an HTTP request to this endpoint after successful completion.
  2. An expected schedule: The monitoring service knows the job should ping every hour, every day at 02:00, or every 15 minutes. It calculates the next expected ping time after each received ping.
  3. An alert on missed ping: If the next expected ping does not arrive within the grace period, the monitoring service sends alerts via email, Slack, PagerDuty, or any configured channel.

Implementation: Shell Scripts

The simplest implementation uses curl to send the ping after the job exits successfully:

#!/bin/bash

set -euo pipefail

# Your job logic here

/usr/local/bin/backup.sh

# Ping only reaches this line if the job exits 0

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

The –retry 3 flag retries the ping up to three times if the network is unreliable. The > /dev/null suppresses output. The -f flag causes curl to return a non-zero exit code on HTTP errors.

Capturing the exit code to ping on failure too:

#!/bin/bash

/usr/local/bin/backup.sh

EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then

    curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

else

    # Some services support a /fail suffix for explicit failure pings

    curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/fail" > /dev/null

fi

exit $EXIT_CODE

Using start and finish pings to detect stuck jobs:

#!/bin/bash

# Signal job started

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/start" > /dev/null

# Run the job

/usr/local/bin/backup.sh

EXIT_CODE=$?

# Signal completion with exit code

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/$EXIT_CODE" > /dev/null

exit $EXIT_CODE

With start and finish pings, the monitoring service can alert if the time between start and finish exceeds an expected maximum duration, catching jobs that are running but hung.

Protecting the primary cron command from monitoring failures:

0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

The && ensures the ping is only sent if backup.sh exits 0. If the curl command fails because the monitoring service is down, it does not affect the job’s exit code.

Implementation: Python

import subprocess

import urllib.request

import sys

PING_URL = "https://hc-ping.com/YOUR-UUID-HERE"

def run_job():

    result = subprocess.run(

        ["/usr/local/bin/process_data.py"],

        capture_output=True,

        text=True

    )

    return result.returncode

def ping(url):

    try:

        urllib.request.urlopen(url, timeout=10)

    except Exception:

        pass  # Do not let monitoring failure affect the job

# Signal start

ping(f"{PING_URL}/start")

exit_code = run_job()

if exit_code == 0:

    ping(PING_URL)

else:

    ping(f"{PING_URL}/fail")

    sys.exit(exit_code)

Implementation: Kubernetes CronJobs

Kubernetes CronJobs add complexity because pods are ephemeral and job execution is managed by the Kubernetes scheduler, not cron. Silent failures take different forms: pods that complete with non-zero exit codes, jobs that never schedule because of insufficient cluster resources, jobs that exceed their activeDeadlineSeconds, and missed schedules when the Kubernetes controller is behind.

Add a heartbeat curl to your job container’s command:

apiVersion: batch/v1

kind: CronJob

metadata:

  name: daily-backup

spec:

  schedule: "0 2 * * *"

  jobTemplate:

    spec:

      template:

        spec:

          containers:

          - name: backup

            image: backup-tool:latest

            command:

            - /bin/sh

            - -c

            - |

              curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/start" || true

              /usr/local/bin/backup.sh

              EXIT=$?

              curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/$EXIT" || true

              exit $EXIT

          restartPolicy: OnFailure

      activeDeadlineSeconds: 3600

Kubernetes-native monitoring via OTel Collector:

For teams already running the OTel Collector in their cluster, the k8sobjectsreceiver captures Kubernetes events including job completions, pod failures, and missed schedules without adding curl calls to each container:

receivers:

  k8sobjects:

    objects:

      - name: events

        mode: watch

        group: events.k8s.io

        namespaces: [default, production]

This captures events like BackoffLimitExceeded (job failed after all retries) and the absence of Complete events within the expected window.

Grace Periods: Avoiding False Alerts

Jobs do not always run at exactly their scheduled time. Server load, scheduling jitter, and time zone differences can all cause a job to ping a few minutes late. Alerting on any delay, however brief, creates false positive noise that leads teams to ignore alerts.

Set grace periods based on measured variance:

Job frequencySuggested grace period
Every minute1 to 2 minutes
Every 5 minutes2 to 3 minutes
Hourly5 to 10 minutes
Daily15 to 30 minutes
Weekly1 to 2 hours

A grace period that is too short creates false alerts. A grace period that is too long means real failures are detected late. Most monitoring tools let you configure this per check.

What to Monitor Beyond the Heartbeat

Heartbeat monitoring confirms whether a job ran and exited 0. It does not tell you whether the job did its full work correctly. Complement heartbeats with:

SignalWhat it catchesHow to capture it
Row count or record countJob ran but processed zero recordsLog the count, assert it is above a minimum threshold before pinging success
Execution duration trendJob is getting slower over time, indicating a scaling problemLog start and end timestamps, send duration in the ping body
Output file sizeBackup produced an empty or suspiciously small fileCheck file size before pinging success
Exit code (non-zero)Script-level failureUse the /fail ping or check exit code before pinging
External dependency healthJob failed because the database or API it depends on was unavailableLog the dependency error with structured fields for later correlation

Common Setup Mistakes

MistakeWhat happensFix
Pinging at the start instead of endYou know the job started, not that it succeededPing at the end, or use both start and finish pings and alert on missing finish
Not checking exit codes before pingingA failed job pings successUse command && curl ping_url or explicitly check $? before pinging
Using one check URL for multiple jobsOne missed ping could be any of the jobsCreate one check per job, per environment
Grace period too shortFalse alerts from scheduling jitterSet grace period based on measured historical variance
Monitoring only productionSilent failures in staging go unnoticedCreate separate checks per environment
Not testing the monitorMonitor is set up but has never firedIntentionally skip a ping and confirm you receive an alert

Beyond “Did It Run?”: Knowing Why It Failed

A heartbeat tells you a job completed and exited 0. What it cannot tell you is why a subsequent job is failing, whether a resource contention event on the same host caused the job to run slowly, or whether an infrastructure event that happened at the same time explains the failure.

CubeAPM monitors Kubernetes CronJob execution natively, tracking pod runs, exit codes, durations, retries, and missed schedules with alerts on failed or long-running jobs. What separates it from standalone heartbeat services is correlation: when a job misses a run or fails, CubeAPM connects the event to the application logs from that pod, the infrastructure metrics on the node at that moment, and any distributed traces from services the job called. The result is not just “this job failed” but “this job failed because the database connection pool was exhausted, which also caused these three API requests to fail at the same time.” It runs self-hosted inside your own infrastructure at $0.15/GB ingestion with no per-user fees.

Summary

Silent cron failures happen because cron is a scheduler, not a monitoring system. The minimum viable monitoring layer is a heartbeat: the job pings an external service after successful completion, and an alert fires if the ping does not arrive within a grace period. Implement it with a single curl call after your job command, or with start and finish pings for stuck-job detection. For Kubernetes CronJobs, combine heartbeat pings with native Kubernetes event monitoring. Set grace periods based on measured scheduling variance to avoid false alert noise.

Signal layerWhat it catchesHow to implement
Heartbeat ping on successJob did not run, job crashed, job exited non-zerocommand && curl ping_url or explicit exit code check
Start and finish pingsJob started but hung or timed outPing at start and end, alert if finish ping never arrives
Exit code assertionPartial failure that exits 0 without doing useful workValidate outcome before pinging success
Duration trackingJob is getting slower, approaching timeout thresholdsLog duration in ping body or structured log
Kubernetes event monitoringMissed schedules, BackoffLimitExceeded, pod evictionOTel Collector k8sobjectsreceiver on events API
Infrastructure correlationJob failed because of host-level resource problemAPM platform correlating job events with host metrics

Disclaimer: Healthchecks.io pricing (Hobbyist free/20 checks, Business $20/month/100 checks, Business Plus $80/month/1000 checks, unlimited team members from February 18, 2026) verified directly from healthchecks.io/pricing as of May 2026. Better Stack free plan (10 heartbeats included) and Responder license pricing ($29/month annual) verified from betterstack.com as of May 2026. CubeAPM cron job monitoring capabilities verified from cubeapm.com/blog/top-cron-job-monitoring-tools as of May 2026.

Also read:

How to Instrument Go Applications with OpenTelemetry 

What Is the Difference Between OpenTelemetry and Zipkin? 

What Is the Difference Between OpenTelemetry and Jaeger? 

×
×