How to Monitor Cron Jobs for Silent Failures

A cron job that does not run is a silent failure. Traditional cron has no built-in alerting, no success reporting, and no visibility. When a job crashes, times out, exits with a non-zero code, or never runs because the server was rebooted, the only signal you get is the absence of an outcome: a missing backup, unprocessed payments, stale data, often discovered hours or days later, long after damage is done.

The problem is structural. Cron’s design is intentional: it is a scheduler, not a monitoring system. It dispatches commands. It does not know or care whether those commands succeeded. Detecting silent failures requires adding an external monitoring layer. Let’s understand how to monitor cron jobs for these failures.

Key Takeaways

A cron job can be “running” in the sense that the process was started while doing nothing useful. Exit code 0 means the process exited cleanly, not that the job did what it was supposed to do
The most reliable detection mechanism for silent failures is heartbeat monitoring (also called the dead man’s switch technique): the job sends a ping to an external service on successful completion. If the ping does not arrive within the expected window, an alert fires
The ping must be sent only after successful completion, not at the start of the job. Pinging at both start and finish is more precise for detecting stuck or hung jobs, but pinging only at the end is the minimum viable implementation
Log files do not replace heartbeat monitoring. Logs tell you what happened when the job ran. Heartbeat monitoring tells you when the job did not run at all
For Kubernetes CronJobs, exit codes, pod restart counts, and missed schedules are natively trackable. The OTel Collector’s Kubernetes receiver captures these without code changes
Grace periods prevent false alerts from scheduling jitter. A job expected at 02:00 that runs at 02:04 due to server load should not fire an alert. Set grace periods based on measured execution variance, not guesswork

Why Cron Fails Silently

Cron fails silently in ways that are invisible to standard monitoring:

Exit code 0 does not mean success: A script that encounters a partial failure, processes zero records, or writes to a disk that is full can still exit 0. Unless you explicitly check outcomes and exit non-zero on failure, cron has no way to know anything went wrong.
Jobs can be skipped entirely: If the server reboots during a scheduled window, the job does not run. If the cron daemon is not running (happens more often than expected after OS updates), no jobs run. If a previous job run is still executing when the next one would start, cron skips the new run by default.
Long-running jobs can hang: A job that normally completes in 5 minutes can wait indefinitely for a database lock, a network response, or an external API. The process is alive, consuming resources, not completing, and not alerting anyone.
Disk full kills jobs silently: When the target partition is full, tools like mysqldump, rsync, and tar exit with non-zero codes.

If your cron command does not check exit codes explicitly, the failure is invisible:

# This looks fine but fails silently if backup.sh exits non-zero

0 2 * * * /usr/local/bin/backup.sh


# This is explicit: exit code is checked

0 2 * * * /usr/local/bin/backup.sh || echo "BACKUP FAILED" | mail -s "Backup failure" ops@example.com

# This looks fine but fails silently if backup.sh exits non-zero

0 2 * * * /usr/local/bin/backup.sh


# This is explicit: exit code is checked

0 2 * * * /usr/local/bin/backup.sh || echo "BACKUP FAILED" | mail -s "Backup failure" [email protected]

MAILTO is not enough. Cron can email output via MAILTO, but this relies on a working mail server, the right address being configured, and someone reading the emails. It also sends output on every run, including successful ones, creating noise that leads to the emails being ignored.

The Heartbeat (Dead Man’s Switch) Technique

Heartbeat monitoring inverts the traditional monitoring model. Instead of a monitor checking whether a service is up, the service itself reports when it has completed successfully. If the report does not arrive within the expected window, the monitoring service concludes that something went wrong and sends an alert.

The workflow has three components:

A unique ping URL per job: Each monitored job gets a unique HTTPS endpoint. The job sends an HTTP request to this endpoint after successful completion.
An expected schedule: The monitoring service knows the job should ping every hour, every day at 02:00, or every 15 minutes. It calculates the next expected ping time after each received ping.
An alert on missed ping: If the next expected ping does not arrive within the grace period, the monitoring service sends alerts via email, Slack, PagerDuty, or any configured channel.

Implementation: Shell Scripts

The simplest implementation uses curl to send the ping after the job exits successfully:

#!/bin/bash

set -euo pipefail

# Your job logic here

/usr/local/bin/backup.sh

# Ping only reaches this line if the job exits 0

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

#!/bin/bash

set -euo pipefail

# Your job logic here

/usr/local/bin/backup.sh

# Ping only reaches this line if the job exits 0

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

The –retry 3 flag retries the ping up to three times if the network is unreliable. The > /dev/null suppresses output. The -f flag causes curl to return a non-zero exit code on HTTP errors.

Capturing the exit code to ping on failure too:

#!/bin/bash

/usr/local/bin/backup.sh

EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then

    curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

else

    # Some services support a /fail suffix for explicit failure pings

    curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/fail" > /dev/null

fi

exit $EXIT_CODE

#!/bin/bash

/usr/local/bin/backup.sh

EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then

    curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

else

    # Some services support a /fail suffix for explicit failure pings

    curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/fail" > /dev/null

fi

exit $EXIT_CODE

Using start and finish pings to detect stuck jobs:

#!/bin/bash

# Signal job started

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/start" > /dev/null

# Run the job

/usr/local/bin/backup.sh

EXIT_CODE=$?

# Signal completion with exit code

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/$EXIT_CODE" > /dev/null

exit $EXIT_CODE

#!/bin/bash

# Signal job started

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/start" > /dev/null

# Run the job

/usr/local/bin/backup.sh

EXIT_CODE=$?

# Signal completion with exit code

curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/$EXIT_CODE" > /dev/null

exit $EXIT_CODE

With start and finish pings, the monitoring service can alert if the time between start and finish exceeds an expected maximum duration, catching jobs that are running but hung.

Protecting the primary cron command from monitoring failures:

0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null

The && ensures the ping is only sent if backup.sh exits 0. If the curl command fails because the monitoring service is down, it does not affect the job’s exit code.

Implementation: Python

import subprocess

import urllib.request

import sys

PING_URL = "https://hc-ping.com/YOUR-UUID-HERE"

def run_job():

    result = subprocess.run(

        ["/usr/local/bin/process_data.py"],

        capture_output=True,

        text=True

    )

    return result.returncode

def ping(url):

    try:

        urllib.request.urlopen(url, timeout=10)

    except Exception:

        pass  # Do not let monitoring failure affect the job

# Signal start

ping(f"{PING_URL}/start")

exit_code = run_job()

if exit_code == 0:

    ping(PING_URL)

else:

    ping(f"{PING_URL}/fail")

    sys.exit(exit_code)

import subprocess

import urllib.request

import sys

PING_URL = "https://hc-ping.com/YOUR-UUID-HERE"

def run_job():

    result = subprocess.run(

        ["/usr/local/bin/process_data.py"],

        capture_output=True,

        text=True

    )

    return result.returncode

def ping(url):

    try:

        urllib.request.urlopen(url, timeout=10)

    except Exception:

        pass  # Do not let monitoring failure affect the job

# Signal start

ping(f"{PING_URL}/start")

exit_code = run_job()

if exit_code == 0:

    ping(PING_URL)

else:

    ping(f"{PING_URL}/fail")

    sys.exit(exit_code)

Implementation: Kubernetes CronJobs

Kubernetes CronJobs add complexity because pods are ephemeral and job execution is managed by the Kubernetes scheduler, not cron. Silent failures take different forms: pods that complete with non-zero exit codes, jobs that never schedule because of insufficient cluster resources, jobs that exceed their activeDeadlineSeconds, and missed schedules when the Kubernetes controller is behind.

Add a heartbeat curl to your job container’s command:

apiVersion: batch/v1

kind: CronJob

metadata:

  name: daily-backup

spec:

  schedule: "0 2 * * *"

  jobTemplate:

    spec:

      template:

        spec:

          containers:

          - name: backup

            image: backup-tool:latest

            command:

            - /bin/sh

            - -c

            - |

              curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/start" || true

              /usr/local/bin/backup.sh

              EXIT=$?

              curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/$EXIT" || true

              exit $EXIT

          restartPolicy: OnFailure

      activeDeadlineSeconds: 3600

apiVersion: batch/v1

kind: CronJob

metadata:

  name: daily-backup

spec:

  schedule: "0 2 * * *"

  jobTemplate:

    spec:

      template:

        spec:

          containers:

          - name: backup

            image: backup-tool:latest

            command:

            - /bin/sh

            - -c

            - |

              curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/start" || true

              /usr/local/bin/backup.sh

              EXIT=$?

              curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/$EXIT" || true

              exit $EXIT

          restartPolicy: OnFailure

      activeDeadlineSeconds: 3600

Kubernetes-native monitoring via OTel Collector:

For teams already running the OTel Collector in their cluster, the k8sobjectsreceiver captures Kubernetes events including job completions, pod failures, and missed schedules without adding curl calls to each container:

receivers:

  k8sobjects:

    objects:

      - name: events

        mode: watch

        group: events.k8s.io

        namespaces: [default, production]

receivers:

  k8sobjects:

    objects:

      - name: events

        mode: watch

        group: events.k8s.io

        namespaces: [default, production]

This captures events like BackoffLimitExceeded (job failed after all retries) and the absence of Complete events within the expected window.

Grace Periods: Avoiding False Alerts

Jobs do not always run at exactly their scheduled time. Server load, scheduling jitter, and time zone differences can all cause a job to ping a few minutes late. Alerting on any delay, however brief, creates false positive noise that leads teams to ignore alerts.

Set grace periods based on measured variance:

Job frequency	Suggested grace period
Every minute	1 to 2 minutes
Every 5 minutes	2 to 3 minutes
Hourly	5 to 10 minutes
Daily	15 to 30 minutes
Weekly	1 to 2 hours

A grace period that is too short creates false alerts. A grace period that is too long means real failures are detected late. Most monitoring tools let you configure this per check.

What to Monitor Beyond the Heartbeat

Heartbeat monitoring confirms whether a job ran and exited 0. It does not tell you whether the job did its full work correctly. Complement heartbeats with:

Signal	What it catches	How to capture it
Row count or record count	Job ran but processed zero records	Log the count, assert it is above a minimum threshold before pinging success
Execution duration trend	Job is getting slower over time, indicating a scaling problem	Log start and end timestamps, send duration in the ping body
Output file size	Backup produced an empty or suspiciously small file	Check file size before pinging success
Exit code (non-zero)	Script-level failure	Use the /fail ping or check exit code before pinging
External dependency health	Job failed because the database or API it depends on was unavailable	Log the dependency error with structured fields for later correlation

Common Setup Mistakes

Mistake	What happens	Fix
Pinging at the start instead of end	You know the job started, not that it succeeded	Ping at the end, or use both start and finish pings and alert on missing finish
Not checking exit codes before pinging	A failed job pings success	Use command && curl ping_url or explicitly check $? before pinging
Using one check URL for multiple jobs	One missed ping could be any of the jobs	Create one check per job, per environment
Grace period too short	False alerts from scheduling jitter	Set grace period based on measured historical variance
Monitoring only production	Silent failures in staging go unnoticed	Create separate checks per environment
Not testing the monitor	Monitor is set up but has never fired	Intentionally skip a ping and confirm you receive an alert

Beyond “Did It Run?”: Knowing Why It Failed

A heartbeat tells you a job completed and exited 0. What it cannot tell you is why a subsequent job is failing, whether a resource contention event on the same host caused the job to run slowly, or whether an infrastructure event that happened at the same time explains the failure.

CubeAPM monitors Kubernetes CronJob execution natively, tracking pod runs, exit codes, durations, retries, and missed schedules with alerts on failed or long-running jobs. What separates it from standalone heartbeat services is correlation: when a job misses a run or fails, CubeAPM connects the event to the application logs from that pod, the infrastructure metrics on the node at that moment, and any distributed traces from services the job called. The result is not just “this job failed” but “this job failed because the database connection pool was exhausted, which also caused these three API requests to fail at the same time.” It runs self-hosted inside your own infrastructure at $0.15/GB ingestion with no per-user fees.

Summary

Silent cron failures happen because cron is a scheduler, not a monitoring system. The minimum viable monitoring layer is a heartbeat: the job pings an external service after successful completion, and an alert fires if the ping does not arrive within a grace period. Implement it with a single curl call after your job command, or with start and finish pings for stuck-job detection. For Kubernetes CronJobs, combine heartbeat pings with native Kubernetes event monitoring. Set grace periods based on measured scheduling variance to avoid false alert noise.

Signal layer	What it catches	How to implement
Heartbeat ping on success	Job did not run, job crashed, job exited non-zero	command && curl ping_url or explicit exit code check
Start and finish pings	Job started but hung or timed out	Ping at start and end, alert if finish ping never arrives
Exit code assertion	Partial failure that exits 0 without doing useful work	Validate outcome before pinging success
Duration tracking	Job is getting slower, approaching timeout thresholds	Log duration in ping body or structured log
Kubernetes event monitoring	Missed schedules, BackoffLimitExceeded, pod eviction	OTel Collector k8sobjectsreceiver on events API
Infrastructure correlation	Job failed because of host-level resource problem	APM platform correlating job events with host metrics

Disclaimer: Healthchecks.io pricing (Hobbyist free/20 checks, Business $20/month/100 checks, Business Plus $80/month/1000 checks, unlimited team members from February 18, 2026) verified directly from healthchecks.io/pricing as of May 2026. Better Stack free plan (10 heartbeats included) and Responder license pricing ($29/month annual) verified from betterstack.com as of May 2026. CubeAPM cron job monitoring capabilities verified from cubeapm.com/blog/top-cron-job-monitoring-tools as of May 2026.

Also read:

How to Instrument Go Applications with OpenTelemetry

What Is the Difference Between OpenTelemetry and Zipkin?

What Is the Difference Between OpenTelemetry and Jaeger?