A cron job that does not run is a silent failure. Traditional cron has no built-in alerting, no success reporting, and no visibility. When a job crashes, times out, exits with a non-zero code, or never runs because the server was rebooted, the only signal you get is the absence of an outcome: a missing backup, unprocessed payments, stale data, often discovered hours or days later, long after damage is done.
The problem is structural. Cron’s design is intentional: it is a scheduler, not a monitoring system. It dispatches commands. It does not know or care whether those commands succeeded. Detecting silent failures requires adding an external monitoring layer. Let’s understand how to monitor cron jobs for these failures.
Key Takeaways
- A cron job can be “running” in the sense that the process was started while doing nothing useful. Exit code 0 means the process exited cleanly, not that the job did what it was supposed to do
- The most reliable detection mechanism for silent failures is heartbeat monitoring (also called the dead man’s switch technique): the job sends a ping to an external service on successful completion. If the ping does not arrive within the expected window, an alert fires
- The ping must be sent only after successful completion, not at the start of the job. Pinging at both start and finish is more precise for detecting stuck or hung jobs, but pinging only at the end is the minimum viable implementation
- Log files do not replace heartbeat monitoring. Logs tell you what happened when the job ran. Heartbeat monitoring tells you when the job did not run at all
- For Kubernetes CronJobs, exit codes, pod restart counts, and missed schedules are natively trackable. The OTel Collector’s Kubernetes receiver captures these without code changes
- Grace periods prevent false alerts from scheduling jitter. A job expected at 02:00 that runs at 02:04 due to server load should not fire an alert. Set grace periods based on measured execution variance, not guesswork
Why Cron Fails Silently
Cron fails silently in ways that are invisible to standard monitoring:
- Exit code 0 does not mean success: A script that encounters a partial failure, processes zero records, or writes to a disk that is full can still exit 0. Unless you explicitly check outcomes and exit non-zero on failure, cron has no way to know anything went wrong.
- Jobs can be skipped entirely: If the server reboots during a scheduled window, the job does not run. If the cron daemon is not running (happens more often than expected after OS updates), no jobs run. If a previous job run is still executing when the next one would start, cron skips the new run by default.
- Long-running jobs can hang: A job that normally completes in 5 minutes can wait indefinitely for a database lock, a network response, or an external API. The process is alive, consuming resources, not completing, and not alerting anyone.
- Disk full kills jobs silently: When the target partition is full, tools like mysqldump, rsync, and tar exit with non-zero codes.
If your cron command does not check exit codes explicitly, the failure is invisible:
# This looks fine but fails silently if backup.sh exits non-zero
0 2 * * * /usr/local/bin/backup.sh
# This is explicit: exit code is checked
0 2 * * * /usr/local/bin/backup.sh || echo "BACKUP FAILED" | mail -s "Backup failure" [email protected]MAILTO is not enough. Cron can email output via MAILTO, but this relies on a working mail server, the right address being configured, and someone reading the emails. It also sends output on every run, including successful ones, creating noise that leads to the emails being ignored.
The Heartbeat (Dead Man’s Switch) Technique
Heartbeat monitoring inverts the traditional monitoring model. Instead of a monitor checking whether a service is up, the service itself reports when it has completed successfully. If the report does not arrive within the expected window, the monitoring service concludes that something went wrong and sends an alert.
The workflow has three components:
- A unique ping URL per job: Each monitored job gets a unique HTTPS endpoint. The job sends an HTTP request to this endpoint after successful completion.
- An expected schedule: The monitoring service knows the job should ping every hour, every day at 02:00, or every 15 minutes. It calculates the next expected ping time after each received ping.
- An alert on missed ping: If the next expected ping does not arrive within the grace period, the monitoring service sends alerts via email, Slack, PagerDuty, or any configured channel.
Implementation: Shell Scripts
The simplest implementation uses curl to send the ping after the job exits successfully:
#!/bin/bash
set -euo pipefail
# Your job logic here
/usr/local/bin/backup.sh
# Ping only reaches this line if the job exits 0
curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/nullThe –retry 3 flag retries the ping up to three times if the network is unreliable. The > /dev/null suppresses output. The -f flag causes curl to return a non-zero exit code on HTTP errors.
Capturing the exit code to ping on failure too:
#!/bin/bash
/usr/local/bin/backup.sh
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/null
else
# Some services support a /fail suffix for explicit failure pings
curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/fail" > /dev/null
fi
exit $EXIT_CODEUsing start and finish pings to detect stuck jobs:
#!/bin/bash
# Signal job started
curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/start" > /dev/null
# Run the job
/usr/local/bin/backup.sh
EXIT_CODE=$?
# Signal completion with exit code
curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE/$EXIT_CODE" > /dev/null
exit $EXIT_CODEWith start and finish pings, the monitoring service can alert if the time between start and finish exceeds an expected maximum duration, catching jobs that are running but hung.
Protecting the primary cron command from monitoring failures:
0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 "https://hc-ping.com/YOUR-UUID-HERE" > /dev/nullThe && ensures the ping is only sent if backup.sh exits 0. If the curl command fails because the monitoring service is down, it does not affect the job’s exit code.
Implementation: Python
import subprocess
import urllib.request
import sys
PING_URL = "https://hc-ping.com/YOUR-UUID-HERE"
def run_job():
result = subprocess.run(
["/usr/local/bin/process_data.py"],
capture_output=True,
text=True
)
return result.returncode
def ping(url):
try:
urllib.request.urlopen(url, timeout=10)
except Exception:
pass # Do not let monitoring failure affect the job
# Signal start
ping(f"{PING_URL}/start")
exit_code = run_job()
if exit_code == 0:
ping(PING_URL)
else:
ping(f"{PING_URL}/fail")
sys.exit(exit_code)Implementation: Kubernetes CronJobs
Kubernetes CronJobs add complexity because pods are ephemeral and job execution is managed by the Kubernetes scheduler, not cron. Silent failures take different forms: pods that complete with non-zero exit codes, jobs that never schedule because of insufficient cluster resources, jobs that exceed their activeDeadlineSeconds, and missed schedules when the Kubernetes controller is behind.
Add a heartbeat curl to your job container’s command:
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: backup-tool:latest
command:
- /bin/sh
- -c
- |
curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/start" || true
/usr/local/bin/backup.sh
EXIT=$?
curl -fsS "https://hc-ping.com/YOUR-UUID-HERE/$EXIT" || true
exit $EXIT
restartPolicy: OnFailure
activeDeadlineSeconds: 3600Kubernetes-native monitoring via OTel Collector:
For teams already running the OTel Collector in their cluster, the k8sobjectsreceiver captures Kubernetes events including job completions, pod failures, and missed schedules without adding curl calls to each container:
receivers:
k8sobjects:
objects:
- name: events
mode: watch
group: events.k8s.io
namespaces: [default, production]This captures events like BackoffLimitExceeded (job failed after all retries) and the absence of Complete events within the expected window.
Grace Periods: Avoiding False Alerts
Jobs do not always run at exactly their scheduled time. Server load, scheduling jitter, and time zone differences can all cause a job to ping a few minutes late. Alerting on any delay, however brief, creates false positive noise that leads teams to ignore alerts.
Set grace periods based on measured variance:
| Job frequency | Suggested grace period |
| Every minute | 1 to 2 minutes |
| Every 5 minutes | 2 to 3 minutes |
| Hourly | 5 to 10 minutes |
| Daily | 15 to 30 minutes |
| Weekly | 1 to 2 hours |
A grace period that is too short creates false alerts. A grace period that is too long means real failures are detected late. Most monitoring tools let you configure this per check.
What to Monitor Beyond the Heartbeat
Heartbeat monitoring confirms whether a job ran and exited 0. It does not tell you whether the job did its full work correctly. Complement heartbeats with:
| Signal | What it catches | How to capture it |
| Row count or record count | Job ran but processed zero records | Log the count, assert it is above a minimum threshold before pinging success |
| Execution duration trend | Job is getting slower over time, indicating a scaling problem | Log start and end timestamps, send duration in the ping body |
| Output file size | Backup produced an empty or suspiciously small file | Check file size before pinging success |
| Exit code (non-zero) | Script-level failure | Use the /fail ping or check exit code before pinging |
| External dependency health | Job failed because the database or API it depends on was unavailable | Log the dependency error with structured fields for later correlation |
Common Setup Mistakes
| Mistake | What happens | Fix |
| Pinging at the start instead of end | You know the job started, not that it succeeded | Ping at the end, or use both start and finish pings and alert on missing finish |
| Not checking exit codes before pinging | A failed job pings success | Use command && curl ping_url or explicitly check $? before pinging |
| Using one check URL for multiple jobs | One missed ping could be any of the jobs | Create one check per job, per environment |
| Grace period too short | False alerts from scheduling jitter | Set grace period based on measured historical variance |
| Monitoring only production | Silent failures in staging go unnoticed | Create separate checks per environment |
| Not testing the monitor | Monitor is set up but has never fired | Intentionally skip a ping and confirm you receive an alert |
Beyond “Did It Run?”: Knowing Why It Failed
A heartbeat tells you a job completed and exited 0. What it cannot tell you is why a subsequent job is failing, whether a resource contention event on the same host caused the job to run slowly, or whether an infrastructure event that happened at the same time explains the failure.
CubeAPM monitors Kubernetes CronJob execution natively, tracking pod runs, exit codes, durations, retries, and missed schedules with alerts on failed or long-running jobs. What separates it from standalone heartbeat services is correlation: when a job misses a run or fails, CubeAPM connects the event to the application logs from that pod, the infrastructure metrics on the node at that moment, and any distributed traces from services the job called. The result is not just “this job failed” but “this job failed because the database connection pool was exhausted, which also caused these three API requests to fail at the same time.” It runs self-hosted inside your own infrastructure at $0.15/GB ingestion with no per-user fees.
Summary
Silent cron failures happen because cron is a scheduler, not a monitoring system. The minimum viable monitoring layer is a heartbeat: the job pings an external service after successful completion, and an alert fires if the ping does not arrive within a grace period. Implement it with a single curl call after your job command, or with start and finish pings for stuck-job detection. For Kubernetes CronJobs, combine heartbeat pings with native Kubernetes event monitoring. Set grace periods based on measured scheduling variance to avoid false alert noise.
| Signal layer | What it catches | How to implement |
| Heartbeat ping on success | Job did not run, job crashed, job exited non-zero | command && curl ping_url or explicit exit code check |
| Start and finish pings | Job started but hung or timed out | Ping at start and end, alert if finish ping never arrives |
| Exit code assertion | Partial failure that exits 0 without doing useful work | Validate outcome before pinging success |
| Duration tracking | Job is getting slower, approaching timeout thresholds | Log duration in ping body or structured log |
| Kubernetes event monitoring | Missed schedules, BackoffLimitExceeded, pod eviction | OTel Collector k8sobjectsreceiver on events API |
| Infrastructure correlation | Job failed because of host-level resource problem | APM platform correlating job events with host metrics |
Disclaimer: Healthchecks.io pricing (Hobbyist free/20 checks, Business $20/month/100 checks, Business Plus $80/month/1000 checks, unlimited team members from February 18, 2026) verified directly from healthchecks.io/pricing as of May 2026. Better Stack free plan (10 heartbeats included) and Responder license pricing ($29/month annual) verified from betterstack.com as of May 2026. CubeAPM cron job monitoring capabilities verified from cubeapm.com/blog/top-cron-job-monitoring-tools as of May 2026.
Also read:
How to Instrument Go Applications with OpenTelemetry





