Kubernetes has become the backbone of modern cloud infrastructure. As of 2024, over 60% of enterprises were running Kubernetes in production, making it the de facto standard for orchestrating containerized workloads. But with adoption rising, keeping clusters stable is a growing challenge. A common issue is the CrashLoopBackOff state, where pods restart endlessly, wasting resources, causing downtime, and slowing development while driving up costs.
CubeAPM helps teams resolve CrashLoopBackOff quickly by unifying pod metrics, container logs, and Kubernetes events into a single view. With the OpenTelemetry Collector running in daemonset and deployment modes, it links restarts to resource spikes or probe failures, so engineers can identify root causes quickly and restore stability before users are affected.
In this article, we’ll explain what CrashLoopBackOff is, why it happens, how to fix it, and how to monitor it effectively with CubeAPM.
Table of Contents
ToggleWhat is CrashLoopBackOff Error in Kubernetes?

CrashLoopBackOff is a pod state in Kubernetes that indicates a container is repeatedly failing after startup. Instead of running normally, the container crashes, Kubernetes restarts it, and the cycle continues with longer back-off delays between attempts.
When you run kubectl get pods, you might see:
STATUS: CrashLoopBackOff
This is not a single error but rather a signal from Kubernetes that something is fundamentally wrong — whether in your application code, resource allocation, or environment configuration.
This status is not a root cause by itself — it’s a signal that something deeper is wrong. Common triggers include:
- Application-level problems – such as missing dependencies, unhandled errors, or crashes at startup.
- Infrastructure constraints – like tight CPU or memory limits that kill containers.
- Misconfigured health checks – liveness, readiness, or startup probes that are too strict.
- Configuration mistakes – invalid environment variables, missing Secrets, or broken ConfigMaps.
- External dependencies – databases, APIs, or services that aren’t reachable when the app starts.
Left unresolved, CrashLoopBackOff leads to wasted compute resources, slower deployments, and degraded user experience. That’s why understanding its causes and monitoring it effectively are critical for production-grade Kubernetes environments.
Why CrashLoopBackOff Happens
1. Application Errors
The most common cause is an error inside the application itself. If the code has bugs, dependencies are missing, or environment variables are misconfigured, the container may crash immediately on startup. For example, an app expecting a database connection string might fail if that secret wasn’t mounted correctly, forcing Kubernetes to restart the pod repeatedly.
Check logs from the last failed run:
kubectl logs web-app-1 --previous --namespace default
This shows whether the app failed due to missing dependencies, a crash loop, or a misconfiguration like an unset environment variable.
2. Resource Limits
CrashLoopBackOff often appears when pods are constrained by strict CPU or memory limits. If a container exceeds its memory allocation, the operating system kills it (OOMKilled), and Kubernetes restarts it. Similarly, under-provisioned CPU can cause timeouts or slow startups, making the pod unstable..
Describe the pod to confirm:
kubectl describe pod web-app-1 | grep -i "OOMKilled"
If confirmed, adjust limits in your manifest:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
Apply the fix:
kubectl apply -f deployment.yaml
3. Probe Failures
Kubernetes relies on liveness, readiness, and startup probes to determine if a container is healthy. If these checks are misconfigured—say, the probe timeout is too aggressive or points to the wrong endpoint—the pod may restart even when the app is functioning. Overly strict probes create false negatives that trigger unnecessary restarts, resulting in a CrashLoopBackOff state.
Example of a safe liveness probe:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
This gives apps time to start before probes trigger restarts.
4. Image or Configuration Issues
Containers can also crash due to faulty images or bad configurations. A Docker image missing required files, an invalid ConfigMap, or a misreferenced Secret can all prevent the application from starting. In such cases, Kubernetes doesn’t know the difference between a build issue and a runtime one—it just restarts the pod, trapping it in a loop until the image or config is fixed.
Check for bad secrets or configs:
kubectl describe pod web-app-1 | grep -i "ConfigMap\|Secret"
If the image itself is broken, rebuild locally:
docker run -it myapp:latest
Fix the entrypoint, rebuild, then redeploy.
5. Permissions and Security Constraints
Even if the application is correct, permission errors can stop it from running. Problems like restricted RBAC policies, blocked service accounts, or misconfigured volume mounts can prevent the container from accessing critical resources. Since these errors typically happen at startup, the container fails immediately, pushing the pod into CrashLoopBackOff.
Check events:
kubectl describe pod web-app-1 | grep -i "permission"
Update your role or service account:
serviceAccountName: my-app-sa
6. Networking and Dependencies
Finally, many pods crash when they can’t reach the external services they depend on. If DNS resolution fails, a database isn’t ready, or network policies block traffic, the app may exit on startup. Without retries or proper error handling, Kubernetes simply restarts the pod, creating a loop until the dependency becomes available or the network issue is fixed.
How to Fix CrashLoopBackOff
This section is a branching runbook. Don’t do everything—triage first, then follow the one branch that matches your symptoms. All commands are standard, copy-paste safe once you set the variables.
1. Before you begin (set once)
# === Set your context ===
NS=default
DEPLOY=web-app
CONTAINER=app
# pick a failing pod from the deployment (adjust label if needed)
POD=$(kubectl get pod -n "$NS" -l app="$DEPLOY" -o jsonpath='{.items[0].metadata.name}')
echo "NS=$NS DEPLOY=$DEPLOY CONTAINER=$CONTAINER POD=$POD"
2. 30-second triage (run every time)
# See failing pod & status
kubectl get pods -n "$NS"# Events, exit code, probe results (save for handoff)
kubectl describe pod "$POD" -n "$NS" | tee describe.$POD.txt# Logs from the last crashed run (key for CrashLoopBackOff)
kubectl logs "$POD" -c "$CONTAINER" --previous -n "$NS" | tee logs.$POD.prev.txt
# Most recent events in order
kubectl get events -n "$NS" --field-selector="involvedObject.name=$POD" \
--sort-by=.metadata.creationTimestamp | tail -n 20 | tee events.$POD.txt
3. Pick one branch (symptom → fix)
What you saw in triage | Go to this branch |
Stack trace / runtime exception in app logs |
1. App & env |
Reason: OOMKilled, memory near limit, heavy GC |
2. Resources |
Forbidden / permission denied / volume mount error |
3. Probes |
ImagePullBackOff, wrong entrypoint, missing binary |
4. RBAC & mounts |
ImagePullBackOff, wrong entrypoint, missing binary |
5. Image & config |
Connection/DNS errors, dependency not read |
6. Networking & deps |
1. App & env issues (missing env, bad command, crash at startup)
You’re in the right place if: logs show exceptions, missing env vars, or mis-wired startup args.
Inspect the exact error:
kubectl logs "$POD" -c "$CONTAINER" --previous -n "$NS"
Set/fix env (example: DB URL) and roll:
kubectl set env deploy/"$DEPLOY" DB_URL='postgres://db:5432/app' -n "$NS"
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
If entrypoint/image is wrong, update to known-good:
kubectl set image deploy/"$DEPLOY" "$CONTAINER"=myrepo/web-app:fixed -n "$NS"
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Success: STATUS=Running, RESTARTS stop increasing, no new fatal logs.
2. Resources (OOMKilled / CPU throttling)
You’re in the right place if: describe shows Reason: OOMKilled or CPU/memory pressure.
Confirm OOM hint:
kubectl describe pod "$POD" -n "$NS" | grep -i oomkilled -n || true
Bump requests/limits (safe baseline):
kubectl set resources deploy/"$DEPLOY" -n "$NS" \
--containers="$CONTAINER" \
--requests=cpu=250m,memory=512Mi \
--limits=cpu=500m,memory=1Gi
kubectl rollout status deploy/"$DEPLOY" -n "$NS"(If metrics-server available)
kubectl top pod -n "$NS"
Success: no new OOM events; restarts level off; p95 memory < ~80% of limit.
3. Probes (too strict/wrong endpoint)
You’re in the right place if events show frequent liveness/readiness/startup failures.
Quick patch (one-liner)
kubectl patch deploy "$DEPLOY" -n "$NS" --type='json' -p='[
{"op":"add","path":"/spec/template/spec/containers/0/livenessProbe","value":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":20,"periodSeconds":10,"timeoutSeconds":5,"failureThreshold":3}},
{"op":"add","path":"/spec/template/spec/containers/0/readinessProbe","value":{"httpGet":{"path":"/ready","port":8080},"initialDelaySeconds":10,"periodSeconds":5,"timeoutSeconds":3,"failureThreshold":6}},
{"op":"add","path":"/spec/template/spec/containers/0/startupProbe","value":{"httpGet":{"path":"/startup","port":8080},"failureThreshold":30,"periodSeconds":5}}
]'
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Success: probe failures stop; pod becomes Ready and stays there.
4. RBAC & mounts (permissions, service account, volumes)
You’re in the right place if: you see Forbidden, permission denied, or mount errors.
Bind minimal RBAC and attach a ServiceAccount:
# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata: { name: web-app-sa, namespace: default }
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: { name: web-app-role, namespace: default }
rules:
- apiGroups: [""]
resources: ["configmaps","secrets"]
verbs: ["get","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { name: web-app-rb, namespace: default }
subjects: [{ kind: ServiceAccount, name: web-app-sa, namespace: default }]
roleRef: { apiGroup: rbac.authorization.k8s.io, kind: Role, name: web-app-role }
kubectl apply -f rbac.yaml
kubectl patch deploy "$DEPLOY" -n "$NS" -p '{"spec":{"template":{"spec":{"serviceAccountName":"web-app-sa"}}}}'
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Success: permission/mount errors disappear; pod reaches Ready.
5. Image & configuration (bad image, missing Secret/ConfigMap)
You’re in the right place if events mention missing Secret/ConfigMap or image/entrypoint issues.
Check config refs quickly:
kubectl describe pod "$POD" -n "$NS" | grep -Ei "configmap|secret|image"
Create Secret and mount via env (example):
kubectl create secret generic app-secrets -n "$NS" \
--from-literal=APP_API_KEY='REDACTED'
kubectl patch deploy "$DEPLOY" -n "$NS" --type='json' -p='[
{"op":"add","path":"/spec/template/spec/containers/0/envFrom","value":[{"secretRef":{"name":"app-secrets"}}]}
]'
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Sanity-check image locally, then pin a fixed tag:
docker run --rm myrepo/web-app:fixed /app/healthcheck || echo "image/startup problem"
kubectl set image deploy/"$DEPLOY" "$CONTAINER"=myrepo/web-app:fixed -n "$NS"
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Success: no missing resource/image errors; container starts cleanly
6. Networking & dependencies (DB/API not reachable, DNS/NetworkPolicy)
You’re in the right place if: logs show connection/DNS errors or late dependencies.
Quick connectivity check (throwaway pod):
kubectl run -it netcheck --rm --restart=Never -n "$NS" \
--image=busybox:1.36 -- sh -lc 'nc -zv db 5432 && echo ok || echo fail'
(Ephemeral debug if enabled)
kubectl debug -it "$POD" -n "$NS" --image=busybox:1.36 --target="$CONTAINER"
Delay app until deps are ready (init container gate):
apiVersion: apps/v1
kind: Deployment
metadata: { name: web-app, namespace: default }
spec:
template:
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ["sh","-c","until nc -z db 5432; do echo waiting for db; sleep 2; done"]
containers:
- name: app
image: myrepo/web-app:latestkubectl apply -f init-wait.yaml
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Success: app reaches dependencies; no new restarts tied to timeouts.
Emergency mitigation (if a fresh release caused the loop)
kubectl rollout undo deploy/"$DEPLOY" -n "$NS"
kubectl rollout status deploy/"$DEPLOY" -n "$NS"
Verify resolution (always)
kubectl get pods -n "$NS"
kubectl describe pod "$POD" -n "$NS" | grep -Ei 'Reason|Probe|BackOff|OOM' || true
kubectl logs deploy/"$DEPLOY" -c "$CONTAINER" --tail=200 -n "$NS"
Prevent repeats: keep probes realistic, right-size resources from observed usage, use init containers for late dependencies, and watch restart rate, probe failures, memory headroom (your CubeAPM dashboards/alerts).
Monitoring CrashLoopBackOff Error with CubeAPM
Fixing a CrashLoopBackOff once is good, but preventing it from recurring is even better. That’s where monitoring comes in. CubeAPM gives Kubernetes teams a complete view of pod restarts, probe failures, and resource pressure in one platform, helping them detect problems early and resolve them before users feel the impact. Here’s how CubeAPM makes CrashLoopBackOff easier to handle:
1. Make sure CubeAPM is running (Helm)
helm repo add cubeapm https://charts.cubeapm.com
helm repo update cubeapm
helm show values cubeapm/cubeapm > values.yaml
# edit values.yaml as needed, then:
helm install cubeapm cubeapm/cubeapm -f values.yaml
2. Install OpenTelemetry Collector in both modes
CubeAPM’s k8s guide recommends DaemonSet + Deployment for full coverage (logs, kubelet/node, cluster metrics, events)
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update open-telemetry
Create two values files and set your CubeAPM OTLP endpoints.
otel-collector-daemonset.yaml (per-node: kubelet/host metrics + container logs)
otel-collector-deployment.yaml (cluster metrics + Kubernetes events)
In both files, point exporters to CubeAPM (replace host/token as applicable):
exporters:
otlphttp/metrics: { endpoint: "https://<CUBE_HOST>:4318/v1/metrics" }
otlphttp/logs: { endpoint: "https://<CUBE_HOST>:4318/v1/logs" }
otlphttp/traces: { endpoint: "https://<CUBE_HOST>:4318/v1/traces" }
Install:
helm install otel-collector-daemonset open-telemetry/opentelemetry-collector -f otel-collector-daemonset.yaml
helm install otel-collector-deployment open-telemetry/opentelemetry-collector -f otel-collector-deployment.yaml
3. (Optional) Pull in Prometheus exporter metrics (e.g., restarts)
Add this to either collector values file to scrape your exporters (e.g., kube-state-metrics) so you can chart/alert on restart counters:
receivers:
prometheus:
config:
scrape_configs:
- job_name: kube-state-metrics
scrape_interval: 60s
static_configs:
- targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlphttp/metrics]
4. Verify in CubeAPM (what you should see)
- Infrastructure → Kubernetes: pods/nodes with CPU/memory & restarts (from kubelet/host metrics).
- Logs → Kubernetes events: filter for reason=”CrashLoopBackOff” / BackOff / Killing to spot loops. (Events are shipped by the Deployment pipeline.)
- Dashboards/Alerts: if you enabled Prometheus, use restart counters (e.g., from kube-state-metrics) to chart hot spots and alert on spikes.
5. Wire an alert quickly
- Events-based: alert when CrashLoopBackOff events persist in a namespace.
- Metrics-based: alert on rising container restart rate or high memory headroom risk using scraped Prometheus metrics. (CubeAPM supports alerting from stored metrics/logs.
That’s it—two OTel collectors pointing at CubeAPM (DaemonSet + Deployment), optional Prometheus scrapes, then use Events + Metrics + Logs in CubeAPM to see and alert on CrashLoopBackOff.
Example Alert Rules
CrashLoopBackOff issues can’t just be spotted after the fact — they need proactive detection. The right alerts catch restart loops early, highlight the likely cause, and guide engineers on what action to take. Since CubeAPM ingests both Prometheus scrapes and Kubernetes events, these rules can be set up quickly and tied directly to your dashboards and on-call workflows.
1. Pod stuck in CrashLoopBackOff
This alert fires when a pod remains in a CrashLoopBackOff state for more than a few minutes. It’s the most direct signal that something is wrong and ensures teams respond quickly instead of letting pods silently churn. By surfacing the affected pod, namespace, and last exit code, engineers can immediately dive into CubeAPM’s pod view to check recent probe outcomes and deployments. This is implemented as shown:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cubeapm-crashloopbackoff
namespace: monitoring
spec:
groups:
- name: crashloopbackoff.core
interval: 30s
rules:
# 1) Pod stuck in CrashLoopBackOff
- alert: PodCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
for: 2m
labels:
severity: critical
annotations:
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is in CrashLoopBackOff. Check last exit code, probes, and recent deploys in CubeAPM."
2. High container restart rate
Sometimes pods flip in and out of CrashLoopBackOff without staying there long. Monitoring the restart rate—like more than three restarts in five minutes—catches these hidden loops. With context like container name, image tag, and recent logs, teams can link restarts to resource pressure or configuration changes before they escalate. This is implemented as shown:
# 2) High container restart rate
- alert: HighContainerRestartRate
expr: increase(kube_pod_container_status_restarts_total[5m]) > 3
for: 0m
labels:
severity: warning
annotations:
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} restarted >3 times in 5m. Correlate with CPU/memory and deploys in CubeAPM."
3. Probe failure storm
Readiness and liveness probes that fail too often create restart storms. An alert that tracks probe failures over time highlights whether probes are too strict or pointing to the wrong endpoint. Engineers can then fine-tune probe thresholds or delays, preventing unnecessary restarts triggered by misconfigured health checks. This is implemented as shown:
# 3) Probe failure storm (readiness flapping)
- alert: ProbeFailureStorm
expr: avg_over_time(kube_pod_status_ready{condition="false"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} NotReady >10% over 5m (likely readiness probe issues). Tune initialDelay/timeout/thresholds and verify endpoints in CubeAPM."
4. Memory pressure or OOMKilled
A common cause of CrashLoopBackOff is memory exhaustion. Alerts that trigger when containers approach their memory limits—or when Kubernetes reports OOMKilled events—help engineers right-size resources proactively. This reduces wasted cycles and keeps pods running smoothly instead of crashing under load. This implemented as shown:
# 4) Memory pressure precursor (before OOMKilled)
- alert: ContainerMemoryPressure
expr: |
(container_spec_memory_limit_bytes{container!=""} > 0)
and on (namespace,pod,container)
(container_memory_working_set_bytes{container!=""} / container_spec_memory_limit_bytes{container!=""}) > 0.90
for: 10m
labels:
severity: warning
annotations:
description: "High memory (>90% of limit) for {{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.container }}. Right-size limits/requests; confirm memory graphs & OOM events in CubeAPM."
5. Dependency-related restarts
Applications often rely on databases or APIs that may not always be available. Alerts that correlate restart loops with spikes in service latency or errors make these dependencies visible. If a pod crashes every time a database connection fails, CubeAPM’s traces and metrics confirm the link, guiding engineers to fix the upstream issue rather than chasing false leads.
# 5) Dependency-related restarts (restarts + sustained NotReady)
- alert: DependencyRelatedRestarts
expr: |
(increase(kube_pod_container_status_restarts_total[10m]) > 3)
and on (namespace,pod)
(avg_over_time(kube_pod_status_ready{condition="false"}[10m]) > 0.5)
for: 0m
labels:
severity: warning
annotations:
description: "Likely dependency issue: {{ $labels.namespace }}/{{ $labels.pod }} has restart spike with NotReady >50% over 10m. In CubeAPM, pivot from logs to traces to check DB/API connectivity and NetworkPolicies."
Conclusion
CrashLoopBackOff isn’t a root cause—it’s Kubernetes signaling that something deeper is wrong. Whether it’s application bugs, misconfigured probes, or resource limits, resolving the loop requires systematic debugging. But fixing once isn’t enough: teams need visibility to prevent repeat incidents.
CubeAPM gives Kubernetes teams that edge. By unifying metrics, logs, traces, and events, it makes root causes obvious and cuts downtime. With flat $0.15/GB pricing and full OpenTelemetry coverage, you get enterprise-grade Kubernetes observability without hidden costs.