CubeAPM
CubeAPM CubeAPM

Kubernetes OOMKilled Error: Deep Dive into Memory Leaks, Limits, and Kernel Signals

Kubernetes OOMKilled Error: Deep Dive into Memory Leaks, Limits, and Kernel Signals

Table of Contents

Kubernetes OOMKilled error occurs when a container exceeds its memory limit and is terminated by the kernel’s Out-Of-Memory (OOM) killer. In practice, such failures contribute to the wider reliability challenges in Kubernetes: a recent survey found that 87% of organizations experienced outages in the last year, and 83% said those were Kubernetes-related. For businesses running large-scale microservices, these incidents can cascade into downtime, performance degradation, and revenue loss.

CubeAPM brings Kubernetes-specific visibility that makes OOMKilled errors easier to detect and resolve. It tracks pod and node memory metrics in real time, correlates OOMKilled events with deployments, autoscaling actions, and node pressure, and captures container logs to confirm when the kernel OOM killer terminated a process.

In this guide, we’ll break down what the OOMKilled error means, the top reasons it occurs, proven fixes with step-by-step commands, and how to set up monitoring and alerting with CubeAPM to prevent future incidents.

What is Kubernetes OOMKilled Error 

Kubernetes OOMkilled error

An OOMKilled state in Kubernetes means a container was terminated because it exceeded the memory resources assigned to it. When this happens, the Linux kernel invokes its Out-Of-Memory (OOM) killer, which forcefully ends the process to keep the node itself from crashing. Kubernetes then marks the pod’s container with the “OOMKilled” reason and typically attempts to restart it.

From a technical perspective, this event is tied to exit code 137, which indicates termination due to a SIGKILL signal. Logs may show messages like “Killed process <pid> (java)” in the node’s dmesg, confirming that the OOM killer acted.

OOMKilled errors often point to deeper issues:

  • Improper memory requests and limits in the pod spec, causing workloads to run out of space. 
  • Memory leaks in the application code, which slowly consume all available memory. 
  • Unexpected load spikes from traffic surges or batch jobs that temporarily exceed allocated resources.

In production systems, repeated OOMKilled events can degrade reliability by forcing constant pod restarts, breaking connections, and even triggering cascading failures in distributed applications. For stateful services like databases or caching layers, an OOMKilled container can also lead to data inconsistency or lost transactions.

Why OOMKilled in Kubernetes Pods Happens

1. Misconfigured Memory Limits

If a container’s memory limit is set too low, normal workload operations may exceed it and trigger the OOM killer. For example, a Java application often requires more heap than expected, leading to frequent OOMKilled events.

Quick check:

Bash
kubectl describe pod <pod-name>

 

Look for the container’s Limits section under Resources to confirm if the memory allocation is too restrictive.

2. Memory Leaks in the Application

Applications that fail to release memory properly will slowly consume all available memory. This is common in long-running services where objects pile up in memory, causing the container to be killed even if limits are generous.

Quick check:

Bash
kubectl top pod <pod-name>

 

Watch memory usage grow continuously without dropping, even when traffic subsides.

3. Sudden Traffic Spikes or Heavy Workloads

Batch jobs, large queries, or sudden increases in request volume can cause containers to momentarily exceed memory limits. Even well-tuned applications may face OOMKilled events if scaling rules don’t react fast enough.

Quick check:

Bash
kubectl get hpa

 

Verify if the Horizontal Pod Autoscaler (HPA) is configured to scale based on memory metrics, not just CPU.

4. Node-Level Resource Pressure

Sometimes the container’s limits are fine, but the node itself is under pressure because other pods are competing for memory. In this case, the kubelet reports OOMKilled even if the workload was behaving normally.

Quick check:

Bash
kubectl describe node <node-name>

 

Check the Allocatable vs Used memory and review which pods are consuming most resources.

How to Fix OOMKilled in Kubernetes Pods

Fixing OOMKilled errors requires validating each possible failure point and adjusting workloads accordingly. Below are the most common fixes with quick checks and commands:

1. Adjust Memory Requests and Limits

If limits are too low, Kubernetes will repeatedly kill the container. Increase them based on actual workload usage.

Quick check:

Bash
kubectl describe pod <pod-name>

 

Look at the Limits section and compare with observed usage from kubectl top pod.

Fix:

Bash
kubectl set resources deployment <deploy-name> -n <namespace> --containers=<container-name> --limits=memory=1Gi --requests=memory=512Mi

 

2. Identify and Fix Memory Leaks

Use application profiling or tracing to detect memory leaks in code. A container with steadily growing memory usage likely has a leak.

Quick check:

Bash
kubectl top pod <pod-name>

 

If memory keeps rising even when load is stable, investigate the app.

Fix: Patch the application, upgrade libraries, or containerize with updated runtimes (e.g., JVM heap configs for Java apps).

3. Enable Autoscaling Based on Memory

Sometimes spikes are legitimate workload bursts. Ensure the Horizontal Pod Autoscaler (HPA) scales on memory usage, not just CPU.

Quick check:

Bash
kubectl get hpa

 

Fix:

Bash
kubectl autoscale deployment <deploy-name> --min=2 --max=10 --cpu-percent=70

 

Then extend it with custom metrics to include memory.

4. Spread Pods Across Nodes

If OOMKilled happens because a node is overloaded, reschedule workloads. Use taints, tolerations, or pod anti-affinity to balance memory-heavy pods.

Fix:

Bash
kubectl cordon <node-name>

kubectl drain <node-name> --ignore-daemonsets --delete-local-data

 

Then redeploy pods to healthier nodes.

5. Monitor Kernel OOM Logs

Sometimes limits are fine but the kernel kills a process due to overall node stress.

Quick check:

Bash
kubectl logs <pod-name> -c <container> --previous

 

Look for Killed process <pid> in system logs.

Fix: Adjust node sizing, increase cluster resources, or isolate heavy workloads.

Monitoring OOMKilled in Kubernetes Pods with CubeAPM

When a Pod is terminated with OOMKilled, the fastest way to root cause is by correlating four signal streams: Kubernetes Events (e.g., OOMKilled, Evicted), pod & node memory metrics (requests, limits, usage), container logs (Killed process <pid> messages from the OOM killer), and deployment rollouts or scaling actions. CubeAPM ingests all of these via the OpenTelemetry Collector and stitches them into timelines so you can see what drove the container over its limit—leak, spike, or node pressure.

Step 1 — Install CubeAPM (Helm)

Install (or upgrade) CubeAPM with your values file (endpoint, auth, retention, etc.):

Bash
helm install cubeapm cubeapm/cubeapm -f values.yaml

 

(Upgrade if already installed:)

Bash
helm upgrade cubeapm cubeapm/cubeapm -f values.yaml

 

Step 2 — Deploy the OpenTelemetry Collector (DaemonSet + Deployment)

Run the Collector both as a DaemonSet (for node-level stats, kubelet scraping, and events) and a Deployment (for central pipelines).

Bash
helm install otel-collector-daemonset open-telemetry/opentelemetry-collector -f otel-collector-daemonset.yaml

helm install otel-collector-deployment open-telemetry/opentelemetry-collector -f otel-collector-deployment.yaml

 

Step 3 — Collector Configs Focused on OOMKilled

Keep configs separate for the DaemonSet and central Deployment.

3a) DaemonSet config (otel-collector-daemonset.yaml)
Key idea: collect kubelet stats (memory), Kubernetes events (OOMKilled, Evicted), and kube-state-metrics if present.

YAML
receivers:

  k8s_events: {}

  kubeletstats:

    collection_interval: 30s

    auth_type: serviceAccount

    endpoint: https://${NODE_NAME}:10250

    insecure_skip_verify: true

  prometheus:

    config:

      scrape_configs:

        - job_name: "kube-state-metrics"

          scrape_interval: 30s

          static_configs:

            - targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]

 

processors:

  batch: {}

 

exporters:

  otlp:

    endpoint: ${CUBEAPM_OTLP_ENDPOINT}

    headers:

      x-api-key: ${CUBEAPM_API_KEY}

 

service:

  pipelines:

    metrics:

      receivers: [kubeletstats, prometheus]

      processors: [batch]

      exporters: [otlp]

    logs:

      receivers: [k8s_events]

      processors: [batch]

      exporters: [otlp]

 

  • k8s_events captures OOMKilled and Evicted events from kubelet. 
  • kubeletstats surfaces per-pod memory usage vs. limits. 
  • prometheus (optional) scrapes kube-state-metrics for restart counts and pod phase. 

3b) Central Deployment config (otel-collector-deployment.yaml)
Key idea: receive OTLP, enrich with metadata, and export to CubeAPM.

YAML
receivers:

  otlp:

    protocols:

      grpc:

      http:

 

processors:

  resource:

    attributes:

      - key: cube.env

        value: production

        action: upsert

  batch: {}

 

exporters:

  otlp:

    endpoint: ${CUBEAPM_OTLP_ENDPOINT}

    headers:

      x-api-key: ${CUBEAPM_API_KEY}

 

service:

  pipelines:

    metrics:

      receivers: [otlp]

      processors: [resource, batch]

      exporters: [otlp]

    logs:

      receivers: [otlp]

      processors: [resource, batch]

      exporters: [otlp]

    traces:

      receivers: [otlp]

      processors: [resource, batch]

      exporters: [otlp]

 

Set CUBEAPM_OTLP_ENDPOINT and CUBEAPM_API_KEY via Helm values or Kubernetes Secrets.

Step 4 — One-Line Helm Installs for kube-state-metrics (if missing)

Bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update && helm install kube-state-metrics prometheus-community/kube-state-metrics -n kube-system --create-namespace

 

Step 5 — Verification (What You Should See in CubeAPM)

After a few minutes of ingestion, validate that OOMKilled signals are flowing in:

  • Events timeline: OOMKilled and Evicted events aligned with workload or node pressure. 
  • Memory graphs: Pods showing usage climbing past their configured limits. 
  • Restart counts: From kube-state-metrics, affected pods restarting with reason=OOMKilled. 
  • Logs: Kernel messages like “Killed process <pid> due to out-of-memory”. 
  • Rollout context: If triggered by a deployment, ReplicaSet changes appear in the same timeline. 

Example Alert Rules for OOMKilled in Kubernetes Pods

1. Pods Restarting Due to OOMKilled

Repeated restarts are the clearest sign of a workload that keeps breaching memory limits. Tracking restarts over a short window helps teams intervene before services spiral into instability.

Bash
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 3

 

2. Pods Nearing Memory Limits

Instead of waiting for a crash, it’s better to alert when a container’s memory usage approaches its configured limit. This gives engineers time to scale, optimize, or adjust limits before the kernel kills the process.

Bash
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9

 

3. Nodes Under Memory Pressure

OOMKilled can also be triggered at the node level when memory is exhausted across multiple pods. This rule alerts when the kubelet reports MemoryPressure, signaling that the entire node is at risk.

Bash
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1

 

4. High Cluster-Wide OOMKilled Events

A single OOMKilled may not warrant escalation, but a spike across many pods in a short time window usually means misconfigured workloads or bad deployments. This alert helps catch widespread failures early.

Bash
increase(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[10m]) > 10

Conclusion

OOMKilled errors are one of the most disruptive issues in Kubernetes, as they silently restart containers when memory usage exceeds defined limits. Left unchecked, they can break user sessions, cause data loss in stateful workloads, and trigger cascading outages across clusters.

The key to preventing them lies in a mix of right-sizing memory, fixing leaks in application code, and scaling workloads effectively. But manual troubleshooting with kubectl commands is not enough when running dozens or hundreds of services in production.

With CubeAPM, teams gain end-to-end visibility into pod memory metrics, OOMKilled events, and node pressure. By correlating logs, traces, and events in one place, CubeAPM helps engineers pinpoint root causes faster and stop repeat crashes. Start monitoring OOMKilled errors today with CubeAPM to keep your Kubernetes workloads stable, resilient, and cost-efficient.

 

FAQs

1. What does OOMKilled mean in Kubernetes?

OOMKilled means a container was terminated because it exceeded its memory limit. The Linux kernel’s Out-Of-Memory (OOM) killer stops the process to protect the node, and Kubernetes marks the container with the reason “OOMKilled.” With CubeAPM, these events are automatically captured and visualized, so you can quickly see which pod was affected and why.

You can check by describing the pod and reviewing Kubernetes events, which will show the reason as OOMKilled. CubeAPM simplifies this by pulling events, restart counts, and container logs into a single timeline, making it clear when and where the termination happened.

The most common causes are low memory limits, application memory leaks, unexpected traffic spikes, or node-level memory pressure. CubeAPM helps identify which of these factors is responsible by correlating memory metrics with deployments, scaling actions, and node health.

You can fix OOMKilled errors by adjusting memory requests and limits, profiling applications for leaks, and enabling autoscaling based on memory usage. With CubeAPM monitoring, you get early warnings when pods approach their limits, helping you fix issues before they trigger restarts.

CubeAPM continuously collects pod memory metrics, OOMKilled events, and container logs. It correlates them with deployment rollouts and node pressure, giving engineers clear timelines of why a pod was killed. With prebuilt dashboards and smart alerts, CubeAPM helps teams detect issues early, reduce downtime, and prevent repeat crashes.

×