CubeAPM
CubeAPM CubeAPM

Kubernetes 502 Bad Gateway Error Explained: Upstream Failures, Pod Crashes, and Network Timeouts

Author: | Published: October 6, 2025 | Kubernetes Errors

Kubernetes 502 bad gateway error occurs when a request passes through an ingress or gateway (like NGINX or Envoy) but the upstream Pod returns an invalid or empty response. Even short-lived 502s can ripple through microservices—breaking APIs, frontends, and downstream calls—and with major outages costing companies a median of $2M per hour, minimizing these errors is critical.

CubeAPM enables teams to trace Kubernetes 502 Bad Gateway errors end-to-end by correlating ingress spikes with Pod restarts, container logs, and rollout events. This makes it clear whether the issue stems from failing upstream Pods, readiness probe misfires, or misrouted Services—cutting resolution time from hours to minutes.

In this guide, we’ll explain what the Kubernetes 502 Bad Gateway error is, why it happens, how to fix it, and—critically—how CubeAPM can help you detect, correlate, and resolve it faster. 

What is Kubernetes ‘502 Bad Gateway’ Error

kubernetes-502-bad-gateway-error

Kubernetes 502 Bad Gateway error means that an ingress or gateway component (such as NGINX Ingress Controller, Envoy, or HAProxy) successfully forwarded a client request, but the upstream Pod or Service failed to respond with a valid payload. Instead, the gateway receives an invalid, empty, or incomplete response, which is then surfaced to the client as a 502 error.

This issue typically arises in distributed microservice setups where traffic must pass through multiple layers—Ingress, Services, and Pods—before reaching an application container. Any failure along this chain (such as a Pod crash, readiness probe failure, or misrouted Service) can cause the gateway to return a 502.

Key characteristics of Kubernetes 502 Bad Gateway error:

  • Intermittent or sudden: May appear only under load spikes or during Pod restarts
  • Ingress-level visibility: Error is surfaced by NGINX, Envoy, or HAProxy, not the Pod itself
  • Tied to upstream health: Often linked to failing readiness probes, crashes, or timeouts
  • Cascading effect: Can propagate across dependent services if not resolved quickly
  • Difficult to isolate: Root cause is often buried in Pod logs or rollout history

Why Kubernetes ‘502 Bad Gateway’ Error Happens

1. Readiness probe failures

A readiness probe failure often leads to the Kubernetes 502 Bad Gateway error. When readiness probes are too strict or misconfigured, Kubernetes may mark a Pod as “Ready” before the application is actually prepared to handle traffic. Requests routed at this stage often fail because the container process is still warming up or dependencies aren’t available. Over time, this leads to repeated 502 responses that appear randomly, especially after rollouts or scaling events.

Quick check:

Bash
kubectl describe pod <pod-name>

 

What to look for: Events showing repeated Readiness probe failed messages, indicating the Pod was marked as ready too soon.

2. Pod crashes or restarts

Containers that crash mid-request leave the ingress controller waiting on a response that never arrives.. During this downtime, Kubernetes attempts to restart the Pod, but until it’s healthy again, all traffic routed there results in 502 errors. This pattern is common with memory leaks, OOMKilled events, or application-level exceptions.

Quick check:

Bash
kubectl get pods

 

What to look for: Pods with a high RESTARTS count, which signals instability and repeated container crashes.

3. Service misconfiguration

If a Service selector doesn’t align with Pod labels, the Service ends up with no endpoints. This misconfiguration is surprisingly easy to miss during deployments, especially when labels or selectors change in manifests. The ingress continues forwarding traffic, but with no upstream Pod, every request returns a 502 error until the mismatch is fixed.

Quick check:

Bash
kubectl describe svc <service-name>

 

What to look for: Endpoints: <none> or empty endpoint lists, showing the Service isn’t routing to any Pods.

4. Network policies blocking traffic

Strict or misapplied NetworkPolicy rules can unintentionally block ingress-to-Pod communication. From the cluster’s perspective, Pods may look perfectly healthy, but the gateway cannot reach them. This disconnect causes clients to see 502 responses even when application logs show no errors, making it a tricky issue to diagnose without looking at policies.

Quick check:

Bash
kubectl describe networkpolicy

 

What to look for: Rules that omit or block ingress traffic from the ingress namespace to backend Pods.

5. Backend timeouts under load

When Pods are overloaded—due to insufficient resources, exhausted connection pools, or slow query handling—they may fail to respond within expected timeouts. The ingress controller, waiting for an upstream reply, eventually terminates the request and surfaces it as a 502. This problem usually spikes during high traffic periods when scaling can’t keep up.

Quick check:

Bash
kubectl logs -n ingress-nginx <controller-pod>

 

What to look for: upstream timeout messages that appear during traffic peaks, confirming backend slowness.

6. TLS/SSL handshake issues

If the ingress and backend Pods use mismatched TLS versions, unsupported ciphers, or expired certificates, handshakes between them fail. These failed attempts never establish a proper connection, so the ingress surfaces them as 502 errors. This issue often happens in environments where custom certificates are rotated manually or services enforce stricter TLS settings than the ingress.

Quick check:

Bash
openssl s_client -connect <pod-ip>:<port>

 

What to look for: Handshake failure logs in the ingress or certificate expiry/mismatch errors from the openssl command.

7. Rolling updates without surge capacity

During rolling updates, Kubernetes may shut down too many old Pods before the new ones are ready if the deployment strategy isn’t tuned. For a short period, there are no valid endpoints available, and the ingress returns 502 responses. While this usually resolves once new Pods come online, it creates noticeable outages if surge and availability settings are misconfigured.

Quick check:

Bash
kubectl rollout status deployment <name>

 

What to look for: Gaps in availability where no Pods are running during rollout, especially with low maxSurge or high maxUnavailable values.

How to Fix Kubernetes ‘502 Bad Gateway’ Error

1. Fix readiness probe misconfigurations

To resolve a Kubernetes 502 Bad Gateway error readiness must reflect when the app can actually serve traffic; premature “Ready” states route requests too early and trigger 502s.

Quick check:

Bash
kubectl describe pod <pod-name>

 

Fix:

Bash
kubectl patch deployment <deploy-name> -n <ns> --type='json' -p='[{"op":"add","path":"/spec/template/spec/containers/0/readinessProbe","value":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":15,"periodSeconds":5,"timeoutSeconds":2,"failureThreshold":6,"successThreshold":1}}]'

 

2. Stabilize crashing Pods

Frequent Pod crashes are one of the main causes of Kubernetes ingress 502 errors. When a container restarts mid-request, the ingress layer receives no valid upstream response, which surfaces as a kubernetes 502 Bad Gateway error.

Quick check:

Bash
kubectl get pods -n <ns>

 

Fix:

Bash
kubectl set resources deployment <deploy-name> -n <ns> --limits=cpu=1000m,memory=1Gi --requests=cpu=300m,memory=512Mi

 

3. Correct Service selectors

Service misconfigurations are another common root cause of Kubernetes.502 Bad Gateway error  when Service selectors don’t match the labels on running Pods, the Service has no valid endpoints, and the ingress forwards traffic into a void.

Quick check:

Bash
kubectl describe svc <service-name> -n <ns>

 

Fix:

Bash
kubectl patch svc <service-name> -n <ns> --type='merge' -p='{"spec":{"selector":{"app":"<pod-label-app>","tier":"<pod-label-tier>"}}}'

 

4. Adjust NetworkPolicy rules

Over-restrictive policies can block ingress-to-Pod traffic, making healthy Pods unreachable.

Quick check:

Bash
kubectl describe networkpolicy -n <ns>

 

Fix:

Bash
kubectl patch networkpolicy <np-name> -n <ns> --type='merge' -p='{"spec":{"ingress":[{"from":[{"namespaceSelector":{"matchLabels":{"kubernetes.io/metadata.name":"ingress-nginx"}}}],"ports":[{"protocol":"TCP","port":80},{"protocol":"TCP","port":443}]}]}}'

 

5. Resolve backend timeouts

When backends are slow or overloaded, upstream timeouts at the ingress manifest as 502s.

Quick check:

Bash
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -i "upstream timeout"

 

Fix:

Bash
kubectl annotate ingress <ingress-name> -n <ns> nginx.ingress.kubernetes.io/proxy-read-timeout="60" nginx.ingress.kubernetes.io/proxy-send-timeout="60" --overwrite

 

6. Fix TLS/SSL handshake problems

Protocol or certificate mismatches break the ingress↔backend handshake and bubble up as 502s.

Quick check:

Bash
openssl s_client -connect <pod-ip>:<port> -servername <svc-hostname>

 

Fix:

Bash
kubectl annotate ingress <ingress-name> -n <ns> nginx.ingress.kubernetes.io/ssl-protocols="TLSv1.2 TLSv1.3" nginx.ingress.kubernetes.io/ssl-prefer-server-ciphers="true" --overwrite

 

7. Tune rolling update strategy

If too many old Pods terminate before new Pods are ready, the ingress has no upstreams and clients see 502s mid-rollout.

Quick check:

Bash
kubectl rollout status deployment <deploy-name> -n <ns>

 

Fix:

Bash
kubectl patch deployment <deploy-name> -n <ns> --type='merge' -p='{"spec":{"strategy":{"type":"RollingUpdate","rollingUpdate":{"maxUnavailable":0,"maxSurge":1}}}}'

 

Monitoring Kubernetes ‘502 Bad Gateway’ Error with CubeAPM

Fastest path to root cause: CubeAPM enables you to correlate ingress-layer 502 spikes with Pod logs, container restarts, Kubernetes events, and rollout history in one view. For 502s, the four signals you’ll lean on are: Events (readiness flaps, rollout gaps), Metrics (ingress and pod health), Logs (ingress controller + app errors), and Rollouts (deployment strategy & surge gaps).

Step 1 — Install CubeAPM (Helm)

Bash
helm repo add cubeapm https://charts.cubeapm.com && helm repo update cubeapm && helm show values cubeapm/cubeapm > values.yaml && helm install cubeapm cubeapm/cubeapm -f values.yaml

 

If already installed:

Bash
helm upgrade cubeapm cubeapm/cubeapm -f values.yaml

 

Step 2 — Deploy the OpenTelemetry Collector (DaemonSet + Deployment)

Run both modes:

  • DaemonSet → per-node, captures pod logs and kubelet/host metrics
  • Deployment → cluster-level events and metrics
Bash
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts && helm repo update open-telemetry

 

Bash
helm install otel-collector-daemonset open-telemetry/opentelemetry-collector -f otel-collector-daemonset.yaml && helm install otel-collector-deployment open-telemetry/opentelemetry-collector -f otel-collector-deployment.yaml

 

Step 3 — Collector Configs Focused on 502 Bad Gateway

  1. DaemonSet (logs + kubelet + host metrics):
YAML
# otel-collector-daemonset.yaml

mode: daemonset

image:

  repository: otel/opentelemetry-collector-contrib

presets:

  kubernetesAttributes: { enabled: true }

  hostMetrics: { enabled: true }

  kubeletMetrics: { enabled: true }

  logsCollection: { enabled: true, storeCheckpoints: true }

config:

  exporters:

    otlphttp/metrics:

      metrics_endpoint: http://<cubeapm_endpoint>:3130/api/metrics/v1/save/otlp

      retry_on_failure: { enabled: false }

    otlphttp/logs:

      logs_endpoint: http://<cubeapm_endpoint>:3130/api/logs/insert/opentelemetry/v1/logs

      headers: { Cube-Stream-Fields: k8s.namespace.name,k8s.deployment.name }

    otlp/traces:

      endpoint: <cubeapm_endpoint>:4317

      tls: { insecure: true }

  processors:

    batch: {}

    resourcedetection: { detectors: ["system"] }

  receivers:

    otlp: { protocols: { grpc: {}, http: {} } }

    kubeletstats:

      collection_interval: 60s

      metric_groups: [container, node, pod]

    hostmetrics:

      collection_interval: 60s

      scrapers: { cpu: {}, memory: {}, network: {} }

service:

  pipelines:

    metrics: { receivers: [hostmetrics, kubeletstats], processors: [batch], exporters: [otlphttp/metrics] }

    logs:    { receivers: [otlp], processors: [batch], exporters: [otlphttp/logs] }

    traces:  { receivers: [otlp], processors: [batch], exporters: [otlp/traces] }

 

  • logsCollection → captures ingress/app logs with 502 lines
  • kubeletstats → surface Pod restarts tied to 502s
  • hostmetrics → CPU/mem/network saturation that correlates to upstream timeouts
  1. Deployment (cluster events + metrics):
YAML
# otel-collector-deployment.yaml

mode: deployment

image:

  repository: otel/opentelemetry-collector-contrib

presets:

  kubernetesEvents: { enabled: true }

  clusterMetrics: { enabled: true }

config:

  exporters:

    otlphttp/metrics:

      metrics_endpoint: http://<cubeapm_endpoint>:3130/api/metrics/v1/save/otlp

    otlphttp/k8s-events:

      logs_endpoint: http://<cubeapm_endpoint>:3130/api/logs/insert/opentelemetry/v1/logs

      headers: { Cube-Stream-Fields: event.domain }

  processors:

    batch: {}

  receivers:

    k8s_cluster:

      collection_interval: 60s

      metrics:

        k8s.node.condition: { enabled: true }

service:

  pipelines:

    metrics: { receivers: [k8s_cluster], processors: [batch], exporters: [otlphttp/metrics] }

    logs:    { receivers: [k8sobjects], processors: [batch], exporters: [otlphttp/k8s-events] }

 

  • kubernetesEvents → readiness flaps, rollout gaps
  • clusterMetrics → node health, scheduling context
  • event logs → timeline of probe failures alongside 502 spikes

Step 4 — Supporting Components

If you want ingress metrics directly, add Prometheus scraping:

YAML
receivers:

  prometheus:

    config:

      scrape_configs:

        - job_name: "ingress"

          static_configs:

            - targets: ["ingress-nginx-controller.ingress-nginx.svc.cluster.local:10254"]

 

Step 5 — Verification in CubeAPM

  • Events: Readiness probe failed, rollout start/stop, Pod crash events
  • Metrics: CPU/memory pressure near 502 spikes, ingress request error rates
  • Logs: Ingress logs with 502 or upstream timeout entries, app logs for backend errors
  • Restarts: Containers in crash loop or restarting during load
  • Rollouts: Deployment gaps where Pods dropped before new ones came online

Example Alert Rules for Kubernetes ‘502 Bad Gateway’ Error

1. Spike in ingress 502 responses

A sudden increase in 502 responses from the ingress controller usually means backend Pods are failing, unreachable, or timing out. This is often the first external signal customers see when services are unhealthy. Catching this spike early allows teams to investigate Pod readiness, network policies, or resource bottlenecks before downtime escalates.

YAML
- alert: High502ErrorRate

  expr: sum(rate(nginx_ingress_controller_requests{status="502"}[5m])) by (namespace) > 5

  for: 2m

  labels: { severity: critical }

  annotations:

    summary: "High 502 error rate detected in namespace"

 

2. No healthy endpoints in a Service

If a Service has no available endpoints, traffic routed through it will always fail, resulting in 502s at the ingress layer. This usually happens due to label mismatches or Pods failing readiness checks. Monitoring for Services without endpoints helps catch silent configuration issues before they cause widespread client errors.

YAML
- alert: ServiceNoEndpoints

  expr: kube_endpoint_address_available == 0

  for: 1m

  labels: { severity: warning }

  annotations:

    summary: "Service has no endpoints available"

 

3. Backend Pod restarts during traffic

When Pods restart frequently, they often drop connections mid-request, producing incomplete or failed responses. The ingress controller surfaces these as 502 errors because the upstream suddenly goes offline. Tracking restart frequency ensures unstable Pods are flagged before they impact user-facing traffic.

YAML
- alert: PodCrashLooping

  expr: rate(kube_pod_container_status_restarts_total[5m]) > 3

  for: 5m

  labels: { severity: warning }

  annotations:

    summary: "Pod restarting frequently, may cause 502s"

 

4. Pods not becoming Ready (readiness probe trouble)

If Pods stay NotReady after rollout or scale-up, Services will route to backends that can’t serve traffic, surfacing 502s at the ingress layer. This often points to incorrect probe paths/thresholds, slow startups, or missing dependencies. Alerting on stuck NotReady Pods gives you time to fix probes or add warm-up delays before customers feel it.

YAML
- alert: PodsStuckNotReady

  expr: sum(kube_pod_status_ready{condition="true"} == 0) by (namespace) > 0

  for: 5m

  labels: { severity: warning }

  annotations:

    summary: "One or more Pods remain NotReady, downstream 502s likely"

5. Rollout gaps (no surge capacity)

During a deployment, if too many old Pods terminate before new ones are Ready, the ingress temporarily has no upstream endpoints, causing waves of 502 responses. This alert catches rollout windows where availability drops below the intended capacity, usually due to an aggressive maxUnavailable or too-low maxSurge. Tune the strategy before pushing large releases.

YAML
- alert: DeploymentReplicasUnavailable

  expr: sum(kube_deployment_status_replicas_unavailable) by (namespace, deployment) > 0

  for: 3m

  labels: { severity: critical }

  annotations:

    summary: "Deployment has unavailable replicas during rollout; risk of 502s"

 

Conclusion

The Kubernetes 502 Bad Gateway error is more than just a failed HTTP response—it’s a sign that something deeper in your cluster isn’t aligned. Whether it’s misconfigured readiness probes, Pod crashes, missing Service endpoints, or rollout gaps, the result is the same: traffic drops, APIs fail, and users experience broken workflows.

The fixes require a systematic approach: validating probes, stabilizing Pods, checking Services, and tightening NetworkPolicies. With alert rules in place, you can catch 502 patterns before they turn into widespread outages.

CubeAPM makes this process faster by tying together events, metrics, logs, and rollout history into one unified view. Instead of chasing 502s through multiple tools, teams get clear visibility into root cause and resolution paths—reducing downtime and keeping Kubernetes workloads resilient at scale.

×