Kubernetes 502 bad gateway error occurs when a request passes through an ingress or gateway (like NGINX or Envoy) but the upstream Pod returns an invalid or empty response. Even short-lived 502s can ripple through microservices—breaking APIs, frontends, and downstream calls—and with major outages costing companies a median of $2M per hour, minimizing these errors is critical.
CubeAPM enables teams to trace Kubernetes 502 Bad Gateway errors end-to-end by correlating ingress spikes with Pod restarts, container logs, and rollout events. This makes it clear whether the issue stems from failing upstream Pods, readiness probe misfires, or misrouted Services—cutting resolution time from hours to minutes.
In this guide, we’ll explain what the Kubernetes 502 Bad Gateway error is, why it happens, how to fix it, and—critically—how CubeAPM can help you detect, correlate, and resolve it faster.
Table of Contents
ToggleWhat is Kubernetes ‘502 Bad Gateway’ Error
Kubernetes 502 Bad Gateway error means that an ingress or gateway component (such as NGINX Ingress Controller, Envoy, or HAProxy) successfully forwarded a client request, but the upstream Pod or Service failed to respond with a valid payload. Instead, the gateway receives an invalid, empty, or incomplete response, which is then surfaced to the client as a 502 error.
This issue typically arises in distributed microservice setups where traffic must pass through multiple layers—Ingress, Services, and Pods—before reaching an application container. Any failure along this chain (such as a Pod crash, readiness probe failure, or misrouted Service) can cause the gateway to return a 502.
Key characteristics of Kubernetes 502 Bad Gateway error:
- Intermittent or sudden: May appear only under load spikes or during Pod restarts
- Ingress-level visibility: Error is surfaced by NGINX, Envoy, or HAProxy, not the Pod itself
- Tied to upstream health: Often linked to failing readiness probes, crashes, or timeouts
- Cascading effect: Can propagate across dependent services if not resolved quickly
- Difficult to isolate: Root cause is often buried in Pod logs or rollout history
Why Kubernetes ‘502 Bad Gateway’ Error Happens
1. Readiness probe failures
A readiness probe failure often leads to the Kubernetes 502 Bad Gateway error. When readiness probes are too strict or misconfigured, Kubernetes may mark a Pod as “Ready” before the application is actually prepared to handle traffic. Requests routed at this stage often fail because the container process is still warming up or dependencies aren’t available. Over time, this leads to repeated 502 responses that appear randomly, especially after rollouts or scaling events.
Quick check:
kubectl describe pod <pod-name>
What to look for: Events showing repeated Readiness probe failed messages, indicating the Pod was marked as ready too soon.
2. Pod crashes or restarts
Containers that crash mid-request leave the ingress controller waiting on a response that never arrives.. During this downtime, Kubernetes attempts to restart the Pod, but until it’s healthy again, all traffic routed there results in 502 errors. This pattern is common with memory leaks, OOMKilled events, or application-level exceptions.
Quick check:
kubectl get pods
What to look for: Pods with a high RESTARTS count, which signals instability and repeated container crashes.
3. Service misconfiguration
If a Service selector doesn’t align with Pod labels, the Service ends up with no endpoints. This misconfiguration is surprisingly easy to miss during deployments, especially when labels or selectors change in manifests. The ingress continues forwarding traffic, but with no upstream Pod, every request returns a 502 error until the mismatch is fixed.
Quick check:
kubectl describe svc <service-name>
What to look for: Endpoints: <none> or empty endpoint lists, showing the Service isn’t routing to any Pods.
4. Network policies blocking traffic
Strict or misapplied NetworkPolicy rules can unintentionally block ingress-to-Pod communication. From the cluster’s perspective, Pods may look perfectly healthy, but the gateway cannot reach them. This disconnect causes clients to see 502 responses even when application logs show no errors, making it a tricky issue to diagnose without looking at policies.
Quick check:
kubectl describe networkpolicy
What to look for: Rules that omit or block ingress traffic from the ingress namespace to backend Pods.
5. Backend timeouts under load
When Pods are overloaded—due to insufficient resources, exhausted connection pools, or slow query handling—they may fail to respond within expected timeouts. The ingress controller, waiting for an upstream reply, eventually terminates the request and surfaces it as a 502. This problem usually spikes during high traffic periods when scaling can’t keep up.
Quick check:
kubectl logs -n ingress-nginx <controller-pod>
What to look for: upstream timeout messages that appear during traffic peaks, confirming backend slowness.
6. TLS/SSL handshake issues
If the ingress and backend Pods use mismatched TLS versions, unsupported ciphers, or expired certificates, handshakes between them fail. These failed attempts never establish a proper connection, so the ingress surfaces them as 502 errors. This issue often happens in environments where custom certificates are rotated manually or services enforce stricter TLS settings than the ingress.
Quick check:
openssl s_client -connect <pod-ip>:<port>
What to look for: Handshake failure logs in the ingress or certificate expiry/mismatch errors from the openssl command.
7. Rolling updates without surge capacity
During rolling updates, Kubernetes may shut down too many old Pods before the new ones are ready if the deployment strategy isn’t tuned. For a short period, there are no valid endpoints available, and the ingress returns 502 responses. While this usually resolves once new Pods come online, it creates noticeable outages if surge and availability settings are misconfigured.
Quick check:
kubectl rollout status deployment <name>
What to look for: Gaps in availability where no Pods are running during rollout, especially with low maxSurge or high maxUnavailable values.
How to Fix Kubernetes ‘502 Bad Gateway’ Error
1. Fix readiness probe misconfigurations
To resolve a Kubernetes 502 Bad Gateway error readiness must reflect when the app can actually serve traffic; premature “Ready” states route requests too early and trigger 502s.
Quick check:
kubectl describe pod <pod-name>
Fix:
kubectl patch deployment <deploy-name> -n <ns> --type='json' -p='[{"op":"add","path":"/spec/template/spec/containers/0/readinessProbe","value":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":15,"periodSeconds":5,"timeoutSeconds":2,"failureThreshold":6,"successThreshold":1}}]'
2. Stabilize crashing Pods
Frequent Pod crashes are one of the main causes of Kubernetes ingress 502 errors. When a container restarts mid-request, the ingress layer receives no valid upstream response, which surfaces as a kubernetes 502 Bad Gateway error.
Quick check:
kubectl get pods -n <ns>
Fix:
kubectl set resources deployment <deploy-name> -n <ns> --limits=cpu=1000m,memory=1Gi --requests=cpu=300m,memory=512Mi
3. Correct Service selectors
Service misconfigurations are another common root cause of Kubernetes.502 Bad Gateway error when Service selectors don’t match the labels on running Pods, the Service has no valid endpoints, and the ingress forwards traffic into a void.
Quick check:
kubectl describe svc <service-name> -n <ns>
Fix:
kubectl patch svc <service-name> -n <ns> --type='merge' -p='{"spec":{"selector":{"app":"<pod-label-app>","tier":"<pod-label-tier>"}}}'
4. Adjust NetworkPolicy rules
Over-restrictive policies can block ingress-to-Pod traffic, making healthy Pods unreachable.
Quick check:
kubectl describe networkpolicy -n <ns>
Fix:
kubectl patch networkpolicy <np-name> -n <ns> --type='merge' -p='{"spec":{"ingress":[{"from":[{"namespaceSelector":{"matchLabels":{"kubernetes.io/metadata.name":"ingress-nginx"}}}],"ports":[{"protocol":"TCP","port":80},{"protocol":"TCP","port":443}]}]}}'
5. Resolve backend timeouts
When backends are slow or overloaded, upstream timeouts at the ingress manifest as 502s.
Quick check:
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -i "upstream timeout"
Fix:
kubectl annotate ingress <ingress-name> -n <ns> nginx.ingress.kubernetes.io/proxy-read-timeout="60" nginx.ingress.kubernetes.io/proxy-send-timeout="60" --overwrite
6. Fix TLS/SSL handshake problems
Protocol or certificate mismatches break the ingress↔backend handshake and bubble up as 502s.
Quick check:
openssl s_client -connect <pod-ip>:<port> -servername <svc-hostname>
Fix:
kubectl annotate ingress <ingress-name> -n <ns> nginx.ingress.kubernetes.io/ssl-protocols="TLSv1.2 TLSv1.3" nginx.ingress.kubernetes.io/ssl-prefer-server-ciphers="true" --overwrite
7. Tune rolling update strategy
If too many old Pods terminate before new Pods are ready, the ingress has no upstreams and clients see 502s mid-rollout.
Quick check:
kubectl rollout status deployment <deploy-name> -n <ns>
Fix:
kubectl patch deployment <deploy-name> -n <ns> --type='merge' -p='{"spec":{"strategy":{"type":"RollingUpdate","rollingUpdate":{"maxUnavailable":0,"maxSurge":1}}}}'
Monitoring Kubernetes ‘502 Bad Gateway’ Error with CubeAPM
Fastest path to root cause: CubeAPM enables you to correlate ingress-layer 502 spikes with Pod logs, container restarts, Kubernetes events, and rollout history in one view. For 502s, the four signals you’ll lean on are: Events (readiness flaps, rollout gaps), Metrics (ingress and pod health), Logs (ingress controller + app errors), and Rollouts (deployment strategy & surge gaps).
Step 1 — Install CubeAPM (Helm)
helm repo add cubeapm https://charts.cubeapm.com && helm repo update cubeapm && helm show values cubeapm/cubeapm > values.yaml && helm install cubeapm cubeapm/cubeapm -f values.yaml
If already installed:
helm upgrade cubeapm cubeapm/cubeapm -f values.yaml
Step 2 — Deploy the OpenTelemetry Collector (DaemonSet + Deployment)
Run both modes:
- DaemonSet → per-node, captures pod logs and kubelet/host metrics
- Deployment → cluster-level events and metrics
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts && helm repo update open-telemetry
helm install otel-collector-daemonset open-telemetry/opentelemetry-collector -f otel-collector-daemonset.yaml && helm install otel-collector-deployment open-telemetry/opentelemetry-collector -f otel-collector-deployment.yaml
Step 3 — Collector Configs Focused on 502 Bad Gateway
- DaemonSet (logs + kubelet + host metrics):
# otel-collector-daemonset.yaml
mode: daemonset
image:
repository: otel/opentelemetry-collector-contrib
presets:
kubernetesAttributes: { enabled: true }
hostMetrics: { enabled: true }
kubeletMetrics: { enabled: true }
logsCollection: { enabled: true, storeCheckpoints: true }
config:
exporters:
otlphttp/metrics:
metrics_endpoint: http://<cubeapm_endpoint>:3130/api/metrics/v1/save/otlp
retry_on_failure: { enabled: false }
otlphttp/logs:
logs_endpoint: http://<cubeapm_endpoint>:3130/api/logs/insert/opentelemetry/v1/logs
headers: { Cube-Stream-Fields: k8s.namespace.name,k8s.deployment.name }
otlp/traces:
endpoint: <cubeapm_endpoint>:4317
tls: { insecure: true }
processors:
batch: {}
resourcedetection: { detectors: ["system"] }
receivers:
otlp: { protocols: { grpc: {}, http: {} } }
kubeletstats:
collection_interval: 60s
metric_groups: [container, node, pod]
hostmetrics:
collection_interval: 60s
scrapers: { cpu: {}, memory: {}, network: {} }
service:
pipelines:
metrics: { receivers: [hostmetrics, kubeletstats], processors: [batch], exporters: [otlphttp/metrics] }
logs: { receivers: [otlp], processors: [batch], exporters: [otlphttp/logs] }
traces: { receivers: [otlp], processors: [batch], exporters: [otlp/traces] }
- logsCollection → captures ingress/app logs with 502 lines
- kubeletstats → surface Pod restarts tied to 502s
- hostmetrics → CPU/mem/network saturation that correlates to upstream timeouts
- Deployment (cluster events + metrics):
# otel-collector-deployment.yaml
mode: deployment
image:
repository: otel/opentelemetry-collector-contrib
presets:
kubernetesEvents: { enabled: true }
clusterMetrics: { enabled: true }
config:
exporters:
otlphttp/metrics:
metrics_endpoint: http://<cubeapm_endpoint>:3130/api/metrics/v1/save/otlp
otlphttp/k8s-events:
logs_endpoint: http://<cubeapm_endpoint>:3130/api/logs/insert/opentelemetry/v1/logs
headers: { Cube-Stream-Fields: event.domain }
processors:
batch: {}
receivers:
k8s_cluster:
collection_interval: 60s
metrics:
k8s.node.condition: { enabled: true }
service:
pipelines:
metrics: { receivers: [k8s_cluster], processors: [batch], exporters: [otlphttp/metrics] }
logs: { receivers: [k8sobjects], processors: [batch], exporters: [otlphttp/k8s-events] }
- kubernetesEvents → readiness flaps, rollout gaps
- clusterMetrics → node health, scheduling context
- event logs → timeline of probe failures alongside 502 spikes
Step 4 — Supporting Components
If you want ingress metrics directly, add Prometheus scraping:
receivers:
prometheus:
config:
scrape_configs:
- job_name: "ingress"
static_configs:
- targets: ["ingress-nginx-controller.ingress-nginx.svc.cluster.local:10254"]
Step 5 — Verification in CubeAPM
- Events: Readiness probe failed, rollout start/stop, Pod crash events
- Metrics: CPU/memory pressure near 502 spikes, ingress request error rates
- Logs: Ingress logs with 502 or upstream timeout entries, app logs for backend errors
- Restarts: Containers in crash loop or restarting during load
- Rollouts: Deployment gaps where Pods dropped before new ones came online
Example Alert Rules for Kubernetes ‘502 Bad Gateway’ Error
1. Spike in ingress 502 responses
A sudden increase in 502 responses from the ingress controller usually means backend Pods are failing, unreachable, or timing out. This is often the first external signal customers see when services are unhealthy. Catching this spike early allows teams to investigate Pod readiness, network policies, or resource bottlenecks before downtime escalates.
- alert: High502ErrorRate
expr: sum(rate(nginx_ingress_controller_requests{status="502"}[5m])) by (namespace) > 5
for: 2m
labels: { severity: critical }
annotations:
summary: "High 502 error rate detected in namespace"
2. No healthy endpoints in a Service
If a Service has no available endpoints, traffic routed through it will always fail, resulting in 502s at the ingress layer. This usually happens due to label mismatches or Pods failing readiness checks. Monitoring for Services without endpoints helps catch silent configuration issues before they cause widespread client errors.
- alert: ServiceNoEndpoints
expr: kube_endpoint_address_available == 0
for: 1m
labels: { severity: warning }
annotations:
summary: "Service has no endpoints available"
3. Backend Pod restarts during traffic
When Pods restart frequently, they often drop connections mid-request, producing incomplete or failed responses. The ingress controller surfaces these as 502 errors because the upstream suddenly goes offline. Tracking restart frequency ensures unstable Pods are flagged before they impact user-facing traffic.
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 3
for: 5m
labels: { severity: warning }
annotations:
summary: "Pod restarting frequently, may cause 502s"
4. Pods not becoming Ready (readiness probe trouble)
If Pods stay NotReady after rollout or scale-up, Services will route to backends that can’t serve traffic, surfacing 502s at the ingress layer. This often points to incorrect probe paths/thresholds, slow startups, or missing dependencies. Alerting on stuck NotReady Pods gives you time to fix probes or add warm-up delays before customers feel it.
- alert: PodsStuckNotReady
expr: sum(kube_pod_status_ready{condition="true"} == 0) by (namespace) > 0
for: 5m
labels: { severity: warning }
annotations:
summary: "One or more Pods remain NotReady, downstream 502s likely"
5. Rollout gaps (no surge capacity)
During a deployment, if too many old Pods terminate before new ones are Ready, the ingress temporarily has no upstream endpoints, causing waves of 502 responses. This alert catches rollout windows where availability drops below the intended capacity, usually due to an aggressive maxUnavailable or too-low maxSurge. Tune the strategy before pushing large releases.
- alert: DeploymentReplicasUnavailable
expr: sum(kube_deployment_status_replicas_unavailable) by (namespace, deployment) > 0
for: 3m
labels: { severity: critical }
annotations:
summary: "Deployment has unavailable replicas during rollout; risk of 502s"
Conclusion
The Kubernetes 502 Bad Gateway error is more than just a failed HTTP response—it’s a sign that something deeper in your cluster isn’t aligned. Whether it’s misconfigured readiness probes, Pod crashes, missing Service endpoints, or rollout gaps, the result is the same: traffic drops, APIs fail, and users experience broken workflows.
The fixes require a systematic approach: validating probes, stabilizing Pods, checking Services, and tightening NetworkPolicies. With alert rules in place, you can catch 502 patterns before they turn into widespread outages.
CubeAPM makes this process faster by tying together events, metrics, logs, and rollout history into one unified view. Instead of chasing 502s through multiple tools, teams get clear visibility into root cause and resolution paths—reducing downtime and keeping Kubernetes workloads resilient at scale.