CubeAPM
CubeAPM CubeAPM

GKE Node Pool and Workload Identity Issues: Troubleshooting Guide

GKE Node Pool and Workload Identity Issues: Troubleshooting Guide

Table of Contents

GKE Workload Identity eliminates the need for service account key files by letting Kubernetes service accounts authenticate directly to Google Cloud APIs. When configured correctly, pods get short lived credentials automatically rotated by the GKE metadata server. When misconfigured, workloads fail authentication with cryptic metadata endpoint timeouts or 403 errors that surface only in production.

The most common failure pattern: Workload Identity enabled at the cluster level but not on individual node pools. GKE metadata server requests from pods on those node pools fail silently, and workloads continue trying to use the legacy Compute Engine default service account, which may lack the necessary IAM roles. According to the CNCF 2024 security practices survey, 68% of Kubernetes users cite authentication and secret management as top security concerns, which Workload Identity is designed to address when configured properly.

This guide covers what Workload Identity is, how it works at the node pool and cluster level, the most common misconfigurations that break authentication, step by step fixes for each failure mode, and how to monitor Workload Identity health in production.

What Is GKE Workload Identity and Why It Matters

GKE Workload Identity is a mechanism that allows Kubernetes service accounts (KSAs) to act as Google Cloud IAM service accounts (GSAs) without storing credentials as secrets in the cluster. It replaces the older pattern of downloading service account JSON keys, storing them in Kubernetes secrets, and mounting them into pods.

Workload Identity works through a trust relationship managed by the GKE metadata server running on each node. When a pod makes a request to the metadata server for credentials, the metadata server verifies the pod’s Kubernetes service account, checks if it is bound to a Google Cloud service account, and returns short lived OAuth tokens if the binding exists.

This approach provides three core benefits:

Eliminates key sprawl: No JSON keys to rotate, leak, or expire. Credentials are ephemeral and scoped to the workload lifetime.

Granular IAM control: Each workload can have a distinct Google Cloud service account with precisely scoped IAM permissions. This limits blast radius if a pod is compromised.

Audit trail: Every API call traces back to a specific Kubernetes service account and namespace, giving clear visibility into which workloads accessed which Google Cloud resources.

The tradeoff: Workload Identity introduces configuration complexity that does not exist with static keys. A single missing IAM binding or node pool flag can silently break authentication for an entire deployment.

How GKE Workload Identity Works Across Clusters and Node Pools

Workload Identity has three layers that must all be configured correctly for authentication to succeed: cluster level configuration, node pool metadata mode, and IAM policy bindings.

Cluster level Workload Identity pool

When you enable Workload Identity on a GKE cluster, Google Cloud creates a workload identity pool tied to your project. The pool follows the format PROJECT_ID.svc.id.goog. This pool is permanent even if you delete all clusters in the project, Google Cloud does not remove it.

The cluster level setting establishes that the cluster participates in Workload Identity, but it does not enable the metadata server on nodes. That requires a separate node pool setting.

Node pool metadata mode: GKE_METADATA vs GCE_METADATA

Each node pool has a workloadMetadataConfig setting that controls how the GKE metadata server behaves on those nodes. There are two modes:

GKE_METADATA: The GKE metadata server intercepts metadata requests from pods and enforces Workload Identity bindings. This is the required mode for Workload Identity to work.

GCE_METADATA: The default Compute Engine metadata server responds to pod requests. Pods authenticate as the node’s service account, bypassing Workload Identity entirely. This is the legacy mode and the mode that causes authentication failures when Workload Identity is expected.

If a cluster has Workload Identity enabled but a node pool remains in GCE_METADATA mode, pods on that node pool will not get Workload Identity credentials. They will either use the node service account which likely lacks the necessary IAM roles or fail authentication entirely.

This is the single most common misconfiguration: cluster level Workload Identity on, node pool metadata mode left at default.

IAM policy binding: linking KSA to GSA

After the cluster and node pool are configured, you must create an IAM policy binding that grants the Kubernetes service account permission to impersonate the Google Cloud service account. The binding uses this format:

gcloud iam service-accounts add-iam-policy-binding GSA_EMAIL \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]"

This binding is directional: it grants the KSA the ability to act as the GSA. If the binding is missing, metadata server requests succeed but return an error stating the KSA is not authorized to impersonate the GSA.

Common GKE Workload Identity Failures and Root Causes

Workload Identity failures fall into four categories: node pool misconfiguration, IAM binding errors, metadata server failures, and network policy blocks.

Node pool not configured for GKE_METADATA

Symptom: Pods log metadata endpoint timeout errors or 403 Forbidden responses when trying to access Google Cloud APIs. Application logs show authentication failures but no clear root cause.

Root cause: The node pool where the pod is scheduled has workloadMetadataConfig set to GCE_METADATA or unset. The GKE metadata server is not running on those nodes, so Workload Identity bindings are never checked.

How to verify:

gcloud container node-pools describe NODEPOOL_NAME \
  --cluster CLUSTER_NAME \
  --region REGION \
  --format="value(config.workloadMetadataConfig.mode)"

If the output is empty or GCE_METADATA, Workload Identity is not active on that node pool.

Fix:

gcloud container node-pools update NODEPOOL_NAME \
  --cluster CLUSTER_NAME \
  --region REGION \
  --workload-metadata=GKE_METADATA

This change applies immediately and does not require node recreation. Pods do not need to be restarted, they will start using the GKE metadata server on their next credential request.

A production incident at a large SaaS platform documented on Reddit showed a 3 node pool cluster where 2 pools had GKE_METADATA enabled and 1 did not. Deployments that landed on the misconfigured pool failed authentication unpredictably, appearing as intermittent API failures that correlated with pod scheduling.

IAM binding missing or incorrectly scoped

Symptom: Metadata server responds successfully, but the response contains an error message stating the Kubernetes service account is not authorized to act as the Google Cloud service account. Application logs show 403 Forbidden errors on Google Cloud API calls.

Root cause: The IAM policy binding linking the KSA to the GSA is missing, uses the wrong namespace or service account name, or was created in a different Google Cloud project than the one where the cluster runs.

How to verify:

gcloud iam service-accounts get-iam-policy GSA_EMAIL \
  --format=json | jq '.bindings[] | select(.role=="roles/iam.workloadIdentityUser")'

Check that the member list includes serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME] with the exact namespace and KSA name your pod uses.

Fix:

gcloud iam service-accounts add-iam-policy-binding GSA_EMAIL \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]"

After adding the binding, credential requests succeed immediately. No pod restart required.

Common variant: The GSA exists in a different project than the GKE cluster. In this case, you must use the GSA email from the other project and ensure cross project IAM bindings are allowed by your organization policies.

Kubernetes service account missing annotation

Symptom: IAM binding exists, node pool uses GKE_METADATA, but pods still fail authentication. Metadata server returns an error stating no Google Cloud service account is associated with the Kubernetes service account.

Root cause: The Kubernetes service account resource does not have the iam.gke.io/gcp-service-account annotation pointing to the GSA email.

How to verify:

kubectl get serviceaccount KSA_NAME -n NAMESPACE -o yaml

Check for this annotation under metadata.annotations:

metadata:
  annotations:
    iam.gke.io/gcp-service-account: GSA_EMAIL

If the annotation is missing, the metadata server has no way to know which GSA the KSA should impersonate.

Fix:

kubectl annotate serviceaccount KSA_NAME \
  -n NAMESPACE \
  iam.gke.io/gcp-service-account=GSA_EMAIL

Existing pods must be restarted to pick up the annotation change. New pods will use the updated service account configuration immediately.

Network policy blocking metadata server access

Symptom: Pods log metadata endpoint connection timeout or connection refused errors. Other Google Cloud API calls from the same pod also fail if they rely on instance metadata.

Root cause: A Kubernetes NetworkPolicy is blocking egress traffic to the metadata server IP 169.254.169.254 on port 80 or 988. The metadata server runs locally on each node, but network policies can still block access.

How to verify:

kubectl get networkpolicies -n NAMESPACE

Review egress rules to confirm traffic to 169.254.169.254 is allowed. You can test from a pod:

kubectl exec -it POD_NAME -n NAMESPACE -- curl -H "Metadata-Flavor: Google" \
  http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

If the request times out or is refused, a network policy is the likely cause.

Fix: Add an egress rule allowing traffic to the metadata server:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-metadata-server
  namespace: NAMESPACE
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 169.254.169.254/32
    ports:
    - protocol: TCP
      port: 80
    - protocol: TCP
      port: 988

Apply the policy and test metadata server access again.

Step by Step Fix: Enabling Workload Identity on an Existing Cluster

If you have an existing GKE cluster that was created without Workload Identity, enabling it requires changes at the cluster level, node pool level, and IAM level.

Step 1: Enable Workload Identity on the cluster

gcloud container clusters update CLUSTER_NAME \
  --region REGION \
  --workload-pool=PROJECT_ID.svc.id.goog

This change applies immediately and does not disrupt running workloads. The cluster now participates in the workload identity pool, but node pools do not yet enforce Workload Identity until their metadata mode is updated.

Step 2: Update each node pool to use GKE_METADATA

For every node pool in the cluster:

gcloud container node-pools update NODEPOOL_NAME \
  --cluster CLUSTER_NAME \
  --region REGION \
  --workload-metadata=GKE_METADATA

This change applies without node recreation. Existing pods continue running but will start using the GKE metadata server on their next credential request, typically within 1 hour as credentials expire and are refreshed.

If you create a new node pool after enabling cluster level Workload Identity, you must explicitly set the metadata mode during creation:

gcloud container node-pools create NEW_NODEPOOL \
  --cluster CLUSTER_NAME \
  --region REGION \
  --workload-metadata=GKE_METADATA

New node pools do not inherit the cluster setting automatically.

Step 3: Create Google Cloud service accounts for your workloads

For each workload that needs Google Cloud API access, create a dedicated GSA:

gcloud iam service-accounts create GSA_NAME \
  --project PROJECT_ID

Grant the GSA the minimum IAM roles required for the workload. For example, if the workload reads from Cloud Storage:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member "serviceAccount:GSA_EMAIL" \
  --role roles/storage.objectViewer

Avoid granting roles/editor or roles/owner. Workload Identity is designed for least privilege, and overly broad roles negate the security benefit.

Step 4: Create Kubernetes service accounts and annotate them

In each namespace where a workload runs:

kubectl create serviceaccount KSA_NAME -n NAMESPACE
kubectl annotate serviceaccount KSA_NAME \
  -n NAMESPACE \
  iam.gke.io/gcp-service-account=GSA_EMAIL

Step 5: Bind the KSA to the GSA using IAM

gcloud iam service-accounts add-iam-policy-binding GSA_EMAIL \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]"

Step 6: Update pod specs to use the annotated KSA

In your Deployment, StatefulSet, or Pod manifest, specify the KSA:

apiVersion: v1
kind: Pod
metadata:
  name: workload-identity-test
  namespace: NAMESPACE
spec:
  serviceAccountName: KSA_NAME
  containers:
  - name: app
    image: gcr.io/google.com/cloudsdktool/cloud-sdk:slim
    command: ["sleep", "infinity"]

Deploy the pod and test authentication:

kubectl exec -it workload-identity-test -n NAMESPACE -- gcloud auth list

The active account should show the GSA email, not the Compute Engine default service account.

Monitoring Workload Identity Health in Production

Once Workload Identity is configured, three signals are critical to monitor: metadata server request success rate, IAM policy binding changes, and credential refresh failures.

Metadata server request metrics

GKE exposes metadata server metrics through Cloud Monitoring. The most important metric is k8s_node/metadata_server/request_count with dimensions for response code and service account.

A spike in 403 responses indicates IAM binding issues. A spike in 500 responses indicates metadata server failures or misconfigurations. Timeouts suggest network policy blocks or metadata server unavailability.

Query for metadata server errors:

fetch k8s_node
| metric 'kubernetes.io/node/metadata_server/request_count'
| filter response_code != '200'
| group_by [response_code], sum(value.request_count)

Alert when non 200 response codes exceed 1% of total requests over a 5 minute window.

IAM policy audit logs

Google Cloud logs every IAM policy change as an AuditLog event. Monitor for SetIamPolicy calls on service accounts tied to Workload Identity bindings. Unexpected removals of the roles/iam.workloadIdentityUser binding will break authentication for all pods using that KSA.

Create a log based metric in Cloud Logging:

resource.type="service_account"
protoPayload.methodName="google.iam.admin.v1.SetIamPolicy"
protoPayload.request.policy.bindings.role="roles/iam.workloadIdentityUser"

Alert on any binding removal unless it is part of a planned change.

Pod authentication failure logs

Application logs are the earliest signal of Workload Identity failure. Most Google Cloud client libraries log authentication errors before retrying. Parse application logs for patterns like Error 403: Forbidden, could not get token from metadata server, or invalid_grant.

Aggregate these errors by namespace and service account to identify which workloads are affected. Cross reference with recent IAM changes or node pool updates to find the root cause.

CubeAPM’s Kubernetes monitoring correlates pod logs with infrastructure events like node pool updates or IAM binding changes, surfacing the exact timeline of when authentication started failing and what changed. This removes the manual cross referencing step and cuts troubleshooting time from hours to minutes.

Best Practices for Managing Workload Identity at Scale

Workload Identity misconfiguration risk compounds as clusters, namespaces, and workloads grow. These patterns reduce failure modes:

Use a consistent naming convention: Map each Kubernetes service account to a Google Cloud service account with a predictable name, like app-name-environment-ksa and app-name-environment-gsa. This makes it trivial to verify bindings and spot mismatches.

Automate IAM binding creation with infrastructure as code: Store GSA creation, IAM role grants, and Workload Identity bindings in Terraform or a similar tool. This ensures bindings are versioned, reviewed, and reproducible. Manual gcloud commands lead to drift and missed bindings.

Test Workload Identity in CI before production: Deploy a test pod in your CI pipeline that uses Workload Identity to call a Google Cloud API. If the pod fails authentication, fail the pipeline. This catches IAM binding errors or annotation mistakes before they reach production.

Monitor IAM policy changes with alerts: Set up alerts on SetIamPolicy audit logs for service accounts tied to Workload Identity. Unexpected binding removals should trigger an immediate investigation.

Document which workloads use which GSAs: Maintain a mapping of Kubernetes service accounts to Google Cloud service accounts in a shared knowledge base. When an incident occurs, responders need to know which GSA controls access for a given workload without reverse engineering IAM policies.

Avoid reusing GSAs across namespaces: Each namespace should have its own set of GSAs even if workloads have similar permissions. This limits blast radius if a namespace is compromised and makes it easier to audit which namespace accessed which Google Cloud resource.

Use separate GSAs for different environments: Do not share the same GSA between dev, staging, and production. If a dev workload is compromised, it should not have access to production resources. Environment specific GSAs enforce this boundary at the IAM level.

Troubleshooting Workload Identity Across Multiple Node Pools

Clusters with multiple node pools introduce a failure mode where Workload Identity works on some nodes but not others. Pods scheduled to misconfigured node pools fail authentication unpredictably, appearing as intermittent errors that are hard to reproduce.

How to identify the problem: Check the metadata mode for every node pool in the cluster:

gcloud container node-pools list \
  --cluster CLUSTER_NAME \
  --region REGION \
  --format="table(name, config.workloadMetadataConfig.mode)"

If any node pool shows GCE_METADATA or an empty value, pods scheduled there will not use Workload Identity.

Why this happens: When you enable Workload Identity on a cluster, existing node pools are not automatically updated. GKE does not propagate the cluster level setting to node pools created before Workload Identity was enabled. Each node pool must be updated individually.

Fix: Update each misconfigured node pool:

gcloud container node-pools update NODEPOOL_NAME \
  --cluster CLUSTER_NAME \
  --region REGION \
  --workload-metadata=GKE_METADATA

How to prevent it: When creating a new node pool in a Workload Identity enabled cluster, always specify the metadata mode explicitly:

gcloud container node-pools create NEW_NODEPOOL \
  --cluster CLUSTER_NAME \
  --region REGION \
  --workload-metadata=GKE_METADATA

Do not rely on default settings. The default is GCE_METADATA, which breaks Workload Identity.

Tools for Managing and Debugging Workload Identity

Several tools simplify Workload Identity setup and troubleshooting.

gcloud container clusters get-credentials with impersonate-service-account: When testing Workload Identity locally, you can impersonate a GSA to verify it has the correct IAM roles before deploying to GKE:

gcloud auth application-default login --impersonate-service-account=GSA_EMAIL

Run your application locally. If it fails to access Google Cloud APIs, the GSA lacks the necessary IAM roles. Fix the roles before deploying to GKE.

kubectl auth can-i for Kubernetes RBAC verification: Workload Identity controls Google Cloud API access, but Kubernetes RBAC controls which pods can use which Kubernetes service accounts. Verify a pod can use a KSA:

kubectl auth can-i use serviceaccounts/KSA_NAME \
  --namespace NAMESPACE \
  --as system:serviceaccount:NAMESPACE:POD_SA

Cloud Asset Inventory for IAM binding audits: Query all Workload Identity bindings across your organization:

gcloud asset search-all-iam-policies \
  --query="policy:roles/iam.workloadIdentityUser" \
  --format=json

This surfaces every KSA to GSA binding, making it easy to audit which workloads have access to which Google Cloud resources.

Policy Analyzer for overly permissive bindings: Use IAM Policy Analyzer to find GSAs that have unused permissions. If a GSA has roles/editor but only reads from Cloud Storage, reduce the role to roles/storage.objectViewer.

CubeAPM for unified observability across GKE and Google Cloud APIs: CubeAPM monitors Kubernetes pod health, metadata server request rates, and Google Cloud API call success rates in a single view. When a workload starts failing Google Cloud API calls, CubeAPM correlates the failure with recent node pool updates, IAM policy changes, or pod scheduling events, cutting root cause time from hours to minutes.

Migrating from JSON Key Files to Workload Identity

Teams moving from service account key files to Workload Identity can migrate incrementally without downtime.

Migration strategy

Step 1: Enable Workload Identity on the cluster and update node pools to GKE_METADATA mode as described earlier. Existing workloads continue using key files.

Step 2: For one workload, create a GSA, bind it to the workload’s KSA, and annotate the KSA. Do not remove the key file secret yet.

Step 3: Deploy a test pod that uses the KSA. Verify it can authenticate to Google Cloud APIs by checking logs or running a test command inside the pod.

Step 4: Once verified, update the production deployment to use the annotated KSA. Remove the GOOGLE_APPLICATION_CREDENTIALS environment variable and the secret volume mount from the pod spec.

Step 5: Monitor the workload for authentication errors over 24 hours. If no errors occur, the migration is successful. Delete the key file secret from Kubernetes.

Step 6: Repeat for the next workload until all workloads use Workload Identity.

Common migration pitfall: Forgetting to remove the GOOGLE_APPLICATION_CREDENTIALS environment variable. If the variable is set, Google Cloud client libraries will use the key file instead of Workload Identity, even if the KSA is annotated. The workload appears to work, but it is not using Workload Identity.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.

Frequently Asked Questions

What is the difference between Workload Identity and Workload Identity Federation for GKE?

Workload Identity Federation for GKE is the current official name for what was previously called Workload Identity. The terms refer to the same mechanism. Google Cloud rebranded it to align with the broader Workload Identity Federation feature that supports workloads running outside Google Cloud, like AWS or Azure. For GKE, the functionality is identical.

Can I use Workload Identity with GKE Autopilot clusters?

Yes. GKE Autopilot clusters have Workload Identity enabled by default. Node pools in Autopilot are fully managed by Google and automatically use GKE_METADATA mode. You still need to create IAM bindings and annotate Kubernetes service accounts, but you do not need to configure node pool settings manually.

Why does my pod fail authentication even though the IAM binding exists?

Three common causes: the Kubernetes service account is missing the `iam.gke.io/gcp-service-account` annotation, the node pool is not set to GKE_METADATA mode, or a network policy is blocking traffic to the metadata server IP. Verify all three before investigating further.

How do I test Workload Identity without deploying a real workload?

Deploy a test pod with the `gcloud` CLI installed. Use `kubectl exec` to run `gcloud auth list` inside the pod. The active account should show the GSA email if Workload Identity is configured correctly. If it shows the Compute Engine default service account or an error, the configuration is incomplete.

Does enabling Workload Identity affect existing workloads that use service account key files?

No. Enabling Workload Identity at the cluster and node pool level does not break existing workloads. Pods that use key files stored in Kubernetes secrets continue to work. Workload Identity only applies to pods that use an annotated Kubernetes service account with an IAM binding.

Can multiple Kubernetes service accounts share the same Google Cloud service account?

Yes, but it is not recommended. If two KSAs are bound to the same GSA, both workloads have identical Google Cloud API permissions. This makes it harder to audit which workload accessed which resource and increases blast radius if one workload is compromised. Use separate GSAs per workload for better isolation.

What happens if I delete a Google Cloud service account that is bound to a Kubernetes service account?

Pods using that Kubernetes service account will fail authentication immediately. The IAM binding remains in Kubernetes and the KSA annotation still points to the deleted GSA, but metadata server requests return an error stating the GSA does not exist. Recreate the GSA and restore the IAM roles to fix authentication.

×
×