How to Monitor GKE Clusters with Prometheus and Grafana

You have two paths for monitoring GKE clusters with Prometheus and Grafana. The first is self-hosted: deploy the kube-prometheus-stack Helm chart into your cluster, which installs Prometheus, Grafana, Alertmanager, kube-state-metrics, and node-exporter together with pre-built dashboards and alert rules. The second is Google Managed Service for Prometheus (GMP): let Google run the Prometheus collectors and storage, and connect Grafana to GMP as a data source.

Both approaches use PromQL, support the same Grafana dashboards, and give you the same GKE metrics. The choice is an operational one: self-hosted gives you full control and no additional GCP cost for metrics ingestion; GMP removes the operational overhead of running Prometheus at scale but bills per metric sample ingested.

Key Takeaways

GKE requires an extra RBAC step before deploying kube-prometheus-stack – without it, ClusterRole creation fails with a Forbidden error
kube-prometheus-stack is the fastest path to a complete self-hosted setup – it deploys Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics, and default alert rules in a single Helm command
Google Managed Service for Prometheus is enabled by default on new GKE clusters created after a certain version – check whether it is already collecting metrics before deploying a second Prometheus stack
kube-state-metrics and cAdvisor cover different things: kube-state-metrics exposes object state (desired vs actual replica count, pod status), cAdvisor exposes resource usage (CPU, memory per container) – you need both
Grafana dashboard ID 315 (Kubernetes cluster monitoring) and ID 6417 (Kubernetes Cluster) are the most widely used community dashboards for GKE
Enable persistent storage for both Prometheus and Grafana in production – the defaults are in-memory and do not survive pod restarts

Option 1: Self-Hosted kube-prometheus-stack

Step 1: Fix GKE RBAC Before Installing

GKE restricts the ability to create ClusterRoles and ClusterRoleBindings unless your Google identity explicitly has cluster-admin. This is a known GKE-specific requirement. Without this step, the kube-prometheus-stack installation fails with a Forbidden error on ClusterRole creation.

# Replace with your Google Cloud account email

ACCOUNT=$(gcloud info --format='value(config.account)')

kubectl create clusterrolebinding owner-cluster-admin-binding \

  --clusterrole cluster-admin \

  --user $ACCOUNT

# Replace with your Google Cloud account email

ACCOUNT=$(gcloud info --format='value(config.account)')

kubectl create clusterrolebinding owner-cluster-admin-binding \

  --clusterrole cluster-admin \

  --user $ACCOUNT

Run this once before any Helm installation that creates ClusterRoles.

Step 2: Install kube-prometheus-stack

# Add the Prometheus community Helm repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

# Create a dedicated monitoring namespace

kubectl create namespace monitoring

# Install the stack with persistent storage

helm install prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set prometheus.prometheusSpec.retention=15d \

  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \

  --set grafana.persistence.enabled=true \

  --set grafana.persistence.size=10Gi \

  --set grafana.persistence.storageClassName=standard-rwo

# Add the Prometheus community Helm repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

# Create a dedicated monitoring namespace

kubectl create namespace monitoring

# Install the stack with persistent storage

helm install prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set prometheus.prometheusSpec.retention=15d \

  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \

  --set grafana.persistence.enabled=true \

  --set grafana.persistence.size=10Gi \

  --set grafana.persistence.storageClassName=standard-rwo

The standard-rwo storage class is the GKE default for ReadWriteOnce volumes. If you are on Autopilot, use premium-rwo for better I/O.

What gets installed:

Prometheus server with 15-day retention and 50Gi persistent storage
Grafana with 10Gi persistent storage (default credentials: admin/prom-operator)
Alertmanager
kube-state-metrics
prometheus-node-exporter on every node
A full set of pre-configured dashboards and alert rules via the Kubernetes Monitoring Mixin

Step 3: Verify the Installation

# Check all pods are running

kubectl get pods -n monitoring

# Check Prometheus targets (should show all as UP)

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090/targets

# Access Grafana

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Open http://localhost:3000 - login: admin / prom-operator

# Check all pods are running

kubectl get pods -n monitoring

# Check Prometheus targets (should show all as UP)

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090/targets

# Access Grafana

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Open http://localhost:3000 - login: admin / prom-operator

Change the default Grafana password immediately after the first login. The default password prom-operator is publicly known – anyone who can reach your Grafana endpoint can log in with it. Go to Profile > Change Password as soon as you access the UI for the first time.

Step 4: Expose Grafana with a LoadBalancer or Ingress

For persistent access without port-forwarding:

# Upgrade with LoadBalancer service type

helm upgrade prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set grafana.service.type=LoadBalancer

# Upgrade with LoadBalancer service type

helm upgrade prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set grafana.service.type=LoadBalancer

Or deploy an Ingress if you have an ingress controller configured in the cluster. For internal-only access, annotate the LoadBalancer service with cloud.google.com/load-balancer-type: “Internal”.

Option 2: Google Managed Service for Prometheus (GMP)

GMP is enabled by default on new GKE clusters. It replaces the self-hosted Prometheus server with Google-managed collectors that forward metrics to Google’s Monarch backend, queryable via the Prometheus API.

Check if GMP is already enabled on your cluster:

gcloud container clusters describe your-cluster-name \

  --zone your-zone \

  --format="value(monitoringConfig.componentConfig.enableComponents)"

Enable GMP on an existing cluster:

gcloud container clusters describe your-cluster-name \

  --zone your-zone \

  --format="value(monitoringConfig.componentConfig.enableComponents)"

Enable GMP on an existing cluster:

gcloud container clusters update your-cluster-name \

  --zone your-zone \

  --enable-managed-prometheus

Connect Grafana to GMP as a data source:

gcloud container clusters update your-cluster-name \

  --zone your-zone \

  --enable-managed-prometheus

Connect Grafana to GMP as a data source:

GMP exposes a Prometheus-compatible API endpoint. Create a GCP service account with the Monitoring Viewer role and configure it in Grafana:

# Create a service account for Grafana

gcloud iam service-accounts create grafana-reader \

  --display-name="Grafana GMP Reader" \

  --project=your-project-id

# Grant Monitoring Viewer role

gcloud projects add-iam-policy-binding your-project-id \

  --member="serviceAccount:grafana-reader@your-project-id.iam.gserviceaccount.com" \

  --role="roles/monitoring.viewer"

# Create a key file for Grafana authentication

gcloud iam service-accounts keys create grafana-key.json \

  --iam-account=grafana-reader@your-project-id.iam.gserviceaccount.com

# Create a service account for Grafana

gcloud iam service-accounts create grafana-reader \

  --display-name="Grafana GMP Reader" \

  --project=your-project-id

# Grant Monitoring Viewer role

gcloud projects add-iam-policy-binding your-project-id \

  --member="serviceAccount:[email protected]" \

  --role="roles/monitoring.viewer"

# Create a key file for Grafana authentication

gcloud iam service-accounts keys create grafana-key.json \

  --iam-account=grafana-reader@your-project-id.iam.gserviceaccount.com

In Grafana, add a Prometheus data source with the URL: https://monitoring.googleapis.com/v1/projects/your-project-id/location/global/prometheus

Upload the JSON key file as the authentication credential.

What GMP gives you over self-hosted:

24-month metric retention included in the price
No Prometheus servers to scale, shard, or maintain
Global querying across all clusters and regions from a single Grafana instance
PromQL queries work identically – existing dashboards and alerts need no changes
Free GKE system metrics (CPU, memory, pod status) are included without sending data to GMP

What GMP costs: GMP bills per metric sample ingested. For large GKE clusters with many pods and custom metrics, this can add up. Run the GCP pricing calculator with your expected sample volume before committing.

What kube-state-metrics and cAdvisor Each Cover

A common point of confusion when setting up GKE monitoring is what each component actually measures.

Source	What it measures	Example metrics
kube-state-metrics	Kubernetes object state – desired vs actual	kube_deployment_spec_replicas, kube_pod_status_phase, kube_node_status_condition
cAdvisor (via kubelet)	Container resource usage – actual consumption	container_cpu_usage_seconds_total, container_memory_working_set_bytes
node-exporter	Node-level OS and hardware metrics	node_cpu_seconds_total, node_filesystem_avail_bytes

You need all three for complete GKE visibility. kube-prometheus-stack installs all of them. GMP’s managed collection also covers all three when fully configured.

Key PromQL Queries for GKE

CPU usage by namespace:

sum(rate(container_cpu_usage_seconds_total{

  namespace!="kube-system",

  container!=""

}[5m])) by (namespace)

Memory usage by pod:

sum(container_memory_working_set_bytes{

  container!="",

  namespace!="kube-system"

}) by (pod, namespace)

Pods not in Running state:

kube_pod_status_phase{

  phase!="Running",

  phase!="Succeeded"

} == 1

Node CPU saturation (alert threshold: > 80%):

100 - (avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

Deployment replicas available vs desired:

kube_deployment_spec_replicas - kube_deployment_status_replicas_available > 0

Container restart rate (useful for crashloop detection):

rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0

CPU usage by namespace:

sum(rate(container_cpu_usage_seconds_total{

  namespace!="kube-system",

  container!=""

}[5m])) by (namespace)

Memory usage by pod:

sum(container_memory_working_set_bytes{

  container!="",

  namespace!="kube-system"

}) by (pod, namespace)

Pods not in Running state:

kube_pod_status_phase{

  phase!="Running",

  phase!="Succeeded"

} == 1

Node CPU saturation (alert threshold: > 80%):

100 - (avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

Deployment replicas available vs desired:

kube_deployment_spec_replicas - kube_deployment_status_replicas_available > 0

Container restart rate (useful for crashloop detection):

rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0

Grafana Dashboard IDs for GKE

Import these directly in Grafana via Dashboards > Import > Enter dashboard ID:

Dashboard	Grafana ID	What it shows
Kubernetes Cluster Monitoring	315	Cluster-wide CPU, memory, pods, nodes
Kubernetes Cluster (kube-prometheus)	6417	Nodes, pods, deployments overview
Kubernetes Namespace Resources	7249	Per-namespace CPU and memory
Node Exporter Full	1860	Per-node OS metrics
kube-state-metrics	13332	Kubernetes object states

Note: Dashboard ID 315 requires node-exporter to be running. kube-prometheus-stack installs node-exporter automatically.

GKE-Specific Gotchas

Private cluster webhook firewall: If you run a private GKE cluster, the kube-prometheus-stack admission webhooks require a firewall rule allowing the GKE control plane to reach port 8443 on your Prometheus Operator pod. Without this, CRD validation fails, and ServiceMonitors cannot be created. Either add the firewall rule or disable webhooks with –set prometheusOperator.admissionWebhooks.enabled=false for non-production clusters.
Do not run GMP and self-hosted Prometheus scraping the same targets: If GMP is already enabled on your cluster and you also deploy kube-prometheus-stack without disabling GMP collection, you will have duplicate metric ingestion. Either disable GMP (–no-enable-managed-prometheus at cluster creation) or configure kube-prometheus-stack to scrape only your application metrics while GMP handles system metrics.
Grafana persistence defaults are in-memory: The default kube-prometheus-stack Grafana installation has no persistent storage. Any dashboards you create or import are lost when the Grafana pod restarts. Always set grafana.persistence.enabled=true in production.
Scrape interval for GKE: The default scrape interval in kube-prometheus-stack is 30 seconds, which is appropriate for most GKE workloads. For high-cardinality environments with many pods, increasing to 60 seconds reduces Prometheus memory pressure significantly.

When Prometheus and Grafana Metrics Are Not Enough

Prometheus and Grafana give you exceptional visibility into cluster resource health: CPU saturation, memory pressure, pod restarts, deployment availability, node status. These are the right tools for infrastructure-level alerting and capacity planning.

What they do not show is what is happening inside individual requests. When CPU spikes on a node, Prometheus tells you the node is hot. It does not tell you which service is generating the load, which API endpoint is being hit, or which downstream database call is taking 3 seconds on every request. Metrics tell you that something is wrong. Traces tell you why.

Teams running GKE typically already have a Prometheus and Grafana stack for infrastructure metrics. CubeAPM layers distributed tracing on top of that existing setup via the OpenTelemetry SDK – no agents to deploy on nodes, no changes to your Prometheus configuration. When a CPU alert fires in Grafana, engineers switch to CubeAPM to navigate directly to the service generating the load, the specific endpoint, and the trace that shows the slow downstream call responsible. The two tools answer different questions from the same incident. CubeAPM can be self-hosted inside your GKE cluster, keeping all trace data within your GCP project.

Summary

Approach	Best for	Operational overhead
kube-prometheus-stack (self-hosted)	Full control, no extra GCP cost	You manage Prometheus scaling and storage
Google Managed Prometheus + Grafana	Large clusters, multi-cluster, long retention	Minimal – Google manages collection and storage
Both combined	App metrics in self-hosted, system metrics in GMP	Medium – requires careful target scoping to avoid duplicates

Start with kube-prometheus-stack if you want a self-contained setup in under 10 minutes. Use GMP if you are managing multiple GKE clusters and want unified querying across all of them without running Prometheus servers per cluster. In either case, the GKE-specific RBAC step and persistent storage configuration are non-negotiable for a production-ready deployment.

Disclaimer : Commands, Helm values, and configuration examples are for guidance only – verify against current kube-prometheus-stack documentation and Google Managed Service for Prometheus documentation before applying to production. GKE behavior and GCP pricing change over time. CubeAPM references reflect genuine use cases; Evaluate all tools against your own requirements.

Also read:

What Are the Key AWS RDS CloudWatch Metrics to Watch?

How to Monitor AWS RDS PostgreSQL Slow Queries

How Do I Monitor AWS RDS with Prometheus?