CubeAPM
CubeAPM CubeAPM

How to Monitor EKS Pods and Nodes with Grafana

How to Monitor EKS Pods and Nodes with Grafana

Table of Contents

You monitor EKS pods and nodes with Grafana by deploying a Prometheus stack inside the cluster to scrape metrics from kube-state-metrics, cAdvisor, and node-exporter, then connecting Grafana to Prometheus as a data source. The kube-prometheus-stack Helm chart does all of this in a single command and ships with pre-built dashboards for pods, nodes, deployments, and namespaces.

There are two deployment paths. The first is fully self-hosted: run Prometheus and Grafana inside your EKS cluster with EBS persistent storage. The second uses AWS managed services: Amazon Managed Service for Prometheus (AMP) handles metric collection and storage, and you connect either self-hosted Grafana or Amazon Managed Grafana (AMG) to it as a data source. Both paths use PromQL and the same Grafana dashboards.

Key Takeaways

  • EKS requires the EBS CSI driver for persistent storage – without it, Prometheus and Grafana pods use ephemeral storage and lose all data on restart. This is the most commonly missed prerequisite
  • The EBS CSI driver now recommends EKS Pod Identity for IAM permissions over the older IRSA method – EKS Pod Identity requires Kubernetes 1.24+ and EC2 nodes; Fargate workloads still require IRSA
  • The default Grafana admin password from kube-prometheus-stack is prom-operator – this is publicly known, change it immediately after first login
  • EKS does not expose the control plane (API server, etcd, controller manager, scheduler) metrics to self-hosted Prometheus by default – you need AMP or CloudWatch Container Insights for control plane metrics
  • kube-state-metrics measures object state (replica counts, pod phase), cAdvisor measures actual resource usage (CPU, memory per container) – you need both for complete pod visibility
  • Amazon Managed Service for Prometheus requires Grafana 7.3.5 or later and SigV4 authentication

Prerequisites

Before installing the monitoring stack, verify these are in place:

  • EKS cluster running with managed node groups on EC2 (Fargate nodes cannot mount EBS volumes – Prometheus requires EC2 nodes)
  • kubectl configured against the cluster
  • Helm 3 installed
  • AWS CLI configured with credentials for your account
  • OIDC provider associated with the cluster (required only if using IRSA – EKS Pod Identity does not require OIDC setup)

If you plan to use IRSA rather than EKS Pod Identity, associate the OIDC provider:

eksctl utils associate-iam-oidc-provider \

  --region your-region \

  --cluster your-cluster-name \

  --approve

EKS Pod Identity eliminates the need for OIDC entirely – this is one of its key advantages over IRSA.

Step 1: Install the EBS CSI Driver (Required for Persistent Storage)

Without the EBS CSI driver, Prometheus and Grafana use emptyDir volumes that are wiped every time a pod restarts. This is not acceptable for production. The EBS CSI driver enables dynamic provisioning of EBS volumes as Kubernetes PersistentVolumes.

AWS now recommends EKS Pod Identity over the older IRSA method for attaching IAM permissions to the CSI driver. EKS Pod Identity requires Kubernetes 1.24 or later and EC2 nodes – Fargate workloads must continue using IRSA.

Install via eksctl using EKS Pod Identity (recommended for new clusters):

#bash

# Step 1: Install the EKS Pod Identity agent add-on (if not already present)

eksctl create addon \

  --name eks-pod-identity-agent \

  --cluster your-cluster-name \

  --region your-region

# Step 2: Create the IAM role for the EBS CSI driver

eksctl create iamserviceaccount \

  --name ebs-csi-controller-sa \

  --namespace kube-system \

  --cluster your-cluster-name \

  --region your-region \

  --role-name AmazonEKS_EBS_CSI_DriverRole \

  --role-only \

  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicyV2 \

  --approve

# Step 3: Install the EBS CSI driver add-on referencing the IAM role

eksctl create addon \

  --name aws-ebs-csi-driver \

  --cluster your-cluster-name \

  --region your-region \

  --service-account-role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole \

  --force

Replace YOUR_ACCOUNT_ID with your AWS account ID. The IAM role step is required – installing the add-on without the role will result in UnauthorizedOperation errors when PVCs are created.

If you prefer IRSA (or are on Fargate): Replace Step 1 with eksctl utils associate-iam-oidc-provider and skip the Pod Identity agent installation. Steps 2 and 3 remain the same.

Verify the driver is running:

kubectl get pods -n kube-system | grep ebs-csi

You should see the ebs-csi-controller deployment and ebs-csi-node DaemonSet pods in Running state.

Create a gp3 StorageClass (gp3 provides better baseline performance and lower cost than gp2):

# gp3-storageclass.yaml

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

  name: gp3

  annotations:

    storageclass.kubernetes.io/is-default-class: "true"

provisioner: ebs.csi.aws.com

volumeBindingMode: WaitForFirstConsumer

allowVolumeExpansion: true

parameters:

  type: gp3

  encrypted: "true"
kubectl apply -f gp3-storageclass.yaml

# If gp2 is currently the default, remove that annotation

kubectl patch storageclass gp2 \

  -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

Step 2: Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

kubectl create namespace monitoring

helm install prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set prometheus.prometheusSpec.retention=15d \

  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \

  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \

  --set grafana.persistence.enabled=true \

  --set grafana.persistence.storageClassName=gp3 \

  --set grafana.persistence.size=10Gi \

  --set grafana.adminPassword=your-secure-password

Set grafana.adminPassword to a strong password at install time rather than relying on the default prom-operator.

What gets installed:

  • Prometheus with 15-day retention on a 50Gi EBS gp3 volume
  • Grafana with 10Gi persistent storage and pre-configured Prometheus data source
  • Alertmanager
  • kube-state-metrics
  • prometheus-node-exporter DaemonSet on every EC2 node
  • Pre-built Kubernetes dashboards and alert rules via the Kubernetes Monitoring Mixin

Verify all pods are running:

kubectl get pods -n monitoring

All pods should show Running. If Prometheus pods are stuck in Pending, the EBS CSI driver is usually the cause – check PVC status with kubectl get pvc -n monitoring.

Step 3: Access Grafana

# Port-forward Grafana to your local machine

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Open http://localhost:3000. Login with the password you set at install, or retrieve the generated one:
kubectl get secret -n monitoring prometheus-grafana \

  -o jsonpath="{.data.admin-password}" | base64 --decode; echo

Change the default password immediately if you did not set one at installation. The default prom-operator is publicly known – go to Profile > Change Password before exposing Grafana externally.

Step 4: Expose Grafana (Production Access)

For persistent access without port-forwarding, expose Grafana via a LoadBalancer. For an internal-only endpoint, use the AWS internal load balancer annotation:

helm upgrade prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set grafana.service.type=LoadBalancer \

  --set grafana.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-internal"="true"

Remove the annotation if you need a public endpoint, and ensure your security groups restrict access appropriately.

What kube-state-metrics, cAdvisor, and node-exporter Each Cover

SourceWhat it measuresKey metrics
kube-state-metricsKubernetes object statekube_pod_status_phase, kube_deployment_spec_replicas, kube_node_status_condition
cAdvisor (via kubelet)Container resource consumptioncontainer_cpu_usage_seconds_total, container_memory_working_set_bytes
node-exporterEC2 node OS and hardwarenode_cpu_seconds_total, node_memory_MemAvailable_bytes, node_filesystem_avail_bytes

All three are installed automatically by kube-prometheus-stack.

Key PromQL Queries for EKS Pods and Nodes

Pod CPU usage by namespace:

sum(rate(container_cpu_usage_seconds_total{

  container!="",

  namespace!="kube-system"

}[5m])) by (pod, namespace)

Pod memory usage (working set):

sum(container_memory_working_set_bytes{

  container!="",

  namespace!="kube-system"

}) by (pod, namespace)

Pods not in Running or Succeeded phase:

kube_pod_status_phase{

  phase!="Running",

  phase!="Succeeded"

} == 1

Node CPU saturation:

100 - (avg by (node) (

  rate(node_cpu_seconds_total{mode="idle"}[5m])

) * 100)

Node memory available percentage:

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Pod container restart rate (crashloop detection):

rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0

Deployment replicas available vs desired:

kube_deployment_spec_replicas

  - kube_deployment_status_replicas_available > 0

Grafana Dashboard IDs for EKS

Import directly in Grafana via Dashboards > Import > Enter ID:

DashboardGrafana IDWhat it shows
Kubernetes EKS Cluster (Prometheus)17119EKS-specific cluster overview
Kubernetes Cluster Monitoring315Cluster-wide CPU, memory, pods, nodes
Node Exporter Full1860Per-node OS and hardware metrics
Kubernetes Namespace Resources7249Per-namespace CPU and memory
Kubernetes Pod Resources6336Per-pod CPU, memory, restarts
kube-state-metrics13332Kubernetes object states

Dashboard 315 and 1860 require node-exporter to be running, which kube-prometheus-stack installs automatically. Dashboard 17119 is built specifically for EKS and is a good starting point for EKS-focused teams.

Option 2: Amazon Managed Service for Prometheus (AMP) + Grafana

If you prefer not to manage Prometheus infrastructure or need longer metric retention, AMP stores metrics in a fully managed, highly available backend replicated across three Availability Zones in the same region.

Create an AMP workspace:

aws amp create-workspace \

  --alias my-eks-monitoring \

  --region us-east-1

Note the workspaceId and the prometheusEndpoint from the output.

Configure Prometheus to remote_write to AMP. Add the remote write configuration to your kube-prometheus-stack installation:

WORKSPACE_ID=your-workspace-id

REGION=us-east-1

helm upgrade prometheus prometheus-community/kube-prometheus-stack \

  --namespace monitoring \

  --set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.${REGION}.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write" \

  --set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=${REGION}

The Prometheus service account needs the AmazonPrometheusRemoteWriteAccess IAM policy attached via EKS Pod Identity or IRSA.

Connect Grafana to AMP:

AMP requires SigV4 authentication. Self-hosted Grafana 7.3.5 and later includes SigV4 support built in.

  1. In Grafana, go to Configuration > Data Sources > Add data source > Prometheus
  2. Set the URL to your AMP query endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/your-workspace-id
  3. Enable SigV4 Auth
  4. Set the Default Region to match your AMP workspace region
  5. Click Save and Test

If you use Amazon Managed Grafana (AMG), it can discover AMP workspaces automatically and manage the SigV4 credential configuration for you – no manual credential setup required.

EKS-Specific Gotchas

  • EKS control plane metrics are not exposed to self-hosted Prometheus: Unlike self-managed Kubernetes, EKS does not expose API server, etcd, controller manager, or scheduler endpoints for Prometheus scraping. To monitor EKS control plane behavior (API server latency, request error rates), use CloudWatch Container Insights or AMP with the EKS control plane logging enabled.
  • Fargate nodes cannot run node-exporter or mount EBS volumes: If your workload runs on Fargate, node-exporter cannot be scheduled as a DaemonSet (there is no underlying EC2 node to run on), and EBS PersistentVolumes cannot be mounted to Fargate pods. Run Prometheus and Grafana on EC2 nodes even in mixed clusters. You can still monitor Fargate pod metrics via kube-state-metrics and cAdvisor, but node-level OS metrics will not be available for Fargate nodes.
  • The default gp2 StorageClass may still be set as default: Many EKS clusters created before mid-2023 have gp2 as the default StorageClass. gp2 uses burst IOPS credits that can deplete under sustained Prometheus write load, causing write latency spikes. Switching to gp3 (as shown in Step 1) provides consistent 3,000 IOPS baseline without burst credit mechanics.
  • Metrics Server is separate from Prometheus: The Kubernetes Metrics Server (used by kubectl top pods) is not installed by kube-prometheus-stack. If you need kubectl top to work, install it separately. Prometheus does not depend on Metrics Server – it scrapes kubelet and cAdvisor directly.

When Prometheus and Grafana Metrics Are Not Enough

Prometheus and Grafana give you strong pod and node visibility: CPU and memory consumption, pod restarts, deployment availability, and node saturation. These metrics are well-suited for alerting on resource exhaustion and capacity planning.

What they do not provide is the path from a resource metric to the application behavior causing it. When node CPU spikes or a pod restart rate increases, Prometheus tells you the cluster symptom. It does not tell you which service endpoint is generating the CPU load, which database query is timing out and triggering retries, or whether the spike is from one specific pod or evenly spread across a deployment.

CubeAPM runs inside your EKS cluster alongside kube-prometheus-stack – no external data egress, no agents to manage separately. It instruments your application services via the OpenTelemetry SDK and captures distributed traces that connect a Grafana metric spike to the specific request path, service, and downstream call that caused it. When a pod CPU alarm fires in Grafana, the trace in CubeAPM shows you which endpoint was being hit, how many times per second, and what it was waiting on. The two tools answer complementary questions from the same incident: Prometheus answers what the cluster is doing, CubeAPM answers why your application is behaving that way.

Summary

StepWhat it doesWhy it matters
Associate OIDC providerEnables IRSAOnly required if using IRSA – Pod Identity does not need OIDC
Install EBS CSI driverEnables EBS PersistentVolumesWithout this, all metric data is lost on pod restart
Create gp3 StorageClassConsistent IOPS for Prometheus writesAvoids gp2 burst credit depletion under sustained load
Install kube-prometheus-stackDeploys full monitoring stackPrometheus, Grafana, exporters, dashboards, alerts in one command
Set Grafana password at installAvoids default prom-operator exposureDefault password is publicly known

Start with the EBS CSI driver – skip this step and the entire monitoring setup is fragile. After that, kube-prometheus-stack is a single Helm command. Use AMP if you need managed Prometheus storage, cross-cluster metrics, or retention beyond what local EBS storage provides.

Disclaimer : Commands, IAM policies, and configuration examples are for guidance only – verify against current Amazon EKS documentation and kube-prometheus-stack documentation before applying to production. AWS service behavior and IAM requirements change over time. CubeAPM references reflect genuine use cases; Evaluate all tools against your own requirements.

Also read:

How to Monitor GKE Clusters with Prometheus and Grafana

How Do I Monitor AWS RDS with Prometheus?

How to Monitor AWS RDS PostgreSQL Slow Queries

×
×