You monitor EKS pods and nodes with Grafana by deploying a Prometheus stack inside the cluster to scrape metrics from kube-state-metrics, cAdvisor, and node-exporter, then connecting Grafana to Prometheus as a data source. The kube-prometheus-stack Helm chart does all of this in a single command and ships with pre-built dashboards for pods, nodes, deployments, and namespaces.
There are two deployment paths. The first is fully self-hosted: run Prometheus and Grafana inside your EKS cluster with EBS persistent storage. The second uses AWS managed services: Amazon Managed Service for Prometheus (AMP) handles metric collection and storage, and you connect either self-hosted Grafana or Amazon Managed Grafana (AMG) to it as a data source. Both paths use PromQL and the same Grafana dashboards.
Key Takeaways
- EKS requires the EBS CSI driver for persistent storage – without it, Prometheus and Grafana pods use ephemeral storage and lose all data on restart. This is the most commonly missed prerequisite
- The EBS CSI driver now recommends EKS Pod Identity for IAM permissions over the older IRSA method – EKS Pod Identity requires Kubernetes 1.24+ and EC2 nodes; Fargate workloads still require IRSA
- The default Grafana admin password from kube-prometheus-stack is prom-operator – this is publicly known, change it immediately after first login
- EKS does not expose the control plane (API server, etcd, controller manager, scheduler) metrics to self-hosted Prometheus by default – you need AMP or CloudWatch Container Insights for control plane metrics
- kube-state-metrics measures object state (replica counts, pod phase), cAdvisor measures actual resource usage (CPU, memory per container) – you need both for complete pod visibility
- Amazon Managed Service for Prometheus requires Grafana 7.3.5 or later and SigV4 authentication
Prerequisites
Before installing the monitoring stack, verify these are in place:
- EKS cluster running with managed node groups on EC2 (Fargate nodes cannot mount EBS volumes – Prometheus requires EC2 nodes)
- kubectl configured against the cluster
- Helm 3 installed
- AWS CLI configured with credentials for your account
- OIDC provider associated with the cluster (required only if using IRSA – EKS Pod Identity does not require OIDC setup)
If you plan to use IRSA rather than EKS Pod Identity, associate the OIDC provider:
eksctl utils associate-iam-oidc-provider \
--region your-region \
--cluster your-cluster-name \
--approveEKS Pod Identity eliminates the need for OIDC entirely – this is one of its key advantages over IRSA.
Step 1: Install the EBS CSI Driver (Required for Persistent Storage)
Without the EBS CSI driver, Prometheus and Grafana use emptyDir volumes that are wiped every time a pod restarts. This is not acceptable for production. The EBS CSI driver enables dynamic provisioning of EBS volumes as Kubernetes PersistentVolumes.
AWS now recommends EKS Pod Identity over the older IRSA method for attaching IAM permissions to the CSI driver. EKS Pod Identity requires Kubernetes 1.24 or later and EC2 nodes – Fargate workloads must continue using IRSA.
Install via eksctl using EKS Pod Identity (recommended for new clusters):
#bash
# Step 1: Install the EKS Pod Identity agent add-on (if not already present)
eksctl create addon \
--name eks-pod-identity-agent \
--cluster your-cluster-name \
--region your-region
# Step 2: Create the IAM role for the EBS CSI driver
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster your-cluster-name \
--region your-region \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicyV2 \
--approve
# Step 3: Install the EBS CSI driver add-on referencing the IAM role
eksctl create addon \
--name aws-ebs-csi-driver \
--cluster your-cluster-name \
--region your-region \
--service-account-role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole \
--forceReplace YOUR_ACCOUNT_ID with your AWS account ID. The IAM role step is required – installing the add-on without the role will result in UnauthorizedOperation errors when PVCs are created.
If you prefer IRSA (or are on Fargate): Replace Step 1 with eksctl utils associate-iam-oidc-provider and skip the Pod Identity agent installation. Steps 2 and 3 remain the same.
Verify the driver is running:
kubectl get pods -n kube-system | grep ebs-csiYou should see the ebs-csi-controller deployment and ebs-csi-node DaemonSet pods in Running state.
Create a gp3 StorageClass (gp3 provides better baseline performance and lower cost than gp2):
# gp3-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
type: gp3
encrypted: "true"kubectl apply -f gp3-storageclass.yaml
# If gp2 is currently the default, remove that annotation
kubectl patch storageclass gp2 \
-p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'Step 2: Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.storageClassName=gp3 \
--set grafana.persistence.size=10Gi \
--set grafana.adminPassword=your-secure-passwordSet grafana.adminPassword to a strong password at install time rather than relying on the default prom-operator.
What gets installed:
- Prometheus with 15-day retention on a 50Gi EBS gp3 volume
- Grafana with 10Gi persistent storage and pre-configured Prometheus data source
- Alertmanager
- kube-state-metrics
- prometheus-node-exporter DaemonSet on every EC2 node
- Pre-built Kubernetes dashboards and alert rules via the Kubernetes Monitoring Mixin
Verify all pods are running:
kubectl get pods -n monitoringAll pods should show Running. If Prometheus pods are stuck in Pending, the EBS CSI driver is usually the cause – check PVC status with kubectl get pvc -n monitoring.
Step 3: Access Grafana
# Port-forward Grafana to your local machine
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Open http://localhost:3000. Login with the password you set at install, or retrieve the generated one:kubectl get secret -n monitoring prometheus-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode; echoChange the default password immediately if you did not set one at installation. The default prom-operator is publicly known – go to Profile > Change Password before exposing Grafana externally.
Step 4: Expose Grafana (Production Access)
For persistent access without port-forwarding, expose Grafana via a LoadBalancer. For an internal-only endpoint, use the AWS internal load balancer annotation:
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set grafana.service.type=LoadBalancer \
--set grafana.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-internal"="true"Remove the annotation if you need a public endpoint, and ensure your security groups restrict access appropriately.
What kube-state-metrics, cAdvisor, and node-exporter Each Cover
| Source | What it measures | Key metrics |
| kube-state-metrics | Kubernetes object state | kube_pod_status_phase, kube_deployment_spec_replicas, kube_node_status_condition |
| cAdvisor (via kubelet) | Container resource consumption | container_cpu_usage_seconds_total, container_memory_working_set_bytes |
| node-exporter | EC2 node OS and hardware | node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_filesystem_avail_bytes |
All three are installed automatically by kube-prometheus-stack.
Key PromQL Queries for EKS Pods and Nodes
Pod CPU usage by namespace:
sum(rate(container_cpu_usage_seconds_total{
container!="",
namespace!="kube-system"
}[5m])) by (pod, namespace)
Pod memory usage (working set):
sum(container_memory_working_set_bytes{
container!="",
namespace!="kube-system"
}) by (pod, namespace)
Pods not in Running or Succeeded phase:
kube_pod_status_phase{
phase!="Running",
phase!="Succeeded"
} == 1
Node CPU saturation:
100 - (avg by (node) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)
Node memory available percentage:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
Pod container restart rate (crashloop detection):
rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0
Deployment replicas available vs desired:
kube_deployment_spec_replicas
- kube_deployment_status_replicas_available > 0Grafana Dashboard IDs for EKS
Import directly in Grafana via Dashboards > Import > Enter ID:
| Dashboard | Grafana ID | What it shows |
| Kubernetes EKS Cluster (Prometheus) | 17119 | EKS-specific cluster overview |
| Kubernetes Cluster Monitoring | 315 | Cluster-wide CPU, memory, pods, nodes |
| Node Exporter Full | 1860 | Per-node OS and hardware metrics |
| Kubernetes Namespace Resources | 7249 | Per-namespace CPU and memory |
| Kubernetes Pod Resources | 6336 | Per-pod CPU, memory, restarts |
| kube-state-metrics | 13332 | Kubernetes object states |
Dashboard 315 and 1860 require node-exporter to be running, which kube-prometheus-stack installs automatically. Dashboard 17119 is built specifically for EKS and is a good starting point for EKS-focused teams.
Option 2: Amazon Managed Service for Prometheus (AMP) + Grafana
If you prefer not to manage Prometheus infrastructure or need longer metric retention, AMP stores metrics in a fully managed, highly available backend replicated across three Availability Zones in the same region.
Create an AMP workspace:
aws amp create-workspace \
--alias my-eks-monitoring \
--region us-east-1Note the workspaceId and the prometheusEndpoint from the output.
Configure Prometheus to remote_write to AMP. Add the remote write configuration to your kube-prometheus-stack installation:
WORKSPACE_ID=your-workspace-id
REGION=us-east-1
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.${REGION}.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write" \
--set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=${REGION}The Prometheus service account needs the AmazonPrometheusRemoteWriteAccess IAM policy attached via EKS Pod Identity or IRSA.
Connect Grafana to AMP:
AMP requires SigV4 authentication. Self-hosted Grafana 7.3.5 and later includes SigV4 support built in.
- In Grafana, go to Configuration > Data Sources > Add data source > Prometheus
- Set the URL to your AMP query endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/your-workspace-id
- Enable SigV4 Auth
- Set the Default Region to match your AMP workspace region
- Click Save and Test
If you use Amazon Managed Grafana (AMG), it can discover AMP workspaces automatically and manage the SigV4 credential configuration for you – no manual credential setup required.
EKS-Specific Gotchas
- EKS control plane metrics are not exposed to self-hosted Prometheus: Unlike self-managed Kubernetes, EKS does not expose API server, etcd, controller manager, or scheduler endpoints for Prometheus scraping. To monitor EKS control plane behavior (API server latency, request error rates), use CloudWatch Container Insights or AMP with the EKS control plane logging enabled.
- Fargate nodes cannot run node-exporter or mount EBS volumes: If your workload runs on Fargate, node-exporter cannot be scheduled as a DaemonSet (there is no underlying EC2 node to run on), and EBS PersistentVolumes cannot be mounted to Fargate pods. Run Prometheus and Grafana on EC2 nodes even in mixed clusters. You can still monitor Fargate pod metrics via kube-state-metrics and cAdvisor, but node-level OS metrics will not be available for Fargate nodes.
- The default gp2 StorageClass may still be set as default: Many EKS clusters created before mid-2023 have gp2 as the default StorageClass. gp2 uses burst IOPS credits that can deplete under sustained Prometheus write load, causing write latency spikes. Switching to gp3 (as shown in Step 1) provides consistent 3,000 IOPS baseline without burst credit mechanics.
- Metrics Server is separate from Prometheus: The Kubernetes Metrics Server (used by kubectl top pods) is not installed by kube-prometheus-stack. If you need kubectl top to work, install it separately. Prometheus does not depend on Metrics Server – it scrapes kubelet and cAdvisor directly.
When Prometheus and Grafana Metrics Are Not Enough
Prometheus and Grafana give you strong pod and node visibility: CPU and memory consumption, pod restarts, deployment availability, and node saturation. These metrics are well-suited for alerting on resource exhaustion and capacity planning.
What they do not provide is the path from a resource metric to the application behavior causing it. When node CPU spikes or a pod restart rate increases, Prometheus tells you the cluster symptom. It does not tell you which service endpoint is generating the CPU load, which database query is timing out and triggering retries, or whether the spike is from one specific pod or evenly spread across a deployment.
CubeAPM runs inside your EKS cluster alongside kube-prometheus-stack – no external data egress, no agents to manage separately. It instruments your application services via the OpenTelemetry SDK and captures distributed traces that connect a Grafana metric spike to the specific request path, service, and downstream call that caused it. When a pod CPU alarm fires in Grafana, the trace in CubeAPM shows you which endpoint was being hit, how many times per second, and what it was waiting on. The two tools answer complementary questions from the same incident: Prometheus answers what the cluster is doing, CubeAPM answers why your application is behaving that way.
Summary
| Step | What it does | Why it matters |
| Associate OIDC provider | Enables IRSA | Only required if using IRSA – Pod Identity does not need OIDC |
| Install EBS CSI driver | Enables EBS PersistentVolumes | Without this, all metric data is lost on pod restart |
| Create gp3 StorageClass | Consistent IOPS for Prometheus writes | Avoids gp2 burst credit depletion under sustained load |
| Install kube-prometheus-stack | Deploys full monitoring stack | Prometheus, Grafana, exporters, dashboards, alerts in one command |
| Set Grafana password at install | Avoids default prom-operator exposure | Default password is publicly known |
Start with the EBS CSI driver – skip this step and the entire monitoring setup is fragile. After that, kube-prometheus-stack is a single Helm command. Use AMP if you need managed Prometheus storage, cross-cluster metrics, or retention beyond what local EBS storage provides.
Disclaimer : Commands, IAM policies, and configuration examples are for guidance only – verify against current Amazon EKS documentation and kube-prometheus-stack documentation before applying to production. AWS service behavior and IAM requirements change over time. CubeAPM references reflect genuine use cases; Evaluate all tools against your own requirements.
Also read:
How to Monitor GKE Clusters with Prometheus and Grafana





