CubeAPM
CubeAPM CubeAPM

Kubernetes Cost Monitoring: Resource Requests, Limits, and Right-Sizing Waste

Kubernetes Cost Monitoring: Resource Requests, Limits, and Right-Sizing Waste

Table of Contents

Kubernetes resource requests and limits directly control cluster cost. A service requesting 2 CPU cores but using 400 millicores at peak wastes 80% of its allocated capacity. Multiply that waste across 50 services in a production cluster and the unused capacity could power another environment entirely. According to the CNCF 2024 FinOps for Kubernetes Report, 68% of organizations waste Kubernetes spend due to over-provisioned resources.

This guide walks through monitoring Kubernetes costs by tracking resource requests against actual usage, identifying over-provisioned workloads, and implementing right-sizing strategies that eliminate waste without risking stability. By the end, you will know how to measure utilization gaps, set accurate resource boundaries, and deploy continuous monitoring that keeps costs aligned with real demand.

Prerequisites

Before starting Kubernetes cost monitoring, ensure you have:

  • Access to a running Kubernetes cluster (1.20 or later recommended)
  • kubectl CLI installed and configured
  • Metrics Server deployed in your cluster for resource utilization data
  • Basic understanding of Kubernetes resource requests and limits
  • Cluster admin permissions or namespace-level access to view resource metrics
  • A monitoring tool that supports Kubernetes metrics collection (Prometheus, CubeAPM, or cloud-native monitoring)
  • Access to cloud provider cost dashboards if running on managed Kubernetes (EKS, AKS, GKS)

Step 1: Deploy Metrics Server and Verify Resource Data Collection

Metrics Server collects real time resource utilization data from kubelets and exposes it via the Kubernetes API. Without Metrics Server, you cannot measure actual CPU and memory usage to compare against requests and limits.

First, check if Metrics Server is already deployed:

kubectl get deployment metrics-server -n kube-system

If the command returns Error from server (NotFound), Metrics Server is not installed. Deploy it with:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Wait 60 seconds for Metrics Server to start collecting data, then verify it is working:

kubectl top nodes

You should see CPU and memory usage for each node. If you see error: Metrics API not available, check Metrics Server logs:

kubectl logs -n kube-system deployment/metrics-server

Common issue: TLS certificate errors on self-signed clusters. Add --kubelet-insecure-tls to the Metrics Server deployment args as a temporary workaround for non-production clusters.

Once kubectl top nodes returns data, Metrics Server is ready. This unlocks kubectl top pods which shows actual resource consumption per pod, the foundation for all cost monitoring.

Step 2: Audit Current Resource Requests and Limits Across Namespaces

Before you can identify waste, you need a complete inventory of what resources are currently requested and limited across your cluster. Most teams discover that requests were set once during initial deployment and never revisited.

List all pods with their CPU and memory requests and limits:

kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.containers[0].resources.requests != null) | 
  "\(.metadata.namespace) \(.metadata.name) \(.spec.containers[0].resources.requests.cpu // "none") \(.spec.containers[0].resources.requests.memory // "none") \(.spec.containers[0].resources.limits.cpu // "none") \(.spec.containers[0].resources.limits.memory // "none")"'

This command extracts namespace, pod name, CPU request, memory request, CPU limit, and memory limit for every pod. Save the output to a file for comparison later.

Common findings at this stage:

  • Pods with no requests or limits at all (BestEffort QoS class, high eviction risk)
  • Identical requests across all services regardless of workload type
  • Limits set to arbitrary round numbers (1000m CPU, 1Gi memory) with no usage-based justification
  • CPU limits present on latency-sensitive services causing throttling under load

For each namespace, calculate total requested CPU and memory:

kubectl describe nodes | grep -A 5 "Allocated resources" | grep -E "cpu|memory"

This shows how much of each node’s capacity is reserved by requests. If allocated capacity exceeds 70% but kubectl top nodes shows actual usage under 40%, you have a utilization gap worth investigating.

Step 3: Measure Actual Resource Utilization Over Time

Spot checks with kubectl top show current usage, but right-sizing decisions require usage patterns over days or weeks. A service that peaks at 800m CPU during traffic spikes but averages 200m needs a request closer to 800m, not 200m, to avoid throttling.

Set up continuous monitoring using Prometheus or CubeAPM to track CPU and memory usage over time. If using Prometheus, deploy kube-state-metrics and node-exporter:

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/setup/0namespace-namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/

Wait for all components to reach running state, then access Prometheus at http://localhost:9090 after port-forwarding:

kubectl port-forward -n monitoring svc/prometheus-k8s 9090:9090

Query CPU usage by pod over the past 7 days:

rate(container_cpu_usage_seconds_total{namespace="production"}[5m])

Query memory working set (actual memory in use, excluding cache):

container_memory_working_set_bytes{namespace="production"}

For each deployment, compare p95 actual usage against its current resource request. Export these queries to a dashboard or CSV for analysis.

Example scenario: A pod requests 1000m CPU. Prometheus shows p95 actual usage at 350m over 14 days. The pod is over-provisioned by 65%. If this pod runs 10 replicas, you are wasting 6.5 vCPU of schedulable capacity that could run other workloads or reduce node count.

CubeAPM tracks Kubernetes resource utilization natively and correlates CPU and memory metrics with pod restarts, throttling events, and OOMKills. This eliminates the need to manually cross-reference multiple Prometheus queries. For teams running infrastructure monitoring across cloud and on-prem environments, CubeAPM provides unified visibility without requiring separate Prometheus and Grafana setup.

Step 4: Identify Over-Provisioned Workloads and Calculate Waste

With usage data collected, the next step is finding workloads where requests significantly exceed actual consumption. This is where cost waste lives.

Create a simple utilization ratio for each deployment:

Utilization ratio = (p95 actual usage) / (resource request)

Flag any deployment where:

  • CPU utilization ratio < 0.5 (using less than half of requested CPU)
  • Memory utilization ratio < 0.6 (using less than 60% of requested memory)

Example output from a production audit:

DeploymentCPU RequestCPU p95 UsageCPU RatioMemory RequestMemory p95 UsageMemory Ratio
api-gateway1000m320m0.322Gi800Mi0.39
worker-service500m450m0.901Gi950Mi0.93
frontend2000m600m0.304Gi1.2Gi0.30

In this example, api-gateway and frontend are severely over-provisioned. The worker-service is appropriately sized.

Calculate the wasted capacity:

  • api-gateway: (1000m – 320m) × 5 replicas = 3.4 vCPU wasted
  • frontend: (2000m – 600m) × 3 replicas = 4.2 vCPU wasted

Total: 7.6 vCPU of unused but reserved capacity across just two services. On a cloud provider charging $0.05 per vCPU hour, that is $273 per month in wasted compute.

Repeat this analysis for memory. Over-provisioned memory does not throttle like CPU, but it still consumes node capacity and drives unnecessary node scale-out.

Step 5: Right-Size Resource Requests Based on Usage Patterns

Once over-provisioned workloads are identified, adjust resource requests to align with actual usage while maintaining safety buffers for traffic variability.

Recommended right-sizing formula:

New request = p95 actual usage × 1.2

The 1.2 multiplier provides a 20% safety buffer above peak usage. For workloads with high variability or unpredictable traffic spikes, use 1.3 or 1.4 instead.

Update a deployment’s resource requests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  template:
    spec:
      containers:
      - name: api-gateway
        resources:
          requests:
            cpu: "400m"      # Reduced from 1000m based on 320m p95 usage
            memory: "1Gi"    # Reduced from 2Gi based on 800Mi p95 usage
          limits:
            memory: "1Gi"    # Match memory request for Guaranteed QoS

Apply the updated manifest:

kubectl apply -f api-gateway-deployment.yaml

Monitor the deployment closely after applying changes. Watch for:

  • Increased CPU throttling (check container_cpu_cfs_throttled_seconds_total metric)
  • Pod restarts due to OOMKills (kubectl describe pod shows OOMKilled in last termination reason)
  • Latency increases in APM traces or request duration metrics

If throttling or restarts occur, increase the request incrementally (add 10-20% more capacity) and re-evaluate over the next week.

Never remove CPU limits on latency-sensitive services. CPU limits trigger kernel throttling when exceeded, causing request latency spikes even when the node has spare CPU capacity available. Memory limits behave differently. When a container exceeds its memory limit, Kubernetes terminates it with an OOMKill. For this reason, set memory requests and limits to the same value to achieve Guaranteed QoS and prevent unexpected terminations.

Step 6: Implement Continuous Monitoring and Alerting on Resource Waste

Right-sizing is not a one time task. Application behavior changes with new features, traffic patterns shift, and code changes alter resource consumption. Continuous monitoring catches when a workload drifts back into over-provisioning.

Set up alerts for resource waste patterns:

Alert 1: Low CPU utilization over 7 days

(
  sum(rate(container_cpu_usage_seconds_total{namespace="production"}[7d])) by (pod)
  /
  sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod)
) < 0.4

This alert fires when a pod uses less than 40% of its requested CPU over a 7 day period.

Alert 2: Low memory utilization over 7 days

(
  avg_over_time(container_memory_working_set_bytes{namespace="production"}[7d])
  /
  kube_pod_container_resource_requests{resource="memory", namespace="production"}
) < 0.5

This alert fires when a pod uses less than 50% of its requested memory over a 7 day window.

Route these alerts to a Slack channel or ticketing system, not PagerDuty. They indicate cost optimization opportunities, not production incidents.

Create a monthly cost review process:

  1. Run the utilization audit script across all production namespaces
  2. Identify the top 10 over-provisioned workloads by wasted vCPU or memory
  3. Right-size those workloads following the process in Step 5
  4. Measure cost impact by comparing node count or cloud spend before and after

CubeAPM monitors Kubernetes resource utilization at pod, node, and cluster level without requiring separate Prometheus setup. It automatically flags over-provisioned workloads and surfaces cost waste in the same dashboard where you investigate application performance issues. For teams managing Kubernetes monitoring across multiple clusters, CubeAPM provides unified cost visibility and resource right-sizing recommendations without manual query building.

Step 7: Automate Right-Sizing with Vertical Pod Autoscaler

Manual right-sizing works but does not scale across hundreds of deployments. Vertical Pod Autoscaler (VPA) analyzes historical resource usage and automatically adjusts requests to match real consumption.

VPA operates in three modes:

  • Off: VPA calculates recommendations but does not apply them (safe audit mode)
  • Initial: VPA sets requests only when pods are created, not on running pods
  • Auto: VPA updates requests on running pods by evicting and recreating them with new values

Deploy VPA in your cluster:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

Verify VPA components are running:

kubectl get pods -n kube-system | grep vpa

You should see vpa-admission-controller, vpa-recommender, and vpa-updater.

Create a VPA resource for a deployment in recommendation-only mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-gateway-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  updateMode: "Off"

Apply the VPA configuration:

kubectl apply -f api-gateway-vpa.yaml

Wait 24 hours for VPA to analyze usage patterns, then retrieve recommendations:

kubectl describe vpa api-gateway-vpa

The output shows recommended CPU and memory requests under Status.Recommendation.ContainerRecommendations. Compare these to your current requests. If VPA recommends significantly lower values and your monitoring confirms low utilization, update your deployment with the new requests.

VPA in Auto mode restarts pods to apply new resource values. This works well for stateless services but should be tested carefully on stateful workloads. For production clusters, start with Off mode, manually review recommendations, and apply changes during maintenance windows.

Step 8: Track Cost Impact and Iterate

Right-sizing only matters if it reduces actual infrastructure cost. Measure the financial impact of your optimization efforts to justify continued investment in cost monitoring.

Calculate cost savings by comparing total node capacity before and after right-sizing:

Before optimization:

  • 20 nodes × 8 vCPU per node = 160 vCPU total
  • 60% average node CPU utilization = 96 vCPU actually used
  • 64 vCPU wasted capacity

After right-sizing top 10 over-provisioned workloads:

  • Freed 15 vCPU of wasted requests
  • Reduced node count to 18 nodes (144 vCPU total)
  • 66% average node utilization = 95 vCPU actually used
  • 49 vCPU wasted capacity (improvement)

Cost impact on AWS using m5.2xlarge nodes ($0.384/hour):

  • Removed 2 nodes = 2 × $0.384/hour × 730 hours/month = $561/month saved

Track these metrics over time:

  • Total cluster CPU requested vs. total CPU used (utilization gap)
  • Number of nodes required to support current requests
  • Cost per vCPU of actual application usage (total cost ÷ actual vCPU used)
  • Percentage of workloads with utilization ratio below 0.5

Iterate monthly. As applications change, new inefficiencies appear. The teams that eliminate Kubernetes waste long term treat cost monitoring as a continuous practice, not a one time audit.

Troubleshooting Common Issues

Pods stuck in Pending after reducing requests

When you lower resource requests, pods may fail to schedule if the new values are still too high for available node capacity or if DaemonSets and system pods consume more capacity than expected.

Check why pods are pending:

kubectl describe pod <pod-name>

Look for FailedScheduling events. If the message says Insufficient cpu or Insufficient memory, no node has enough unreserved capacity. Either increase the request slightly or add another node.

CPU throttling after right-sizing

If application latency increases after lowering CPU requests, check for throttling:

kubectl top pod <pod-name>

If actual CPU usage is close to or exceeds the request, the pod is being throttled. Increase the CPU request by 20% and re-test. For services where latency matters more than cost, avoid setting CPU limits entirely. Set a realistic request and let the pod burst to node capacity during load spikes.

Memory OOMKills after reducing requests

If a pod terminates with OOMKilled status after reducing memory requests, the new value was too low for peak usage.

Check termination reason:

kubectl describe pod <pod-name> | grep -A 5 "Last State"

If you see OOMKilled, review memory usage over the past 7 days and set the request to p99 memory usage (not p95) plus a 20% buffer. Applications with memory leaks or unbounded memory growth need code fixes, not higher requests.

VPA recommendations seem too low

VPA uses historical usage data to calculate recommendations. If your application recently had a traffic spike or load test, VPA may not have enough long term data. Let VPA observe for at least 7 days before acting on recommendations. You can also configure VPA’s recommendation percentile:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-gateway-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi

This prevents VPA from recommending requests below 100m CPU or above 2000m CPU.

High node CPU but pods showing low utilization

If kubectl top nodes shows 80% CPU usage but kubectl top pods shows most pods using far less than their requests, system overhead or untracked workloads are consuming capacity. Check for:

  • DaemonSets with no resource requests set (they consume capacity invisibly)
  • High system-reserved CPU (kubectl describe node shows reserved amounts)
  • Non-pod workloads like Docker builds or sidecar processes not tracked by Kubernetes

Add resource requests to DaemonSets and non-application pods to account for their real usage in scheduling decisions.

Kubernetes cost monitoring relies on accurate tracking of resource requests, limits, and actual usage over time. This tutorial covered deploying Metrics Server, auditing current resource settings, measuring real utilization, right-sizing workloads, and implementing continuous monitoring to prevent waste from returning. Teams that follow this process typically reduce Kubernetes compute costs by 30-50% without impacting application stability or performance.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.

Frequently Asked Questions

What is the difference between resource requests and limits in Kubernetes?

Requests reserve capacity for scheduling and determine Quality of Service class. Limits cap maximum usage at runtime. Requests affect where Kubernetes places your pod. Limits affect how the kernel restricts resource consumption after the pod is running.

How do I calculate the right CPU request for a pod?

Set CPU requests to p95 actual usage multiplied by 1.2 for a safety buffer. Measure p95 usage over at least 7 days using Metrics Server or Prometheus to account for traffic variability and load spikes.

Should I set CPU limits on my Kubernetes pods?

Avoid CPU limits on latency-sensitive services because they cause throttling even when the node has spare capacity. For batch jobs or non-critical workloads, limits prevent runaway processes from affecting other pods on shared nodes.

What happens if I set memory requests too low?

If memory usage exceeds the request but stays under the limit, the pod continues running but risks eviction under node memory pressure. If usage exceeds the limit, Kubernetes terminates the pod with an OOMKill.

How does Vertical Pod Autoscaler differ from Horizontal Pod Autoscaler?

VPA adjusts resource requests and limits per pod by changing the pod specification. HPA adjusts the number of pod replicas. Use VPA to right-size individual pods. Use HPA to scale replicas based on load.

Can I use resource quotas to control Kubernetes costs?

Resource quotas limit total requests and limits per namespace. They prevent over-provisioning at the namespace level but do not optimize individual pod sizing. Combine quotas with continuous utilization monitoring for effective cost control.

What is the recommended memory limit for a Kubernetes pod?

Set memory request and limit to the same value whenever possible to achieve Guaranteed QoS class and prevent unexpected OOMKills during node memory pressure. Size both to p95 memory usage plus 20% buffer based on real measurement.

×
×