CubeAPM
CubeAPM CubeAPM

Google Cloud Run Monitoring: Cold Starts, Concurrency, and Cost Per Request

Google Cloud Run Monitoring: Cold Starts, Concurrency, and Cost Per Request

Table of Contents

Google Cloud Run abstracts infrastructure entirely. You deploy a container, set concurrency limits, and Cloud Run auto scales instances up or down based on incoming requests. But that convenience comes with a tradeoff: you lose direct visibility into why a request was slow, why your bill jumped, or whether your concurrency settings are actually helping or hurting performance.

This guide covers how Cloud Run billing works, what causes cold starts, how concurrency settings affect cost and latency, and which metrics to monitor to keep your serverless workloads fast and predictable. Every pricing figure is linked to Google’s official rate cards and reflects Cloud Run’s current model as of early 2026.

What Is Google Cloud Run and Why Monitoring Matters

Google Cloud Run is a fully managed serverless platform built on Knative that runs stateless containers in response to HTTP requests or events. You package your application as a container image, push it to Artifact Registry, and Cloud Run handles routing, auto scaling, and instance lifecycle without you managing any nodes, clusters, or VMs.

Cloud Run services handle HTTP traffic and scale dynamically based on request volume. Cloud Run jobs execute batch tasks or scheduled workloads and scale based on parallelism settings rather than incoming requests. Both share the same pricing model and auto scaling behavior, but services expose an HTTPS endpoint while jobs do not.

Because Cloud Run abstracts the compute layer, traditional infrastructure monitoring does not apply. You cannot SSH into an instance, check system logs, or tune kernel parameters. Instead, monitoring focuses on request level metrics like latency, error rate, instance count, CPU and memory utilization per container, and cold start frequency.

Without monitoring, three problems emerge fast:

Unpredictable latency — cold starts add 1 to 10 seconds to request latency depending on container size and language runtime. If you do not track cold start rate, you will not know whether users are experiencing this delay regularly or only during scale from zero.

Runaway costs — Cloud Run bills based on vCPU time, memory allocation, request count, and network egress. A misconfigured concurrency setting or memory limit can triple your bill without improving performance. Monitoring shows you which dimensions are driving cost so you can optimize before the month closes.

Silent failures — if a container crashes, runs out of memory, or hits CPU throttle, Cloud Run may retry or scale up more instances. Without visibility into error rate, retry count, and throttle events, you see only the symptom (slow responses or higher bill) but not the root cause.

Monitoring Cloud Run means tracking metrics across four layers: request metrics (latency, error rate, throughput), instance metrics (count, startup time, utilization), resource metrics (CPU, memory, concurrency), and cost metrics (billable time, request charges, egress).

How Google Cloud Run Pricing Works

Cloud Run charges based on four dimensions: CPU allocation, memory allocation, request count, and network egress. Each dimension bills independently, and the total cost depends on how long your container runs, how much memory it allocates, how many requests it serves, and how much outbound traffic it generates.

CPU and Memory Allocation

Cloud Run bills CPU and memory per 100 millisecond increments while your container is processing a request or running a background task. You configure CPU allocation in vCPUs (0.08 to 8 vCPU per instance) and memory in gigabytes (128 MiB to 32 GiB per instance). Higher allocation costs more per second but may reduce total billable time if your workload finishes faster.

CPU allocation mode determines when your container has CPU access. Request based billing (the default) gives your container CPU only while processing a request. Between requests, CPU is disabled or severely throttled. This mode costs less but prevents background activity like cache warming or async cleanup from running. Instance based billing gives your container always on CPU access, which costs more per second but allows background tasks to run and can reduce cold start latency by keeping the container warm.

Memory billing runs continuously while the instance is allocated, even during idle time between requests. If you allocate 1 GiB but your container uses only 200 MiB, you still pay for the full 1 GiB. This makes right sizing critical. Over allocating memory wastes cost. Under allocating triggers OOMKills and forces Cloud Run to restart your container, adding latency and retry overhead.

As of early 2026, Cloud Run pricing charges approximately:

  • CPU: ~$0.00002400 per vCPU-second (request based) or ~$0.00004800 per vCPU-second (instance based)
  • Memory: ~$0.00000250 per GiB-second
  • Requests: $0.40 per million requests

Request Pricing and Network Egress

Cloud Run charges $0.40 per million requests. A request is any HTTP call that reaches your service, including health checks, retries, and requests that return errors. If your service receives 10 million requests per month, the request charge alone is $4.

Network egress — outbound traffic leaving Google’s network — costs ~$0.12 per GiB for the first 1 TiB and scales down slightly at higher volumes. If your service returns large JSON payloads, serves images, or calls external APIs, egress adds up fast. A service returning 500 KiB per response serving 1 million requests per month generates ~500 GiB egress, costing ~$60 per month in bandwidth alone.

Egress within the same Google Cloud region (e.g., Cloud Run calling Cloud SQL in us-central1) is free. Egress to other Google Cloud regions costs ~$0.01 per GiB. Egress to the public internet costs the full rate. If you send logs or traces to an external SaaS observability platform, egress fees apply to every byte leaving your VPC.

Cost Scenario: Moderate Traffic Service

A Cloud Run service handling 5 million requests per month with the following profile:

  • 0.5 vCPU, 1 GiB memory per instance
  • Average request duration: 200 ms
  • Concurrency: 80 requests per instance
  • Request based CPU billing
  • Average response size: 100 KiB (500 GiB egress per month)

Monthly cost breakdown:

  • CPU: 5M requests × 0.2 sec × 0.5 vCPU × $0.000024 = $12
  • Memory: 5M requests × 0.2 sec × 1 GiB × $0.0000025 = $2.50
  • Requests: 5M × $0.0000004 = $2
  • Egress: 500 GiB × $0.12 = $60

Total: ~$76.50 per month

If the same service switches to instance based CPU billing to enable background cache warming, CPU cost doubles to $24, raising total cost to ~$88.50. If concurrency drops from 80 to 20, Cloud Run spins up 4× more instances to handle the same load, quadrupling CPU and memory charges to ~$270 per month before egress.

Monitoring these dimensions separately lets you see which setting change caused the bill spike.

Cold Starts: What They Are and How to Reduce Them

A cold start happens when Cloud Run receives a request but no container instance is running or available to handle it. Cloud Run must download the container image, start the container, run the entrypoint command, and wait for the application to listen on the configured port before routing the request. This entire sequence adds latency — typically 1 to 10 seconds depending on image size, language runtime, and region.

Cold starts occur in three scenarios:

  1. Scale from zero — your service has no running instances (min instances = 0) and receives the first request after idle time
  2. Scale up — traffic spikes and existing instances hit max concurrency, forcing Cloud Run to start new instances
  3. Instance replacement — Cloud Run terminates an idle instance after 15 minutes and must restart it when traffic returns

Why Cold Starts Matter

Cold start latency affects the first request to each new instance. If your service scales from 1 instance to 10 during a traffic spike, 9 requests experience cold start delay. If your service scales back to zero between traffic bursts, every new burst triggers a cold start.

For user facing services, a 5 second cold start can cause timeouts, abandoned sessions, or bad user experience. For background jobs, cold start delay may not matter if the job runs for minutes or hours. For APIs serving mobile apps or webhooks, cold start latency shows up as tail latency in P99 metrics and can violate SLAs.

Cold start frequency is not visible in Cloud Run’s default metrics. You must monitor instance startup time and correlate it with latency spikes to detect cold start impact.

How to Reduce Cold Start Frequency

Set minimum instances above zero — configuring min instances = 1 or higher keeps at least one container always warm. This eliminates scale from zero cold starts entirely. The tradeoff: you pay for CPU and memory continuously, even during idle time. For a service with 0.5 vCPU and 1 GiB memory, keeping 1 instance always on with instance based billing costs ~$30 per month in idle charges.

Min instances = 3 ensures that during moderate traffic, Cloud Run does not need to start new instances until load exceeds 3 × max concurrency. This reduces scale up cold starts but triples idle cost.

Enable startup CPU boost — Cloud Run’s startup CPU boost temporarily increases CPU allocation during instance startup to speed up container initialization. This can cut cold start latency by 30% to 50% for CPU heavy startup tasks like JVM warmup or dependency loading. Startup CPU boost costs slightly more per cold start but does not affect steady state billing.

Reduce container image size — smaller images download faster. Multi stage Docker builds, distroless base images, and removing unnecessary dependencies can cut image size from 1 GiB to 100 MiB, reducing cold start time by several seconds.

Optimize application startup — language runtimes vary widely in startup speed. Node.js and Go start in under 1 second. Python with large frameworks like Django or TensorFlow can take 3 to 5 seconds. Java and .NET can take 5 to 10 seconds. Lazy loading dependencies, precompiling assets, and avoiding heavy initialization logic in the entrypoint reduce startup time.

Use a keep alive mechanism — some teams schedule a Cloud Scheduler job to ping their service every 5 to 10 minutes, keeping at least one instance warm even if real traffic is sparse. This works but adds request charges and does not prevent scale up cold starts during traffic spikes.

Cold Start Monitoring Strategy

Track instance startup latency by logging the time between process start and first request received. Cloud Run exposes this as run.googleapis.com/container/startup_latency in Cloud Monitoring. Plot it alongside request latency. If P99 latency spikes correlate with startup latency events, cold starts are affecting user experience.

Track instance count over time. If instance count frequently drops to zero and spikes back up, you are experiencing scale from zero cold starts regularly. If instance count scales smoothly without dropping to zero, cold starts only happen during scale up.

Track request latency histogram split by new vs. existing instances. If new instance requests consistently show 2× to 10× higher latency, cold starts are a bottleneck.

Concurrency Settings: How They Affect Cost and Performance

Concurrency determines how many requests a single Cloud Run instance can handle simultaneously. By default, Cloud Run sets max concurrency to 80 requests per instance. You can increase this to 1,000 or decrease it to 1.

How Concurrency Affects Instance Count

If your service receives 400 requests per second and max concurrency is 80, Cloud Run spins up at least 5 instances (400 / 80 = 5). If you increase concurrency to 200, Cloud Run needs only 2 instances (400 / 200 = 2). Fewer instances means lower CPU and memory billing because you pay per instance second, not per request second.

But if your application is CPU bound or memory intensive, higher concurrency can slow down every request. Concurrency = 1 gives each request exclusive access to the instance’s CPU and memory. Concurrency = 200 means 200 requests compete for the same resources, increasing contention and latency.

Concurrency and CPU Throttling

With request based CPU billing, Cloud Run allocates CPU proportionally to active requests. If your instance has 1 vCPU and concurrency = 80, each request gets ~1/80th of a vCPU. If all 80 requests are CPU heavy, every request runs slower due to throttling.

Switching to instance based CPU billing gives the container always on CPU access, but does not increase CPU allocation per request. If your workload needs dedicated CPU per request, lower concurrency instead of switching billing modes.

When to Lower Concurrency

CPU bound workloads — if your application does heavy computation, image processing, or encryption, high concurrency causes CPU contention. Concurrency = 10 or lower ensures each request gets enough CPU to finish fast.

Memory intensive workloads — if each request allocates large objects or buffers, high concurrency exhausts memory and triggers OOMKills. Lowering concurrency reduces memory pressure per instance.

Stateful or non thread safe code — if your application relies on global state or is not designed for concurrency, concurrency = 1 is safest. This forces Cloud Run to route each request to a dedicated instance, avoiding race conditions.

Database connection limits — if your service opens one database connection per request and your database allows only 50 concurrent connections, concurrency must stay below 50 to avoid connection pool exhaustion.

When to Increase Concurrency

I/O bound workloads — if requests spend most time waiting for external APIs, database queries, or file I/O, higher concurrency improves throughput without increasing CPU load. Concurrency = 200 or higher works well for lightweight proxy services or API gateways.

Cost optimization — higher concurrency reduces instance count, which reduces CPU and memory billing. If your application can handle it without latency degradation, increasing concurrency from 80 to 200 can cut compute costs by 50% or more.

Low memory footprint — if each request uses minimal memory (e.g., returning small JSON responses), high concurrency does not risk OOMKills.

Cost Impact of Concurrency Settings

A service handling 5 million requests per month with 200 ms average latency:

  • Concurrency = 20: Requires ~4× more instances than concurrency = 80. CPU cost: ~$48/month.
  • Concurrency = 80 (default): Baseline. CPU cost: ~$12/month.
  • Concurrency = 200: Requires ~40% fewer instances. CPU cost: ~$7/month.

If your application can handle concurrency = 200 without latency degradation, you save ~$5/month per 5M requests. At 50M requests per month, that is $50/month saved just by tuning concurrency.

But if increasing concurrency from 80 to 200 doubles request latency due to CPU contention, you trade cost savings for bad user experience and potential SLA violations.

Key Metrics to Monitor for Google Cloud Run

Cloud Run exposes metrics through Google Cloud Monitoring (formerly Stackdriver). These metrics fall into four categories: request metrics, instance metrics, resource utilization, and billing metrics.

Request Metrics

Request count (run.googleapis.com/request_count) — total requests received, split by response code (200, 4xx, 5xx). Track this to detect traffic spikes, error rate changes, or unexpected load.

Request latency (run.googleapis.com/request_latencies) — time from request received to response sent, measured in milliseconds. Cloud Monitoring surfaces P50, P95, and P99 latencies. If P99 latency exceeds SLA thresholds, investigate cold starts, CPU throttling, or slow external dependencies.

Error rate — percentage of requests returning 5xx errors. A sudden spike indicates application crashes, OOMKills, or downstream service failures.

Instance Metrics

Instance count (run.googleapis.com/container/instance_count) — number of container instances running. If this number frequently drops to zero, you are scaling from zero regularly and experiencing cold starts. If it scales up rapidly during traffic spikes, check whether concurrency limits are set too low.

Startup latency (run.googleapis.com/container/startup_latency) — time taken to start a new container instance. If this exceeds 5 seconds consistently, optimize your container image or enable startup CPU boost.

Max concurrent requests per instance — if instances consistently hit max concurrency, Cloud Run will scale up. If they never approach max concurrency, you may be over provisioned.

Resource Utilization Metrics

CPU utilization (run.googleapis.com/container/cpu/utilizations) — percentage of allocated CPU used per instance. If this stays below 30%, you allocated too much CPU. If it hits 100% regularly, you need more CPU per instance or lower concurrency.

Memory utilization (run.googleapis.com/container/memory/utilizations) — percentage of allocated memory used. If this exceeds 90%, you risk OOMKills. If it stays below 50%, you are over allocating memory and paying for unused capacity.

Throttled time — if using request based CPU billing, Cloud Run may throttle CPU between requests. High throttle time indicates background tasks are trying to run without CPU access. Switch to instance based billing or refactor to avoid background work.

Billing Metrics

Billable instance time — total seconds billed for CPU and memory. Compare this to request count and latency. If billable time increases but request count stays flat, you may have idle instances due to high min instances or low concurrency.

Billable request count — total requests billed at $0.40 per million. If this includes health checks or retries you did not expect, investigate unnecessary traffic sources.

Network egress — outbound traffic in GiB. If this spikes, check response payload size or external API call volume.

How to Access Cloud Run Metrics

Cloud Run metrics flow into Google Cloud Monitoring automatically. You can query them via the Metrics Explorer, create custom dashboards, or export them to external observability platforms using the Cloud Monitoring API or OpenTelemetry Collector.

For teams that need deeper visibility or want to correlate Cloud Run metrics with logs, traces, and infrastructure data from other services, infrastructure monitoring tools help unify telemetry across your stack without building custom integrations for every Google Cloud service.

Best Practices for Monitoring Google Cloud Run

Set latency and error rate alerts — configure alerts when P99 latency exceeds your SLA or when error rate crosses 1%. Cloud Run’s auto scaling can mask problems by spinning up more instances, but if errors are caused by bad code or exhausted resources, scaling up makes the problem worse.

Monitor cold start frequency and impact — track startup latency and correlate it with request latency spikes. If cold starts happen more than once per hour during business hours, consider setting min instances above zero.

Right size CPU and memory allocation — start with Cloud Run’s default (1 vCPU, 512 MiB) and adjust based on utilization metrics. If CPU stays below 30%, drop to 0.5 vCPU. If memory utilization exceeds 80%, increase memory allocation before OOMKills occur.

Test concurrency settings under load — increase max concurrency in staging and run load tests. Measure latency, error rate, and CPU utilization. If latency stays flat as concurrency increases, you can handle higher concurrency. If latency doubles, roll back.

Enable request tracing — integrate Cloud Run with Cloud Trace to capture distributed traces showing how long each request spends in your code vs. external dependencies. This helps isolate whether slow requests are due to your application logic, database queries, or third party API calls.

Track cost per request — divide total monthly Cloud Run cost by total request count. If cost per request increases month over month but traffic stays flat, investigate configuration drift, increased cold starts, or concurrency changes.

Monitor network egress separately — egress charges often exceed compute charges for data heavy services. Track egress per endpoint. If one endpoint drives 80% of egress, consider caching responses or serving static assets from Cloud CDN instead.

Tools for Monitoring Google Cloud Run

Google Cloud’s native monitoring tools provide baseline visibility, but teams running multiple services or integrating Cloud Run with other platforms often need broader observability coverage.

Google Cloud Monitoring

Google Cloud Monitoring (formerly Stackdriver) is built into Cloud Run and collects metrics, logs, and traces automatically. It provides pre-built dashboards for request latency, instance count, CPU and memory utilization, and error rate. You can create custom dashboards, set up alerts, and query metrics using MQL (Monitoring Query Language).

Strengths: Native integration, no setup required, free tier covers most small to midsize workloads.

Limitations: Query language is less flexible than PromQL or SQL. Cross project or multi cloud visibility requires stitching together multiple dashboards. Exporting data to external tools incurs egress charges.

OpenTelemetry for Cloud Run

OpenTelemetry provides vendor neutral instrumentation for capturing traces, metrics, and logs from Cloud Run services. You can instrument your application using OpenTelemetry SDKs, export telemetry to Google Cloud Monitoring, or send it to any OpenTelemetry compatible backend.

This approach works well for teams running services across Google Cloud, AWS, and on premises infrastructure who want unified observability without vendor lock in. OpenTelemetry Collector can scrape Cloud Run metrics via the Cloud Monitoring API and forward them to any observability platform.

CubeAPM for Full Stack Cloud Run Observability

CubeAPM is an OpenTelemetry native observability platform that deploys inside your own cloud environment and correlates Cloud Run metrics, logs, and traces with infrastructure data from Kubernetes, databases, and other services in your stack.

Unlike cloud native tools that charge per GB ingested and bill separately for logs, metrics, and traces, CubeAPM uses predictable $0.15/GB pricing with unlimited retention. Since it runs on your infrastructure, telemetry data never leaves your VPC, eliminating Google Cloud egress charges that add $0.12/GB when sending logs and traces to external SaaS platforms.

For teams running Cloud Run alongside GKE, Cloud SQL, or Kafka, CubeAPM unifies observability across all services in one platform without needing separate tools for each layer. It includes distributed tracing, log aggregation, real user monitoring, and synthetic monitoring in one deployment.

If you are already evaluating Google Cloud monitoring tools for broader GCP observability, CubeAPM covers Cloud Run monitoring as part of full stack visibility rather than requiring a separate tool for serverless workloads.

Conclusion

Google Cloud Run simplifies deployment by abstracting infrastructure, but effective monitoring requires understanding how cold starts, concurrency, and billing dimensions interact. Without visibility into startup latency, instance count, CPU throttle, and cost per request, teams cannot optimize for both performance and cost.

The metrics that matter most depend on your workload. For user facing services, cold start frequency and request latency determine user experience. For background jobs, CPU and memory utilization determine cost efficiency. For data heavy APIs, network egress often exceeds compute charges.

Monitoring Cloud Run is not just about collecting metrics. It is about correlating those metrics with application behavior, understanding which configuration changes affect cost and latency, and building alerts that catch problems before they affect users or inflate your bill.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. Features, pricing, and plan limits can change over time. Always verify the latest information directly with the vendor before making purchasing or deployment decisions.

Frequently Asked Questions

What causes cold starts in Google Cloud Run?

Cold starts occur when Cloud Run must start a new container instance because no running instance is available. This happens during scale from zero, scale up under load, or instance replacement after idle timeout. Cold starts add 1 to 10 seconds of latency depending on image size and runtime.

How does concurrency affect Cloud Run pricing?

Higher concurrency reduces the number of instances needed to handle the same load, which lowers CPU and memory billing. Lower concurrency increases instance count and costs more but reduces resource contention per request. Optimal concurrency depends on whether your workload is CPU bound or I/O bound.

What metrics should I monitor for Cloud Run services?

Monitor request latency (P50, P95, P99), error rate, instance count, startup latency, CPU utilization, memory utilization, and network egress. Set alerts on P99 latency and error rate to catch performance degradation early.

How can I reduce Cloud Run costs?

Right size CPU and memory allocation based on utilization metrics. Increase concurrency if your workload is I/O bound. Reduce container image size to lower cold start frequency. Monitor network egress and cache responses where possible. Avoid setting min instances unless cold starts are causing real user impact.

Does Cloud Run support distributed tracing?

Yes, Cloud Run integrates with Google Cloud Trace and supports OpenTelemetry for distributed tracing. You can instrument your application using OpenTelemetry SDKs and export traces to Cloud Trace or any OpenTelemetry compatible backend.

What is the difference between request based and instance based CPU billing?

Request based billing gives your container CPU access only while processing requests. Between requests, CPU is throttled. This mode costs less but prevents background tasks from running. Instance based billing gives always on CPU access, which costs more but allows background activity and can reduce cold start latency.

How do I monitor Cloud Run cold starts?

Track startup latency using the `run.googleapis.com/container/startup_latency` metric in Google Cloud Monitoring. Correlate startup events with request latency spikes to determine whether cold starts are affecting user experience. If startup latency exceeds 5 seconds regularly, optimize your container image or enable startup CPU boost.

×
×