AWS Lambda doesn’t give you CPU usage or memory the way a traditional server does. You can’t SSH in. Lambda manages infrastructure for you, which means you’re flying partially blind unless you know exactly which metrics to watch and what they’re actually telling you.
This page covers the 9 Lambda metrics that matter, what good and bad values look like, what alert thresholds to set, and the gaps CloudWatch leaves that you’ll need to fill another way.
Key Takeaways
- The 9 metrics that matter are: Errors, Duration, Throttles, ConcurrentExecutions, Invocations, DeadLetterErrors, IteratorAge, Init Duration, and AsyncEventAge
- Throttles and DeadLetterErrors are the most commonly missed; they don’t show up in your error rate but represent dropped and lost requests
- CloudWatch gives you the raw numbers for free but doesn’t correlate them across services or tell you why a function was slow
- Cold start Init Duration is not a standard CloudWatch metric; you need Lambda Insights or an APM tool to monitor it properly
- Set alert thresholds before you need them, not after your first production incident
Quick Reference: Lambda Metrics That Matter
| Metric | What It Tells You | Alert When |
| Errors | Function execution failures | Error rate > 1% over 5 min |
| Duration | Execution time per invocation | p99 > 80% of your timeout |
| Throttles | Invocations rejected at concurrency limit | Any throttle in 5 min |
| ConcurrentExecutions | Parallel instances running at once | > 80% of your reserved limit |
| Invocations | Total calls (success + fail, not throttled) | Drops to 0 unexpectedly |
| DeadLetterErrors | Failed async event delivery to DLQ | Any DLQ error |
| IteratorAge | Stream lag (Kinesis/DynamoDB Streams) | > 60 seconds |
| Init Duration (cold start) | First-call initialisation latency | Median init > 1,000ms |
| AsyncEventAge | Time events wait before processing | > 30 seconds |
1. Errors: The Most Critical Metric
What it is: The number of invocations that resulted in a function error. This includes unhandled exceptions, out-of-memory errors, and timeouts. It does not include throttles or invocation errors that occur before your code even starts.
What it doesn’t tell you: Why the error happened. You’ll need traces or logs for that.
What good looks like: Error rate (Errors / Invocations) below 0.1% in steady state. Occasional spikes during deploys are normal; persistent rates above 1% are not.
Alert threshold to set:
- Warning: error rate > 0.5% over a 5-minute window
- Critical: error rate > 2% over a 5-minute window, or any single period with > 10 errors
The practical trap: A high error count is obvious. A low but nonzero error count that’s been stable for weeks is the dangerous one; it means something is silently failing and you’ve normalized it.
2. Duration: Execution Time (and Your Billing Clock)
What it is: How long your function code takes to run per invocation, in milliseconds. Lambda bills by duration, rounded to the nearest 1ms, so this is both a performance and cost metric simultaneously.
The three duration numbers you need:
- Average duration: Baseline health
- p99 duration: What your slowest 1% of invocations experience; the most important for user-facing functions
- Max duration: Will this ever hit your timeout?
What good looks like: p99 duration consistently below 50% of your configured timeout gives you headroom. If your timeout is 30 seconds and p99 is 28 seconds, you are one slow dependency call away from mass timeouts.
Alert threshold to set:
- Warning: p99 duration > 70% of your configured timeout
- Critical: p99 duration > 85% of your configured timeout
CloudWatch gotcha: CloudWatch’s percentile metrics for Lambda (p99, p95) work correctly for functions with high invocation volumes. For low-traffic functions, use average + max together instead of relying on percentiles.
The timeout blindspot: If a function actually times out, CloudWatch records an error, but it doesn’t clearly flag it as a timeout. You have to search logs for “Task timed out” after or set a custom metric/alarm. This is one of the gaps where a proper APM tool pays for itself.
3. Throttles: Invocations Your Function Refused
What it is: The count of invocations that Lambda rejected because your function hit its concurrency limit. Throttled invocations are not counted in Invocations or Errors; they’re a separate metric. This means your error rate can look fine while you’re silently dropping requests.
Why it matters: Lambda’s default account-level concurrency limit is 1,000 concurrent executions across all functions in a region. If you haven’t set reserved concurrency per function, one spiky function can throttle everything else in your account.
What good looks like: Zero throttles. A sustained throttle rate means either your concurrency limit is too low or you have a burst in traffic that hasn’t been anticipated.
Alert threshold to set:
- Alert: any throttle count > 0 sustained over a 5-minute window
What to do when you see throttles:
- Check which function is consuming concurrency (ConcurrentExecutions metric per function)
- Increase reserved concurrency for critical functions
- Request a service limit increase if you’re hitting the account-wide 1,000 limit
4. ConcurrentExecutions: How Close You Are to the Wall
What it is: The number of function instances running simultaneously at a given point. This is your headroom metric; it tells you how close you are to throttling before throttling actually happens.
Two levels to watch:
- Account-level ConcurrentExecutions: approaching 1,000 means any function in your region can start throttling
- Function-level ConcurrentExecutions: If you’ve set reserved concurrency on a function, watch this against that limit specifically.
Alert threshold to set:
- Warning: > 80% of your reserved concurrency limit sustained for 5 minutes
- Critical: > 90% of your reserved concurrency limit
The Lambda cold start relationship: Every new concurrent execution that spins up a fresh container incurs a cold start penalty. Watching ConcurrentExecutions spikes alongside Init Duration gives you the full cold start picture.
5. Invocations: Your Request Volume Signal
What it is: The count of times your function was invoked, including both successes and failures. It does not include throttled invocations.
Why a drop matters more than a spike: A sudden spike in invocations can mean upstream services are retrying or an event source is misfiring. A drop to zero when you expect traffic means something upstream broke, and because Lambda is event-driven, that break might be completely silent.
Alert threshold to set:
- Anomaly: invocation count drops > 50% below baseline for your function’s typical traffic pattern
- Alert: invocations = 0 for any 10-minute window where traffic is expected
Cost relationship: Invocations × Duration = your Lambda bill. A function that’s being invoked 10x more than expected will generate a bill surprise before your next review cycle. Set a billing alarm in AWS Budgets alongside this metric.
6. DeadLetterErrors: Silent Data Loss
What it is: The number of times Lambda failed to deliver a failed async event to your dead-letter queue (SQS or SNS). This metric is only relevant if you’ve configured a DLQ on your function.
Why it’s critical: DeadLetterErrors represent data loss. If an async event fails and Lambda can’t write it to the DLQ either, that event is gone. Common causes: incorrect IAM permissions on the Lambda execution role, misconfigured SQS queue, or the event payload exceeding the DLQ’s size limit.
Alert threshold to set:
- Alert: any DeadLetterErrors > 0
If you haven’t configured a DLQ yet: You should. Any Lambda function that processes async events from S3, SNS, EventBridge, or SQS should have a DLQ configured so you have a recovery path when invocations fail.
7. IteratorAge: Stream Processing Lag (Kinesis / DynamoDB Streams)
What it is: For Lambda functions triggered by Kinesis Data Streams or DynamoDB Streams, IteratorAge measures the age of the last record your function processed, i.e., how far behind your function is from the head of the stream.
What it doesn’t apply to: SQS-triggered Lambda functions don’t emit this metric (SQS has its own ApproximateAgeOfOldestMessage metric you should monitor separately).
What good looks like: IteratorAge should be consistently near zero. If it’s growing, your function isn’t processing records fast enough; either the function itself is slow, it’s erroring on records, or you need to increase the shard count on your Kinesis stream.
Alert threshold to set:
- Warning: IteratorAge > 60 seconds (1 minute behind)
- Critical: IteratorAge > 300 seconds (5 minutes behind)
8. Init Duration (Cold Starts): The Hidden Latency Tax
What it is: The time Lambda spends initializing a new execution environment, downloading your code, starting the runtime, and running your initialization code before your handler function actually starts. Cold starts happen when there’s no warm instance available to serve the request.
Where to find it: Init Duration isn’t a CloudWatch metric in the standard metrics view; it’s emitted to CloudWatch Logs in the REPORT line of every cold-start invocation. To monitor it as a metric, you need to do either of the following:
- Enable CloudWatch Lambda Insights (adds init_duration as a proper metric)
- Create a CloudWatch Metric Filter on your log group parsing the Init Duration value
- Use an APM tool that extracts this automatically from Lambda telemetry
What good looks like: For most runtimes, cold start init duration under 500ms is acceptable. Above 1,000ms for user-facing functions starts causing noticeable latency for end users who hit cold instances.
Cold start by runtime (rough benchmarks):
| Runtime | Typical Init Duration |
| Node.js 20 | 100–300ms |
| Python 3.12 | 100–400ms |
| Java 21 | 500–2,000ms |
| .NET 8 | 200–800ms |
| Go | 50–200ms |
What to do about frequent cold starts:
- Enable Provisioned Concurrency for latency-sensitive functions
- Move heavy initialisation code outside your handler to the global scope (runs once per container lifecycle, not per invocation)
- Reduce your deployment package size: larger packages = slower cold starts
9. AsyncEventAge: Queue Buildup Before Processing
What it is: For async invocations, the time between when Lambda successfully queued the event and when it actually invoked your function. If this is growing, events are waiting longer before being processed, usually because the function is erroring and Lambda is retrying with backoff.
Alert threshold to set:
- Warning: AsyncEventAge > 30 seconds
- Critical: AsyncEventAge > 120 seconds
The relationship with Errors: AsyncEventAge spikes almost always accompany an Errors spike on async functions. Watch them together; a growing AsyncEventAge with an errors spike means Lambda is retrying failed events, and your backlog is growing.
What CloudWatch Doesn’t Give You (and How to Fill the Gap)
CloudWatch gives you the raw metric data above. What it doesn’t give you:
1. Distributed traces across services Lambda is rarely alone. It calls DynamoDB, RDS, external APIs, and other Lambda functions. CloudWatch tells you your Lambda had a 3-second duration. It doesn’t tell you that 2.8 seconds of that was waiting on a slow RDS query. For that, you need distributed tracing, either AWS X-Ray or an OpenTelemetry-based APM tool.
2. Correlated logs + metrics in one view Seeing that your error rate spiked at 14:32 is useful. Being able to click into the specific invocation, see the stack trace, and then see the trace of that request across all services is what actually lets you fix the problem in under 10 minutes instead of over an hour.
3. p99 latency per invocation pattern CloudWatch percentile metrics work better at scale but are harder to slice by trigger type, payload size, or downstream dependency, which is often where the interesting performance questions live.
4. Out-of-the-box alerting without alarm configuration overhead CloudWatch requires you to create an alarm for every metric, every function, every threshold you want to watch. At 5 functions, that’s manageable. At 50 functions, that’s a configuration management problem.
The Minimum Lambda Monitoring Setup
If you’re starting from scratch, here’s what to implement in order:
- Errors alarm: error rate > 1% for 5 minutes → SNS → Slack/PagerDuty
- Throttles alarm: p99 > 80% of timeout → warning alert
- Invocations anomaly detection: Set up CloudWatch Anomaly Detection on the Invocations metric so drops are caught automatically
- ConcurrentExecutions alarm — > 80% of account or reserved limit
- Enable Lambda Insights—for init duration, memory utilisation, and cold start tracking
- Configure DLQ + DeadLetterErrors alarm—for any async function
Going Beyond CloudWatch: When to Add an APM Tool
CloudWatch is the right starting point. An APM tool earns its place when:
- You have more than 5–10 Lambda functions in production
- You need to trace a slow request across Lambda + DynamoDB + external APIs to find the root cause
- You want p99 duration per function without building custom dashboards in CloudWatch
- You want correlated logs and traces in one place without switching between CloudWatch Logs Insights and CloudWatch Metrics
CubeAPM instruments Lambda functions via the OpenTelemetry Lambda layer: there is no proprietary agent, no code changes, and no data leaving your AWS account. You get distributed traces, correlated logs, and all the metrics above in a single dashboard, self-hosted inside your VPC.
Summary
The 9 Lambda metrics to monitor, in priority order:
- Errors (error rate): Function failures
- Throttles: Requests being silently dropped
- Duration / p99: Execution time and timeout headroom
- ConcurrentExecutions: Proximity to concurrency wall
- Invocations: Volume anomalies and unexpected zeros
- DeadLetterErrors: Data loss signal for async functions
- IteratorAge: Stream processing lag (Kinesis/DynamoDB Streams)
- Init Duration: Cold start latency
- AsyncEventAge: Event queue buildup
CloudWatch gives you all of these metrics for free. What it doesn’t give you is the correlation between them, the trace that shows you a 3-second duration was caused by a slow downstream RDS call, not your Lambda code. That’s the gap a proper APM layer fills.





