AWS Lambda timeout errors happen when your function exceeds its configured maximum execution time – Lambda terminates the invocation and records it as an error. The problem is CloudWatch groups timeouts and application errors together under the same Errors metric, so a spike in timeouts looks identical to a spike in unhandled exceptions unless you know where to look.
Monitoring Lambda timeouts properly means tracking them as a distinct failure type, not as a subset of general errors.
Key Takeaways
- Lambda records timeouts as Errors in CloudWatch; but does not distinguish them from application errors without additional setup
- The clearest timeout signal is in logs: every timed-out invocation writes Task timed out after X.XX seconds to CloudWatch Logs
- Alert on p99 Duration exceeding 80% of your configured timeout – this is your early warning before timeouts actually happen
- A function timing out on async invocations will silently retry up to 3 times before sending the event to the DLQ; your error count may look low while the problem compounds
- Timeouts are almost always a symptom: a slow downstream dependency, a missing connection timeout, or a function doing too much in a single invocation
Why CloudWatch’s Error Metric Is Not Enough
The Errors metric in CloudWatch counts every failed invocation — unhandled exceptions, out-of-memory kills, and timeouts – all in one number. There is no Timeouts metric in the standard CloudWatch namespace.
This creates two practical problems:
You can’t alert specifically on timeouts. An alarm on Errors > 5 fires whether your function threw a null pointer exception or timed out waiting on a database connection. The response to each is completely different.
The REPORT log line is the only place the distinction exists. Every Lambda invocation writes a structured log line like:
REPORT RequestId: abc-123 Duration: 30003.45 ms Billed Duration: 30000 ms ...
And for timed-out invocations specifically, the line immediately before it reads:
Task timed out after 30.00 seconds
That log line is your primary timeout signal, and everything below is built around surfacing it reliably.
Step 1: Create a CloudWatch Metric Filter for Timeouts
A Metric Filter parses your Lambda log group and emits a custom metric every time the timeout log line appears. This turns an unstructured log event into a proper metric you can alarm on.
In the AWS Console:
- Go to CloudWatch → Log groups → your function’s log group
- Select Metric filters → Create metric filter
- Set the filter pattern to:
“Task timed out”
- Name the metric LambdaTimeouts, namespace Custom/Lambda
- Set metric value to 1, default value to 0
With AWS CLI:
aws logs put-metric-filter \
--log-group-name /aws/lambda/your-function-name \
--filter-name LambdaTimeoutFilter \
--filter-pattern "Task timed out" \
--metric-transformations \
metricName=LambdaTimeouts,\
metricNamespace=Custom/Lambda,\
metricValue=1,\
defaultValue=0
With Terraform:
resource "aws_cloudwatch_log_metric_filter" "lambda_timeouts" {
name = "LambdaTimeoutFilter"
log_group_name = "/aws/lambda/your-function-name"
pattern = "Task timed out"
metric_transformation {
name = "LambdaTimeouts"
namespace = "Custom/Lambda"
value = "1"
default_value = "0"
}
}Once the filter is in place, every timeout writes a 1 to Custom/Lambda/LambdaTimeouts. Now you have something meaningful to alarm on.
Step 2: Set the Right Alerts
You need two alarms – one that catches timeouts after they happen and one that warns you before they do.
Alarm 1: Timeout Count (Reactive)
Fires when a timeout has already occurred.
aws cloudwatch put-metric-alarm \
--alarm-name "LambdaTimeouts-YourFunction" \
--metric-name LambdaTimeouts \
--namespace Custom/Lambda \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:your-alert-topic \
--dimensions Name=FunctionName,Value=your-function-nameThreshold to set: Any timeout — Sum >= 1 over a 5-minute window. Timeouts should not be a normal occurrence. A single timeout in production warrants investigation.
Alarm 2: p99 Duration (Proactive)
Fires before timeouts happen when execution time is trending toward your limit.
aws cloudwatch put-metric-alarm \
--alarm-name "LambdaDuration-p99-YourFunction" \
--metric-name Duration \
--namespace AWS/Lambda \
--extended-statistic p99 \
--period 300 \
--evaluation-periods 3 \
--threshold 24000 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:your-alert-topic \
--dimensions Name=FunctionName,Value=your-function-nameThreshold to set: 80% of your configured timeout. If your timeout is 30 seconds, alarm at 24,000ms. This gives you a window to investigate before invocations start actually timing out.
| Alarm | Metric | Threshold | What it tells you |
| Timeout count | Custom/Lambda/LambdaTimeouts | Sum ≥ 1 over 5 min | A timeout has occurred – act now |
| p99 Duration | AWS/Lambda Duration p99 | ≥ 80% of timeout limit | Timeouts are likely coming – investigate |
Step 3: Find What Actually Timed Out
An alarm tells you a timeout happened. It doesn’t tell you which request, what was slow, or which downstream call caused it. For that, you need logs.
CloudWatch Logs Insights query to find timed-out invocations:
fields @timestamp, @requestId, @duration, @message
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 50This gives you the Request IDs of every timed-out invocation in the time window. Use the Request ID to find the full log stream for that invocation and trace what it was doing.
The practical gap: This tells you the invocation timed out. It does not tell you whether the timeout was caused by a slow DynamoDB query, a hung HTTP call to an external API, or a database connection that was never released. For that you need distributed traces, which CloudWatch Logs alone cannot provide.
The Async Timeout Trap
Timeouts on synchronously invoked functions, API Gateway, ALB, are immediately visible. The caller gets a 502 or 504 and your users notice straight away. Timeouts on async invocations, S3 events, SNS, EventBridge, are much harder to spot.
When an async invocation times out, Lambda retries it automatically up to 3 times by default, with exponential backoff between attempts. Each retry is another timeout. Your Errors metric rises slowly, AsyncEventAge grows as the event ages in the queue, and if you don’t have a DLQ configured, the event is silently dropped after the final retry with no record of what was lost.
What to watch alongside your timeout alarm for async functions:
- AsyncEventAge: if this is growing alongside timeouts, Lambda is retrying and your backlog is building
- DeadLetterErrors: if Lambda can’t write timed-out events to your DLQ, you’re losing data with no recovery path
- Invocation count: a drop in successful invocations during a timeout spike means retries are consuming your concurrency headroom
Practical note: If you haven’t configured a DLQ on async Lambda functions, a sustained run of timeouts means permanently lost events. Set one up before you need it, not after.
Common Causes of Lambda Timeouts
Timeouts are almost always a symptom of something else. The most frequent causes:
Missing connection timeouts on downstream calls. If your function calls an RDS database, external HTTP API, or ElastiCache instance without an explicit connection timeout configured, a hung connection holds the invocation open until Lambda’s timeout fires. Set explicit timeouts on every downstream call — shorter than your Lambda timeout.
Cold start eating into execution time. On a cold start, initialization runs before your handler. If initialization takes 2 seconds and your timeout is 3 seconds, there’s only 1 second left for actual work. Increase the timeout, reduce initialization time, or use Provisioned Concurrency for latency-sensitive functions.
Function is doing too much in one invocation. A Lambda function that fetches data, transforms it, writes to S3, and sends an SNS notification will time out under load in ways that a function doing a single one of those things won’t. Break large functions into smaller, focused ones with appropriate timeouts per step.
VPC cold starts. Functions running inside a VPC incur additional cold start latency from ENI attachment. If your function is VPC-attached and timeouts only appear on cold start invocations, this is likely a contributing factor.
When CloudWatch Isn’t Enough
CloudWatch gives you the raw ingredients – the log line, the Duration metric, the Errors count. What it doesn’t give you:
- A timeout-specific metric without building a Metric Filter yourself for every function
- The trace that shows which downstream call caused the timeout; you get the total duration, not the breakdown of where time was spent
- Correlated logs and traces in one view; you’re switching between Logs Insights for the log line and a separate tool for any trace data
- Cross-function visibility; if a timeout in Function A caused a cascade into Function B and C, CloudWatch shows three separate error spikes with no connection drawn between them
CubeAPM captures Lambda timeout errors as part of distributed traces. So when a function times out, you can see the full span breakdown of what was running at the time, identify which downstream call was slow, and follow the impact across the rest of the request chain. No Metric Filters to configure and maintain per function, no context-switching between tools, and everything self-hosted inside your own AWS account.
Summary
| What to do | Why |
| Create a CloudWatch Metric Filter on “Task timed out” | Turns the timeout log line into a metric you can alarm on – CloudWatch has no native timeout metric |
| Alarm on timeout count Sum ≥ 1 over 5 minutes | Any timeout in production warrants investigation- zero is the right threshold |
| Alarm on p99 Duration ≥ 80% of your configured timeout | Catches timeouts before they happen, not after |
| Watch AsyncEventAge and DeadLetterErrors for async functions | Async timeouts silently retry and can drop events permanently without a DLQ |
| Set explicit connection timeouts on all downstream calls | Most Lambda timeouts trace back to hung downstream connections, not slow function code |
Disclaimer: Configurations, thresholds, and code examples are for guidance only. Verify against current AWS and OpenTelemetry documentation before applying to production. AWS service details change frequently. CubeAPM references reflect genuine use cases; evaluate all tools against your own requirements.





