CubeAPM
CubeAPM CubeAPM

How to Monitor AWS Lambda Timeout Errors and Set Alerts

How to Monitor AWS Lambda Timeout Errors and Set Alerts

Table of Contents

AWS Lambda timeout errors happen when your function exceeds its configured maximum execution time – Lambda terminates the invocation and records it as an error. The problem is CloudWatch groups timeouts and application errors together under the same Errors metric, so a spike in timeouts looks identical to a spike in unhandled exceptions unless you know where to look.

Monitoring Lambda timeouts properly means tracking them as a distinct failure type, not as a subset of general errors.

Key Takeaways

  • Lambda records timeouts as Errors in CloudWatch; but does not distinguish them from application errors without additional setup
  • The clearest timeout signal is in logs: every timed-out invocation writes Task timed out after X.XX seconds to CloudWatch Logs
  • Alert on p99 Duration exceeding 80% of your configured timeout – this is your early warning before timeouts actually happen
  • A function timing out on async invocations will silently retry up to 3 times before sending the event to the DLQ; your error count may look low while the problem compounds
  • Timeouts are almost always a symptom: a slow downstream dependency, a missing connection timeout, or a function doing too much in a single invocation

Why CloudWatch’s Error Metric Is Not Enough

The Errors metric in CloudWatch counts every failed invocation — unhandled exceptions, out-of-memory kills, and timeouts – all in one number. There is no Timeouts metric in the standard CloudWatch namespace.

This creates two practical problems:

You can’t alert specifically on timeouts. An alarm on Errors > 5 fires whether your function threw a null pointer exception or timed out waiting on a database connection. The response to each is completely different.

The REPORT log line is the only place the distinction exists. Every Lambda invocation writes a structured log line like:

REPORT RequestId: abc-123  Duration: 30003.45 ms  Billed Duration: 30000 ms  ...

And for timed-out invocations specifically, the line immediately before it reads:

Task timed out after 30.00 seconds

That log line is your primary timeout signal, and everything below is built around surfacing it reliably.

Step 1: Create a CloudWatch Metric Filter for Timeouts

A Metric Filter parses your Lambda log group and emits a custom metric every time the timeout log line appears. This turns an unstructured log event into a proper metric you can alarm on.

In the AWS Console:

  1. Go to CloudWatch → Log groups → your function’s log group
  2. Select Metric filters → Create metric filter
  3. Set the filter pattern to:

“Task timed out”

  1. Name the metric LambdaTimeouts, namespace Custom/Lambda
  2. Set metric value to 1, default value to 0

With AWS CLI:

aws logs put-metric-filter \
  --log-group-name /aws/lambda/your-function-name \
  --filter-name LambdaTimeoutFilter \
  --filter-pattern "Task timed out" \
  --metric-transformations \
      metricName=LambdaTimeouts,\
      metricNamespace=Custom/Lambda,\
      metricValue=1,\
      defaultValue=0

With Terraform:

resource "aws_cloudwatch_log_metric_filter" "lambda_timeouts" {
  name           = "LambdaTimeoutFilter"
  log_group_name = "/aws/lambda/your-function-name"
  pattern        = "Task timed out"

  metric_transformation {
    name          = "LambdaTimeouts"
    namespace     = "Custom/Lambda"
    value         = "1"
    default_value = "0"
  }
}

Once the filter is in place, every timeout writes a 1 to Custom/Lambda/LambdaTimeouts. Now you have something meaningful to alarm on.

Step 2: Set the Right Alerts

You need two alarms – one that catches timeouts after they happen and one that warns you before they do.

Alarm 1: Timeout Count (Reactive)

Fires when a timeout has already occurred.

aws cloudwatch put-metric-alarm \
  --alarm-name "LambdaTimeouts-YourFunction" \
  --metric-name LambdaTimeouts \
  --namespace Custom/Lambda \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:your-alert-topic \
  --dimensions Name=FunctionName,Value=your-function-name

Threshold to set: Any timeout — Sum >= 1 over a 5-minute window. Timeouts should not be a normal occurrence. A single timeout in production warrants investigation.

Alarm 2: p99 Duration (Proactive)

Fires before timeouts happen when execution time is trending toward your limit.

aws cloudwatch put-metric-alarm \
  --alarm-name "LambdaDuration-p99-YourFunction" \
  --metric-name Duration \
  --namespace AWS/Lambda \
  --extended-statistic p99 \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 24000 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:your-alert-topic \
  --dimensions Name=FunctionName,Value=your-function-name

Threshold to set: 80% of your configured timeout. If your timeout is 30 seconds, alarm at 24,000ms. This gives you a window to investigate before invocations start actually timing out.

AlarmMetricThresholdWhat it tells you
Timeout countCustom/Lambda/LambdaTimeoutsSum ≥ 1 over 5 minA timeout has occurred – act now
p99 DurationAWS/Lambda Duration p99≥ 80% of timeout limitTimeouts are likely coming – investigate

Step 3: Find What Actually Timed Out

An alarm tells you a timeout happened. It doesn’t tell you which request, what was slow, or which downstream call caused it. For that, you need logs.

CloudWatch Logs Insights query to find timed-out invocations:

fields @timestamp, @requestId, @duration, @message

| filter @message like /Task timed out/

| sort @timestamp desc

| limit 50

This gives you the Request IDs of every timed-out invocation in the time window. Use the Request ID to find the full log stream for that invocation and trace what it was doing.

The practical gap: This tells you the invocation timed out. It does not tell you whether the timeout was caused by a slow DynamoDB query, a hung HTTP call to an external API, or a database connection that was never released. For that you need distributed traces, which CloudWatch Logs alone cannot provide.

The Async Timeout Trap

Timeouts on synchronously invoked functions, API Gateway, ALB, are immediately visible. The caller gets a 502 or 504 and your users notice straight away. Timeouts on async invocations, S3 events, SNS, EventBridge, are much harder to spot.

When an async invocation times out, Lambda retries it automatically up to 3 times by default, with exponential backoff between attempts. Each retry is another timeout. Your Errors metric rises slowly, AsyncEventAge grows as the event ages in the queue, and if you don’t have a DLQ configured, the event is silently dropped after the final retry with no record of what was lost.

What to watch alongside your timeout alarm for async functions:

  • AsyncEventAge: if this is growing alongside timeouts, Lambda is retrying and your backlog is building
  • DeadLetterErrors: if Lambda can’t write timed-out events to your DLQ, you’re losing data with no recovery path
  • Invocation count: a drop in successful invocations during a timeout spike means retries are consuming your concurrency headroom

Practical note: If you haven’t configured a DLQ on async Lambda functions, a sustained run of timeouts means permanently lost events. Set one up before you need it, not after.

Common Causes of Lambda Timeouts

Timeouts are almost always a symptom of something else. The most frequent causes:

Missing connection timeouts on downstream calls. If your function calls an RDS database, external HTTP API, or ElastiCache instance without an explicit connection timeout configured, a hung connection holds the invocation open until Lambda’s timeout fires. Set explicit timeouts on every downstream call — shorter than your Lambda timeout.

Cold start eating into execution time. On a cold start, initialization runs before your handler. If initialization takes 2 seconds and your timeout is 3 seconds, there’s only 1 second left for actual work. Increase the timeout, reduce initialization time, or use Provisioned Concurrency for latency-sensitive functions.

Function is doing too much in one invocation. A Lambda function that fetches data, transforms it, writes to S3, and sends an SNS notification will time out under load in ways that a function doing a single one of those things won’t. Break large functions into smaller, focused ones with appropriate timeouts per step.

VPC cold starts. Functions running inside a VPC incur additional cold start latency from ENI attachment. If your function is VPC-attached and timeouts only appear on cold start invocations, this is likely a contributing factor.

When CloudWatch Isn’t Enough

CloudWatch gives you the raw ingredients – the log line, the Duration metric, the Errors count. What it doesn’t give you:

  • A timeout-specific metric without building a Metric Filter yourself for every function
  • The trace that shows which downstream call caused the timeout; you get the total duration, not the breakdown of where time was spent
  • Correlated logs and traces in one view; you’re switching between Logs Insights for the log line and a separate tool for any trace data
  • Cross-function visibility; if a timeout in Function A caused a cascade into Function B and C, CloudWatch shows three separate error spikes with no connection drawn between them

CubeAPM captures Lambda timeout errors as part of distributed traces. So when a function times out, you can see the full span breakdown of what was running at the time, identify which downstream call was slow, and follow the impact across the rest of the request chain. No Metric Filters to configure and maintain per function, no context-switching between tools, and everything self-hosted inside your own AWS account.

Summary

What to doWhy
Create a CloudWatch Metric Filter on “Task timed out”Turns the timeout log line into a metric you can alarm on – CloudWatch has no native timeout metric
Alarm on timeout count Sum ≥ 1 over 5 minutesAny timeout in production warrants investigation- zero is the right threshold
Alarm on p99 Duration ≥ 80% of your configured timeoutCatches timeouts before they happen, not after
Watch AsyncEventAge and DeadLetterErrors for async functionsAsync timeouts silently retry and can drop events permanently without a DLQ
Set explicit connection timeouts on all downstream callsMost Lambda timeouts trace back to hung downstream connections, not slow function code

Disclaimer: Configurations, thresholds, and code examples are for guidance only. Verify against current AWS and OpenTelemetry documentation before applying to production. AWS service details change frequently. CubeAPM references reflect genuine use cases; evaluate all tools against your own requirements.

×
×