CubeAPM
CubeAPM CubeAPM

How to Monitor AWS Step Functions Workflow Executions and Failures

How to Monitor AWS Step Functions Workflow Executions and Failures

Table of Contents

AWS Step Functions is a fully managed workflow orchestration service that lets you coordinate distributed applications and microservices using visual state machines. Once a workflow is running in production, knowing whether it is succeeding, failing, or silently stalling is critical. Without solid AWS Step Functions monitoring, a stuck execution or repeated Lambda failure can go unnoticed for hours, cascading into data loss or customer-facing outages.

This guide walks through every layer of Step Functions observability, from the built-in console view all the way to CloudWatch alarms, structured logging, distributed tracing, and EventBridge automation. Each section includes actionable steps you can apply immediately.

Key Takeaways
  • AWS Step Functions publishes execution metrics (ExecutionsFailed, ExecutionsTimedOut, ExecutionTime, and others) to Amazon CloudWatch automatically. No extra setup is required.
  • The Step Functions console provides a colour-coded visual workflow inspector that shows exactly which state failed and why, making it the fastest first stop when debugging.
  • Enabling CloudWatch Logs at the ERROR or ALL level captures full execution history, including state input and output, for deep post-incident analysis.
  • AWS X-Ray tracing gives you an end-to-end service map across Lambda, DynamoDB, and other integrations invoked by your state machine.
  • Amazon EventBridge can react to execution status change events in near real time, enabling automated alerts, ticketing, and remediation workflows.
  • Express Workflows do not retain execution history in the console after 90 days. CloudWatch Logs is mandatory for long-term observability of Express Workflows.
  • Always set explicit timeouts on every state. Without them, a stuck Task state will hold a RUNNING execution indefinitely.

1. Understanding AWS Step Functions Execution Statuses

Every execution in Step Functions moves through a defined set of statuses. Understanding what each status means is the foundation of effective monitoring.

StatusMeaningAction Required
RUNNINGExecution is actively in progressMonitor duration; alert if exceeding expected time
SUCCEEDEDAll states completed without errorTrack success rate as a baseline metric
FAILEDAn unhandled error terminated the executionInvestigate state-level error and cause fields immediately
TIMED_OUTExecution or a state exceeded its configured timeoutReview timeout settings and upstream service latency
ABORTEDExecution was manually stopped via API or consoleConfirm intentional; investigate if unexpected

You can retrieve execution status programmatically using the DescribeExecution or GetExecutionHistory API calls. The GetExecutionHistory API is particularly useful because it returns a chronological list of events, letting you pinpoint exactly which state triggered a failure even while the execution is still RUNNING.

2. Monitoring Executions in the Step Functions Console

The AWS Management Console is the fastest way to get a real-time view of your state machine. When you open a state machine and click the Executions tab, you see a filterable list of every execution with its status, start time, and duration. You can filter by Running, Succeeded, Failed, Timed Out, or Aborted.

The Visual Workflow Inspector

Click any execution to open the visual workflow view. Each state is colour-coded based on its outcome:

  • Green: state completed successfully
  • Red: state failed with an unhandled error
  • Blue: state is currently executing
  • Orange: state caught an error and recovered via a Catch block
  • Gray: state has not yet been reached

Clicking a state exposes its input, output, error name, and the full cause string. For states configured with Retry, you can see how many attempts were made and when each attempt occurred. This is often enough to resolve common issues without writing a single query.

Execution List Patterns to Watch

Sort the execution list by start time and look for clustering. If a batch of executions all failed within the same five-minute window, the cause is likely an external dependency outage or a bad deployment. Scattered failures over time suggest intermittent upstream issues or data-specific edge cases.

3. AWS Step Functions Monitoring with Amazon CloudWatch Metrics

Step Functions publishes metrics to the AWS/States CloudWatch namespace automatically. These metrics are the backbone of any serious monitoring setup. According to the AWS Step Functions Developer Guide, CloudWatch metrics are delivered on a best-effort basis, so they are suitable for near-real-time alerting but not as an authoritative audit trail.

Core Execution Metrics

MetricNamespaceDescriptionRecommended Statistic
ExecutionsStartedAWS/StatesNumber of executions that beganSum
ExecutionsSucceededAWS/StatesNumber of executions that completed successfullySum
ExecutionsFailedAWS/StatesNumber of executions that failedSum
ExecutionsTimedOutAWS/StatesExecutions that exceeded their timeoutSum
ExecutionsAbortedAWS/StatesExecutions stopped manuallySum
ExecutionTimeAWS/StatesDuration in milliseconds from start to endp50, p90, p99
OpenExecutionCountAWS/StatesApproximate number of executions currently runningAverage

AWS also emits ExecutionsStarted twice per execution (once at start, once with a value of 0 at completion). Use the Sum statistic and divide by 2 to get an accurate execution count, or rely on ExecutionsSucceeded plus ExecutionsFailed for terminal state counts.

Setting Up a CloudWatch Dashboard

A dashboard that overlays ExecutionsSucceeded, ExecutionsFailed, ExecutionsTimedOut, and ExecutionTime (p90) gives you instant visibility into workflow health. Use the following AWS CLI command as a starting point, replacing the StateMachineArn with your own:

aws cloudwatch put-dashboard \
  --dashboard-name MyWorkflowDashboard \
  --dashboard-body '{
    "widgets": [{
      "type": "metric",
      "properties": {
        "title": "Step Functions Execution Health",
        "metrics": [
          ["AWS/States","ExecutionsSucceeded","StateMachineArn","<YOUR_ARN>"],
          ["AWS/States","ExecutionsFailed","StateMachineArn","<YOUR_ARN>"],
          ["AWS/States","ExecutionsTimedOut","StateMachineArn","<YOUR_ARN>"]
        ],
        "period": 300,
        "stat": "Sum",
        "view": "timeSeries"
      }
    }]
  }'

Creating CloudWatch Alarms for Failures

Dashboards are passive. Alarms are active. The following command creates an alarm that triggers when more than 5 executions fail within a 5-minute window:

aws cloudwatch put-metric-alarm \
  --alarm-name "StepFunctions-FailureRate" \
  --metric-name ExecutionsFailed \
  --namespace AWS/States \
  --dimensions Name=StateMachineArn,Value=<YOUR_ARN> \
  --statistic Sum \
  --period 300 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions <YOUR_SNS_TOPIC_ARN>

For execution duration anomalies, create a second alarm on the ExecutionTime metric using the p90 statistic. Set the threshold based on your established baseline. A good starting point is 1.5 times the average p90 from the previous 7 days.

4. Enabling and Querying CloudWatch Logs for Step Functions

CloudWatch Logs gives you the full execution story: every state entry and exit, all input and output payloads, error names, retry counts, and durations. This is essential for post-incident analysis. It is also the only long-term observability option for Express Workflows, which do not retain execution history in the console.

Log Levels

Step Functions supports four logging levels:

  • OFF – no logging is captured (the default)
  • ERROR – logs states that resulted in an error; recommended minimum for production
  • FATAL – logs only fatal execution errors
  • ALL – logs every state transition including full input and output payloads

Important: Using ALL with includeExecutionData: true will log state input and output in plaintext. If your workflows process personally identifiable information, credentials, or other sensitive data, either use ERROR level logging or redact sensitive fields in your state definitions before they are written to logs.

Enabling Logging via AWS CLI

aws stepfunctions update-state-machine \
  --state-machine-arn <YOUR_ARN> \
  --logging-configuration '{
    "level": "ALL",
    "includeExecutionData": true,
    "destinations": [{
      "cloudWatchLogsLogGroup": {
        "logGroupArn": "arn:aws:logs:<REGION>:<ACCOUNT>:log-group:/aws/states/<NAME>:*"
      }
    }]
  }'

Querying Logs with CloudWatch Logs Insights

CloudWatch Logs Insights lets you run structured queries against execution logs. The following query finds all failed states in the past 24 hours, ordered by timestamp:

fields @timestamp, execution_arn, details.name, details.error, details.cause
| filter type = "TaskFailed" or type = "ExecutionFailed"
| sort @timestamp desc
| limit 50

To identify your slowest executions and surface performance regressions, use this query:

fields @timestamp, execution_arn, type
| filter type = "ExecutionSucceeded" or type = "ExecutionFailed"
| stats avg(details.billedDurationInMilliseconds) as avg_ms,
        max(details.billedDurationInMilliseconds) as max_ms
  by bin(1h)
| sort @timestamp desc

5. Detecting and Handling Failures with Retry and Catch

Step Functions has built-in error handling at the state level. Every Task, Parallel, and Map state can define Retry and Catch blocks. Understanding how these work is inseparable from monitoring because they determine whether a failure surfaces immediately or is absorbed silently .

Built-in Error Names to Monitor

  • States.TaskFailed – a Task state failed during execution; acts as a wildcard for most known errors
  • States.Timeout – the state exceeded its TimeoutSeconds value
  • States.HeartbeatTimeout – a task using the task token pattern did not send a heartbeat within HeartbeatSeconds
  • States.Runtime – an unrecoverable runtime error; cannot be caught with States.ALL
  • States.DataLimitExceeded – a state’s input or output exceeded the 256 KB payload quota
  • States.Permissions – the execution role lacked sufficient IAM permissions

Retry Configuration Example

A well-configured Retry block handles transient Lambda errors and AWS service throttling while surfacing genuine failures quickly:

"Retry": [
  {
    "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
    "IntervalSeconds": 2,
    "MaxAttempts": 3,
    "BackoffRate": 2
  },
  {
    "ErrorEquals": ["States.TaskFailed"],
    "IntervalSeconds": 5,
    "MaxAttempts": 2,
    "BackoffRate": 1.5
  }
]

Catch Block for Graceful Failure Routing

"Catch": [
  {
    "ErrorEquals": ["States.ALL"],
    "ResultPath": "$.errorInfo",
    "Next": "HandleFailureState"
  }
]

Routing failures to a dedicated HandleFailureState lets you send a structured notification, write to a dead-letter queue, or trigger a compensating transaction before the execution terminates.

6. Distributed Tracing with AWS X-Ray

AWS X-Ray provides end-to-end distributed tracing for Step Functions, capturing latency data across every Lambda function, DynamoDB table, SQS queue, and other integrated service your state machine calls. Enabling X-Ray reveals bottlenecks that CloudWatch metrics alone cannot surface.

Enabling X-Ray Tracing

aws stepfunctions update-state-machine  --state-machine-arn <YOUR_ARN>  --tracing-configuration enabled=true

Once enabled, X-Ray generates a service map that shows the call graph of your state machine alongside all downstream services it invokes. You can trace individual requests end to end, drill into specific spans, and compare p50, p90, and p99 latencies per service node. This is invaluable when a Task state is slow but not failing outright.

What X-Ray Traces Capture

  • Duration of each state that invokes a supported AWS service
  • Downstream call latency broken out per integration (Lambda, DynamoDB, SQS, SNS, and others)
  • Cold start overhead for Lambda functions called from your state machine
  • Sampling-based traces for Express Workflows at high execution volumes

7. Real-Time Alerting with Amazon EventBridge

Step Functions publishes execution status change events to Amazon EventBridge automatically. You do not need to poll the API or configure anything additional on the Step Functions side. These events fire for every status transition: RUNNING, SUCCEEDED, FAILED, TIMED_OUT, and ABORTED.

Sample EventBridge Rule for Failure Alerts

The following rule matches any execution that enters a FAILED or TIMED_OUT status and routes it to an SNS topic for paging:

{  "source": ["aws.states"],  "detail-type": ["Step Functions Execution Status Change"],  "detail": {    "status": ["FAILED", "TIMED_OUT"],    "stateMachineArn": ["<YOUR_ARN>"]  }}

EventBridge targets can be SNS (for email or PagerDuty), SQS (for downstream processing), Lambda (for custom remediation logic), or another Step Functions state machine (for automated recovery workflows). This makes EventBridge the right tool for building self-healing architectures around your workflows.

EventBridge vs. CloudWatch Alarms

Use CloudWatch alarms for threshold-based alerting (e.g., more than 5 failures in 5 minutes). Use EventBridge for event-driven automation on every individual execution status change. Both have their place in a complete observability setup.

8. Diagnosing Stuck or Long-Running Executions

A common production issue is an execution that stays in RUNNING indefinitely. This typically happens for one of the following reasons:

  • Missing task token callback: a state using WaitForTaskToken sent a token to an external system but never received a SendTaskSuccess or SendTaskFailure call back
  • No state-level timeout: by default, states have no timeout. A single downstream service call that hangs will hold the execution open forever
  • Incorrect Next transition: an ASL definition error where a state does not declare a valid Next field and the execution has no path to a terminal state
  • Downstream service unavailability: the service being called is experiencing an outage and the request is waiting in its queue

Setting Timeouts to Prevent Stuck Executions

AWS recommends setting both TimeoutSeconds and HeartbeatSeconds on every Task state in production. Without them, your execution relies entirely on the downstream service to respond. The following example shows both fields:

"ProcessOrder":  "Type": "Task",  "Resource": "arn:aws:lambda:<REGION>:<ACCOUNT>:function:ProcessOrder",  "TimeoutSeconds": 30 "HeartbeatSeconds": 10 "Retry": [{"ErrorEquals": ["States.HeartbeatTimeout"], "MaxAttempts": 2}],  "Next": "NotifyCustomer"}

Programmatic Detection of Stuck Executions

A lightweight Lambda function scheduled via EventBridge Scheduler can poll for executions in RUNNING status that have exceeded their expected duration:

const { SFNClient, ListExecutionsCommand } = require('@aws-sdk/client-sfn');const client = new SFNClient({});exports.handler = async () => {  const result = await client.send(new ListExecutionsCommand({    stateMachineArn: process.env.STATE_MACHINE_ARN,    statusFilter: "RUNNING",  }));  const maxMinutes = parseInt(process.env.MAX_RUNTIME_MINUTES || '60');  const now = Date.now();  const stuckExecutions = result.executions.filter(e => {    const ageMinutes = (now - e.startDate.getTime()) / 60000;    return ageMinutes > maxMinutes;  });  if (stuckExecutions.length > 0) {    // publish to SNS or write to DynamoDB for downstream alerting    console.log('Stuck executions:', stuckExecutions.map(e => e.executionArn));  }};

S&P Global’s WSO team encountered exactly this challenge when orchestrating long-running reconciliation workflows. They resolved it by using the GetExecutionHistory API within each Task Lambda to check whether a prior step had already completed, preventing duplicate work on loop-back retries. You can read the full case study on the AWS Architecture Blog.

9. Auditing API Calls with AWS CloudTrail

Every Step Functions API call, including StartExecution, StopExecution, UpdateStateMachine, and CreateStateMachine, is recorded in AWS CloudTrail. CloudTrail is the right tool for security and compliance auditing, not day-to-day execution monitoring. Enable it to answer questions like: who started this execution, when was the state machine definition last changed, and which IAM principal aborted a production execution.

For operational monitoring of execution outcomes, CloudWatch metrics and logs are more appropriate. Use CloudTrail alongside them for a complete audit picture.

10. Monitoring Express Workflows

Express Workflows differ from Standard Workflows in two important ways that affect monitoring. First, they are designed for high-volume, short-duration workloads (up to 5 minutes) and can execute more than 100,000 times per second. Second, their execution history is only available in CloudWatch Logs, not in the Step Functions console execution list.

Key Differences for Express Workflow Monitoring

  • CloudWatch Logs is mandatory: the console does not retain Express Workflow execution details beyond the current execution view
  • Higher metric volume: with potentially millions of executions per day, set appropriate CloudWatch metric math expressions to normalise your failure rate as a percentage rather than an absolute count
  • Shorter retention window: execution events are not queryable via the GetExecutionHistory API the way Standard Workflow events are
  • X-Ray sampling: X-Ray automatically samples Express Workflow traces at high volumes; configure sampling rules to ensure representative coverage

11. AWS Step Functions Monitoring Best Practices

Drawing from AWS documentation, real-world production architectures, and community patterns, these are the practices that matter most:

  • Always set TimeoutSeconds on every Task state. The default is no timeout. A single hung Lambda invocation will keep your execution in RUNNING indefinitely.
  • Set alarms on ExecutionsFailed and ExecutionsTimedOut. These two metrics are your primary health signals. Alert on them from day one.
  • Use ERROR log level at minimum in production. Off-by-default logging means you lose the context you need most when an incident happens.
  • Use non-ASCII characters only if absolutely necessary. According to the AWS Developer Guide, state machine names with non-ASCII characters prevent CloudWatch from logging metrics correctly.
  • Avoid payloads larger than 256 KB. If states need to pass large data, store it in Amazon S3 and pass the object ARN through the state machine instead.
  • Limit parallelism in Map states. Set a MaxConcurrency value to prevent overwhelming downstream services. The S&P Global WSO team limited Map State concurrency to 5 parallel report runs to protect their database layer.
  • Use idempotent Lambda functions. Because Step Functions can retry a state, ensure your Lambda functions can be called multiple times without side effects.
  • Tag your state machines. Consistent tagging enables cost allocation and makes it easier to filter CloudWatch dashboards and alarms by environment, team, or application.
  • Baseline your ExecutionTime metrics weekly. What counts as a slow execution changes as your data volume grows. Revisit your alarm thresholds regularly.
AWS Step Functions Observability
Stop Flying Blind on Your AWS Workflows
Built-in CloudWatch metrics and console views are a solid starting point, but production-grade observability requires a dedicated monitoring layer. CubeAPM gives you deep, correlated visibility into AWS Step Functions executions alongside your Lambda functions, databases, and downstream APIs, all in one place.
With CubeAPM you get real-time execution dashboards, automatic failure alerting, distributed tracing across service integrations, anomaly detection on execution duration, and one-click drill-down from a failed execution to the exact Lambda log line that caused it.
No custom instrumentation. No complex setup. Works with Standard and Express workflows out of the box.
Try CubeAPM Free for 14 Days → cubeapm.com/start

Conclusion

Effective AWS Step Functions monitoring is not a single feature you toggle on. It is a layered observability practice that combines the visual console inspector for real-time debugging, CloudWatch metrics and alarms for threshold-based alerting, CloudWatch Logs for deep forensic analysis, X-Ray for distributed tracing, and EventBridge for event-driven automation. Each layer answers a different question: the console tells you what failed, metrics tell you how often, logs tell you why, X-Ray tells you where the latency is, and EventBridge tells the rest of your system that something went wrong.

Start with CloudWatch alarms on ExecutionsFailed and ExecutionsTimedOut, enable ERROR-level logging, and set explicit timeouts on every Task state. From there, layer in X-Ray tracing and EventBridge automation as your workflows become more complex. The investment pays for itself the first time you catch a silent failure before your customers do.

Disclaimer: This article is intended for informational purposes only. AWS service behaviour, API specifications, metric names, and console interfaces may change over time. Always refer to the official AWS Step Functions Developer Guide at docs.aws.amazon.com/step-functions for the most current documentation. The code samples provided are illustrative and should be reviewed and tested thoroughly before use in production environments.

FAQs

1. How do I get notified immediately when an AWS Step Functions execution fails?

Create an Amazon EventBridge rule that matches “aws.states” events with status “FAILED” or “TIMED_OUT” and route it to an SNS topic connected to your on-call alerting tool. For threshold-based alerts (e.g., more than N failures in a time window), use a CloudWatch alarm on the ExecutionsFailed metric instead.

2. Can I monitor AWS Step Functions without CloudWatch?

You can query execution status and history directly through the Step Functions API using DescribeExecution and GetExecutionHistory. You can also use third-party APM and observability platforms. However, for automated alerting and long-term metric retention, CloudWatch remains the most tightly integrated option. For Express Workflows, CloudWatch Logs is effectively mandatory because execution history is not retained in the console.

3. Why is my Step Functions execution stuck in RUNNING status?

The most common causes are: a WaitForTaskToken state that never received its callback, a Task state with no TimeoutSeconds that is waiting on an unresponsive downstream service, or an ASL definition with an incorrect Next transition. Enable CloudWatch Logs at the ALL level and use the GetExecutionHistory API to identify the last event recorded before the execution stalled. Adding explicit TimeoutSeconds and HeartbeatSeconds to every Task state prevents most stuck execution scenarios.

4. What is the difference between Standard and Express Workflow monitoring?

Standard Workflows retain full execution history in the Step Functions console for 90 days, support GetExecutionHistory API queries, and emit CloudWatch metrics per execution. Express Workflows execute at much higher throughput but do not retain execution history in the console. For Express Workflows, you must enable CloudWatch Logs to access execution details. CloudWatch metrics are available for both types, but Express Workflow metrics often need to be interpreted as rates (failures per minute) rather than absolute counts given the higher execution volume.

5. How do I monitor AWS Step Functions costs alongside execution failures?

Step Functions charges per state transition for Standard Workflows and per execution duration for Express Workflows. Use AWS Cost Explorer filtered by the Step Functions service to track spend trends. For per-state-machine cost visibility, tag each state machine with an Environment and Team tag and use Cost Allocation Tags in the billing console. CloudWatch metrics such as ExecutionsStarted and OpenExecutionCount can also serve as indirect cost indicators when combined with your known per-execution pricing.

×
×