AWS Batch Monitoring: Jobs, Queues, Logs, and Compute Health

Running AWS Batch jobs without proper monitoring is like flying blind. Jobs can silently fail, pile up in the RUNNABLE state for hours, or exhaust compute capacity without a single alert firing. Yet AWS Batch does not push detailed job-status metrics to CloudWatch by default, leaving a significant observability gap.

This guide covers every layer of AWS Batch monitoring: native CloudWatch integration, Container Insights, EventBridge-driven custom metrics, alerting, and dashboards. By the end, you will have a complete picture of how to monitor AWS Batch jobs and compute environmental health so issues are caught before they become incidents.

Key Takeaways

AWS Batch does not publish job-status counts (RUNNABLE, FAILED, SUCCEEDED) to CloudWatch by default. You must build custom metrics.
Container Insights gives you CPU, memory, network, and storage metrics for compute environments at the container level.
EventBridge + Lambda is the recommended pattern to get real-time queue depth, job age, and job-state counts.
CloudWatch Alarms and dashboards turn raw metrics into actionable alerts.
Jobs stuck in RUNNABLE almost always mean insufficient compute capacity or a low maxvCpus setting.

1. Understanding the AWS Batch Monitoring Landscape

AWS Batch is a fully managed service that dynamically provisions compute capacity and runs batch computing workloads. Jobs progress through a defined lifecycle before completing or failing.

AWS Batch Job Lifecycle States

Understanding job states is the foundation of effective AWS Batch monitoring. Every job moves through a predictable sequence:

State	Meaning	Monitor For
SUBMITTED	Job accepted by Batch scheduler	Queue depth growth
PENDING	Waiting on dependencies	Stuck dependency chains
RUNNABLE	Ready but waiting for compute	Capacity bottleneck signal
STARTING	Container provisioning	Slow start times
RUNNING	Executing	Duration overruns
SUCCEEDED	Completed successfully	Throughput / SLA
FAILED	Exited with non-zero code	Failure rate spikes

Why RUNNABLE matters: Jobs stuck in RUNNABLE are one of the most common and silent problems in AWS Batch. This state means the job is ready to run but compute capacity is unavailable. It can indicate a low maxvCpus setting, exhausted Spot capacity, a service quota limit, or an IAM permission issue preventing scaling. Without custom monitoring, this can go undetected for hours.

What AWS Batch Sends to CloudWatch Natively

Out of the box, AWS Batch publishes a limited set of infrastructure-level metrics. The official AWS documentation identifies the primary monitoring tools as CloudWatch Logs, Container Insights, and EventBridge events. However, critical operational metrics such as job counts by state are not included natively.

Metric	Default CloudWatch	Custom / Container Insights
CPU & Memory Utilization	Yes (Container Insights)	Yes
Job Status Counts (RUNNABLE etc.)	No	Custom Lambda required
Queue Depth	No	EventBridge + Lambda
Job Duration	No	Custom metric
Container Instance Count	Yes	Yes
Log Output	Yes (awslogs driver)	Yes
Job Failure Alerts	No native	CloudWatch Alarm on custom metric

2. Setting Up CloudWatch Logs for AWS Batch Jobs

CloudWatch Logs is the first monitoring layer to configure. Every container that runs in your AWS Batch compute environment can stream its stdout and stderr output to a CloudWatch log group. This is critical for diagnosing job failures.

Step 1: Create the Log Group

Create a dedicated log group before registering job definitions. Setting a retention period avoids unbounded storage costs.

aws logs create-log-group \  --log-group-name /aws/batch/jobs \  --retention-in-days 30

aws logs create-log-group \  --log-group-name /aws/batch/jobs \  --retention-in-days 30

Step 2: Configure the awslogs Driver in Your Job Definition

Add a logConfiguration block to your container properties when registering a job definition. For EC2-based compute environments you also need to attach the CloudWatchLogsFullAccess policy (or a scoped-down equivalent) to your ecsInstanceRole.

aws batch register-job-definition \  --job-definition-name monitored-job \  --type container \  --container-properties '{    "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",    "resourceRequirements": [      {"type": "VCPU", "value": "2"},      {"type": "MEMORY", "value": "4096"}    ],    "logConfiguration": {      "logDriver": "awslogs",      "options": {        "awslogs-group": "/aws/batch/jobs",        "awslogs-region": "us-east-1",        "awslogs-stream-prefix": "my-app"      }    }  }'

aws batch register-job-definition \  --job-definition-name monitored-job \  --type container \  --container-properties '{    "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",    "resourceRequirements": [      {"type": "VCPU", "value": "2"},      {"type": "MEMORY", "value": "4096"}    ],    "logConfiguration": {      "logDriver": "awslogs",      "options": {        "awslogs-group": "/aws/batch/jobs",        "awslogs-region": "us-east-1",        "awslogs-stream-prefix": "my-app"      }    }  }'

Log streams follow the format <prefix>/<container-name>/<ecs-task-id>. For Fargate-based compute environments, CloudWatch Logs is enabled by default.

3. Enabling CloudWatch Container Insights

Container Insights is the second monitoring layer. It collects, aggregates, and summarizes metrics and logs from your AWS Batch compute environments and jobs. These metrics are stored as structured performance log events using a JSON schema, which CloudWatch then aggregates into higher-level metrics at the compute environment and job level.

Metrics Available Through Container Insights

Once enabled, Container Insights provides the following metrics per compute environment:

JobCount: number of jobs running in the compute environment
ContainerInstanceCount: EC2 instances registered with the ECS agent
CpuReserved and CpuUtilized: CPU units reserved vs. used
MemoryReserved and MemoryUtilized: memory reserved vs. used
NetworkRxBytes and NetworkTxBytes: bytes received and transmitted (awsvpc or bridge mode only)
StorageReadBytes and StorageWriteBytes: bytes read from and written to storage

Important: CloudWatch Container Insights metrics are charged as custom metrics. Review the Amazon CloudWatch pricing before enabling this at scale.

How to Enable Container Insights

Container Insights is enabled per compute environment from the AWS Batch console:

Open the AWS Batch console and choose Environments.
Select the compute environment you want to monitor.
Go to the Container insights tab and toggle Container insights on.
Optionally select a default aggregation interval or set a custom one.

4. Filling the Gaps: Custom Metrics with EventBridge and Lambda

Container Insights and CloudWatch Logs tell you about resource utilization, but they do not tell you how many jobs are RUNNABLE, how long jobs have been waiting, or what your current failure rate is. These gaps require a custom solution.

The Problem: Missing Job-Status Metrics

As documented in the AWS re:Post community and several engineering blogs, AWS Batch does not publish job-status counts (RUNNABLE, RUNNING, FAILED, SUCCEEDED) as CloudWatch metrics. These metrics are visible on the AWS Batch console dashboard but are not surfaced to CloudWatch, making it impossible to build alarms or programmatic dashboards without additional work.

A practical example: in one real-world incident, 1,000 Batch jobs were accidentally submitted at once. They piled up in RUNNABLE due to a low maxvCpus setting and went unnoticed for hours because there was no alarm on queue depth. This is the class of problem that custom metrics solve.

Architecture: EventBridge + Lambda + CloudWatch

The recommended pattern uses three AWS services together:

EventBridge captures job state change events in real time
A Lambda function processes events, queries the Batch API for queue depths, and publishes custom CloudWatch metrics
CloudWatch stores the metrics and powers alarms and dashboards

Step 1: Create the EventBridge Rule

aws events put-rule \  --name batch-job-state-changes \  --event-pattern '{    "source": ["aws.batch"],    "detail-type": ["Batch Job State Change"],    "detail": {      "status": ["FAILED", "SUCCEEDED", "RUNNING", "RUNNABLE"]    }  }'

aws events put-rule \  --name batch-job-state-changes \  --event-pattern '{    "source": ["aws.batch"],    "detail-type": ["Batch Job State Change"],    "detail": {      "status": ["FAILED", "SUCCEEDED", "RUNNING", "RUNNABLE"]    }  }'

Step 2: Lambda Function to Publish Custom Metrics

The Lambda function queries the AWS Batch API for job counts per status and publishes them as custom CloudWatch metrics. Using paginated list_jobs() calls ensures accurate counts even for large queues.

import boto3from datetime import datetime
cloudwatch = boto3.client('cloudwatch')batch      = boto3.client('batch')
def get_job_count(queue, status):    count, token = 0, None    while True:        params = {'jobQueue': queue, 'jobStatus': status, 'maxResults': 100}        if token: params['nextToken'] = token        resp  = batch.list_jobs(**params)        count += len(resp['jobSummaryList'])        token  = resp.get('nextToken')        if not token: break    return count
def handler(event, context):    queues = [q['jobQueueArn'] for q in batch.describe_job_queues()['jobQueues']]    for queue in queues:        for status in ['RUNNABLE','RUNNING','FAILED','SUCCEEDED','SUBMITTED']:            count = get_job_count(queue, status)            cloudwatch.put_metric_data(                Namespace='AWSBatch/JobStatus',                MetricData=[{                    'MetricName': 'JobCount',                    'Dimensions': [                        {'Name': 'JobQueue', 'Value': queue},                        {'Name': 'Status',   'Value': status}                    ],                    'Value': count,                    'Unit':  'Count',                    'Timestamp': datetime.utcnow()                }]            )

import boto3from datetime import datetime
cloudwatch = boto3.client('cloudwatch')batch      = boto3.client('batch')
def get_job_count(queue, status):    count, token = 0, None    while True:        params = {'jobQueue': queue, 'jobStatus': status, 'maxResults': 100}        if token: params['nextToken'] = token        resp  = batch.list_jobs(**params)        count += len(resp['jobSummaryList'])        token  = resp.get('nextToken')        if not token: break    return count
def handler(event, context):    queues = [q['jobQueueArn'] for q in batch.describe_job_queues()['jobQueues']]    for queue in queues:        for status in ['RUNNABLE','RUNNING','FAILED','SUCCEEDED','SUBMITTED']:            count = get_job_count(queue, status)            cloudwatch.put_metric_data(                Namespace='AWSBatch/JobStatus',                MetricData=[{                    'MetricName': 'JobCount',                    'Dimensions': [                        {'Name': 'JobQueue', 'Value': queue},                        {'Name': 'Status',   'Value': status}                    ],                    'Value': count,                    'Unit':  'Count',                    'Timestamp': datetime.utcnow()                }]            )

Schedule this Lambda function every 5 minutes using an EventBridge scheduled rule. The total monthly cost for the Lambda invocations and custom CloudWatch metrics is typically a few cents for most workloads.

5. Creating CloudWatch Alarms for AWS Batch

Metrics are only useful when they trigger action. Create CloudWatch Alarms for the most critical failure modes.

Alarm 1: High Job Failure Rate

aws cloudwatch put-metric-alarm \  --alarm-name batch-high-failure-rate \  --alarm-description 'More than 10 Batch jobs failed in 1 hour' \  --namespace AWSBatch/JobStatus \  --metric-name JobCount \  --dimensions Name=Status,Value=FAILED \  --statistic Sum \  --period 3600 \  --threshold 10 \  --comparison-operator GreaterThanThreshold \  --evaluation-periods 1 \  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-alerts \  --treat-missing-data notBreaching

aws cloudwatch put-metric-alarm \  --alarm-name batch-high-failure-rate \  --alarm-description 'More than 10 Batch jobs failed in 1 hour' \  --namespace AWSBatch/JobStatus \  --metric-name JobCount \  --dimensions Name=Status,Value=FAILED \  --statistic Sum \  --period 3600 \  --threshold 10 \  --comparison-operator GreaterThanThreshold \  --evaluation-periods 1 \  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-alerts \  --treat-missing-data notBreaching

Alarm 2: Jobs Stuck in RUNNABLE

aws cloudwatch put-metric-alarm \  --alarm-name batch-runnable-backlog \  --alarm-description 'More than 50 jobs waiting for compute capacity' \  --namespace AWSBatch/JobStatus \  --metric-name JobCount \  --dimensions Name=Status,Value=RUNNABLE \  --statistic Maximum \  --period 300 \  --threshold 50 \  --comparison-operator GreaterThanThreshold \  --evaluation-periods 1 \  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-alerts

aws cloudwatch put-metric-alarm \  --alarm-name batch-runnable-backlog \  --alarm-description 'More than 50 jobs waiting for compute capacity' \  --namespace AWSBatch/JobStatus \  --metric-name JobCount \  --dimensions Name=Status,Value=RUNNABLE \  --statistic Maximum \  --period 300 \  --threshold 50 \  --comparison-operator GreaterThanThreshold \  --evaluation-periods 1 \  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-alerts

Alarm 3: Long-Running Jobs

aws cloudwatch put-metric-alarm \  --alarm-name batch-long-running-jobs \  --alarm-description 'Batch job running longer than 2 hours' \  --namespace AWSBatch/JobStatus \  --metric-name JobDurationSeconds \  --statistic Maximum \  --period 300 \  --threshold 7200 \  --comparison-operator GreaterThanThreshold \  --evaluation-periods 1 \  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-alerts

aws cloudwatch put-metric-alarm \  --alarm-name batch-long-running-jobs \  --alarm-description 'Batch job running longer than 2 hours' \  --namespace AWSBatch/JobStatus \  --metric-name JobDurationSeconds \  --statistic Maximum \  --period 300 \  --threshold 7200 \  --comparison-operator GreaterThanThreshold \  --evaluation-periods 1 \  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-alerts

6. Monitoring Compute Environment Health

Monitoring job status alone is not sufficient. Your compute environment can be healthy while jobs fail, or jobs can be healthy while your compute environment is degraded. Monitor both layers independently.

Compute Environment State Checks

AWS Batch compute environments expose a status field: CREATING, UPDATING, DELETING, DELETED, VALID, or INVALID. An INVALID state means the compute environment cannot provision capacity, which will immediately impact all jobs queued to it.

Use the AWS Batch describe-compute-environments API to poll environment state:

aws batch describe-compute-environments \  --query 'computeEnvironments[*].{name:computeEnvironmentName,status:status,state:state}' \  --output table

aws batch describe-compute-environments \  --query 'computeEnvironments[*].{name:computeEnvironmentName,status:status,state:state}' \  --output table

Key Metrics to Watch for Compute Health

ContainerInstanceCount: if this drops to zero unexpectedly, scaling has failed
CpuUtilized vs CpuReserved: a persistent gap can indicate wasted spend or undersizing
MemoryUtilized vs MemoryReserved: high utilization above 90% can cause OOM failures
RUNNABLE job count trending up: the primary leading indicator of capacity exhaustion

Spot Instance Interruption Handling

If you use Spot compute environments, jobs can be interrupted when Spot capacity is reclaimed. AWS Batch automatically retries interrupted jobs according to the retry strategy in the job definition. Monitor the FAILED state with a reason filter of ‘Host EC2 was terminated’ to distinguish Spot interruptions from application failures.

7. Building an AWS Batch Monitoring Dashboard

Once you have metrics flowing into CloudWatch, create a dashboard that gives your team a single view of batch pipeline health. A well-structured dashboard for AWS Batch monitoring typically contains four sections.

Job Health Panel

RUNNABLE job count over time
FAILED job count over time
SUCCEEDED job count over time
Job failure rate (FAILED / total)

Compute Capacity Panel

ContainerInstanceCount over time
CpuUtilized vs CpuReserved
MemoryUtilized vs MemoryReserved

Performance Panel

Average and maximum job duration
P99 job duration for SLA tracking

Queue Depth Panel

Jobs per state per queue
Job age (time in RUNNABLE state)

9. AWS Batch Monitoring Best Practices

Always configure CloudWatch Logs on every job definition. Silent failures are the hardest to debug.
Enable Container Insights for any production compute environment. The cost is small relative to the visibility it provides.
Set a RUNNABLE backlog alarm early. A growing RUNNABLE count is almost always actionable before it becomes a crisis.
Separate Spot and On-Demand compute environments. This makes it easier to distinguish Spot interruptions from application failures in your metrics.
Use job retry strategies for transient failures but cap retries. Unlimited retries on a broken job can exhaust your compute budget silently.
Review maxvCpus on all managed compute environments monthly. Workload growth regularly makes previously sensible limits into bottlenecks.
Tag job queues and job definitions consistently. Dimensions in your custom CloudWatch metrics should match these tags so you can filter dashboards by team, pipeline, or environment.

AWS Batch Observability

Try CubeAPM for AWS Batch Monitoring Today

Setting up EventBridge rules, Lambda functions, custom metrics, and CloudWatch dashboards takes time and ongoing maintenance. CubeAPM gives you out-of-the-box observability for AWS Batch, including real-time job state dashboards, compute environment health checks, and intelligent alerting on failure spikes and RUNNABLE backlogs, without writing a single Lambda function.

Pre-built AWS Batch dashboards in minutes, not days
Correlate batch job failures with infrastructure events and application traces
Unified alerting across Batch, ECS, Lambda, and other AWS services

Try CubeAPM Free →

Conclusion

AWS Batch monitoring requires building on top of what AWS provides natively. CloudWatch Logs and Container Insights give you the infrastructure layer. EventBridge and Lambda close the gap for job-status metrics. CloudWatch Alarms and dashboards turn data into action.

The most important metric to get right first is the RUNNABLE job count. Jobs stuck in RUNNABLE are the leading indicator of nearly every serious Batch incident. Once you have an alarm on that metric, you can build out the rest of your monitoring incrementally.

Disclaimer: This article is intended for informational purposes only. AWS services, pricing, and feature availability change over time. Always refer to the official AWS documentation for the most current and accurate information. Code examples are provided as illustrative references and should be reviewed and tested in your own environment before use in production.

FAQ

1. Does AWS Batch publish job status metrics to CloudWatch by default?

No. AWS Batch does not push job-status counts (RUNNABLE, RUNNING, FAILED, SUCCEEDED) to CloudWatch natively. You need to build a custom solution using EventBridge, Lambda, and the Batch API to publish these as custom metrics.

2. How do I get alerted when AWS Batch jobs are stuck in RUNNABLE?

Create a CloudWatch Alarm on a custom metric tracking RUNNABLE job count. If the count crosses a threshold (for example, 50 jobs) over a 5-minute period, trigger an SNS notification. Jobs stuck in RUNNABLE almost always indicate a low maxvCpus setting or insufficient Spot capacity.

3. What is CloudWatch Container Insights for AWS Batch?

Container Insights is an optional feature that collects CPU, memory, network, and storage metrics at the container level for your Batch compute environments. It is enabled per compute environment from the AWS Batch console and is charged as custom CloudWatch metrics.

4. How do I monitor AWS Batch job logs?

Configure the awslogs log driver in your job definition’s logConfiguration block. This streams container stdout and stderr to a CloudWatch log group. For Fargate-based compute environments, CloudWatch Logs is enabled by default. For EC2-based environments, you also need to attach a CloudWatch Logs IAM policy to your ecsInstanceRole.

5. What is the best way to monitor AWS Batch compute environment health?

Use a combination of Container Insights metrics (ContainerInstanceCount, CpuUtilized, MemoryUtilized) and the describe-compute-environments API to check environment state. An INVALID compute environment state means capacity provisioning has failed and will block all queued jobs. Set a CloudWatch Alarm on ContainerInstanceCount dropping to zero as an early warning signal.

How to Monitor AWS Batch Jobs and Compute Environment Health

Table of Contents