How to Monitor AWS Glue Jobs for Failures and Duration

AWS Glue is AWS’s serverless ETL service built on Apache Spark. It runs managed data pipelines that move and transform data between sources such as Amazon S3, Amazon Redshift, and JDBC databases. When a Glue job fails silently or runs twice as long as expected, downstream reports break, SLAs slip, and engineers spend hours in CloudWatch logs trying to piece together what went wrong.

This guide covers the practical steps to set up reliable AWS Glue monitoring: detecting job failures immediately, tracking duration anomalies before they become incidents, and using structured logs and job run insights to find root causes fast.

💡

Key Takeaways

AWS Glue sends metrics to CloudWatch every 30 seconds, including elapsed time, bytes read, and memory usage.
Job state changes (SUCCEEDED, FAILED, TIMEOUT, STOPPED) are published as EventBridge events, enabling automated SNS alerts.
glue.driver.aggregate.elapsedTime is the most reliable metric for tracking and alarming on job duration.
Enabling continuous logging gives you real-time access in CloudWatch Logs under /aws-glue/jobs/logs-v2/.
Job Run Insights (Glue 2.0+) surfaces failure root causes, including the exact script line number and recommended fixes.
AWS Glue does not publish a native “job failed” CloudWatch metric. Failure detection requires EventBridge or a custom Lambda-based solution.

How AWS Glue Monitoring Works

AWS Glue surfaces observability data through three distinct layers.

CloudWatch Metrics

AWS Glue sends metrics to CloudWatch every 30 seconds while a job is running. Metrics live in the Glue namespace and can be viewed in the AWS Glue console under “Job run monitoring,” or directly in the CloudWatch console. The dashboards aggregate 30-second samples into per-minute values using the SUM statistic.

Each metric is tagged with two dimensions:

JobName: the name of your AWS Glue job
JobRunId: the specific run ID, or ALL to aggregate across all runs

CloudWatch Logs

Each job run writes output and error logs to CloudWatch Logs. By default, logs appear only after the job completes. Enabling continuous logging streams them in real time, which is essential for debugging long-running or failing jobs.

Default log groups:

/aws-glue/jobs/output – standard driver output
/aws-glue/jobs/error – error and exception logs
/aws-glue/jobs/logs-v2/ – used when continuous logging is enabled

EventBridge Job State Events

Every time a Glue job changes state, the service publishes an event to the default Amazon EventBridge bus. Possible states are SUCCEEDED, FAILED, TIMEOUT, and STOPPED. These events are the backbone of automated alerting: EventBridge rules match on specific states and route them to SNS topics, Lambda functions, or other targets.

The Key AWS Glue Metrics to Monitor

The table below shows the metrics that matter most for failure detection and duration tracking.

Metric Name	What It Measures	Primary Use Case
glue.driver.aggregate.elapsedTime	ETL elapsed time in milliseconds (excludes bootstrap)	Duration tracking and SLA alarms
glue.driver.aggregate.numFailedTasks	Count of failed Spark tasks	Catching data-level failures before full job crash
glue.driver.aggregate.bytesRead	Bytes read from all data sources	Job progress and bookmark issues
glue.ALL.jvm.heap.used	JVM heap memory across all executors	Memory pressure detection
glue.driver.BlockManager.disk.diskSpaceUsed_MB	Disk used by Spark block manager	Disk overflow detection
glue.driver.aggregate.numCompletedStages	Number of completed Spark stages	Job progress tracking

⚠️

Important: AWS Glue does not publish a native “job failed” CloudWatch metric. The FAILED state is an EventBridge event, not a metric. This is why many teams build the custom Lambda-based solution covered in Section 5.

Monitoring Job Duration with CloudWatch Alarms

The best metric for duration is glue.driver.aggregate.elapsedTime. It measures ETL execution time in milliseconds and excludes the Glue bootstrap period, making it a clean signal of your script’s actual performance.

Step 1: Enable Job Metrics

📝

Note: Job metrics are not enabled by default and must be turned on per job. Enabling CloudWatch custom metrics incurs additional charges. See Amazon CloudWatch Pricing for details.

In AWS Glue Studio, open your job, go to the Job details tab, and enable “Job metrics.” Using the AWS CLI:

aws glue update-job \

  --job-name my-etl-job \

  --job-update '{"DefaultArguments": {"--enable-metrics": ""}}'

aws glue update-job \

  --job-name my-etl-job \

  --job-update '{"DefaultArguments": {"--enable-metrics": ""}}'

Step 2: Create a Duration Alarm

The command below creates a CloudWatch alarm that fires when a job exceeds 3 hours (10,800,000 milliseconds):

aws cloudwatch put-metric-alarm \

  --alarm-name "GlueJobDurationExceeded-my-etl-job" \

  --namespace Glue \

  --metric-name "glue.driver.aggregate.elapsedTime" \

  --dimensions Name=JobName,Value=my-etl-job \

  --statistic Sum \

  --period 60 \

  --evaluation-periods 1 \

  --threshold 10800000 \

  --comparison-operator GreaterThanOrEqualToThreshold \

  --alarm-actions arn:aws:sns:us-east-1:123456789:GlueAlerts

aws cloudwatch put-metric-alarm \

  --alarm-name "GlueJobDurationExceeded-my-etl-job" \

  --namespace Glue \

  --metric-name "glue.driver.aggregate.elapsedTime" \

  --dimensions Name=JobName,Value=my-etl-job \

  --statistic Sum \

  --period 60 \

  --evaluation-periods 1 \

  --threshold 10800000 \

  --comparison-operator GreaterThanOrEqualToThreshold \

  --alarm-actions arn:aws:sns:us-east-1:123456789:GlueAlerts

💡

Tip: Set your duration alarm threshold at 150% of the average historical run time. This catches meaningful regressions without generating too many false positives.

4. Detecting Job Failures with EventBridge and SNS

Because Glue has no native failure metric, the standard pattern for failure alerting uses EventBridge rules. When a job enters FAILED or TIMEOUT, EventBridge routes the event to an SNS topic that sends an email, Slack message, or PagerDuty alert.

Step 1: Create an SNS Topic

aws sns create-topic --name GlueJobFailures

aws sns subscribe \

  --topic-arn arn:aws:sns:us-east-1:123456789:GlueJobFailures \

  --protocol email \

  --notification-endpoint your-team@example.com

aws sns create-topic --name GlueJobFailures

aws sns subscribe \

  --topic-arn arn:aws:sns:us-east-1:123456789:GlueJobFailures \

  --protocol email \

  --notification-endpoint [email protected]

Step 2: Create an EventBridge Rule

The pattern below matches both FAILED and TIMEOUT for all Glue jobs. Add a “jobName” filter inside the detail block to scope it to a specific job.

{

  "source": ["aws.glue"],

  "detail-type": ["Glue Job State Change"],

  "detail": { "state": ["FAILED", "TIMEOUT"] }

}

{

  "source": ["aws.glue"],

  "detail-type": ["Glue Job State Change"],

  "detail": { "state": ["FAILED", "TIMEOUT"] }

}

aws events put-rule \

  --name "GlueJobFailureRule" \

  --event-pattern file://glue-failure-pattern.json \

  --state ENABLED

aws events put-rule \

  --name "GlueJobFailureRule" \

  --event-pattern file://glue-failure-pattern.json \

  --state ENABLED

Step 3: Add SNS as a Target

aws events put-targets \

  --rule GlueJobFailureRule \

  --targets "Id=1,Arn=arn:aws:sns:us-east-1:123456789:GlueJobFailures"

aws events put-targets \

  --rule GlueJobFailureRule \

  --targets "Id=1,Arn=arn:aws:sns:us-east-1:123456789:GlueJobFailures"

The SNS notification includes the job name, run ID, error message, and a direct console link, giving the on-call engineer everything needed to start investigating immediately.

5. Creating Custom Success/Failure Metrics

Built-in Glue metrics do not give you Lambda-style counters (success count, failure count, error rate). You can add these with a small Lambda function that listens on EventBridge and writes custom CloudWatch metrics, covering all jobs automatically including new ones created in the future.

How It Works

A Glue job state change fires an EventBridge event.
An EventBridge rule routes it to a Lambda function.
Lambda extracts job name and state, then calls cloudwatch.put_metric_data with a custom namespace.
You create CloudWatch alarms on those custom metrics like any other.

Lambda Function (Python)

import boto3

from aws_embedded_metrics import metric_scope

@metric_scope

async def handler(event, context, metrics):

    job_name = event["detail"]["jobName"]

    state    = event["detail"]["state"]

    metrics.set_namespace("GlueCustomMetrics")

    metrics.set_dimensions({"JobName": job_name})

    metrics.put_metric("Success", 1 if state == "SUCCEEDED" else 0, "Count")

    metrics.put_metric("Failure", 1 if state == "FAILED"    else 0, "Count")

import boto3

from aws_embedded_metrics import metric_scope

@metric_scope

async def handler(event, context, metrics):

    job_name = event["detail"]["jobName"]

    state    = event["detail"]["state"]

    metrics.set_namespace("GlueCustomMetrics")

    metrics.set_dimensions({"JobName": job_name})

    metrics.put_metric("Success", 1 if state == "SUCCEEDED" else 0, "Count")

    metrics.put_metric("Failure", 1 if state == "FAILED"    else 0, "Count")

6. Using Job Run Insights for Root Cause Analysis

Job Run Insights (AWS Glue 2.0+) performs automatic root cause analysis on failed jobs. When a job fails, it creates two log streams in CloudWatch Logs:

<job-run-id>-job-insights-rca-driver: Exception analysis with the script line number, last Spark action before failure, and time-ordered executor events.
Rule-based insights stream: Root cause analysis and recommended fixes (for example, tuning shuffle partition count).

Enabling Job Run Insights

In AWS Glue Studio, open your job, go to the Job details tab, and check “Generate job insights.” Using the CLI:

aws glue start-job-run \

  --job-name my-etl-job \

  --arguments '{"--enable-job-insights": "true"}'

aws glue start-job-run \

  --job-name my-etl-job \

  --arguments '{"--enable-job-insights": "true"}'

📝

Note: Job Run Insights streams are created only when a job fails. A successful run generates no insight streams, which keeps log storage costs low. This feature requires AWS Glue version 2.0 or above.

7. Enabling Continuous Logging

Without continuous logging, CloudWatch only receives Glue output after the job finishes. For long-running jobs, this means you are blind during execution. Enable it with the following job update:

aws glue update-job --job-name my-etl-job \

  --job-update '{

    "DefaultArguments": {

      "--enable-continuous-cloudwatch-log": "true",

      "--enable-continuous-log-filter": "true",

      "--continuous-log-logGroup": "/aws-glue/jobs/my-etl-job"

    }

  }'

aws glue update-job --job-name my-etl-job \

  --job-update '{

    "DefaultArguments": {

      "--enable-continuous-cloudwatch-log": "true",

      "--enable-continuous-log-filter": "true",

      "--continuous-log-logGroup": "/aws-glue/jobs/my-etl-job"

    }

  }'

To tail logs from a currently running job:

aws logs filter-log-events \

  --log-group-name /aws-glue/jobs/output \

  --filter-pattern "JobName=my-etl-job" \

  --limit 50

aws logs filter-log-events \

  --log-group-name /aws-glue/jobs/output \

  --filter-pattern "JobName=my-etl-job" \

  --limit 50

8. Monitoring AWS Glue Workflows

If you chain jobs and crawlers using AWS Glue Workflows, monitoring individual jobs is not enough. Glue Workflows also publish state changes to EventBridge under the detail-type “Glue Workflow State Change.” You can also poll status programmatically:

import boto3

glue = boto3.client("glue")

response = glue.get_workflow_run(Name="my-workflow", RunId="wr_abc123")

print(response["Run"]["Status"])

import boto3

glue = boto3.client("glue")

response = glue.get_workflow_run(Name="my-workflow", RunId="wr_abc123")

print(response["Run"]["Status"])

💡

Tip: Monitor at the workflow level in addition to individual jobs. A single upstream job timing out can silently block all downstream jobs or pass a stale schema to dependent transforms.

9. Common Failure Types and How to Diagnose Them

The table below maps error signatures in CloudWatch Logs to root causes and fixes.

Error in Logs	Root Cause	Fix
Container killed by YARN for exceeding memory limits	Partition too large for available memory	Upgrade to G.2X worker or add repartition()
java.lang.OutOfMemoryError: Java heap space	Executor out of JVM heap	Use G.2X workers or reduce partition size
Job timed out	Exceeded Timeout setting	Increase Timeout or enable Auto Scaling
Access Denied	IAM role missing S3 or target permissions	Update the Glue job IAM execution role
ModuleNotFoundError	Python library not bundled with job	Add to –additional-python-modules argument
AnalysisException: Path does not exist	Input S3 path empty or wrong	Verify path; check upstream job completed

10. Monitoring Best Practices

Always enable continuous logging on production jobs. The overhead is minimal and the debugging value is substantial.
Tag all Glue jobs consistently so CloudWatch metrics are filterable by environment (dev/staging/prod) or team.
Set job-level timeout values. Without one, a hung job runs until it hits the account DPU limit.
Enable Auto Scaling on Glue 3.0 and 4.0 jobs to reduce cost and prevent over-provisioning from masking performance regressions.
Use composite CloudWatch alarms to reduce alert fatigue: one notification when any job in a pipeline fails rather than N separate alerts.
Store job run history in DynamoDB for weekly reporting on failure rates, retry counts, and longest-running jobs. This gives you trend data that CloudWatch metrics alone do not provide.

AWS Glue Monitoring with CubeAPM

CubeAPM gives data engineering teams a single pane of glass for AWS Glue monitoring: correlate job failures with slow upstream queries, visualize duration trends across runs, set intelligent alert thresholds, and cut MTTR from hours to minutes.

With CubeAPM you get:

Out-of-the-box AWS Glue job failure and duration dashboards
Anomaly detection on job duration to catch slow regressions before they become incidents
Deep integration with CloudWatch Logs for trace-level failure analysis
Alert routing to Slack, PagerDuty, and email with enriched context

Get full AWS Glue observability in under 15 minutes.

Book a demo today →

FAQs

1. Does AWS Glue have a native “job failed” metric in CloudWatch?

No. FAILED and TIMEOUT are EventBridge events, not CloudWatch metrics. To get failure alerts, create an EventBridge rule that routes to an SNS topic, or use a Lambda function to write custom CloudWatch metrics from those events.

2. How do I track how long an AWS Glue job takes to run?

Use the glue.driver.aggregate.elapsedTime CloudWatch metric. It measures actual ETL runtime in milliseconds, excluding bootstrap time. Enable job metrics first by adding –enable-metrics to your job’s default arguments, then set a CloudWatch alarm on this metric.

3. What is the difference between continuous logging and default logging in AWS Glue?

Default logging sends output to CloudWatch only after a job finishes. Continuous logging streams logs in real time while the job runs. For production jobs, always enable continuous logging so you can inspect what is happening without waiting for the job to complete or crash.

4. What causes “Container killed by YARN for exceeding memory limits”?

A Spark executor ran out of memory, usually due to data skew or reading too much data at once. Quick fix: upgrade to a G.2X worker type. Better fix: repartition your data before heavy transforms using df.repartition(n) to distribute load evenly.

5. How do I get notified when an AWS Glue job fails?

Create an EventBridge rule that matches Glue job state changes for FAILED and TIMEOUT, then route it to an SNS topic. Subscribe your team’s email or Slack webhook to that topic. No changes to your job scripts needed, and it works for every job in your account automatically.