CubeAPM
CubeAPM CubeAPM

How to Monitor AWS Glue Jobs for Failures and Duration

How to Monitor AWS Glue Jobs for Failures and Duration

Table of Contents

AWS Glue is AWS’s serverless ETL service built on Apache Spark. It runs managed data pipelines that move and transform data between sources such as Amazon S3, Amazon Redshift, and JDBC databases. When a Glue job fails silently or runs twice as long as expected, downstream reports break, SLAs slip, and engineers spend hours in CloudWatch logs trying to piece together what went wrong.

This guide covers the practical steps to set up reliable AWS Glue monitoring: detecting job failures immediately, tracking duration anomalies before they become incidents, and using structured logs and job run insights to find root causes fast.

💡

Key Takeaways

  • AWS Glue sends metrics to CloudWatch every 30 seconds, including elapsed time, bytes read, and memory usage.
  • Job state changes (SUCCEEDED, FAILED, TIMEOUT, STOPPED) are published as EventBridge events, enabling automated SNS alerts.
  • glue.driver.aggregate.elapsedTime is the most reliable metric for tracking and alarming on job duration.
  • Enabling continuous logging gives you real-time access in CloudWatch Logs under /aws-glue/jobs/logs-v2/.
  • Job Run Insights (Glue 2.0+) surfaces failure root causes, including the exact script line number and recommended fixes.
  • AWS Glue does not publish a native “job failed” CloudWatch metric. Failure detection requires EventBridge or a custom Lambda-based solution.

How AWS Glue Monitoring Works

AWS Glue surfaces observability data through three distinct layers.

CloudWatch Metrics

AWS Glue sends metrics to CloudWatch every 30 seconds while a job is running. Metrics live in the Glue namespace and can be viewed in the AWS Glue console under “Job run monitoring,” or directly in the CloudWatch console. The dashboards aggregate 30-second samples into per-minute values using the SUM statistic.

Each metric is tagged with two dimensions:

  • JobName: the name of your AWS Glue job
  • JobRunId: the specific run ID, or ALL to aggregate across all runs

CloudWatch Logs

Each job run writes output and error logs to CloudWatch Logs. By default, logs appear only after the job completes. Enabling continuous logging streams them in real time, which is essential for debugging long-running or failing jobs.

Default log groups:

  • /aws-glue/jobs/output – standard driver output
  • /aws-glue/jobs/error – error and exception logs
  • /aws-glue/jobs/logs-v2/ – used when continuous logging is enabled

EventBridge Job State Events

Every time a Glue job changes state, the service publishes an event to the default Amazon EventBridge bus. Possible states are SUCCEEDED, FAILED, TIMEOUT, and STOPPED. These events are the backbone of automated alerting: EventBridge rules match on specific states and route them to SNS topics, Lambda functions, or other targets.

The Key AWS Glue Metrics to Monitor

The table below shows the metrics that matter most for failure detection and duration tracking.

Metric NameWhat It MeasuresPrimary Use Case
glue.driver.aggregate.elapsedTimeETL elapsed time in milliseconds (excludes bootstrap)Duration tracking and SLA alarms
glue.driver.aggregate.numFailedTasksCount of failed Spark tasksCatching data-level failures before full job crash
glue.driver.aggregate.bytesReadBytes read from all data sourcesJob progress and bookmark issues
glue.ALL.jvm.heap.usedJVM heap memory across all executorsMemory pressure detection
glue.driver.BlockManager.disk.diskSpaceUsed_MBDisk used by Spark block managerDisk overflow detection
glue.driver.aggregate.numCompletedStagesNumber of completed Spark stagesJob progress tracking
⚠️

Important: AWS Glue does not publish a native “job failed” CloudWatch metric. The FAILED state is an EventBridge event, not a metric. This is why many teams build the custom Lambda-based solution covered in Section 5.

Monitoring Job Duration with CloudWatch Alarms

The best metric for duration is glue.driver.aggregate.elapsedTime. It measures ETL execution time in milliseconds and excludes the Glue bootstrap period, making it a clean signal of your script’s actual performance.

Step 1: Enable Job Metrics

📝

Note: Job metrics are not enabled by default and must be turned on per job. Enabling CloudWatch custom metrics incurs additional charges. See Amazon CloudWatch Pricing for details.

In AWS Glue Studio, open your job, go to the Job details tab, and enable “Job metrics.” Using the AWS CLI:

aws glue update-job \

  --job-name my-etl-job \

  --job-update '{"DefaultArguments": {"--enable-metrics": ""}}'

Step 2: Create a Duration Alarm

The command below creates a CloudWatch alarm that fires when a job exceeds 3 hours (10,800,000 milliseconds):

aws cloudwatch put-metric-alarm \

  --alarm-name "GlueJobDurationExceeded-my-etl-job" \

  --namespace Glue \

  --metric-name "glue.driver.aggregate.elapsedTime" \

  --dimensions Name=JobName,Value=my-etl-job \

  --statistic Sum \

  --period 60 \

  --evaluation-periods 1 \

  --threshold 10800000 \

  --comparison-operator GreaterThanOrEqualToThreshold \

  --alarm-actions arn:aws:sns:us-east-1:123456789:GlueAlerts
💡

Tip: Set your duration alarm threshold at 150% of the average historical run time. This catches meaningful regressions without generating too many false positives.

4. Detecting Job Failures with EventBridge and SNS

Because Glue has no native failure metric, the standard pattern for failure alerting uses EventBridge rules. When a job enters FAILED or TIMEOUT, EventBridge routes the event to an SNS topic that sends an email, Slack message, or PagerDuty alert.

Step 1: Create an SNS Topic

aws sns create-topic --name GlueJobFailures

aws sns subscribe \

  --topic-arn arn:aws:sns:us-east-1:123456789:GlueJobFailures \

  --protocol email \

  --notification-endpoint [email protected]

Step 2: Create an EventBridge Rule

The pattern below matches both FAILED and TIMEOUT for all Glue jobs. Add a “jobName” filter inside the detail block to scope it to a specific job.

{

  "source": ["aws.glue"],

  "detail-type": ["Glue Job State Change"],

  "detail": { "state": ["FAILED", "TIMEOUT"] }

}
aws events put-rule \

  --name "GlueJobFailureRule" \

  --event-pattern file://glue-failure-pattern.json \

  --state ENABLED

Step 3: Add SNS as a Target

aws events put-targets \

  --rule GlueJobFailureRule \

  --targets "Id=1,Arn=arn:aws:sns:us-east-1:123456789:GlueJobFailures"

The SNS notification includes the job name, run ID, error message, and a direct console link, giving the on-call engineer everything needed to start investigating immediately.

5. Creating Custom Success/Failure Metrics

Built-in Glue metrics do not give you Lambda-style counters (success count, failure count, error rate). You can add these with a small Lambda function that listens on EventBridge and writes custom CloudWatch metrics, covering all jobs automatically including new ones created in the future.

How It Works

  1. A Glue job state change fires an EventBridge event.
  2. An EventBridge rule routes it to a Lambda function.
  3. Lambda extracts job name and state, then calls cloudwatch.put_metric_data with a custom namespace.
  4. You create CloudWatch alarms on those custom metrics like any other.

Lambda Function (Python)

import boto3

from aws_embedded_metrics import metric_scope

@metric_scope

async def handler(event, context, metrics):

    job_name = event["detail"]["jobName"]

    state    = event["detail"]["state"]

    metrics.set_namespace("GlueCustomMetrics")

    metrics.set_dimensions({"JobName": job_name})

    metrics.put_metric("Success", 1 if state == "SUCCEEDED" else 0, "Count")

    metrics.put_metric("Failure", 1 if state == "FAILED"    else 0, "Count")

6. Using Job Run Insights for Root Cause Analysis

Job Run Insights (AWS Glue 2.0+) performs automatic root cause analysis on failed jobs. When a job fails, it creates two log streams in CloudWatch Logs:

  • <job-run-id>-job-insights-rca-driver: Exception analysis with the script line number, last Spark action before failure, and time-ordered executor events.
  • Rule-based insights stream: Root cause analysis and recommended fixes (for example, tuning shuffle partition count).

Enabling Job Run Insights

In AWS Glue Studio, open your job, go to the Job details tab, and check “Generate job insights.” Using the CLI:

aws glue start-job-run \

  --job-name my-etl-job \

  --arguments '{"--enable-job-insights": "true"}'
📝

Note: Job Run Insights streams are created only when a job fails. A successful run generates no insight streams, which keeps log storage costs low. This feature requires AWS Glue version 2.0 or above.

7. Enabling Continuous Logging

Without continuous logging, CloudWatch only receives Glue output after the job finishes. For long-running jobs, this means you are blind during execution. Enable it with the following job update:

aws glue update-job --job-name my-etl-job \

  --job-update '{

    "DefaultArguments": {

      "--enable-continuous-cloudwatch-log": "true",

      "--enable-continuous-log-filter": "true",

      "--continuous-log-logGroup": "/aws-glue/jobs/my-etl-job"

    }

  }'

To tail logs from a currently running job:

aws logs filter-log-events \

  --log-group-name /aws-glue/jobs/output \

  --filter-pattern "JobName=my-etl-job" \

  --limit 50

8. Monitoring AWS Glue Workflows

If you chain jobs and crawlers using AWS Glue Workflows, monitoring individual jobs is not enough. Glue Workflows also publish state changes to EventBridge under the detail-type “Glue Workflow State Change.” You can also poll status programmatically:

import boto3

glue = boto3.client("glue")

response = glue.get_workflow_run(Name="my-workflow", RunId="wr_abc123")

print(response["Run"]["Status"])
💡

Tip: Monitor at the workflow level in addition to individual jobs. A single upstream job timing out can silently block all downstream jobs or pass a stale schema to dependent transforms.

9. Common Failure Types and How to Diagnose Them

The table below maps error signatures in CloudWatch Logs to root causes and fixes.

Error in LogsRoot CauseFix
Container killed by YARN for exceeding memory limitsPartition too large for available memoryUpgrade to G.2X worker or add repartition()
java.lang.OutOfMemoryError: Java heap spaceExecutor out of JVM heapUse G.2X workers or reduce partition size
Job timed outExceeded Timeout settingIncrease Timeout or enable Auto Scaling
Access DeniedIAM role missing S3 or target permissionsUpdate the Glue job IAM execution role
ModuleNotFoundErrorPython library not bundled with jobAdd to –additional-python-modules argument
AnalysisException: Path does not existInput S3 path empty or wrongVerify path; check upstream job completed

10. Monitoring Best Practices

  • Always enable continuous logging on production jobs. The overhead is minimal and the debugging value is substantial.
  • Tag all Glue jobs consistently so CloudWatch metrics are filterable by environment (dev/staging/prod) or team.
  • Set job-level timeout values. Without one, a hung job runs until it hits the account DPU limit.
  • Enable Auto Scaling on Glue 3.0 and 4.0 jobs to reduce cost and prevent over-provisioning from masking performance regressions.
  • Use composite CloudWatch alarms to reduce alert fatigue: one notification when any job in a pipeline fails rather than N separate alerts.
  • Store job run history in DynamoDB for weekly reporting on failure rates, retry counts, and longest-running jobs. This gives you trend data that CloudWatch metrics alone do not provide.

AWS Glue Monitoring with CubeAPM

CubeAPM gives data engineering teams a single pane of glass for AWS Glue monitoring: correlate job failures with slow upstream queries, visualize duration trends across runs, set intelligent alert thresholds, and cut MTTR from hours to minutes.

With CubeAPM you get:

  • Out-of-the-box AWS Glue job failure and duration dashboards
  • Anomaly detection on job duration to catch slow regressions before they become incidents
  • Deep integration with CloudWatch Logs for trace-level failure analysis
  • Alert routing to Slack, PagerDuty, and email with enriched context

Get full AWS Glue observability in under 15 minutes.

Book a demo today →

Also Read

Explore related guides from CubeAPM:

AWS Monitoring: Complete Guide to Tools, Metrics, and Best Practices

A broader look at AWS observability tools including CloudWatch, X-Ray, and CloudTrail.

How to Enable EKS Container Logging and Ship to a Backend

Step-by-step guide to deploying Fluent Bit, shipping logs to CloudWatch, and troubleshooting container log pipelines.

What is kube-state-metrics and How Do I Use It on GKE?

Understand kube-state-metrics, how it works alongside CloudWatch, and which metrics matter most for Kubernetes cluster monitoring.

Disclaimer: This article contains pricing estimates based on publicly available AWS CloudWatch Logs rates as of May 2026. Actual costs may vary by AWS region, account type, and usage patterns. Always verify current pricing before making infrastructure decisions.

FAQs

1. Does AWS Glue have a native “job failed” metric in CloudWatch?

No. FAILED and TIMEOUT are EventBridge events, not CloudWatch metrics. To get failure alerts, create an EventBridge rule that routes to an SNS topic, or use a Lambda function to write custom CloudWatch metrics from those events.

2. How do I track how long an AWS Glue job takes to run?

Use the glue.driver.aggregate.elapsedTime CloudWatch metric. It measures actual ETL runtime in milliseconds, excluding bootstrap time. Enable job metrics first by adding –enable-metrics to your job’s default arguments, then set a CloudWatch alarm on this metric.

3. What is the difference between continuous logging and default logging in AWS Glue?

Default logging sends output to CloudWatch only after a job finishes. Continuous logging streams logs in real time while the job runs. For production jobs, always enable continuous logging so you can inspect what is happening without waiting for the job to complete or crash.

4. What causes “Container killed by YARN for exceeding memory limits”?

A Spark executor ran out of memory, usually due to data skew or reading too much data at once. Quick fix: upgrade to a G.2X worker type. Better fix: repartition your data before heavy transforms using df.repartition(n) to distribute load evenly.

5. How do I get notified when an AWS Glue job fails?

Create an EventBridge rule that matches Glue job state changes for FAILED and TIMEOUT, then route it to an SNS topic. Subscribe your team’s email or Slack webhook to that topic. No changes to your job scripts needed, and it works for every job in your account automatically.

×
×