AWS Glue is AWS’s serverless ETL service built on Apache Spark. It runs managed data pipelines that move and transform data between sources such as Amazon S3, Amazon Redshift, and JDBC databases. When a Glue job fails silently or runs twice as long as expected, downstream reports break, SLAs slip, and engineers spend hours in CloudWatch logs trying to piece together what went wrong.
This guide covers the practical steps to set up reliable AWS Glue monitoring: detecting job failures immediately, tracking duration anomalies before they become incidents, and using structured logs and job run insights to find root causes fast.
Key Takeaways
- AWS Glue sends metrics to CloudWatch every 30 seconds, including elapsed time, bytes read, and memory usage.
- Job state changes (SUCCEEDED, FAILED, TIMEOUT, STOPPED) are published as EventBridge events, enabling automated SNS alerts.
- glue.driver.aggregate.elapsedTime is the most reliable metric for tracking and alarming on job duration.
- Enabling continuous logging gives you real-time access in CloudWatch Logs under /aws-glue/jobs/logs-v2/.
- Job Run Insights (Glue 2.0+) surfaces failure root causes, including the exact script line number and recommended fixes.
- AWS Glue does not publish a native “job failed” CloudWatch metric. Failure detection requires EventBridge or a custom Lambda-based solution.
How AWS Glue Monitoring Works
AWS Glue surfaces observability data through three distinct layers.
CloudWatch Metrics
AWS Glue sends metrics to CloudWatch every 30 seconds while a job is running. Metrics live in the Glue namespace and can be viewed in the AWS Glue console under “Job run monitoring,” or directly in the CloudWatch console. The dashboards aggregate 30-second samples into per-minute values using the SUM statistic.
Each metric is tagged with two dimensions:
- JobName: the name of your AWS Glue job
- JobRunId: the specific run ID, or ALL to aggregate across all runs
CloudWatch Logs
Each job run writes output and error logs to CloudWatch Logs. By default, logs appear only after the job completes. Enabling continuous logging streams them in real time, which is essential for debugging long-running or failing jobs.
Default log groups:
- /aws-glue/jobs/output – standard driver output
- /aws-glue/jobs/error – error and exception logs
- /aws-glue/jobs/logs-v2/ – used when continuous logging is enabled
EventBridge Job State Events
Every time a Glue job changes state, the service publishes an event to the default Amazon EventBridge bus. Possible states are SUCCEEDED, FAILED, TIMEOUT, and STOPPED. These events are the backbone of automated alerting: EventBridge rules match on specific states and route them to SNS topics, Lambda functions, or other targets.
The Key AWS Glue Metrics to Monitor
The table below shows the metrics that matter most for failure detection and duration tracking.
| Metric Name | What It Measures | Primary Use Case |
| glue.driver.aggregate.elapsedTime | ETL elapsed time in milliseconds (excludes bootstrap) | Duration tracking and SLA alarms |
| glue.driver.aggregate.numFailedTasks | Count of failed Spark tasks | Catching data-level failures before full job crash |
| glue.driver.aggregate.bytesRead | Bytes read from all data sources | Job progress and bookmark issues |
| glue.ALL.jvm.heap.used | JVM heap memory across all executors | Memory pressure detection |
| glue.driver.BlockManager.disk.diskSpaceUsed_MB | Disk used by Spark block manager | Disk overflow detection |
| glue.driver.aggregate.numCompletedStages | Number of completed Spark stages | Job progress tracking |
Important: AWS Glue does not publish a native “job failed” CloudWatch metric. The FAILED state is an EventBridge event, not a metric. This is why many teams build the custom Lambda-based solution covered in Section 5.
Monitoring Job Duration with CloudWatch Alarms
The best metric for duration is glue.driver.aggregate.elapsedTime. It measures ETL execution time in milliseconds and excludes the Glue bootstrap period, making it a clean signal of your script’s actual performance.
Step 1: Enable Job Metrics
Note: Job metrics are not enabled by default and must be turned on per job. Enabling CloudWatch custom metrics incurs additional charges. See Amazon CloudWatch Pricing for details.
In AWS Glue Studio, open your job, go to the Job details tab, and enable “Job metrics.” Using the AWS CLI:
aws glue update-job \
--job-name my-etl-job \
--job-update '{"DefaultArguments": {"--enable-metrics": ""}}'Step 2: Create a Duration Alarm
The command below creates a CloudWatch alarm that fires when a job exceeds 3 hours (10,800,000 milliseconds):
aws cloudwatch put-metric-alarm \
--alarm-name "GlueJobDurationExceeded-my-etl-job" \
--namespace Glue \
--metric-name "glue.driver.aggregate.elapsedTime" \
--dimensions Name=JobName,Value=my-etl-job \
--statistic Sum \
--period 60 \
--evaluation-periods 1 \
--threshold 10800000 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:GlueAlertsTip: Set your duration alarm threshold at 150% of the average historical run time. This catches meaningful regressions without generating too many false positives.
4. Detecting Job Failures with EventBridge and SNS
Because Glue has no native failure metric, the standard pattern for failure alerting uses EventBridge rules. When a job enters FAILED or TIMEOUT, EventBridge routes the event to an SNS topic that sends an email, Slack message, or PagerDuty alert.
Step 1: Create an SNS Topic
aws sns create-topic --name GlueJobFailures
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789:GlueJobFailures \
--protocol email \
--notification-endpoint [email protected]Step 2: Create an EventBridge Rule
The pattern below matches both FAILED and TIMEOUT for all Glue jobs. Add a “jobName” filter inside the detail block to scope it to a specific job.
{
"source": ["aws.glue"],
"detail-type": ["Glue Job State Change"],
"detail": { "state": ["FAILED", "TIMEOUT"] }
}aws events put-rule \
--name "GlueJobFailureRule" \
--event-pattern file://glue-failure-pattern.json \
--state ENABLEDStep 3: Add SNS as a Target
aws events put-targets \
--rule GlueJobFailureRule \
--targets "Id=1,Arn=arn:aws:sns:us-east-1:123456789:GlueJobFailures"The SNS notification includes the job name, run ID, error message, and a direct console link, giving the on-call engineer everything needed to start investigating immediately.
5. Creating Custom Success/Failure Metrics
Built-in Glue metrics do not give you Lambda-style counters (success count, failure count, error rate). You can add these with a small Lambda function that listens on EventBridge and writes custom CloudWatch metrics, covering all jobs automatically including new ones created in the future.
How It Works
- A Glue job state change fires an EventBridge event.
- An EventBridge rule routes it to a Lambda function.
- Lambda extracts job name and state, then calls cloudwatch.put_metric_data with a custom namespace.
- You create CloudWatch alarms on those custom metrics like any other.
Lambda Function (Python)
import boto3
from aws_embedded_metrics import metric_scope
@metric_scope
async def handler(event, context, metrics):
job_name = event["detail"]["jobName"]
state = event["detail"]["state"]
metrics.set_namespace("GlueCustomMetrics")
metrics.set_dimensions({"JobName": job_name})
metrics.put_metric("Success", 1 if state == "SUCCEEDED" else 0, "Count")
metrics.put_metric("Failure", 1 if state == "FAILED" else 0, "Count")6. Using Job Run Insights for Root Cause Analysis
Job Run Insights (AWS Glue 2.0+) performs automatic root cause analysis on failed jobs. When a job fails, it creates two log streams in CloudWatch Logs:
- <job-run-id>-job-insights-rca-driver: Exception analysis with the script line number, last Spark action before failure, and time-ordered executor events.
- Rule-based insights stream: Root cause analysis and recommended fixes (for example, tuning shuffle partition count).
Enabling Job Run Insights
In AWS Glue Studio, open your job, go to the Job details tab, and check “Generate job insights.” Using the CLI:
aws glue start-job-run \
--job-name my-etl-job \
--arguments '{"--enable-job-insights": "true"}'Note: Job Run Insights streams are created only when a job fails. A successful run generates no insight streams, which keeps log storage costs low. This feature requires AWS Glue version 2.0 or above.
7. Enabling Continuous Logging
Without continuous logging, CloudWatch only receives Glue output after the job finishes. For long-running jobs, this means you are blind during execution. Enable it with the following job update:
aws glue update-job --job-name my-etl-job \
--job-update '{
"DefaultArguments": {
"--enable-continuous-cloudwatch-log": "true",
"--enable-continuous-log-filter": "true",
"--continuous-log-logGroup": "/aws-glue/jobs/my-etl-job"
}
}'To tail logs from a currently running job:
aws logs filter-log-events \
--log-group-name /aws-glue/jobs/output \
--filter-pattern "JobName=my-etl-job" \
--limit 508. Monitoring AWS Glue Workflows
If you chain jobs and crawlers using AWS Glue Workflows, monitoring individual jobs is not enough. Glue Workflows also publish state changes to EventBridge under the detail-type “Glue Workflow State Change.” You can also poll status programmatically:
import boto3
glue = boto3.client("glue")
response = glue.get_workflow_run(Name="my-workflow", RunId="wr_abc123")
print(response["Run"]["Status"])Tip: Monitor at the workflow level in addition to individual jobs. A single upstream job timing out can silently block all downstream jobs or pass a stale schema to dependent transforms.
9. Common Failure Types and How to Diagnose Them
The table below maps error signatures in CloudWatch Logs to root causes and fixes.
| Error in Logs | Root Cause | Fix |
| Container killed by YARN for exceeding memory limits | Partition too large for available memory | Upgrade to G.2X worker or add repartition() |
| java.lang.OutOfMemoryError: Java heap space | Executor out of JVM heap | Use G.2X workers or reduce partition size |
| Job timed out | Exceeded Timeout setting | Increase Timeout or enable Auto Scaling |
| Access Denied | IAM role missing S3 or target permissions | Update the Glue job IAM execution role |
| ModuleNotFoundError | Python library not bundled with job | Add to –additional-python-modules argument |
| AnalysisException: Path does not exist | Input S3 path empty or wrong | Verify path; check upstream job completed |
10. Monitoring Best Practices
- Always enable continuous logging on production jobs. The overhead is minimal and the debugging value is substantial.
- Tag all Glue jobs consistently so CloudWatch metrics are filterable by environment (dev/staging/prod) or team.
- Set job-level timeout values. Without one, a hung job runs until it hits the account DPU limit.
- Enable Auto Scaling on Glue 3.0 and 4.0 jobs to reduce cost and prevent over-provisioning from masking performance regressions.
- Use composite CloudWatch alarms to reduce alert fatigue: one notification when any job in a pipeline fails rather than N separate alerts.
- Store job run history in DynamoDB for weekly reporting on failure rates, retry counts, and longest-running jobs. This gives you trend data that CloudWatch metrics alone do not provide.
AWS Glue Monitoring with CubeAPM
CubeAPM gives data engineering teams a single pane of glass for AWS Glue monitoring: correlate job failures with slow upstream queries, visualize duration trends across runs, set intelligent alert thresholds, and cut MTTR from hours to minutes.
With CubeAPM you get:
- Out-of-the-box AWS Glue job failure and duration dashboards
- Anomaly detection on job duration to catch slow regressions before they become incidents
- Deep integration with CloudWatch Logs for trace-level failure analysis
- Alert routing to Slack, PagerDuty, and email with enriched context
Get full AWS Glue observability in under 15 minutes.
Book a demo today →Also Read
Explore related guides from CubeAPM:
A broader look at AWS observability tools including CloudWatch, X-Ray, and CloudTrail.
Step-by-step guide to deploying Fluent Bit, shipping logs to CloudWatch, and troubleshooting container log pipelines.
Understand kube-state-metrics, how it works alongside CloudWatch, and which metrics matter most for Kubernetes cluster monitoring.
Disclaimer: This article contains pricing estimates based on publicly available AWS CloudWatch Logs rates as of May 2026. Actual costs may vary by AWS region, account type, and usage patterns. Always verify current pricing before making infrastructure decisions.
FAQs
1. Does AWS Glue have a native “job failed” metric in CloudWatch?
No. FAILED and TIMEOUT are EventBridge events, not CloudWatch metrics. To get failure alerts, create an EventBridge rule that routes to an SNS topic, or use a Lambda function to write custom CloudWatch metrics from those events.
2. How do I track how long an AWS Glue job takes to run?
Use the glue.driver.aggregate.elapsedTime CloudWatch metric. It measures actual ETL runtime in milliseconds, excluding bootstrap time. Enable job metrics first by adding –enable-metrics to your job’s default arguments, then set a CloudWatch alarm on this metric.
3. What is the difference between continuous logging and default logging in AWS Glue?
Default logging sends output to CloudWatch only after a job finishes. Continuous logging streams logs in real time while the job runs. For production jobs, always enable continuous logging so you can inspect what is happening without waiting for the job to complete or crash.
4. What causes “Container killed by YARN for exceeding memory limits”?
A Spark executor ran out of memory, usually due to data skew or reading too much data at once. Quick fix: upgrade to a G.2X worker type. Better fix: repartition your data before heavy transforms using df.repartition(n) to distribute load evenly.
5. How do I get notified when an AWS Glue job fails?
Create an EventBridge rule that matches Glue job state changes for FAILED and TIMEOUT, then route it to an SNS topic. Subscribe your team’s email or Slack webhook to that topic. No changes to your job scripts needed, and it works for every job in your account automatically.





