CubeAPM
CubeAPM CubeAPM

AWS Glue Monitoring: DPU Consumption and Job Cost Optimization

AWS Glue Monitoring: DPU Consumption and Job Cost Optimization

Table of Contents

AWS Glue bills by DPU hour at $0.44 per DPU hour for Spark ETL jobs, and without monitoring, teams often overprovision capacity or miss bottlenecks that drive costs up. A Glue job configured with 10 DPUs running for 15 minutes costs $0.66, but the same job scaled to 100 DPUs without justification costs $6.60 for work that may not finish any faster. The difference between right-sized and overprovisioned Glue infrastructure can mean tens of thousands of dollars per year for teams running daily pipelines.

This guide covers how AWS Glue DPU consumption works, how to monitor job metrics in CloudWatch and the Glue console, how to estimate optimal DPU capacity using maximum needed executors, and actionable strategies to reduce Glue costs without sacrificing pipeline reliability.

What Is AWS Glue DPU Consumption

AWS Glue charges based on Data Processing Units (DPUs) consumed during job execution. A DPU is a unit of compute capacity that provides 4 vCPU and 16 GB of memory. Glue ETL jobs bill per second with a 1 minute minimum per job run, calculated as:

Cost = DPUs allocated × job duration (in hours) × DPU rate

For Spark ETL jobs, the rate is $0.44 per DPU hour. For Python Shell jobs, it is $0.44 per DPU hour with smaller DPU sizes (0.0625 or 1 DPU). For Glue Flex execution class jobs, the rate drops to $0.29 per DPU hour but introduces longer startup times, making Flex unsuitable for time-sensitive pipelines.

AWS allocates 10 DPUs by default for each Spark job. One DPU is reserved for the driver, leaving 9 DPUs running two executors each, minus one executor for the Spark driver, yielding 17 executors. If your job only needs 5 executors to process the data efficiently, you are paying for 12 unused executors every time the job runs.

The challenge: Glue does not automatically scale DPUs down to match workload. You define capacity upfront, and Glue bills for what you allocate, not what the job actually uses. This creates a constant risk of overprovisioning, where jobs are allocated far more capacity than they need, or underprovisioning, where jobs run slower than necessary because they lack the parallelism to process data efficiently.

Understanding DPU consumption is the first step toward cost optimization. The next step is measuring how many executors your jobs actually need.

How AWS Glue DPU Billing Works

AWS Glue bills DPU hours per job run, calculated by multiplying the number of DPUs allocated to a job by the duration the job runs, measured in hours. Billing starts when the job initializes and stops when it completes or fails. The minimum billable duration is 1 minute, and billing rounds up to the nearest second after the first minute.

For Spark ETL jobs, the default allocation is 10 DPUs. A job running for 6 minutes with 10 DPUs costs:

10 DPUs × 0.1 hour × $0.44 = $0.44

If you scale that job to 55 DPUs and it finishes in 3 minutes, the cost becomes:

55 DPUs × 0.05 hour × $0.44 = $1.21

The job finishes twice as fast but costs nearly three times as much. Whether this tradeoff makes sense depends on how often the job runs, whether faster completion unblocks downstream processes, and whether the job was actually bottlenecked by DPU capacity or by something else like S3 read speed or small file overhead.

AWS Glue offers three execution classes: Standard, Flex, and G.2X. Standard is the default and bills at $0.44 per DPU hour. Flex reduces the rate to $0.29 per DPU hour but adds startup latency, making it unsuitable for jobs that need to run on tight schedules. G.2X provides 8 vCPU and 32 GB per DPU instead of the standard 4 vCPU and 16 GB, used for memory intensive jobs, but bills at a higher rate.

Glue also charges separately for Glue Data Catalog storage ($1 per 100,000 objects stored per month after the first million), Glue Crawler runs ($0.44 per DPU hour), and requests to the Data Catalog API ($0.10 per million requests after the first million). These costs are often overlooked in initial estimates but accumulate quickly for teams with large metadata footprints or frequent schema discovery jobs.

The two biggest levers for Glue cost control are choosing the right number of DPUs and reducing job run duration. Both require monitoring.

Monitoring AWS Glue DPU Usage with CloudWatch

AWS Glue publishes job metrics to CloudWatch that show DPU consumption, active executors, completed stages, and maximum needed executors during each job run. These metrics are the foundation for capacity planning because they reveal whether your job is using the DPUs you allocated or sitting idle waiting for data.

The most important CloudWatch metric for DPU optimization is glue.driver.aggregate.numRunningTasks and glue.ALL.jvm.heap.usage, but the Glue console surfaces a more actionable derived metric called maximum needed executors, calculated as:

Maximum needed executors = (running tasks + pending tasks) / tasks per executor

This metric shows how many executors your job would use if it had unlimited capacity. If maximum needed executors stays below the number of executors you allocated, you are overprovisioned. If it consistently exceeds allocated executors, your job is underprovisioned and would finish faster with more DPUs.

To view these metrics in the Glue console, go to AWS Glue → ETL Jobs → Job run monitoring, select the job run, and choose View run metrics. The console displays a time series graph showing active executors, maximum needed executors, and maximum allocated executors. The horizontal red line represents the maximum allocated executors based on your DPU setting.

CloudWatch also publishes these Glue-specific metrics:

  • glue.driver.aggregate.elapsedTime — total job run duration in milliseconds
  • glue.driver.aggregate.numCompletedStages — number of Spark stages completed
  • glue.driver.aggregate.numFailedTasks — task failures indicating data skew or resource pressure
  • glue.ALL.system.cpuSystemLoad — CPU utilization across executors
  • glue.ALL.jvm.heap.usage — memory consumption per executor

If CPU load stays below 50% throughout the job, your job is likely I/O bound, waiting on S3 reads or writes rather than compute. Scaling DPUs will not improve runtime in that case. If heap usage repeatedly spikes to 90% or higher, the job may be hitting memory pressure and could benefit from G.2X instances or tuning Spark memory configuration.

CloudWatch metrics are published at 1 minute granularity by default. For short running jobs under 5 minutes, this granularity may miss transient bottlenecks. In those cases, Glue Spark UI logs provide finer resolution but require enabling and downloading logs from S3, adding friction to the debugging process.

For teams running dozens or hundreds of Glue jobs, aggregating CloudWatch metrics across jobs into a centralized dashboard helps identify cost outliers. Infrastructure monitoring platforms can pull Glue CloudWatch metrics via AWS APIs and correlate them with other AWS service metrics like S3 request rates, DynamoDB throttles, or Lambda invocations that may affect Glue job performance.

How to Estimate Optimal DPU Capacity

Estimating the right DPU allocation starts with running the job once at default capacity (10 DPUs) and measuring maximum needed executors. If the job shows 107 maximum needed executors but only has 17 allocated executors, the underprovisioning factor is:

(107 + 1 driver) / (17 + 1 driver) = 6×

You can provision 6 times the current capacity to scale the job to maximum parallelism. Since 1 DPU is reserved for management and each DPU runs 2 executors minus 1 for the driver, the formula to convert executors to DPUs is:

Optimal DPUs = ((target executors + 1) / 2) + 1

For 107 executors:

((107 + 1) / 2) + 1 = 55 DPUs

Running the job again with 55 DPUs shows whether the estimate was correct. If maximum needed executors now stays at or below 107, the job is right-sized. If it still exceeds allocated capacity, there may be data skew or small file overhead creating more tasks than expected.

AWS documents a real example where a Glue Spark job reading 428 gzipped JSON files from S3, applying field mappings, and writing Parquet back to S3 ran for 6 minutes at 10 DPUs. Maximum needed executors peaked at 107, matching the number of input files because each executor could process 4 files in parallel (4 Spark tasks per executor). Scaling to 55 DPUs reduced runtime to under 3 minutes. Scaling further to 100 DPUs did not improve runtime because the job never needed more than 107 executors, confirming that 55 DPUs was optimal.

The rule: scale DPUs until maximum needed executors equals or stays below maximum allocated executors. Beyond that point, additional DPUs sit idle and waste money.

Three factors limit scaling efficiency:

  1. Number of input partitions — If your data has only 50 S3 objects, a job will never use more than 50 executors no matter how many DPUs you allocate. Repartitioning data into smaller files or using dynamic partition pruning can increase parallelism.
  2. Data skew — If 10% of partitions contain 80% of the data, a few executors do most of the work while others sit idle. Salting keys or bucketing heavily skewed columns spreads work more evenly.
  3. I/O bottlenecks — If S3 read throughput is the limiting factor, adding more executors simply means more executors waiting on S3. Enabling S3 Transfer Acceleration, increasing spark.sql.files.maxPartitionBytes, or moving hot data to a faster storage tier can help.

Right-sizing DPUs is not a one time task. As data volumes grow, partition counts change, or transformations become more complex, the optimal DPU count shifts. Teams running production Glue jobs should review capacity quarterly or after significant pipeline changes.

AWS Glue Job Metrics and Performance Bottlenecks

AWS Glue surfaces several job-level metrics beyond DPU utilization that indicate where jobs spend time and what limits performance. These metrics help diagnose whether a slow job needs more DPUs, better S3 configuration, or Spark tuning.

Metrics to track for every Glue job:

  • Job run duration — Total wall clock time from start to finish. Compare across runs to detect regressions.
  • Number of completed stages — Spark breaks jobs into stages based on shuffle boundaries. More stages often means more opportunities for parallelism but also more overhead.
  • Number of failed tasks — Task failures usually indicate data corruption, out of memory errors, or transient S3 failures. A few retries are normal. Dozens suggest a configuration problem.
  • S3 bytes read and written — Shows data movement. If bytes written far exceed bytes read, your transformations may be duplicating data unintentionally.
  • Shuffle read and write bytes — High shuffle indicates expensive operations like joins or group-bys. Broadcast joins can eliminate shuffles for small dimension tables.

If a job spends most of its time in a single stage with high shuffle, that stage is the bottleneck. Investigate whether the operation can be rewritten, whether partition keys are skewed, or whether increasing spark.sql.shuffle.partitions spreads the load better.

If a job shows low CPU utilization and low shuffle but long duration, it is likely I/O bound. Check S3 request metrics in CloudWatch for throttling (HTTP 503 responses) or high latency. Glue jobs share the same S3 request rate limits as other AWS services, and jobs reading thousands of small files can hit these limits. Consolidating small files into larger objects using Glue’s groupFiles option or preprocessing data with a separate compaction job reduces S3 API calls and improves throughput.

Memory pressure shows up as task failures with “OutOfMemoryError: Java heap space” messages in Glue logs. Increasing executor memory by switching to G.2X instances or reducing spark.sql.shuffle.partitions to create fewer, larger partitions can resolve this. But scaling memory alone is expensive. A better approach is profiling which transformation consumes memory using Spark UI and optimizing that step, perhaps by filtering data earlier in the pipeline or switching from a full outer join to a left join.

Monitoring these metrics consistently across all Glue jobs requires centralized observability. Teams running 20 or more Glue jobs often set up dashboards that aggregate job metrics, highlight outliers, and alert on cost spikes or duration regressions. CubeAPM can ingest AWS Glue CloudWatch metrics via OpenTelemetry and correlate them with application traces, allowing teams to see whether a slow Glue job is caused by upstream API delays, database query latency, or Glue configuration issues.

Strategies to Reduce AWS Glue Costs

Optimizing DPU allocation is the first step, but sustained cost reduction requires addressing job frequency, execution class selection, data layout, and overall pipeline architecture. Here are eight strategies that reduce Glue costs without degrading pipeline reliability.

1. Use Glue Flex for non-critical jobs

Flex execution reduces the DPU rate from $0.44 to $0.29 per DPU hour but adds startup latency of several minutes. Jobs that run nightly or on a relaxed schedule can absorb this delay. Jobs that need to complete within 5 minutes or less should stay on Standard execution.

Example: A daily aggregation job running at 10 DPUs for 30 minutes costs $2.20 on Standard, $1.45 on Flex. Over a month, that is $66 vs. $43.50, a 34% reduction.

2. Enable auto scaling

Glue auto scaling adjusts the number of workers dynamically based on workload. If a job needs 50 executors at peak but only 10 during the final stages, auto scaling deallocates unused workers, reducing billed DPU hours. Auto scaling works best for jobs with variable workloads, like processing event streams where volume fluctuates by hour.

Auto scaling is enabled per job in the Glue console under Job details → Advanced properties → Auto Scaling. Set a minimum and maximum worker count. Glue scales within that range based on pending tasks.

3. Consolidate small files before processing

Glue creates one Spark task per input file by default. If your S3 prefix contains 10,000 small files, Glue launches 10,000 tasks, creating massive scheduling overhead that wastes executor time. Consolidating files into fewer, larger objects (ideally 128 MB to 1 GB each) reduces task count and improves throughput.

Use Glue’s groupFiles option to automatically group small files during read:

datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="my_table",
    transformation_ctx="datasource",
    additional_options={"groupFiles": "inPartition", "groupSize": "134217728"}
)

This tells Glue to group files within each partition up to 128 MB per group.

4. Set appropriate job timeouts

Glue jobs can run indefinitely if not configured with a timeout. A misconfigured job stuck in a retry loop or waiting on a slow external API can rack up hours of DPU charges. Set a timeout value under Job details → Advanced properties → Timeout to automatically kill jobs that exceed expected runtime.

Example: A job that normally finishes in 10 minutes should have a timeout of 20–30 minutes. If it runs longer, something is wrong, and killing it early prevents waste.

5. Use job bookmarks to avoid reprocessing data

Glue job bookmarks track which data has already been processed and skip it on subsequent runs. This is critical for incremental pipelines that should only process new or updated records. Without bookmarks, every run reprocesses the entire dataset, wasting DPU hours on redundant work.

Enable bookmarks under Job details → Advanced properties → Job bookmark and set it to “Enable”. Glue tracks state automatically for supported data sources like S3, JDBC, and DynamoDB.

6. Optimize Spark configuration for your workload

Glue’s default Spark settings work for general cases but are rarely optimal. Tuning spark.sql.shuffle.partitions, spark.default.parallelism, and spark.executor.memory based on your data size and transformation complexity can reduce runtime significantly.

For jobs processing less than 10 GB, reduce shuffle partitions from the default 200 to 50 or fewer:

spark.conf.set("spark.sql.shuffle.partitions", "50")

For memory intensive transformations, increase executor memory by switching to G.2X workers or setting spark.executor.memoryOverhead.

7. Stop or delete unused Glue development endpoints and notebooks

Glue development endpoints and interactive sessions bill continuously while running. A forgotten development endpoint left on for a month at 5 DPUs costs:

5 DPUs × 730 hours × $0.44 = $1,606

Stop endpoints when not in use or delete them entirely. Use Jupyter magic %stop_session to terminate interactive sessions after debugging.

8. Monitor and alert on cost anomalies

Set up CloudWatch alarms on EstimatedDPUHour metrics or use AWS Cost Explorer to track Glue spending by job. Alert thresholds should trigger when daily or weekly costs exceed expected baselines by 20% or more, indicating a configuration change, data volume spike, or runaway job.

For teams managing multiple AWS accounts or complex pipelines, centralized monitoring helps spot cost trends before they escalate. Synthetic monitoring tools can simulate Glue job runs in test environments to validate configuration changes before deploying them to production pipelines.

Monitoring AWS Glue Jobs with CubeAPM

CubeAPM provides unified monitoring for AWS Glue jobs alongside application traces, logs, and infrastructure metrics, giving teams full context when diagnosing pipeline failures or cost spikes. CubeAPM ingests AWS Glue CloudWatch metrics via OpenTelemetry and correlates them with upstream and downstream services in the data pipeline.

Key capabilities for Glue monitoring:

  • DPU utilization dashboards — Track DPU hours consumed per job, per day, and per account. Identify jobs with the highest cost and whether they are right-sized or overprovisioned.
  • Job duration trending — Visualize job run times over weeks or months. Detect regressions caused by data growth, schema changes, or configuration drift.
  • Error tracking — Capture Glue job failures with full stack traces and correlate them with S3 access errors, database connection timeouts, or API rate limits in upstream services.
  • Alerting on cost anomalies — Set alerts when a job’s DPU consumption exceeds historical averages by a defined threshold. Route alerts to Slack, PagerDuty, or email with full job run context.
  • Cross-service correlation — When a Glue job processes data from an API, CubeAPM links the job’s performance metrics to the API’s trace data, showing whether slowdowns originated in the Glue job or the service feeding it data.

CubeAPM deploys on-premises or in your own VPC, ensuring telemetry data including Glue job metadata never leaves your infrastructure. This matters for teams with data residency requirements or compliance constraints that prohibit sending pipeline telemetry to third party SaaS platforms.

CubeAPM pricing is $0.15 per GB of telemetry ingested, covering logs, traces, and metrics with unlimited retention. For a team ingesting 10 TB per month across all services, including Glue job metrics, the total cost is $1,500 per month. Datadog charges separately for infrastructure monitoring ($18 per host), logs ($0.10 per GB ingested plus $1.70 per million events indexed), and APM ($42 per host), making comparable visibility cost $4,894 per month for the same workload.

Frequently Asked Questions

How much do AWS Glue jobs cost per DPU hour?

AWS Glue Standard execution bills $0.44 per DPU hour for Spark ETL jobs. Flex execution reduces the rate to $0.29 per DPU hour but adds startup latency. Python Shell jobs use the same rate but with smaller DPU sizes.

What is a DPU in AWS Glue?

A Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. Glue allocates DPUs to jobs based on the configured capacity. One DPU is reserved for the driver, and the rest run Spark executors.

How do I monitor AWS Glue DPU consumption?

Monitor DPU usage in the AWS Glue console under Job run monitoring or via CloudWatch metrics like glue.driver.aggregate.numRunningTasks. Track maximum needed executors to determine if your job is overprovisioned or underprovisioned.

How do I reduce AWS Glue costs?

Reduce costs by right-sizing DPU allocations, enabling auto scaling, using Flex execution for non-critical jobs, consolidating small files, setting job timeouts, enabling job bookmarks, and stopping unused development endpoints.

What is the maximum needed executors metric in Glue?

Maximum needed executors shows how many executors your job would use if it had unlimited capacity. It is calculated as running tasks plus pending tasks divided by tasks per executor. If this metric exceeds allocated executors, your job is underprovisioned.

How do I estimate the right number of DPUs for a Glue job?

Run the job once at default capacity and check maximum needed executors in the Glue console. Use the formula: Optimal DPUs equals maximum needed executors plus 1 divided by 2 plus 1. Test the new allocation and adjust if needed.

Does AWS Glue auto scale DPUs automatically?

Glue does not scale DPUs automatically unless you enable auto scaling. Auto scaling adjusts the number of workers dynamically within a defined range based on pending tasks, reducing billed DPU hours for variable workloads.

Disclaimer: The information in this article reflects the latest details available at the time of publication and may change as technologies and products evolve. AWS Glue features, pricing, and CloudWatch metrics can change over time. Always verify the latest information directly with AWS documentation before making capacity planning or cost optimization decisions.

×
×