Azure Databricks Monitoring: Jobs, Clusters, Metrics, and Alerts

Azure Databricks is a widely used analytics and data engineering platform built on Apache Spark. As workloads grow to dozens or hundreds of scheduled jobs, knowing whether your jobs are running, how long they take, and why they fail becomes critical.

This guide walks you through the practical monitoring tools available for Azure Databricks, from built-in UIs to Azure Monitor, system tables, and the Jobs REST API. Each section answers a specific question: what to monitor, where to find it, and how to act on it.

Key Takeaways

Azure Databricks provides built-in monitoring through Spark UI, Ganglia metrics, and the Jobs UI for real-time job and cluster visibility.
Azure Monitor and Diagnostic Settings (Premium plan only) let you route logs and metrics to Log Analytics, enabling powerful KQL-based queries and alerts.
System tables such as lakeflow_monitoring.job_runs and billable_usage provide historical job performance and cost data that can be queried using SQL.
The Databricks Jobs REST API lets you programmatically check job status, retrieve run details, and export logs for custom dashboards.
Third-party tools like Datadog and Dynatrace offer deeper observability with anomaly detection and unified monitoring across your Azure stack.
Combining native tools with custom dashboards (Databricks Lakeflow Dashboard or Power BI) provides the most complete monitoring coverage for large workspaces.

What Does Azure Databricks Monitoring Cover?

Azure Databricks monitoring spans two main areas: job monitoring (tracking scheduled or triggered pipeline runs) and cluster monitoring (tracking the compute infrastructure running those jobs).

For jobs, you typically want to know:

Whether a job run succeeded, failed, or is still running
How long it took compared to historical baselines
Which task inside a multi-task job failed and why
What the job cost in terms of DBUs (Databricks Units)

For clusters, you want to know:

Whether the cluster is up and healthy
CPU, memory, and network utilization
Driver and executor logs for debugging
Whether auto-scaling is behaving as expected

Method 1: Use the Databricks Jobs UI for Run-Level Monitoring

The simplest starting point is the Databricks Jobs UI, available at Workflows > Jobs in your workspace sidebar.

From the Jobs list view, you can see:

All scheduled and triggered jobs in your workspace
The last run status (Succeeded, Failed, Running, Skipped)
Next scheduled run time
Duration of the most recent run

Clicking into a specific job opens the Run History page, which shows every run with timestamps, duration, status, and links to Spark UI and logs. For multi-task jobs, you also get a task-level DAG view showing which task in the pipeline failed.

Limitation: The UI works well for diagnosing one job at a time. It becomes impractical when you need a workspace-wide view of dozens of jobs. Use system tables or dashboards for that.

Method 2: Monitor Cluster and Spark Metrics

Spark UI

The Spark UI is available from the cluster detail page (All-Purpose clusters) or the Job Run History page (job clusters). It provides:

Jobs tab: Status and timeline of all Spark jobs in the application
Stages tab: Breakdown of each stage and individual task metrics
Storage tab: Cached RDDs and DataFrames in memory
Executors tab: CPU and memory usage per executor
SQL tab: Query plans for DataFrame and SQL operations

Ganglia Metrics

Ganglia is a built-in metrics dashboard accessible from the Metrics tab on cluster detail pages (All-Purpose clusters) and the Job Runs page (job clusters). It shows live CPU, memory, network, and disk I/O across all nodes.

Note: Ganglia data is ephemeral. For job clusters, metrics are only available as a static image after the cluster terminates, making it difficult to do historical analysis. For long-term tracking, use Azure Monitor or system tables instead.

JVM-Level Debugging

If you need to go deeper than what Spark UI provides, the following JVM tools are available on cluster nodes:

jstack – Captures thread stack traces to diagnose hangs or deadlocks
jmap – Creates heap dumps for OutOfMemory debugging
jstat – Reports garbage collection and memory statistics over time

Method 3: Azure Monitor and Diagnostic Settings

Azure Monitor is the recommended approach for production-grade, long-term monitoring of Azure Databricks. It requires the Azure Databricks Premium plan

Enable Diagnostic Settings

To enable Diagnostic Settings:

Open your Azure Databricks workspace in the Azure portal
Navigate to Monitoring > Diagnostic settings
Click “Add diagnostic setting”
Select the log categories you want (see below)
Choose “Send to Log Analytics workspace” as the destination
Save the settings

Available Log Tables in Log Analytics

Once configured, the following tables are available in your Log Analytics workspace:

Table Name	What It Contains
DatabricksClusters	Cluster create, start, terminate, edit events
DatabricksJobs	Job create, delete, reset, run now events
DatabricksNotebooks	Notebook attach, detach, create, delete events
DatabricksAccounts	Workspace-level account audit logs
DatabricksSQL	SQL warehouse query execution logs
AzureMetrics	Platform-level health and performance metrics
AzureActivity	Subscription-level events (portal actions, ARM calls)

Example KQL Query: Failed Job Runs

Run this query in Log Analytics to find all failed job runs in the past 24 hours:

DatabricksJobs| where TimeGenerated > ago(24h)| where ActionName == 'runFailed'| project TimeGenerated, JobId = tostring(RequestParams.job_id),          RunId = tostring(RequestParams.run_id), Identity,          ErrorMessage = tostring(Response.error_code)| order by TimeGenerated desc

DatabricksJobs| where TimeGenerated > ago(24h)| where ActionName == 'runFailed'| project TimeGenerated, JobId = tostring(RequestParams.job_id),          RunId = tostring(RequestParams.run_id), Identity,          ErrorMessage = tostring(Response.error_code)| order by TimeGenerated desc

Set Up Alerts from Azure Monitor

Once logs are flowing into Log Analytics, you can create Azure Monitor Alerts to notify your team when jobs fail. Navigate to Azure Monitor > Alerts > New alert rule, choose your Log Analytics workspace as the scope, and create a custom log search query based on the examples above.

Method 4: Monitor Job Costs and Performance Using System Tables

Databricks system tables are Unity Catalog-managed tables that store historical metadata about your jobs, clusters, and costs. They are queryable via SQL from any notebook or SQL warehouse in your workspace.

Key System Tables for Job Monitoring

system.lakeflow.job_runs – Each completed or running job run, with run ID, state, start/end time, and cluster ID
system.billing.usage – DBU consumption per job run, queryable by job ID for cost attribution
system.compute.clusters – Cluster configuration snapshots over time

Requirements: You must be either a metastore admin and account admin, or have USE and SELECT permissions on the system schemas. These queries only cover jobs run on jobs compute and serverless compute. Jobs on SQL warehouses or all-purpose compute are excluded from cost attribution.

Query: Most Expensive Jobs (Last 30 Days)

WITH list_cost_per_job AS (  SELECT    t1.workspace_id,    t1.usage_metadata.job_id,    COUNT(DISTINCT t1.usage_metadata.job_run_id) AS num_runs,    SUM(t1.usage_quantity * t2.pricing.default) AS total_cost  FROM system.billing.usage AS t1  INNER JOIN system.billing.list_prices AS t2    ON t1.sku_name = t2.sku_name  WHERE t1.usage_date >= CURRENT_DATE - 30    AND t1.usage_metadata.job_id IS NOT NULL  GROUP BY 1, 2)SELECT * FROM list_cost_per_jobORDER BY total_cost DESCLIMIT 20;

WITH list_cost_per_job AS (  SELECT    t1.workspace_id,    t1.usage_metadata.job_id,    COUNT(DISTINCT t1.usage_metadata.job_run_id) AS num_runs,    SUM(t1.usage_quantity * t2.pricing.default) AS total_cost  FROM system.billing.usage AS t1  INNER JOIN system.billing.list_prices AS t2    ON t1.sku_name = t2.sku_name  WHERE t1.usage_date >= CURRENT_DATE - 30    AND t1.usage_metadata.job_id IS NOT NULL  GROUP BY 1, 2)SELECT * FROM list_cost_per_jobORDER BY total_cost DESCLIMIT 20;

Query: Slowest Job Runs (Last 7 Days)

SELECT  job_id,  run_id,  run_name,  result_state,  TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_minutesFROM system.lakeflow.job_runsWHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYSORDER BY duration_minutes DESCLIMIT 20;

SELECT  job_id,  run_id,  run_name,  result_state,  TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_minutesFROM system.lakeflow.job_runsWHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYSORDER BY duration_minutes DESCLIMIT 20;

Lakeflow Monitoring Dashboard

Databricks provides a pre-built Lakeflow Monitoring Dashboard that uses system tables to visualize job performance, failure rates, and cost trends across your workspace. Download the dashboard JSON from the Databricks GitHub Repository and import it into your workspace.

Method 5: Use the Databricks Jobs API for Programmatic Monitoring

The Databricks Jobs REST API gives you full programmatic access to job and run metadata. This is ideal for building custom dashboards, integrating with external monitoring systems, or automating workflows.

Key API Endpoints for Monitoring

Endpoint	Method	What It Returns
/api/2.1/jobs/list	GET	All jobs in the workspace
/api/2.1/jobs/runs/list	GET	Runs for a specific job with status
/api/2.1/jobs/runs/get	GET	Details of a single run including state, start time, and duration
/api/2.1/jobs/runs/get-output	GET	Output and error logs from a completed run
/api/2.1/jobs/runs/cancel	POST	Cancel a running job
/api/2.1/jobs/runs/repair	POST	Retry only failed tasks in a run

Example: List Failed Runs via Python

import requests
DATABRICKS_HOST = 'https://<your-workspace>.azuredatabricks.net'TOKEN = '<your-personal-access-token>'JOB_ID = 12345
response = requests.get(    f'{DATABRICKS_HOST}/api/2.1/jobs/runs/list',    headers={'Authorization': f'Bearer {TOKEN}'},    params={        'job_id': JOB_ID,        'life_cycle_state': 'TERMINATED',        'result_state': 'FAILED',        'limit': 25    })
runs = response.json().get('runs', [])for run in runs:    print(f"Run {run['run_id']}: {run['state']['result_state']} at {run['start_time']}")

import requests
DATABRICKS_HOST = 'https://<your-workspace>.azuredatabricks.net'TOKEN = '<your-personal-access-token>'JOB_ID = 12345
response = requests.get(    f'{DATABRICKS_HOST}/api/2.1/jobs/runs/list',    headers={'Authorization': f'Bearer {TOKEN}'},    params={        'job_id': JOB_ID,        'life_cycle_state': 'TERMINATED',        'result_state': 'FAILED',        'limit': 25    })
runs = response.json().get('runs', [])for run in runs:    print(f"Run {run['run_id']}: {run['state']['result_state']} at {run['start_time']}")

Method 6: Monitor Cluster Up/Down Status

A common requirement is knowing whether a cluster is currently running or has gone down unexpectedly. Azure Databricks does not send native webhooks for cluster state changes, so you have two practical approaches:

Option A: Poll Cluster State via API

Use the Clusters API to check cluster state on a schedule:

GET /api/2.0/clusters/get?cluster_id=<cluster-id>
# Relevant fields in response:# state: PENDING | RUNNING | RESTARTING | RESIZING |#        TERMINATING | TERMINATED | ERROR | UNKNOWN# state_message: Human-readable description of current state

GET /api/2.0/clusters/get?cluster_id=<cluster-id>
# Relevant fields in response:# state: PENDING | RUNNING | RESTARTING | RESIZING |#        TERMINATING | TERMINATED | ERROR | UNKNOWN# state_message: Human-readable description of current state

A cluster in ERROR or TERMINATED state unexpectedly indicates a problem. You can wrap this in a monitoring script that sends an alert to Slack or PagerDuty if the state is unexpected.

Option B: Use Azure Monitor Alerts on DatabricksClusters Logs

If Diagnostic Settings are enabled, the DatabricksClusters table in Log Analytics records terminate events. Create an alert rule that fires when an unexpected clusterTerminated event appears outside maintenance windows.

Method 7: Third-Party Monitoring Tools

Native tools cover most use cases, but third-party observability platforms add convenience, anomaly detection, and integration with the rest of your monitoring stack.

CubeAPM

CubeAPM gives teams a self-hosted, OpenTelemetry-native way to monitor Databricks workloads alongside the rest of their application stack. It can collect metrics, logs, traces, and infrastructure signals in one place, making it useful when teams want Databricks visibility without sending all observability data to a SaaS-only platform.

Datadog

Datadog offers a Databricks integration that pulls cluster metrics, job run data, and Spark application metrics via the Datadog Agent installed on cluster nodes. It supports anomaly detection, dashboards, and alert routing out of the box.

Dynatrace

Dynatrace provides workspace-level and cluster-level observability for Azure Databricks through its OneAgent and ActiveGate integration. It automatically discovers Databricks entities, traces Spark jobs, and maps dependencies.

IBM Instana

IBM Instana supports Azure Databricks monitoring through Azure Monitor integration, capturing job run metrics and cluster health into its observability pipeline alongside your other Azure services.

Method 8: Build a Custom Monitoring Dashboard

When you manage a large workspace with dozens of teams and hundreds of jobs, a custom dashboard gives you the workspace-wide view the Databricks UI cannot.

Common approaches:

Databricks AI/BI Dashboards (formerly Lakeflow Dashboards): Build natively inside your workspace using system table queries. Best for Databricks-centric teams.
Azure Monitor Workbooks: Combine Log Analytics queries with Azure Metrics in a single visualization. Best for Azure-centric ops teams.
Power BI: Connect to system tables via Databricks SQL connector. Best when dashboards are shared with business stakeholders who already use Power BI.
Grafana: Use the Azure Monitor data source or Databricks Prometheus endpoint. Best for engineering teams with an existing Grafana stack.

The approach described by Maksim Pachkouski on Dev Genius combines the REST API (for near-real-time operational data) with system tables (for historical trends), which is the most flexible architecture for complex workspaces.

What to Monitor: A Practical Checklist

What to Monitor	Best Tool	Frequency
Job run success/failure	Jobs UI or system tables	Real-time / on schedule
Job duration trends	System tables (SQL)	Daily
Job cost per run (DBU)	system.billing.usage	Daily / Weekly
Cluster CPU and memory	Ganglia or Datadog Agent	Real-time
Cluster up/down state	Clusters API polling	Every 5 minutes
Driver and executor logs	Spark UI / Cluster log delivery	On failure
Audit trail (who did what)	Azure Monitor (DatabricksAccounts)	On demand / compliance
SQL warehouse query perf	Query History UI / system tables	Daily

Struggling to get full visibility into your Azure Databricks jobs?

CubeAPM provides end-to-end observability for Databricks workloads, giving you real-time job status, cluster health metrics, alerting, and cost tracking in one unified dashboard. No complex setup required.

Start monitoring your Databricks environment today.

Try CubeAPM Free →

Conclusion

Azure Databricks monitoring does not have a single right answer. The best setup depends on your team size, workspace complexity, and existing observability stack. Start with the built-in tools (Spark UI, Jobs UI, Ganglia) for immediate visibility, add Azure Monitor Diagnostic Settings for audit logging and alerts, and layer in system tables for historical cost and performance analysis.

For teams managing large workspaces with many jobs, combining the REST API with a custom or pre-built dashboard gives you the clearest picture. Third-party tools like Datadog or Dynatrace are worth the investment when you need anomaly detection or unified monitoring across Azure services.

Disclaimer: The information in this article is based on publicly available documentation and community resources as of May 2026. Azure Databricks features, API endpoints, and system table schemas may change with platform updates. Always refer to the official Microsoft and Databricks documentation for the most current information. Feature availability may vary by plan tier (Standard, Premium, or Enterprise) and cloud region.

FAQs

1. How do I check if an Azure Databricks job is currently running?

Use the Jobs UI under Workflows > Jobs in your workspace to see current run status. Alternatively, call GET /api/2.1/jobs/runs/list with the job_id parameter and filter by life_cycle_state=RUNNING. System tables can also be queried: SELECT * FROM system.lakeflow.job_runs WHERE result_state IS NULL will return runs that have not yet completed.

2. Do I need the Premium plan for Azure Databricks monitoring?

Not for all monitoring. The built-in Spark UI, Ganglia metrics, and Jobs UI are available on all plan tiers. However, Azure Monitor Diagnostic Settings (which route Databricks logs to Log Analytics) require the Azure Databricks Premium plan. System tables require Unity Catalog, which is also a Premium feature.

3. How do I get alerts when an Azure Databricks job fails?

There are three approaches. First, enable Diagnostic Settings to route DatabricksJobs logs to Azure Monitor, then create an alert rule that fires on runFailed events. Second, use the built-in Databricks notification system: in the job settings under “Notifications,” add an email or webhook destination for failure events. Third, poll the Jobs API on a schedule and trigger your own alerting logic when a run returns FAILED.

4. How do I find out what a Databricks job run cost?

Query the system.billing.usage table in Unity Catalog. Join it with system.billing.list_prices on sku_name, filter by usage_metadata.job_id, and sum usage_quantity * pricing.default. This gives you DBU-based cost per job run for the selected period. Note that this only covers jobs run on jobs compute and serverless compute, not jobs on all-purpose or SQL warehouse compute.

5. Can I monitor Azure Databricks with Prometheus and Grafana?

Yes. Databricks exposes a Prometheus-compatible metrics endpoint via the cluster init script that installs the Prometheus JMX exporter. You can also use the Datadog Agent with a Prometheus scrape configuration, or the Azure Monitor metrics API to pull metrics into Grafana using the Azure Monitor data source plugin. The AnalyticJeremy/Azure-Databricks-Monitoring GitHub project (https://github.com/mspnp/azure-databricks-dev-guide) contains reference architecture and configuration examples for this approach.

How to Monitor Azure Databricks Jobs and Clusters

Table of Contents