Azure Databricks is a widely used analytics and data engineering platform built on Apache Spark. As workloads grow to dozens or hundreds of scheduled jobs, knowing whether your jobs are running, how long they take, and why they fail becomes critical.
This guide walks you through the practical monitoring tools available for Azure Databricks, from built-in UIs to Azure Monitor, system tables, and the Jobs REST API. Each section answers a specific question: what to monitor, where to find it, and how to act on it.
- Azure Databricks provides built-in monitoring through Spark UI, Ganglia metrics, and the Jobs UI for real-time job and cluster visibility.
- Azure Monitor and Diagnostic Settings (Premium plan only) let you route logs and metrics to Log Analytics, enabling powerful KQL-based queries and alerts.
-
System tables such as
lakeflow_monitoring.job_runsandbillable_usageprovide historical job performance and cost data that can be queried using SQL. - The Databricks Jobs REST API lets you programmatically check job status, retrieve run details, and export logs for custom dashboards.
- Third-party tools like Datadog and Dynatrace offer deeper observability with anomaly detection and unified monitoring across your Azure stack.
- Combining native tools with custom dashboards (Databricks Lakeflow Dashboard or Power BI) provides the most complete monitoring coverage for large workspaces.
What Does Azure Databricks Monitoring Cover?
Azure Databricks monitoring spans two main areas: job monitoring (tracking scheduled or triggered pipeline runs) and cluster monitoring (tracking the compute infrastructure running those jobs).
For jobs, you typically want to know:
- Whether a job run succeeded, failed, or is still running
- How long it took compared to historical baselines
- Which task inside a multi-task job failed and why
- What the job cost in terms of DBUs (Databricks Units)
For clusters, you want to know:
- Whether the cluster is up and healthy
- CPU, memory, and network utilization
- Driver and executor logs for debugging
- Whether auto-scaling is behaving as expected
Method 1: Use the Databricks Jobs UI for Run-Level Monitoring
The simplest starting point is the Databricks Jobs UI, available at Workflows > Jobs in your workspace sidebar.
From the Jobs list view, you can see:
- All scheduled and triggered jobs in your workspace
- The last run status (Succeeded, Failed, Running, Skipped)
- Next scheduled run time
- Duration of the most recent run
Clicking into a specific job opens the Run History page, which shows every run with timestamps, duration, status, and links to Spark UI and logs. For multi-task jobs, you also get a task-level DAG view showing which task in the pipeline failed.
Limitation: The UI works well for diagnosing one job at a time. It becomes impractical when you need a workspace-wide view of dozens of jobs. Use system tables or dashboards for that.
Method 2: Monitor Cluster and Spark Metrics
Spark UI
The Spark UI is available from the cluster detail page (All-Purpose clusters) or the Job Run History page (job clusters). It provides:
- Jobs tab: Status and timeline of all Spark jobs in the application
- Stages tab: Breakdown of each stage and individual task metrics
- Storage tab: Cached RDDs and DataFrames in memory
- Executors tab: CPU and memory usage per executor
- SQL tab: Query plans for DataFrame and SQL operations
Ganglia Metrics
Ganglia is a built-in metrics dashboard accessible from the Metrics tab on cluster detail pages (All-Purpose clusters) and the Job Runs page (job clusters). It shows live CPU, memory, network, and disk I/O across all nodes.
Note: Ganglia data is ephemeral. For job clusters, metrics are only available as a static image after the cluster terminates, making it difficult to do historical analysis. For long-term tracking, use Azure Monitor or system tables instead.
JVM-Level Debugging
If you need to go deeper than what Spark UI provides, the following JVM tools are available on cluster nodes:
- jstack – Captures thread stack traces to diagnose hangs or deadlocks
- jmap – Creates heap dumps for OutOfMemory debugging
- jstat – Reports garbage collection and memory statistics over time
Method 3: Azure Monitor and Diagnostic Settings
Azure Monitor is the recommended approach for production-grade, long-term monitoring of Azure Databricks. It requires the Azure Databricks Premium plan
Enable Diagnostic Settings
To enable Diagnostic Settings:
- Open your Azure Databricks workspace in the Azure portal
- Navigate to Monitoring > Diagnostic settings
- Click “Add diagnostic setting”
- Select the log categories you want (see below)
- Choose “Send to Log Analytics workspace” as the destination
- Save the settings
Available Log Tables in Log Analytics
Once configured, the following tables are available in your Log Analytics workspace:
| Table Name | What It Contains |
| DatabricksClusters | Cluster create, start, terminate, edit events |
| DatabricksJobs | Job create, delete, reset, run now events |
| DatabricksNotebooks | Notebook attach, detach, create, delete events |
| DatabricksAccounts | Workspace-level account audit logs |
| DatabricksSQL | SQL warehouse query execution logs |
| AzureMetrics | Platform-level health and performance metrics |
| AzureActivity | Subscription-level events (portal actions, ARM calls) |
Example KQL Query: Failed Job Runs
Run this query in Log Analytics to find all failed job runs in the past 24 hours:
DatabricksJobs| where TimeGenerated > ago(24h)| where ActionName == 'runFailed'| project TimeGenerated, JobId = tostring(RequestParams.job_id), RunId = tostring(RequestParams.run_id), Identity, ErrorMessage = tostring(Response.error_code)| order by TimeGenerated descSet Up Alerts from Azure Monitor
Once logs are flowing into Log Analytics, you can create Azure Monitor Alerts to notify your team when jobs fail. Navigate to Azure Monitor > Alerts > New alert rule, choose your Log Analytics workspace as the scope, and create a custom log search query based on the examples above.
Method 4: Monitor Job Costs and Performance Using System Tables
Databricks system tables are Unity Catalog-managed tables that store historical metadata about your jobs, clusters, and costs. They are queryable via SQL from any notebook or SQL warehouse in your workspace.
Key System Tables for Job Monitoring
- system.lakeflow.job_runs – Each completed or running job run, with run ID, state, start/end time, and cluster ID
- system.billing.usage – DBU consumption per job run, queryable by job ID for cost attribution
- system.compute.clusters – Cluster configuration snapshots over time
Requirements: You must be either a metastore admin and account admin, or have USE and SELECT permissions on the system schemas. These queries only cover jobs run on jobs compute and serverless compute. Jobs on SQL warehouses or all-purpose compute are excluded from cost attribution.
Query: Most Expensive Jobs (Last 30 Days)
WITH list_cost_per_job AS ( SELECT t1.workspace_id, t1.usage_metadata.job_id, COUNT(DISTINCT t1.usage_metadata.job_run_id) AS num_runs, SUM(t1.usage_quantity * t2.pricing.default) AS total_cost FROM system.billing.usage AS t1 INNER JOIN system.billing.list_prices AS t2 ON t1.sku_name = t2.sku_name WHERE t1.usage_date >= CURRENT_DATE - 30 AND t1.usage_metadata.job_id IS NOT NULL GROUP BY 1, 2)SELECT * FROM list_cost_per_jobORDER BY total_cost DESCLIMIT 20;Query: Slowest Job Runs (Last 7 Days)
SELECT job_id, run_id, run_name, result_state, TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_minutesFROM system.lakeflow.job_runsWHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYSORDER BY duration_minutes DESCLIMIT 20;Lakeflow Monitoring Dashboard
Databricks provides a pre-built Lakeflow Monitoring Dashboard that uses system tables to visualize job performance, failure rates, and cost trends across your workspace. Download the dashboard JSON from the Databricks GitHub Repository and import it into your workspace.
Method 5: Use the Databricks Jobs API for Programmatic Monitoring
The Databricks Jobs REST API gives you full programmatic access to job and run metadata. This is ideal for building custom dashboards, integrating with external monitoring systems, or automating workflows.
Key API Endpoints for Monitoring
| Endpoint | Method | What It Returns |
| /api/2.1/jobs/list | GET | All jobs in the workspace |
| /api/2.1/jobs/runs/list | GET | Runs for a specific job with status |
| /api/2.1/jobs/runs/get | GET | Details of a single run including state, start time, and duration |
| /api/2.1/jobs/runs/get-output | GET | Output and error logs from a completed run |
| /api/2.1/jobs/runs/cancel | POST | Cancel a running job |
| /api/2.1/jobs/runs/repair | POST | Retry only failed tasks in a run |
Example: List Failed Runs via Python
import requests
DATABRICKS_HOST = 'https://<your-workspace>.azuredatabricks.net'TOKEN = '<your-personal-access-token>'JOB_ID = 12345
response = requests.get( f'{DATABRICKS_HOST}/api/2.1/jobs/runs/list', headers={'Authorization': f'Bearer {TOKEN}'}, params={ 'job_id': JOB_ID, 'life_cycle_state': 'TERMINATED', 'result_state': 'FAILED', 'limit': 25 })
runs = response.json().get('runs', [])for run in runs: print(f"Run {run['run_id']}: {run['state']['result_state']} at {run['start_time']}")Method 6: Monitor Cluster Up/Down Status
A common requirement is knowing whether a cluster is currently running or has gone down unexpectedly. Azure Databricks does not send native webhooks for cluster state changes, so you have two practical approaches:
Option A: Poll Cluster State via API
Use the Clusters API to check cluster state on a schedule:
GET /api/2.0/clusters/get?cluster_id=<cluster-id>
# Relevant fields in response:# state: PENDING | RUNNING | RESTARTING | RESIZING |# TERMINATING | TERMINATED | ERROR | UNKNOWN# state_message: Human-readable description of current stateA cluster in ERROR or TERMINATED state unexpectedly indicates a problem. You can wrap this in a monitoring script that sends an alert to Slack or PagerDuty if the state is unexpected.
Option B: Use Azure Monitor Alerts on DatabricksClusters Logs
If Diagnostic Settings are enabled, the DatabricksClusters table in Log Analytics records terminate events. Create an alert rule that fires when an unexpected clusterTerminated event appears outside maintenance windows.
Method 7: Third-Party Monitoring Tools
Native tools cover most use cases, but third-party observability platforms add convenience, anomaly detection, and integration with the rest of your monitoring stack.
CubeAPM
CubeAPM gives teams a self-hosted, OpenTelemetry-native way to monitor Databricks workloads alongside the rest of their application stack. It can collect metrics, logs, traces, and infrastructure signals in one place, making it useful when teams want Databricks visibility without sending all observability data to a SaaS-only platform.
Datadog
Datadog offers a Databricks integration that pulls cluster metrics, job run data, and Spark application metrics via the Datadog Agent installed on cluster nodes. It supports anomaly detection, dashboards, and alert routing out of the box.
Dynatrace
Dynatrace provides workspace-level and cluster-level observability for Azure Databricks through its OneAgent and ActiveGate integration. It automatically discovers Databricks entities, traces Spark jobs, and maps dependencies.
IBM Instana
IBM Instana supports Azure Databricks monitoring through Azure Monitor integration, capturing job run metrics and cluster health into its observability pipeline alongside your other Azure services.
Method 8: Build a Custom Monitoring Dashboard
When you manage a large workspace with dozens of teams and hundreds of jobs, a custom dashboard gives you the workspace-wide view the Databricks UI cannot.
Common approaches:
- Databricks AI/BI Dashboards (formerly Lakeflow Dashboards): Build natively inside your workspace using system table queries. Best for Databricks-centric teams.
- Azure Monitor Workbooks: Combine Log Analytics queries with Azure Metrics in a single visualization. Best for Azure-centric ops teams.
- Power BI: Connect to system tables via Databricks SQL connector. Best when dashboards are shared with business stakeholders who already use Power BI.
- Grafana: Use the Azure Monitor data source or Databricks Prometheus endpoint. Best for engineering teams with an existing Grafana stack.
The approach described by Maksim Pachkouski on Dev Genius combines the REST API (for near-real-time operational data) with system tables (for historical trends), which is the most flexible architecture for complex workspaces.
What to Monitor: A Practical Checklist
| What to Monitor | Best Tool | Frequency |
| Job run success/failure | Jobs UI or system tables | Real-time / on schedule |
| Job duration trends | System tables (SQL) | Daily |
| Job cost per run (DBU) | system.billing.usage | Daily / Weekly |
| Cluster CPU and memory | Ganglia or Datadog Agent | Real-time |
| Cluster up/down state | Clusters API polling | Every 5 minutes |
| Driver and executor logs | Spark UI / Cluster log delivery | On failure |
| Audit trail (who did what) | Azure Monitor (DatabricksAccounts) | On demand / compliance |
| SQL warehouse query perf | Query History UI / system tables | Daily |
Conclusion
Azure Databricks monitoring does not have a single right answer. The best setup depends on your team size, workspace complexity, and existing observability stack. Start with the built-in tools (Spark UI, Jobs UI, Ganglia) for immediate visibility, add Azure Monitor Diagnostic Settings for audit logging and alerts, and layer in system tables for historical cost and performance analysis.
For teams managing large workspaces with many jobs, combining the REST API with a custom or pre-built dashboard gives you the clearest picture. Third-party tools like Datadog or Dynatrace are worth the investment when you need anomaly detection or unified monitoring across Azure services.
Disclaimer: The information in this article is based on publicly available documentation and community resources as of May 2026. Azure Databricks features, API endpoints, and system table schemas may change with platform updates. Always refer to the official Microsoft and Databricks documentation for the most current information. Feature availability may vary by plan tier (Standard, Premium, or Enterprise) and cloud region.
FAQs
1. How do I check if an Azure Databricks job is currently running?
Use the Jobs UI under Workflows > Jobs in your workspace to see current run status. Alternatively, call GET /api/2.1/jobs/runs/list with the job_id parameter and filter by life_cycle_state=RUNNING. System tables can also be queried: SELECT * FROM system.lakeflow.job_runs WHERE result_state IS NULL will return runs that have not yet completed.
2. Do I need the Premium plan for Azure Databricks monitoring?
Not for all monitoring. The built-in Spark UI, Ganglia metrics, and Jobs UI are available on all plan tiers. However, Azure Monitor Diagnostic Settings (which route Databricks logs to Log Analytics) require the Azure Databricks Premium plan. System tables require Unity Catalog, which is also a Premium feature.
3. How do I get alerts when an Azure Databricks job fails?
There are three approaches. First, enable Diagnostic Settings to route DatabricksJobs logs to Azure Monitor, then create an alert rule that fires on runFailed events. Second, use the built-in Databricks notification system: in the job settings under “Notifications,” add an email or webhook destination for failure events. Third, poll the Jobs API on a schedule and trigger your own alerting logic when a run returns FAILED.
4. How do I find out what a Databricks job run cost?
Query the system.billing.usage table in Unity Catalog. Join it with system.billing.list_prices on sku_name, filter by usage_metadata.job_id, and sum usage_quantity * pricing.default. This gives you DBU-based cost per job run for the selected period. Note that this only covers jobs run on jobs compute and serverless compute, not jobs on all-purpose or SQL warehouse compute.
5. Can I monitor Azure Databricks with Prometheus and Grafana?
Yes. Databricks exposes a Prometheus-compatible metrics endpoint via the cluster init script that installs the Prometheus JMX exporter. You can also use the Datadog Agent with a Prometheus scrape configuration, or the Azure Monitor metrics API to pull metrics into Grafana using the Azure Monitor data source plugin. The AnalyticJeremy/Azure-Databricks-Monitoring GitHub project (https://github.com/mspnp/azure-databricks-dev-guide) contains reference architecture and configuration examples for this approach.





