Azure Databricks Integration
Cluster metrics, job run performance, and notebook execution monitoring for Azure Databricks. Detect data pipeline failures and cluster inefficiencies before they delay business insights.
How It Works
Configure Log Analytics Export
Enable the Databricks diagnostic log export to Azure Log Analytics workspace. TigerOps reads cluster events, job run logs, and Spark metrics from this workspace via the Azure Monitor integration.
Deploy the Spark Metrics Library
Add the TigerOps Spark metrics init script to your cluster policies. The library forwards driver and executor JVM, GC, and shuffle metrics to TigerOps remote-write without changing your application code.
Create Service Principal
Register an Azure Service Principal with Monitoring Reader and the Databricks workspace contributor role. TigerOps uses the Databricks REST API to collect job run status and cluster state.
Set Job SLO Alerts
Define job duration SLOs and failure rate thresholds per job cluster or job name. TigerOps alerts when job runs exceed duration SLOs and correlates failures with cluster configuration drift.
What You Get Out of the Box
Cluster Resource Utilization
CPU, memory, disk I/O, and network per driver and executor node. Identify over-provisioned clusters wasting budget and under-provisioned clusters causing job slowdowns.
Job Run Performance
Job run duration, success and failure rates, queue wait time, and cluster startup latency per job. Track SLO compliance for production pipelines and ML training jobs.
Spark Stage & Task Metrics
Stage duration, task retry counts, shuffle read/write bytes, and spill-to-disk volume per job run. Identify data skew and shuffle bottlenecks causing stage slowdowns.
Notebook Execution Monitoring
Notebook run duration, command execution counts, and error rates from interactive and scheduled notebook runs. Track notebook performance regressions across code changes.
Auto-Scaling Cluster Insights
Scale-out and scale-in event history, node provisioning duration, and spot instance eviction counts for auto-scaling clusters. Understand capacity behavior under variable workload.
AI Job Failure Root Cause
TigerOps AI analyzes Spark exception logs, executor lost events, and OOMKill signals to surface the most likely root cause of job failures, reducing debug time from hours to minutes.
Databricks Cluster Init Script
Add this init script to your cluster policy to forward Spark metrics to TigerOps.
#!/bin/bash
# TigerOps Databricks init script
# Add to cluster policy: spark.databricks.cluster.profile singleNode
# Init script path: dbfs:/tigerops/init/tigerops-metrics.sh
set -euo pipefail
TIGEROPS_ENDPOINT="https://ingest.atatus.net/api/v1/write"
TIGEROPS_API_KEY="${TIGEROPS_API_KEY}"
CLUSTER_NAME="${DB_CLUSTER_NAME}"
# Install TigerOps Spark metrics library
pip install tigerops-spark==1.x --quiet
# Write Spark metrics configuration
cat > /databricks/spark/conf/metrics.properties <<EOF
*.sink.tigerops.class=net.atatus.spark.TigerOpsSink
*.sink.tigerops.endpoint=${TIGEROPS_ENDPOINT}
*.sink.tigerops.token=${TIGEROPS_API_KEY}
*.sink.tigerops.cluster=${CLUSTER_NAME}
*.sink.tigerops.period=15
*.sink.tigerops.unit=SECONDS
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
EOF
# Enable Databricks diagnostic log export to Log Analytics
# (Configure via Azure Portal or ARM template separately)
echo "TigerOps Spark metrics sink configured for cluster: ${CLUSTER_NAME}"Common Questions
How does TigerOps collect Spark metrics from Databricks?
TigerOps provides a Databricks init script that installs a Spark metrics sink on your clusters. The sink pushes driver and executor metrics to TigerOps remote-write every 15 seconds. Alternatively, Databricks diagnostic logs streamed to Log Analytics are parsed for cluster and job events.
Does TigerOps support Delta Live Tables pipelines?
Yes. TigerOps monitors Delta Live Tables pipeline runs, update duration, flow progress, and data quality expectation pass and fail rates from the Databricks Event Log streamed to Log Analytics.
Can TigerOps alert on Databricks job SLA breaches?
Yes. You can define a maximum allowed run duration per job or job cluster in TigerOps. When a run exceeds the SLA, TigerOps fires an alert with the current run duration, estimated completion time, and any executor failures observed so far.
Does TigerOps support Unity Catalog and workspace-level metrics?
Yes. TigerOps ingests Databricks Account-level audit logs from Log Analytics which include Unity Catalog access events, metastore operations, and workspace-level user activity for compliance and performance monitoring.
How does TigerOps help reduce Databricks cluster costs?
TigerOps tracks DBU consumption per cluster, job, and user alongside cluster utilization. The AI cost insights feature identifies clusters with low utilization that are good candidates for auto-termination or downsizing, surfacing potential monthly savings.
Make Every Databricks Job Run Count
Cluster metrics, job SLO tracking, and AI root cause analysis for Azure Databricks. Deploy in 5 minutes.