All Integrations
CloudAzure Monitor + Service Principal

Azure Databricks Integration

Cluster metrics, job run performance, and notebook execution monitoring for Azure Databricks. Detect data pipeline failures and cluster inefficiencies before they delay business insights.

Setup

How It Works

01

Configure Log Analytics Export

Enable the Databricks diagnostic log export to Azure Log Analytics workspace. TigerOps reads cluster events, job run logs, and Spark metrics from this workspace via the Azure Monitor integration.

02

Deploy the Spark Metrics Library

Add the TigerOps Spark metrics init script to your cluster policies. The library forwards driver and executor JVM, GC, and shuffle metrics to TigerOps remote-write without changing your application code.

03

Create Service Principal

Register an Azure Service Principal with Monitoring Reader and the Databricks workspace contributor role. TigerOps uses the Databricks REST API to collect job run status and cluster state.

04

Set Job SLO Alerts

Define job duration SLOs and failure rate thresholds per job cluster or job name. TigerOps alerts when job runs exceed duration SLOs and correlates failures with cluster configuration drift.

Capabilities

What You Get Out of the Box

Cluster Resource Utilization

CPU, memory, disk I/O, and network per driver and executor node. Identify over-provisioned clusters wasting budget and under-provisioned clusters causing job slowdowns.

Job Run Performance

Job run duration, success and failure rates, queue wait time, and cluster startup latency per job. Track SLO compliance for production pipelines and ML training jobs.

Spark Stage & Task Metrics

Stage duration, task retry counts, shuffle read/write bytes, and spill-to-disk volume per job run. Identify data skew and shuffle bottlenecks causing stage slowdowns.

Notebook Execution Monitoring

Notebook run duration, command execution counts, and error rates from interactive and scheduled notebook runs. Track notebook performance regressions across code changes.

Auto-Scaling Cluster Insights

Scale-out and scale-in event history, node provisioning duration, and spot instance eviction counts for auto-scaling clusters. Understand capacity behavior under variable workload.

AI Job Failure Root Cause

TigerOps AI analyzes Spark exception logs, executor lost events, and OOMKill signals to surface the most likely root cause of job failures, reducing debug time from hours to minutes.

Configuration

Databricks Cluster Init Script

Add this init script to your cluster policy to forward Spark metrics to TigerOps.

tigerops-init.sh
#!/bin/bash
# TigerOps Databricks init script
# Add to cluster policy: spark.databricks.cluster.profile singleNode
# Init script path: dbfs:/tigerops/init/tigerops-metrics.sh

set -euo pipefail

TIGEROPS_ENDPOINT="https://ingest.atatus.net/api/v1/write"
TIGEROPS_API_KEY="${TIGEROPS_API_KEY}"
CLUSTER_NAME="${DB_CLUSTER_NAME}"

# Install TigerOps Spark metrics library
pip install tigerops-spark==1.x --quiet

# Write Spark metrics configuration
cat > /databricks/spark/conf/metrics.properties <<EOF
*.sink.tigerops.class=net.atatus.spark.TigerOpsSink
*.sink.tigerops.endpoint=${TIGEROPS_ENDPOINT}
*.sink.tigerops.token=${TIGEROPS_API_KEY}
*.sink.tigerops.cluster=${CLUSTER_NAME}
*.sink.tigerops.period=15
*.sink.tigerops.unit=SECONDS
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
EOF

# Enable Databricks diagnostic log export to Log Analytics
# (Configure via Azure Portal or ARM template separately)
echo "TigerOps Spark metrics sink configured for cluster: ${CLUSTER_NAME}"
FAQ

Common Questions

How does TigerOps collect Spark metrics from Databricks?

TigerOps provides a Databricks init script that installs a Spark metrics sink on your clusters. The sink pushes driver and executor metrics to TigerOps remote-write every 15 seconds. Alternatively, Databricks diagnostic logs streamed to Log Analytics are parsed for cluster and job events.

Does TigerOps support Delta Live Tables pipelines?

Yes. TigerOps monitors Delta Live Tables pipeline runs, update duration, flow progress, and data quality expectation pass and fail rates from the Databricks Event Log streamed to Log Analytics.

Can TigerOps alert on Databricks job SLA breaches?

Yes. You can define a maximum allowed run duration per job or job cluster in TigerOps. When a run exceeds the SLA, TigerOps fires an alert with the current run duration, estimated completion time, and any executor failures observed so far.

Does TigerOps support Unity Catalog and workspace-level metrics?

Yes. TigerOps ingests Databricks Account-level audit logs from Log Analytics which include Unity Catalog access events, metastore operations, and workspace-level user activity for compliance and performance monitoring.

How does TigerOps help reduce Databricks cluster costs?

TigerOps tracks DBU consumption per cluster, job, and user alongside cluster utilization. The AI cost insights feature identifies clusters with low utilization that are good candidates for auto-termination or downsizing, surfacing potential monthly savings.

Get Started

Make Every Databricks Job Run Count

Cluster metrics, job SLO tracking, and AI root cause analysis for Azure Databricks. Deploy in 5 minutes.