SUMMARY:

See how XTIVIA built a production-grade, utility-aware CI/CD pipeline on Databricks — embedding NERC, FERC, and EPA compliance checks directly into every deployment gate.

The Goal: Production-Grade CI/CD for Databricks

An integrated U.S. energy company running generation reliability and performance workloads on Databricks needed a proper software delivery pipeline. Data engineers were deploying notebooks manually—zipping files, uploading them through the UI, hoping nothing broke overnight. There were minimal tests, unclear environment separation, a basic audit trail, and manual rollback if something went wrong.

The objective was straightforward: build a CI/CD pipeline that treats Databricks workloads with the same engineering discipline applied to any production software system. Every code change tested. Every deployment gated. Every production release traceable and reversible.

What made this engagement distinct was going one step further—making the pipeline utility-aware. Energy sector data workloads have operational requirements that generic CI/CD templates don’t account for: concurrency constraints that prevent duplicate report generation, timeout policies aligned with plant data ingestion windows, service principal governance for NERC compliance, and run health thresholds tied to operational SLAs. XTIVIA encoded all of these as first-class quality gates.

Pipeline Architecture: Dev → Staging → Prod

The pipeline uses Databricks Asset Bundles to define all three environments in a single databricks.yml —each target mapped to its own isolated workspace with separate credentials, separate job registries, and separate notebook paths.

39272100%
Isolated workspacesProduction jobs deployedAutomated quality checksPipeline pass rate
Dev WorkspaceStaging WorkspaceProd Workspace
Auto-deploys on merge to main

Unit tests + 145 static checks + 127 runtime checks. Jobs prefixed [dev mrautroy] [dev]. No manual step required.
GitHub Environment · required team reviewers · OIDC

Named team approval via GitHub UI. OIDC short-lived credentials (no PAT). Branch restriction: main only. Integration tests + data contract tests. GitHub deployment record with approver, commit SHA, and timestamp.
Manual trigger · type DEPLOY

Full gate suite. Release tag created on every successful deploy. One-click rollback to any previous tag. Clean [prod] job naming.

Developer workflow: A Git commit triggers PR CI → static checks → merge to main → auto-deploy to Dev → Staging promotion via GitHub Environment (named team approval, OIDC credentials, branch restriction, deployment record) → Prod promotion with DEPLOY confirmation and release tag. On GitHub Enterprise, Staging uses full environment protection rules instead of the manual keyword gate.

The Two-Phase Quality Gate System

At the heart of the pipeline are two quality gate phases that run on every deployment. Together, they execute 272 checks—covering both standard DevOps best practices and utility-sector-specific operational requirements. Critical, High, and Medium-severity findings block deployment. Low-severity findings log as warnings.

Phase 1 — Static (pre-deployment)Phase 2 — Runtime (post-deployment)
Reads YAML directly · no API call · works for brand new jobs · 145 checks per deploy

Job name follows [env] naming convention — Blocks High
Failure notifications configured — Blocks High
run_as is a service principal — Blocks High
max_concurrent_runs ≤ 1 — Blocks High
Task timeouts set — Blocks High
No plaintext secrets in parameters — Blocks High
Task retries configured — Blocks Medium
Libraries pinned to specific versions — Blocks Medium
Notebook paths defined — Blocks Medium
+ 7 more checks
Databricks API · only current target’s jobs · 127 checks per deploy

Job ACL / permissions (least-privilege) — Blocks High
No broad CAN_MANAGE grants — Blocks High
Historical run health (30-day lookback) — Blocks High
Failure rate below threshold — Blocks High
P95 duration vs operational SLA — Blocks Medium
run_as identity confirmed in workspace — Blocks Medium
DBR version ≥ 13.3 — Warning Low
Target-scoped: only checks [env] jobs — Info
Skips gracefully on first deploy — Info
+ 118 more checks

What Makes It Utility-Aware: Insight360 Check Enrichment

Standard CI/CD pipelines validate code correctness. Utility data pipelines need more — they need to know that a generation reliability job cannot run concurrently, that task timeouts must align to plant data ingestion windows, that a personal user token in run_as is a NERC compliance risk, and that run health is measured against operational SLAs rather than generic uptime metrics.

XTIVIA embedded its Insight360 utility check pack into the pipeline gates — translating energy-sector operational knowledge into enforceable YAML and API checks. The result: every deployment is validated against the same standards that Insight360 uses to assess Databricks platform maturity for utility customers.

NERC Reliability & CyberGeneration OperationsEPA Environmental
UTL.013 · UTL.014 · UTL.015

GADS outage reporting readiness, CIP asset inventory classification, and CIP access-log auditability enforced as data contract tests before staging UAT.
UTL.001 · UTL.002 · UTL.003

Plant/unit master linkage, generation telemetry freshness (<24h threshold), and outage event classification completeness validated pre-deploy to staging.
UTL.021 · UTL.022 · UTL.024

CEMS hourly completeness, GHG Subpart D emissions factor readiness, and emissions-to-generation cross-reference integrity checked before Gold layer promotion.
FERC & EIA Reporting
UTL.016 · UTL.017 · UTL.018

FERC Form 1 / EIA generation reporting field coverage, PPA contract master completeness, and PPA pricing schedule data quality validated in the staging gate.
Wholesale Commercial
UTL.006 · UTL.007 · UTL.008

Wholesale contract master completeness, billing determinant MWh and rate fields, and settlement variance detectability required before UAT sign-off.
SOX & Audit Trail
UTL.030 · UTL.031

Settlement and billing determinant audit columns (created_date, modified_by, approval_status) and generation output lineage checks enforced before prod.

These utility-specific checks run as data contract tests in the staging gate—after static job checks pass and before deployment proceeds. They ensure that UAT teams always receive data that meets the regulatory column requirements for the energy domain.

Job Governance: What the Gates Enforce on Every Deploy

The nine production jobs — five Gold layer aggregations, two ML models, one validation workflow, and one master orchestrator — are validated across 16 checks per job on every deployment. The checks below reflect the specific operational requirements of energy sector generation reliability workloads:

CheckWhy It Matters for Utility WorkloadsPhaseGate
max_concurrent_runs ≤ 1Concurrent generation availability runs produce duplicate reports consumed by plant operations.StaticBlocks — High
Failure notifications configuredSilent Gold layer failures mean stale data reaches operations teams before anyone knows the job failed.StaticBlocks — High
run_as service principalPersonal user tokens expire and are tied to individuals. Energy workloads require stable, auditable identities for NERC compliance.StaticBlocks — High
Task timeouts setTimeout windows must align to plant data ingestion schedules. Runaway jobs block downstream generation reporting pipelines.StaticBlocks — High
No plaintext secretsSCADA integration credentials and PI historian tokens must never appear in job parameters or task configs.StaticBlocks — High
Task retries configuredGeneration telemetry ingestion has transient failures. Tasks must be idempotent and retry-safe for reliable data completeness.StaticBlocks — Med
Libraries pinned to versionsUnpinned energy analytics libraries can break derate and heatrate calculations silently across releases.Static + RuntimeBlocks — Med
Job ACL least-privilegeNo broad CAN_MANAGE grants on generation reliability jobs — consistent with NERC CIP access control principles.RuntimeBlocks — High
P95 duration vs SLAGold jobs must complete within operational windows. P95 exceeding threshold signals performance degradation before it impacts operations.RuntimeBlocks — Med
Run health (30-day lookback)Historical failure rate tracked per job. Deteriorating health caught before the next deploy promotes bad code to the next environment.RuntimeWarning — Low

Results

OutcomeDetail
All 9 jobs: fully governedService principal, failure notifications, timeouts, and retries enforced on all 9 Gold and ML jobs across Dev, Staging, and Prod — by the pipeline, not by convention.
Three isolated workspaces liveDev, Staging, and Prod each have separate Databricks URLs, tokens, and job registries. No code reaches prod without clearing 272 automated checks.
Full deployment audit trailOn GitHub Enterprise, Staging uses GitHub Environment protection rules: named team reviewer, OIDC credentials (no PAT), and branch restriction. Every staging deploy creates a GitHub deployment record with approver name, commit SHA, and timestamp. Every prod deploy creates a prod-release-YYYYMMDD git tag with one-click rollback.
Utility checks embedded permanentlyNERC, FERC, EPA, and SOX checks run as data contract tests in the staging gate — regulatory requirements validated on every promotion.

“The pipeline doesn’t just check that code compiles — it checks that a generation reliability job is operationally safe to run in production. That’s the difference between generic CI/CD and a pipeline built for utility data workloads.”

Ready to build a production-grade, utility-aware Databricks pipeline? Contact us today!