SUMMARY:
See how XTIVIA built a production-grade, utility-aware CI/CD pipeline on Databricks — embedding NERC, FERC, and EPA compliance checks directly into every deployment gate.
Table of contents
The Goal: Production-Grade CI/CD for Databricks
An integrated U.S. energy company running generation reliability and performance workloads on Databricks needed a proper software delivery pipeline. Data engineers were deploying notebooks manually—zipping files, uploading them through the UI, hoping nothing broke overnight. There were minimal tests, unclear environment separation, a basic audit trail, and manual rollback if something went wrong.
The objective was straightforward: build a CI/CD pipeline that treats Databricks workloads with the same engineering discipline applied to any production software system. Every code change tested. Every deployment gated. Every production release traceable and reversible.
What made this engagement distinct was going one step further—making the pipeline utility-aware. Energy sector data workloads have operational requirements that generic CI/CD templates don’t account for: concurrency constraints that prevent duplicate report generation, timeout policies aligned with plant data ingestion windows, service principal governance for NERC compliance, and run health thresholds tied to operational SLAs. XTIVIA encoded all of these as first-class quality gates.
Pipeline Architecture: Dev → Staging → Prod
The pipeline uses Databricks Asset Bundles to define all three environments in a single databricks.yml —each target mapped to its own isolated workspace with separate credentials, separate job registries, and separate notebook paths.
| 3 | 9 | 272 | 100% |
|---|---|---|---|
| Isolated workspaces | Production jobs deployed | Automated quality checks | Pipeline pass rate |
| Dev Workspace | Staging Workspace | Prod Workspace |
|---|---|---|
| Auto-deploys on merge to main Unit tests + 145 static checks + 127 runtime checks. Jobs prefixed [dev mrautroy] [dev]. No manual step required. | GitHub Environment · required team reviewers · OIDC Named team approval via GitHub UI. OIDC short-lived credentials (no PAT). Branch restriction: main only. Integration tests + data contract tests. GitHub deployment record with approver, commit SHA, and timestamp. | Manual trigger · type DEPLOY Full gate suite. Release tag created on every successful deploy. One-click rollback to any previous tag. Clean [prod] job naming. |
Developer workflow: A Git commit triggers PR CI → static checks → merge to main → auto-deploy to Dev → Staging promotion via GitHub Environment (named team approval, OIDC credentials, branch restriction, deployment record) → Prod promotion with DEPLOY confirmation and release tag. On GitHub Enterprise, Staging uses full environment protection rules instead of the manual keyword gate.
The Two-Phase Quality Gate System
At the heart of the pipeline are two quality gate phases that run on every deployment. Together, they execute 272 checks—covering both standard DevOps best practices and utility-sector-specific operational requirements. Critical, High, and Medium-severity findings block deployment. Low-severity findings log as warnings.
| Phase 1 — Static (pre-deployment) | Phase 2 — Runtime (post-deployment) |
|---|---|
| Reads YAML directly · no API call · works for brand new jobs · 145 checks per deploy Job name follows [env] naming convention — Blocks HighFailure notifications configured — Blocks High run_as is a service principal — Blocks Highmax_concurrent_runs ≤ 1 — Blocks HighTask timeouts set — Blocks High No plaintext secrets in parameters — Blocks High Task retries configured — Blocks Medium Libraries pinned to specific versions — Blocks Medium Notebook paths defined — Blocks Medium + 7 more checks | Databricks API · only current target’s jobs · 127 checks per deploy Job ACL / permissions (least-privilege) — Blocks High No broad CAN_MANAGE grants — Blocks High Historical run health (30-day lookback) — Blocks High Failure rate below threshold — Blocks High P95 duration vs operational SLA — Blocks Medium run_as identity confirmed in workspace — Blocks MediumDBR version ≥ 13.3 — Warning Low Target-scoped: only checks [env] jobs — InfoSkips gracefully on first deploy — Info + 118 more checks |
What Makes It Utility-Aware: Insight360 Check Enrichment
Standard CI/CD pipelines validate code correctness. Utility data pipelines need more — they need to know that a generation reliability job cannot run concurrently, that task timeouts must align to plant data ingestion windows, that a personal user token in run_as is a NERC compliance risk, and that run health is measured against operational SLAs rather than generic uptime metrics.
XTIVIA embedded its Insight360 utility check pack into the pipeline gates — translating energy-sector operational knowledge into enforceable YAML and API checks. The result: every deployment is validated against the same standards that Insight360 uses to assess Databricks platform maturity for utility customers.
| NERC Reliability & Cyber | Generation Operations | EPA Environmental |
|---|---|---|
| UTL.013 · UTL.014 · UTL.015 GADS outage reporting readiness, CIP asset inventory classification, and CIP access-log auditability enforced as data contract tests before staging UAT. | UTL.001 · UTL.002 · UTL.003 Plant/unit master linkage, generation telemetry freshness (<24h threshold), and outage event classification completeness validated pre-deploy to staging. | UTL.021 · UTL.022 · UTL.024 CEMS hourly completeness, GHG Subpart D emissions factor readiness, and emissions-to-generation cross-reference integrity checked before Gold layer promotion. |
| FERC & EIA Reporting UTL.016 · UTL.017 · UTL.018 FERC Form 1 / EIA generation reporting field coverage, PPA contract master completeness, and PPA pricing schedule data quality validated in the staging gate. | Wholesale Commercial UTL.006 · UTL.007 · UTL.008 Wholesale contract master completeness, billing determinant MWh and rate fields, and settlement variance detectability required before UAT sign-off. | SOX & Audit Trail UTL.030 · UTL.031 Settlement and billing determinant audit columns ( created_date, modified_by, approval_status) and generation output lineage checks enforced before prod. |
These utility-specific checks run as data contract tests in the staging gate—after static job checks pass and before deployment proceeds. They ensure that UAT teams always receive data that meets the regulatory column requirements for the energy domain.
Job Governance: What the Gates Enforce on Every Deploy
The nine production jobs — five Gold layer aggregations, two ML models, one validation workflow, and one master orchestrator — are validated across 16 checks per job on every deployment. The checks below reflect the specific operational requirements of energy sector generation reliability workloads:
| Check | Why It Matters for Utility Workloads | Phase | Gate |
|---|---|---|---|
max_concurrent_runs ≤ 1 | Concurrent generation availability runs produce duplicate reports consumed by plant operations. | Static | Blocks — High |
| Failure notifications configured | Silent Gold layer failures mean stale data reaches operations teams before anyone knows the job failed. | Static | Blocks — High |
run_as service principal | Personal user tokens expire and are tied to individuals. Energy workloads require stable, auditable identities for NERC compliance. | Static | Blocks — High |
| Task timeouts set | Timeout windows must align to plant data ingestion schedules. Runaway jobs block downstream generation reporting pipelines. | Static | Blocks — High |
| No plaintext secrets | SCADA integration credentials and PI historian tokens must never appear in job parameters or task configs. | Static | Blocks — High |
| Task retries configured | Generation telemetry ingestion has transient failures. Tasks must be idempotent and retry-safe for reliable data completeness. | Static | Blocks — Med |
| Libraries pinned to versions | Unpinned energy analytics libraries can break derate and heatrate calculations silently across releases. | Static + Runtime | Blocks — Med |
| Job ACL least-privilege | No broad CAN_MANAGE grants on generation reliability jobs — consistent with NERC CIP access control principles. | Runtime | Blocks — High |
| P95 duration vs SLA | Gold jobs must complete within operational windows. P95 exceeding threshold signals performance degradation before it impacts operations. | Runtime | Blocks — Med |
| Run health (30-day lookback) | Historical failure rate tracked per job. Deteriorating health caught before the next deploy promotes bad code to the next environment. | Runtime | Warning — Low |
Results
| Outcome | Detail |
|---|---|
| All 9 jobs: fully governed | Service principal, failure notifications, timeouts, and retries enforced on all 9 Gold and ML jobs across Dev, Staging, and Prod — by the pipeline, not by convention. |
| Three isolated workspaces live | Dev, Staging, and Prod each have separate Databricks URLs, tokens, and job registries. No code reaches prod without clearing 272 automated checks. |
| Full deployment audit trail | On GitHub Enterprise, Staging uses GitHub Environment protection rules: named team reviewer, OIDC credentials (no PAT), and branch restriction. Every staging deploy creates a GitHub deployment record with approver name, commit SHA, and timestamp. Every prod deploy creates a prod-release-YYYYMMDD git tag with one-click rollback. |
| Utility checks embedded permanently | NERC, FERC, EPA, and SOX checks run as data contract tests in the staging gate — regulatory requirements validated on every promotion. |
“The pipeline doesn’t just check that code compiles — it checks that a generation reliability job is operationally safe to run in production. That’s the difference between generic CI/CD and a pipeline built for utility data workloads.”
Ready to build a production-grade, utility-aware Databricks pipeline? Contact us today!