Governance¶
For the step-by-step procedure for enabling Databricks Asset Bundle (DAB) integration, see Configure Bundles. This page explains the governance choices the framework supports and how they fit together.
Where governance lives in an LHP project¶
LHP touches three governance surfaces. Unity Catalog (UC) gets table names, comments, and column metadata from the FlowGroup definitions. The generated Python carries audit columns through operational metadata. The Databricks Asset Bundle wraps generated pipelines into deployable resource definitions with environment-specific permissions.
Each surface answers a different question. UC governance answers “what does this table represent and who can read it?”. Operational metadata answers “where did this row come from and when did it land?”. Bundle configuration answers “how does this pipeline get from a Git commit to a running pipeline in a workspace?”.
The pages on this site treat them as separate concerns because they have different release cycles and different review audiences. A change to a table comment is low-risk and platform-team-reviewed. A change to bundle permissions affects which humans can run which pipelines and needs security review.
Unity Catalog naming as a flow of substitutions¶
Every write target in LHP produces a Unity Catalog object —
catalog.schema.table. The three-part name is the most common piece
of configuration that has to differ across environments, and the
recommended pattern is to source every part from a substitution token:
write_target:
type: streaming_table
database: "${SILVER_CATALOG}.${SILVER_SCHEMA}"
table: "%{entity}_silver"
comment: "Silver layer for %{entity}, validated and enriched."
The catalog and schema come from substitution tokens, so the dev
pipeline writes to main_silver_dev.orders.customers_silver and the
prod pipeline writes to main_silver_prod.orders.customers_silver
from the same YAML. The local variable %{entity} keeps the
FlowGroup parameterised by entity name; the comment string carries the
same.
The medallion-token convention pairs with this naturally. A standard
substitution-token set — bronze_catalog, silver_catalog,
gold_catalog — gives every write target a predictable home and
keeps the substitution files small.
Operational metadata as an audit trail¶
Every row in a generated table can carry standard audit columns
through LHP’s operational metadata system. The columns are defined
once at the project level in lhp.yaml:
operational_metadata:
columns:
ingest_timestamp:
expression: "F.current_timestamp()"
description: "When the record was ingested"
applies_to: [streaming_table, materialized_view]
source_file:
expression: "F.input_file_name()"
description: "Source file path (CloudFiles only)"
applies_to: [streaming_table]
pipeline_run_id:
expression: "F.lit(spark.conf.get('pipelines.id'))"
description: "Pipeline run identifier"
The fields are straightforward. expression is the PySpark
expression that produces the column value; LHP inlines it into the
generated code. description becomes column metadata.
applies_to limits which write-target types receive the column —
input_file_name() only makes sense in CloudFiles-backed streaming
tables and produces an error elsewhere. additional_imports adds
import statements the expression needs.
Layer presets bundle the metadata columns appropriate to each
layer. A bronze_standard preset declares ingest_timestamp,
source_file, and pipeline_run_id. A silver_standard preset
declares updated_at (the silver-layer equivalent). Operational
metadata is additive across preset, FlowGroup, and action levels —
LHP deep-merges with deduplication, so a FlowGroup can add a column
without losing the preset’s defaults.
The payoff is post-hoc debugging. When a downstream report shows suspicious numbers, the operational metadata columns let you trace exactly which pipeline run wrote which rows, when, and from which source file. Without them, you reverse-engineer that information from DLT logs and timestamps, which is much slower.
Lineage from generated code¶
Unity Catalog’s automatic lineage tracker watches Spark execution and records read/write relationships between tables. LHP’s generated Python uses standard Spark and DLT decorators, so the lineage tracker sees every read and write without LHP doing anything special.
The implication is that LHP’s job is to generate code that names
tables consistently — the same ${SILVER_CATALOG}.orders.customers_silver
across pipelines — so the lineage graph in UC connects the right
nodes. Inconsistent table naming across pipelines (one FlowGroup
writes to main.orders.customers, another reads from
catalog.orders.customers) produces a lineage graph with
disconnected components.
The substitution layer is the enforcement mechanism. A
project-wide ${SILVER_CATALOG} token means every FlowGroup that
reads or writes silver tables uses the same catalog name. Hardcoded
catalog names are the predictable source of broken lineage.
Bundle integration as deployment governance¶
The Databricks Asset Bundle layer turns FlowGroups into deployable
units. LHP scaffolds the bundle structure by default with lhp init
(opt out with --no-bundle if you manage bundles separately). The
generated structure includes databricks.yml, per-environment
target configuration, and resource files (*.pipeline.yml,
*.job.yml) that point at the generated Python.
lhp deps --format job produces job resource definitions from the
dependency analysis. The graph of FlowGroups and their cross-pipeline
dependencies determines the task graph in the generated job. This
gives you an orchestrated multi-pipeline deployment from the same
declarative source, with the dependency edges deduced from the data
rather than maintained by hand.
The recommendation to keep generated bundle resources in a dedicated
directory (bundle/generated/ or similar) is operational hygiene.
Generated and hand-written resources should not share a directory
because regeneration overwrites everything in its output path. A
clean separation means you can regenerate bundle resources freely
without worrying about losing manual additions.
Per-environment bundle targets pair with the substitution-file
pattern. The bundle’s targets.dev and targets.prod entries
control workspace, permissions, and tags. The substitution file
controls the catalog and schema names the generated code writes to.
Together they specify both where the pipeline deploys and what
data it touches. Keep the two layers separate; do not try to encode
catalog overrides in the bundle target file.
DQE tiering as a governance lever¶
The data-quality tier per layer is itself a governance decision.
Bronze warn-only is permissive on purpose — raw data is precious,
even bad raw data. Silver drop is where you commit to a contract:
bad rows do not propagate, but you save them through a quarantine
table for review. Gold fail is the strict tier reserved for
business-critical invariants (a primary-key uniqueness violation, a
referential-integrity failure on a reference table).
Each tier corresponds to a different governance posture. Bronze says
“we will keep everything and triage later”. Silver says “downstream
consumers can trust this layer’s schema”. Gold says “if this is
wrong, stop the pipeline”. The decision belongs to the data steward
of each domain, not the pipeline author — which is why DQE files
live in expectations/ separately from FlowGroup YAML and benefit
from CODEOWNERS scoping.
Test reporting for audit¶
LHP’s test actions generate validation views. Their results can be published to Azure DevOps test reports or a Delta audit table — turning pipeline tests into a governance signal that surfaces in the same dashboards as application tests. See Test Result Reporting (Publishing) for the publishing mechanics.
Anti-patterns¶
Not using operational metadata. Debugging production data incidents without ingest timestamps and source file paths is much harder than with them. The cost of adding the metadata is one preset declaration; the cost of debugging without it is hours per incident.
Blank or trivial table comments. A UC catalog where every comment is the table name repeated is no better than no catalog. Meaningful descriptions pay for themselves the first time a downstream consumer finds the right table without asking.
Hardcoded catalog names. Beyond breaking environment promotion, they fragment UC lineage. Every catalog name should come from a substitution token.
Mixing generated and hand-written bundle resources in one directory. Regeneration overwrites everything in the output path. Keep them in separate directories.
See also¶
Configure Bundles for the step-by-step Databricks Asset Bundle integration procedure.
Operational Metadata for the full operational metadata reference, including all configuration fields.
Test Result Reporting (Publishing) for publishing test results to external audit systems.
Environments for the substitution patterns that pair with UC naming consistency.
Comments are queryable governance¶
The
commentfield on write targets propagates to the generated@dp.tabledecorator as acomment=...argument. Databricks stores this comment as table metadata; it shows up in the Data Explorer UI and is queryable throughDESCRIBE TABLE EXTENDEDor the UC information schema.This is the single highest-leverage governance field LHP exposes. A team that fills in meaningful comments on every silver and gold table (“Silver layer orders — deduped, validated, enriched with customer data; refresh every 15 minutes; owner: erp-team”) makes the data discoverable to downstream consumers without any separate catalog tool. A team that leaves comments blank — or fills them with the table name repeated — gets a UC catalog full of opaque table names.
The same applies at the column level through schema transforms, where each renamed column can carry a description. The descriptions flow into the table’s column metadata.