Governance¶

For the step-by-step procedure for enabling Databricks Asset Bundle (DAB) integration, see Configure Bundles. This page explains the governance choices the framework supports and how they fit together.

Where governance lives in an LHP project¶

LHP touches three governance surfaces. Unity Catalog (UC) gets table names, comments, and column metadata from the FlowGroup definitions. The generated Python carries audit columns through operational metadata. The Databricks Asset Bundle wraps generated pipelines into deployable resource definitions with environment-specific permissions.

Each surface answers a different question. UC governance answers “what does this table represent and who can read it?”. Operational metadata answers “where did this row come from and when did it land?”. Bundle configuration answers “how does this pipeline get from a Git commit to a running pipeline in a workspace?”.

The pages on this site treat them as separate concerns because they have different release cycles and different review audiences. A change to a table comment is low-risk and platform-team-reviewed. A change to bundle permissions affects which humans can run which pipelines and needs security review.

Unity Catalog naming as a flow of substitutions¶

Every write target in LHP produces a Unity Catalog object — catalog.schema.table. The three-part name is the most common piece of configuration that has to differ across environments, and the recommended pattern is to source every part from a substitution token:

write_target:
  type: streaming_table
  database: "${SILVER_CATALOG}.${SILVER_SCHEMA}"
  table: "%{entity}_silver"
  comment: "Silver layer for %{entity}, validated and enriched."

The catalog and schema come from substitution tokens, so the dev pipeline writes to main_silver_dev.orders.customers_silver and the prod pipeline writes to main_silver_prod.orders.customers_silver from the same YAML. The local variable %{entity} keeps the FlowGroup parameterised by entity name; the comment string carries the same.

The medallion-token convention pairs with this naturally. A standard substitution-token set — bronze_catalog, silver_catalog, gold_catalog — gives every write target a predictable home and keeps the substitution files small.

Comments are queryable governance¶

The comment field on write targets propagates to the generated @dp.table decorator as a comment=... argument. Databricks stores this comment as table metadata; it shows up in the Data Explorer UI and is queryable through DESCRIBE TABLE EXTENDED or the UC information schema.

This is the single highest-leverage governance field LHP exposes. A team that fills in meaningful comments on every silver and gold table (“Silver layer orders — deduped, validated, enriched with customer data; refresh every 15 minutes; owner: erp-team”) makes the data discoverable to downstream consumers without any separate catalog tool. A team that leaves comments blank — or fills them with the table name repeated — gets a UC catalog full of opaque table names.

The same applies at the column level through schema transforms, where each renamed column can carry a description. The descriptions flow into the table’s column metadata.

Operational metadata as an audit trail¶

Every row in a generated table can carry standard audit columns through LHP’s operational metadata system. The columns are defined once at the project level in lhp.yaml:

lhp.yaml operational metadata block¶

operational_metadata:
  columns:
    ingest_timestamp:
      expression: "F.current_timestamp()"
      description: "When the record was ingested"
      applies_to: [streaming_table, materialized_view]
    source_file:
      expression: "F.input_file_name()"
      description: "Source file path (CloudFiles only)"
      applies_to: [streaming_table]
    pipeline_run_id:
      expression: "F.lit(spark.conf.get('pipelines.id'))"
      description: "Pipeline run identifier"

The fields are straightforward. expression is the PySpark expression that produces the column value; LHP inlines it into the generated code. description becomes column metadata. applies_to limits which write-target types receive the column — input_file_name() only makes sense in CloudFiles-backed streaming tables and produces an error elsewhere. additional_imports adds import statements the expression needs.

Layer presets bundle the metadata columns appropriate to each layer. A bronze_standard preset declares ingest_timestamp, source_file, and pipeline_run_id. A silver_standard preset declares updated_at (the silver-layer equivalent). Operational metadata is additive across preset, FlowGroup, and action levels — LHP deep-merges with deduplication, so a FlowGroup can add a column without losing the preset’s defaults.

The payoff is post-hoc debugging. When a downstream report shows suspicious numbers, the operational metadata columns let you trace exactly which pipeline run wrote which rows, when, and from which source file. Without them, you reverse-engineer that information from DLT logs and timestamps, which is much slower.

Lineage from generated code¶

Unity Catalog’s automatic lineage tracker watches Spark execution and records read/write relationships between tables. LHP’s generated Python uses standard Spark and DLT decorators, so the lineage tracker sees every read and write without LHP doing anything special.

The implication is that LHP’s job is to generate code that names tables consistently — the same ${SILVER_CATALOG}.orders.customers_silver across pipelines — so the lineage graph in UC connects the right nodes. Inconsistent table naming across pipelines (one FlowGroup writes to main.orders.customers, another reads from catalog.orders.customers) produces a lineage graph with disconnected components.

The substitution layer is the enforcement mechanism. A project-wide ${SILVER_CATALOG} token means every FlowGroup that reads or writes silver tables uses the same catalog name. Hardcoded catalog names are the predictable source of broken lineage.

Bundle integration as deployment governance¶

The Databricks Asset Bundle layer turns FlowGroups into deployable units. LHP scaffolds the bundle structure by default with lhp init (opt out with --no-bundle if you manage bundles separately). The generated structure includes databricks.yml, per-environment target configuration, and resource files (*.pipeline.yml, *.job.yml) that point at the generated Python.

lhp deps --format job produces job resource definitions from the dependency analysis. The graph of FlowGroups and their cross-pipeline dependencies determines the task graph in the generated job. This gives you an orchestrated multi-pipeline deployment from the same declarative source, with the dependency edges deduced from the data rather than maintained by hand.

The recommendation to keep generated bundle resources in a dedicated directory (bundle/generated/ or similar) is operational hygiene. Generated and hand-written resources should not share a directory because regeneration overwrites everything in its output path. A clean separation means you can regenerate bundle resources freely without worrying about losing manual additions.

Per-environment bundle targets pair with the substitution-file pattern. The bundle’s targets.dev and targets.prod entries control workspace, permissions, and tags. The substitution file controls the catalog and schema names the generated code writes to. Together they specify both where the pipeline deploys and what data it touches. Keep the two layers separate; do not try to encode catalog overrides in the bundle target file.

DQE tiering as a governance lever¶

The data-quality tier per layer is itself a governance decision. Bronze warn-only is permissive on purpose — raw data is precious, even bad raw data. Silver drop is where you commit to a contract: bad rows do not propagate, but you save them through a quarantine table for review. Gold fail is the strict tier reserved for business-critical invariants (a primary-key uniqueness violation, a referential-integrity failure on a reference table).

Each tier corresponds to a different governance posture. Bronze says “we will keep everything and triage later”. Silver says “downstream consumers can trust this layer’s schema”. Gold says “if this is wrong, stop the pipeline”. The decision belongs to the data steward of each domain, not the pipeline author — which is why DQE files live in expectations/ separately from FlowGroup YAML and benefit from CODEOWNERS scoping.

Test reporting for audit¶

LHP’s test actions generate validation views. Their results can be published to Azure DevOps test reports or a Delta audit table — turning pipeline tests into a governance signal that surfaces in the same dashboards as application tests. See Test Result Reporting (Publishing) for the publishing mechanics.

Anti-patterns¶

Not using operational metadata. Debugging production data incidents without ingest timestamps and source file paths is much harder than with them. The cost of adding the metadata is one preset declaration; the cost of debugging without it is hours per incident.

Blank or trivial table comments. A UC catalog where every comment is the table name repeated is no better than no catalog. Meaningful descriptions pay for themselves the first time a downstream consumer finds the right table without asking.

Hardcoded catalog names. Beyond breaking environment promotion, they fragment UC lineage. Every catalog name should come from a substitution token.

Mixing generated and hand-written bundle resources in one directory. Regeneration overwrites everything in the output path. Keep them in separate directories.