Architecture¶
Lakehouse Plumber (LHP) is a code generator. You write declarative YAML; LHP produces Databricks Lakeflow Declarative Pipelines Python code. This page explains the model that shape sits on: what objects you compose, how those objects relate, what substitutions resolve and in what order, and how the generation engine turns YAML into Python.
If you want to do something — build a pipeline, configure CI/CD, troubleshoot a failure — see Overview. If you are choosing between two ways of doing something (preset versus template, streaming versus batch), see Decisions.
The composition model¶
LHP has three nested objects: Pipeline, FlowGroup, Action. They are
authored as YAML and validated by Pydantic models in src/lhp/models/config.py.
A Pipeline is a logical grouping label. Every FlowGroup declares pipeline:
<name>, and all FlowGroups sharing that name produce Python files into the same
output folder. A Pipeline is a deployment unit — when you build a Databricks Asset
Bundle, one Lakeflow Declarative Pipeline resource is generated per Pipeline name.
A FlowGroup is a logical slice of a Pipeline, often a single source table or
business entity. A FlowGroup file declares its Pipeline, its own name, an optional
job_name for multi-job orchestration, optional local variables, and an ordered
list of Actions. One YAML file can hold one or many FlowGroups (see
Multi-Flowgroup YAML Files).
An Action is a single step inside a FlowGroup. Actions have four top-level types,
declared by the type: field. Each type has a sub-type that selects the generator
backing it. The full catalogue lives in Actions Reference; the canonical list of
enum values comes from ActionType and the four sub-type enums in
models/config.py.
Type |
Sub-types |
Purpose |
|---|---|---|
|
|
Read external data into a temporary view. One per data source. |
|
|
Reshape or check data already loaded into a view. Zero or many per FlowGroup. |
|
|
Persist the final dataset. One per output table or sink. |
|
|
Assert a property of the data. Run only with |
The model is a directed acyclic graph. A typical FlowGroup is load → zero or more
transforms → write, with optional tests attached. Actions inside a FlowGroup do not
execute in YAML order — Lakeflow’s declarative engine schedules them at runtime based
on the source/target view references. LHP enforces the DAG at generation time
so cycles fail before deployment.
Reuse primitives¶
Three primitives let you factor reuse out of FlowGroups. They layer rather than substitute for each other; the decision matrix lives in Decisions.
A Preset is a YAML file of default values that get deep-merged into actions matched by type. Presets resolve before substitutions and before validation, so the defaults flow into every action that matches and explicit FlowGroup config wins. See Presets Reference for the resolution rules and merging semantics.
A Template is a YAML file of parametrised actions. Where a Preset injects
values, a Template injects whole action blocks. A FlowGroup applies a template via
use_template: and supplies template_parameters:; LHP renders the template with
Jinja2 {{ }} placeholders and appends the rendered actions to the FlowGroup. See
Templates Reference.
A Blueprint is a higher-order template that instantiates multiple FlowGroups at
once. Where templates parameterise actions inside one FlowGroup, blueprints
parameterise the FlowGroups themselves. Blueprint instances supply parameters via
use_blueprint: + parameters: (legacy blueprint: + flat keys is deprecated;
see BlueprintInstance in models/config.py). See Blueprints for full
semantics.
The substitution layer cake¶
LHP resolves four substitution syntaxes in a fixed order. The order matters because each layer may emit text that the next layer then sees and resolves.
Phase |
Syntax |
Source |
Resolved by |
|---|---|---|---|
1 |
|
FlowGroup |
|
2 |
|
Template |
Jinja2 in the template engine |
3 |
|
|
|
4 |
|
Databricks secret scopes (resolved at runtime by |
|
The order is enforced inside FlowgroupProcessor.process_flowgroup (steps 0.5, 1,
3, 5 in core/services/flowgroup_processor.py). A consequence: an env token can
expand to a string that contains a secret reference, but not the other way around. A
template parameter can expand to a string that contains an env token, but a local
variable cannot reference a template parameter that has not yet rendered.
The ${token} form is canonical. The bare {token} form is deprecated —
treat any documentation or example that still uses it as legacy. For full syntax,
including file substitutions and Databricks Connect compatibility shims, see
Substitutions & Secrets.
The generation workflow¶
lhp generate --env <env> runs three phases over every FlowGroup it discovers. The
phase names map to services in src/lhp/core/services/.
graph TD
subgraph Discovery
A[Scan pipelines/] --> B[Apply include patterns]
B --> C[Parse YAML to FlowGroup models]
C --> C2[Discover and expand Blueprints]
end
subgraph Resolution
C2 --> D[Resolve local variables]
D --> E[Expand templates]
E --> F[Apply preset defaults]
F --> G[Apply env substitutions]
G --> H[Validate FlowGroup]
H --> I[Validate secret references]
end
subgraph Code generation
I --> M[Run action generators]
M --> O[Inject secret calls]
O --> P[Write Python file]
end
Discovery is driven by FlowgroupDiscoverer and the project-level include
patterns from lhp.yaml. Blueprint instances are expanded into concrete
FlowGroups by BlueprintExpander before the resolution phase sees them.
Resolution runs each FlowGroup through FlowgroupProcessor, applying the
substitution layer cake described above, deep-merging preset defaults, and running
Pydantic validation. Unresolved tokens raise LHP-CFG-010 with the unresolved
names listed.
Code generation dispatches each action to a generator looked up in
ActionRegistry (one of 7 load, 5 transform, 3 write, or 9 test generators). The
generators emit Jinja2-rendered Python; CodeGenerator injects
dbutils.secrets.get calls last so secret references never leak into source files.
The final Python is written via SmartFileWriter, which only touches disk when
content actually changes. lhp deps exposes the cross-pipeline dependency DAG for
inspection (see Dependency Analysis & Job Generation).
Multi-environment generation¶
LHP is environment-aware. The same FlowGroup YAML generates different Python files
per environment because ${env_token} resolves against substitutions/<env>.yaml.
substitutions/lhp.yaml provides shared defaults that environment files override.
Multi-job orchestration¶
A FlowGroup can declare job_name: <name> to opt into the multi-job model. When
any FlowGroup sets job_name, every FlowGroup must — the all-or-nothing rule
prevents partial orchestration. LHP then generates one Databricks job per unique
job_name plus a master orchestration job that chains them, and writes per-job
configuration from multi-document job_config.yaml. The orchestration generator
lives in core/services/job_generator.py; full configuration semantics are in
Bundle Configuration Reference.
How this maps to the codebase¶
The objects above correspond directly to Python modules:
src/lhp/models/config.py— Pydantic models for every YAML object.src/lhp/parsers/yaml_parser.py— YAML → model conversion.src/lhp/core/services/flowgroup_processor.py— substitution + preset + template pipeline.src/lhp/core/action_registry.py— action-type → generator mapping.src/lhp/generators/{load,transform,write,test}/— one generator per sub-type.src/lhp/core/orchestrator.py— top-level coordination across phases.src/lhp/utils/smart_file_writer.py— content-aware file writes.
The model is intentionally narrow. The four action types and their sub-type lists are
closed enums, not extension points — adding a new sub-type means adding a generator,
registering it in ActionRegistry, and adding an enum value. Reuse extension is
expected to happen through Presets, Templates, and Blueprints, not through new
action types.
See also
Decisions for when to use which primitive, Actions Reference for the full action reference, and Substitutions & Secrets for the full substitution syntax reference.