Project Structure

For the step-by-step version of laying out a project, see Quickstart. This page explains why the layout looks the way it does and where the rough edges are.

Why directory layout outlives the project

LHP discovers files by walking specific directories. The discovery rules are not symmetric — some directories support recursion and others do not — and the difference shapes every other organisational choice.

pipelines/, sql/, schemas/, expectations/, and python_modules/ are discovered recursively. You can nest as deep as you want; LHP finds files by glob pattern at any depth. presets/, templates/, and substitutions/ are flat: discovery uses glob("*.yaml"), not rglob, so subdirectories under these are silently ignored.

The asymmetry exists because pipelines, SQL, schemas, expectations, and Python modules belong to a specific data domain or layer — putting them under a subdirectory communicates that scope. Presets, templates, and substitutions are project-wide reusable assets; nesting them would suggest scope they do not actually have.

This single fact drives the two organisational rules that matter most:

  • For directories that support subdirectories, use them. Mirror the domain/layer hierarchy across pipelines/, sql/, schemas/, and expectations/.

  • For directories that do not, use a prefix in the filename to encode the scope a folder would have given you.

Domain-first, not action-first

Group pipelines/ by data domain, not by action type. pipelines/orders/ beats pipelines/loads/ for the same reason most code is grouped by feature rather than by language construct: people work on a domain at a time. A CODEOWNERS rule that says “Team A owns pipelines/orders/” maps to reality. A CODEOWNERS rule that says “Team A owns all load actions across the project” does not.

Within each domain, group by medallion layer:

Pipeline directory shape
pipelines/
  erp/
    bronze/
    silver/
    gold/
  crm/
    bronze/
    silver/
  shared/
    gold/                              # cross-system aggregates

The same layout in sql/, schemas/, and expectations/ lets a reviewer who is reading pipelines/erp/silver/orders_enriched.yaml find the referenced SQL at sql/erp/silver/enrich_orders.sql without thinking. Mirrored structures eliminate one cognitive switch.

Single-purpose YAML files

Aim for 50-200 lines per YAML file. The number is not magic — it is the size at which one PR can sensibly review one file. Monolithic files with fifteen or more FlowGroups become unreadable, and lhp validate errors become harder to triage because the line number alone does not tell you which FlowGroup is failing.

LHP supports multi-document (---) and array (flowgroups:) syntax for cases where several FlowGroups share a pipeline. Use that syntax when the FlowGroups genuinely belong together — they share pipeline, presets, operational_metadata, or job_name. Do not use it as a workaround to keep the file count low; the FlowGroup boundary exists for a reason.

See Multi-Flowgroup YAML Files for the array-syntax mechanics.

The flat-directory problem

Templates and presets cannot live in subdirectories. At three templates, this is not a problem. At thirty, a flat alphabetical list becomes the discoverability bottleneck. The fix is a prefix convention that encodes layer and action type in the filename:

Template prefix pattern
templates/
  TMPL001_brz_load_cloudfiles_standard.yaml
  TMPL002_brz_load_kafka_events.yaml
  TMPL004_slv_transform_sql_enrichment.yaml
  TMPL006_slv_write_st_with_dqe.yaml
  TMPL007_gld_write_mv_aggregation.yaml

The TMPLxxx_ numeric prefix sorts templates by creation order in lhp list_templates output. The layer prefix (brz_, slv_, gld_) groups templates that belong together in the same alphabetical neighbourhood. The descriptive suffix tells you what the template does without opening it.

The same idea applies to presets, where the prefix encodes scope:

presets/
  global_defaults.yaml
  brz_standard.yaml
  brz_cloudfiles_json.yaml
  slv_cdc_scd2.yaml
  gld_standard.yaml

This is convention, not enforcement. LHP does not parse the prefix. But the convention pays for itself when someone joins the project and has to find the right template in under a minute.

Why the variant suffix matters in FlowGroup files

A FlowGroup file named erp_bronze_ingest_TMPL001.yaml carries two useful pieces of information in its name: the FlowGroup’s data domain and layer, and the template it uses. When you see a generated file erp_brz_raw_orders.py in a PR diff, you can find the source YAML without grepping. When you see TMPL001 in a filename, you know which template changes would cascade through.

The mirror rule applies to identifiers everywhere. snake_case for pipelines, FlowGroups, actions, templates, presets, and variables — because action names become Python function names, and they have to be valid identifiers. ${SCREAMING_SNAKE_CASE} for environment tokens because they are constants resolved at generation time. %{lower_snake_case} for local variables because they are FlowGroup-scoped. {{ snake_case }} for template parameters because Jinja2 renders them. The case alone tells you which substitution layer applies.

Anti-patterns

The following organisational choices look reasonable on day one and break on day ninety. Each is followed by the recommended alternative.

Generic names without system/layer context. pipeline_1, ingest.yaml, transform.sql mean nothing at five hundred FlowGroups. A name has to survive being read out of context — in a log line, a Git blame, a Databricks UI list. erp_brz_raw_orders does; pipeline_v2 does not.

Subdirectories under ``templates/`` or ``presets/``. They are not discovered. The user files a bug about a template not being found, you investigate, and the answer is that the discovery glob is glob("*.yaml") not rglob. Use prefix-based naming.

Dumping all SQL files in a flat ``sql/`` directory. At one hundred SQL files, finding enrich_orders.sql becomes painful. The recursive discovery means there is no reason not to mirror sql/<system>/<layer>/<description>.sql.

Monolithic YAML files. Fifteen FlowGroups in one file is unreadable and unreviewable. One reviewable unit per file; multi-document syntax only when FlowGroups genuinely share inheritable fields.

Hand-written files under resources/lhp/. The directory is wholly LHP-managed: every lhp generate wipes it and regenerates one <pipeline_name>.pipeline.yml per pipeline. Custom resource YAML files (hand-written jobs, dashboards, secret scopes) belong under resources/ at the top level or in any subdirectory other than resources/lhp/. Files outside resources/lhp/ are never touched by LHP, with one exception: the monitoring job YAML at resources/<name>.job.yml, which LHP identifies by its sentinel header (# Generated by LakehousePlumber - Monitoring Job) and replaces on each run.

Cross-system, multi-layer god-blueprints. A blueprint that spans several systems and several layers couples release cycles that should be independent. The default is one blueprint per (system, layer) pair. The escape hatch — a multi-layer blueprint with its instance files in pipelines/<system>/instances/ — exists for the rare case where the shape really is cross-layer, and signals that the blast radius is wide.

CODEOWNERS as the structural enforcement layer

The directory layout pays off most when paired with a CODEOWNERS file at the repo root. CODEOWNERS is a Git platform feature (GitHub, GitLab, Azure DevOps) that names required reviewers for PRs touching specific paths. The directory shape determines what CODEOWNERS rules you can write.

The natural mapping puts the platform team on shared assets and the domain teams on their pipelines:

Example CODEOWNERS
/presets/                @platform-team
/substitutions/          @platform-team
/templates/              @platform-team
/pipelines/erp/          @erp-team
/pipelines/crm/          @crm-team

A change to a preset — say, defaulting all bronze tables to a different schema evolution mode — affects every pipeline that uses it. Without CODEOWNERS, that PR can merge with no input from someone who understands the blast radius. With CODEOWNERS, the platform team is on the review automatically.

The reason this matters more than the typical CODEOWNERS use case is that LHP intentionally moves shared behaviour into shared assets. The whole point of presets and templates is leverage. The same leverage makes mistakes scale.

See also