Quickstart¶
Build your first Lakehouse Plumber (LHP) pipeline in about ten minutes. This
walk-through uses the samples.tpch.customer_sample table that ships with every
Databricks workspace, so you do not need to upload data first. By the end you have
a generated Lakeflow Declarative Pipelines Python file you can run in Databricks.
Prerequisites¶
Python 3.11 or later (3.12 recommended).
A Databricks workspace with Unity Catalog enabled.
Read access to the
samplescatalog. Verify with:SELECT 1 FROM samples.tpch.customer_sample LIMIT 1;
If the
samplescatalog is unavailable in your tenant (some sovereign-cloud workspaces — for example GovCloud and China — do not provision it), follow Ingest with Auto Loader instead for an Auto Loader walk-through against your own landing volume.A catalog and schema you can write to. Note the names — the substitutions file in Step 3 needs them.
Step 1 — Install LHP¶
Create a virtual environment and install the CLI:
python -m venv .venv
source .venv/bin/activate
pip install lakehouse-plumber
Confirm the install:
lhp --version
Step 2 — Scaffold a project¶
Run lhp init from an empty directory:
mkdir my_first_pipeline && cd my_first_pipeline
lhp init my_first_pipeline
The default lhp init enables Databricks Asset Bundle (DAB) integration. To
opt out, pass --no-bundle.
The command creates these directories:
pipelines/— your FlowGroup YAML files.substitutions/— per-environment token files (dev,tst,prod).templates/— reusable FlowGroup templates (Jinja2).presets/— Action defaults applied by type.schemas/andexpectations/— schema files and data-quality rules.generated/— destination for generated Python code.resources/lhp/— DAB resource definitions (bundle mode only).
It also creates lhp.yaml (the project file), databricks.yml (DAB
manifest, bundle mode), and a .vscode/ directory wired for YAML
IntelliSense.
You see this success message:
Initialized Databricks Asset Bundle project: my_first_pipeline
Created directories: presets, templates, pipelines, substitutions, schemas,
expectations, generated, config, resources
Created example files: presets/bronze_layer.yaml,
templates/standard_ingestion.yaml, databricks.yml
The example presets/ and templates/ files ship as .tmpl to prevent
them from running until you adopt them. Leave them in place for now.
Step 3 — Configure substitutions/dev.yaml¶
The init step creates substitutions/dev.yaml.tmpl with placeholders. Rename
it and replace the catalog and schema values:
mv substitutions/dev.yaml.tmpl substitutions/dev.yaml
Open substitutions/dev.yaml and set catalog and bronze_schema to the
catalog and schema you can write to:
dev:
env: dev
catalog: my_catalog
bronze_schema: my_bronze_schema
silver_schema: my_silver_schema
gold_schema: my_gold_schema
landing_path: /Volumes/my_catalog/my_bronze_schema/landing
LHP substitutes ${catalog} and ${bronze_schema} into your FlowGroup at
generate time. The substitution syntax in YAML and SQL is ${token}.
Step 4 — Write your first FlowGroup¶
Create pipelines/customer_sample.yaml:
pipeline: tpch_sample_ingestion
flowgroup: customer_ingestion
actions:
- name: customer_sample_load
type: load
readMode: stream
source:
type: delta
database: "samples.tpch"
table: customer_sample
target: v_customer_sample_raw
description: "Load customer sample table from the Databricks samples catalog"
- name: write_customer_sample_bronze
type: write
source: v_customer_sample_raw
write_target:
type: streaming_table
database: "${catalog}.${bronze_schema}"
table: "tpch_sample_customer"
description: "Write customer sample to bronze"
The file declares one Pipeline (tpch_sample_ingestion) containing one
FlowGroup (customer_ingestion) with two Actions: a streaming load from
the samples catalog and a write into your bronze schema.
Step 5 — Validate¶
Run validation before generating code:
lhp validate --env dev
A passing run ends with:
✅ All configurations are valid
If validation fails, LHP prints a structured error with the file, action name,
and a suggestion. Fix the reported issue and run lhp validate again.
Step 6 — Generate Python code¶
Run code generation:
lhp generate --env dev
LHP writes the generated file to
generated/tpch_sample_ingestion/customer_ingestion.py and prints:
✅ tpch_sample_ingestion: Generated 1 file(s)
✅ Code generation completed successfully
Inspect the file. The body looks like this:
# Generated by LakehousePlumber
# Pipeline: tpch_sample_ingestion
# FlowGroup: customer_ingestion
from pyspark import pipelines as dp
PIPELINE_ID = "tpch_sample_ingestion"
FLOWGROUP_ID = "customer_ingestion"
@dp.temporary_view()
def v_customer_sample_raw():
"""Load customer sample table from the Databricks samples catalog"""
df = spark.readStream.table("samples.tpch.customer_sample")
return df
dp.create_streaming_table(
name="my_catalog.my_bronze_schema.tpch_sample_customer",
comment="Streaming table: tpch_sample_customer",
table_properties={
"delta.autoOptimize.optimizeWrite": "true",
"delta.enableChangeDataFeed": "true",
},
)
@dp.append_flow(
target="my_catalog.my_bronze_schema.tpch_sample_customer",
name="f_write_customer_sample_bronze",
comment="Write customer sample to bronze",
)
def f_write_customer_sample_bronze():
df = spark.readStream.table("v_customer_sample_raw")
return df
The generated file is a standard Lakeflow Declarative Pipelines script. The
${catalog} and ${bronze_schema} tokens are resolved against
substitutions/dev.yaml.
Step 7 — Run it in Databricks¶
You have two ways to run the generated code:
Bundle deploy. Because lhp init enabled DAB by default, your project
already has a databricks.yml and matching resource definitions under
resources/lhp/. From the project root, deploy with the Databricks CLI:
databricks bundle deploy --target dev
databricks bundle run tpch_sample_ingestion --target dev
Manual. Sync the generated/ folder to your workspace (via Repos or the
Databricks CLI), create a Lakeflow Declarative Pipeline in the UI, and point
its source field at generated/tpch_sample_ingestion.
Tip
Re-run lhp generate --env dev whenever you change a FlowGroup to refresh
the generated Python.
Next steps¶
You now have a working first pipeline. To go further, see Overview for task-shaped recipes — adding transforms, ingesting with Auto Loader, attaching data-quality expectations, layering presets, and deploying bundles across environments. To understand how the YAML maps to generated Python and why LHP designs FlowGroups and Actions this way, read Architecture.