Quickstart¶

Build your first Lakehouse Plumber (LHP) pipeline in about ten minutes. This walk-through uses the samples.tpch.customer_sample table that ships with every Databricks workspace, so you do not need to upload data first. By the end you have a generated Lakeflow Declarative Pipelines Python file you can run in Databricks.

Prerequisites¶

Python 3.11 or later (3.12 recommended).
A Databricks workspace with Unity Catalog enabled.
Read access to the samples catalog. Verify with:
```
SELECT 1 FROM samples.tpch.customer_sample LIMIT 1;
```
If the samples catalog is unavailable in your tenant (some sovereign-cloud workspaces — for example GovCloud and China — do not provision it), follow Ingest with Auto Loader instead for an Auto Loader walk-through against your own landing volume.
A catalog and schema you can write to. Note the names — the substitutions file in Step 3 needs them.

Step 1 — Install LHP¶

Create a virtual environment and install the CLI:

python -m venv .venv
source .venv/bin/activate
pip install lakehouse-plumber

Confirm the install:

lhp --version

Step 2 — Scaffold a project¶

Run lhp init from an empty directory:

mkdir my_first_pipeline && cd my_first_pipeline
lhp init my_first_pipeline

The default lhp init enables Databricks Asset Bundle (DAB) integration. To opt out, pass --no-bundle.

The command creates these directories:

pipelines/ — your FlowGroup YAML files.
substitutions/ — per-environment token files (dev, tst, prod).
templates/ — reusable FlowGroup templates (Jinja2).
presets/ — Action defaults applied by type.
schemas/ and expectations/ — schema files and data-quality rules.
generated/ — destination for generated Python code.
resources/lhp/ — DAB resource definitions (bundle mode only).

It also creates lhp.yaml (the project file), databricks.yml (DAB manifest, bundle mode), and a .vscode/ directory wired for YAML IntelliSense.

You see this success message:

Initialized Databricks Asset Bundle project: my_first_pipeline
Created directories: presets, templates, pipelines, substitutions, schemas,
expectations, generated, config, resources
Created example files: presets/bronze_layer.yaml,
templates/standard_ingestion.yaml, databricks.yml

The example presets/ and templates/ files ship as .tmpl to prevent them from running until you adopt them. Leave them in place for now.

Step 3 — Configure `substitutions/dev.yaml`¶

The init step creates substitutions/dev.yaml.tmpl with placeholders. Rename it and replace the catalog and schema values:

mv substitutions/dev.yaml.tmpl substitutions/dev.yaml

Open substitutions/dev.yaml and set catalog and bronze_schema to the catalog and schema you can write to:

substitutions/dev.yaml¶

dev:
  env: dev
  catalog: my_catalog
  bronze_schema: my_bronze_schema
  silver_schema: my_silver_schema
  gold_schema: my_gold_schema
  landing_path: /Volumes/my_catalog/my_bronze_schema/landing

LHP substitutes ${catalog} and ${bronze_schema} into your FlowGroup at generate time. The substitution syntax in YAML and SQL is ${token}.

Step 4 — Write your first FlowGroup¶

Create pipelines/customer_sample.yaml:

pipelines/customer_sample.yaml¶

pipeline: tpch_sample_ingestion
flowgroup: customer_ingestion

actions:
  - name: customer_sample_load
    type: load
    readMode: stream
    source:
      type: delta
      database: "samples.tpch"
      table: customer_sample
    target: v_customer_sample_raw
    description: "Load customer sample table from the Databricks samples catalog"

  - name: write_customer_sample_bronze
    type: write
    source: v_customer_sample_raw
    write_target:
      type: streaming_table
      database: "${catalog}.${bronze_schema}"
      table: "tpch_sample_customer"
    description: "Write customer sample to bronze"

The file declares one Pipeline (tpch_sample_ingestion) containing one FlowGroup (customer_ingestion) with two Actions: a streaming load from the samples catalog and a write into your bronze schema.

Step 5 — Validate¶

Run validation before generating code:

lhp validate --env dev

A passing run ends with:

✅ All configurations are valid

If validation fails, LHP prints a structured error with the file, action name, and a suggestion. Fix the reported issue and run lhp validate again.

Step 6 — Generate Python code¶

Run code generation:

lhp generate --env dev

LHP writes the generated file to generated/tpch_sample_ingestion/customer_ingestion.py and prints:

✅ tpch_sample_ingestion: Generated 1 file(s)
✅ Code generation completed successfully

Inspect the file. The body looks like this:

generated/tpch_sample_ingestion/customer_ingestion.py¶

# Generated by LakehousePlumber
# Pipeline: tpch_sample_ingestion
# FlowGroup: customer_ingestion

from pyspark import pipelines as dp

PIPELINE_ID = "tpch_sample_ingestion"
FLOWGROUP_ID = "customer_ingestion"

@dp.temporary_view()
def v_customer_sample_raw():
    """Load customer sample table from the Databricks samples catalog"""
    df = spark.readStream.table("samples.tpch.customer_sample")
    return df

dp.create_streaming_table(
    name="my_catalog.my_bronze_schema.tpch_sample_customer",
    comment="Streaming table: tpch_sample_customer",
    table_properties={
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.enableChangeDataFeed": "true",
    },
)

@dp.append_flow(
    target="my_catalog.my_bronze_schema.tpch_sample_customer",
    name="f_write_customer_sample_bronze",
    comment="Write customer sample to bronze",
)
def f_write_customer_sample_bronze():
    df = spark.readStream.table("v_customer_sample_raw")
    return df

The generated file is a standard Lakeflow Declarative Pipelines script. The ${catalog} and ${bronze_schema} tokens are resolved against substitutions/dev.yaml.

Step 7 — Run it in Databricks¶

You have two ways to run the generated code:

Bundle deploy. Because lhp init enabled DAB by default, your project already has a databricks.yml and matching resource definitions under resources/lhp/. From the project root, deploy with the Databricks CLI:

databricks bundle deploy --target dev
databricks bundle run tpch_sample_ingestion --target dev

Manual. Sync the generated/ folder to your workspace (via Repos or the Databricks CLI), create a Lakeflow Declarative Pipeline in the UI, and point its source field at generated/tpch_sample_ingestion.

Tip

Re-run lhp generate --env dev whenever you change a FlowGroup to refresh the generated Python.

Next steps¶

You now have a working first pipeline. To go further, see Overview for task-shaped recipes — adding transforms, ingesting with Auto Loader, attaching data-quality expectations, layering presets, and deploying bundles across environments. To understand how the YAML maps to generated Python and why LHP designs FlowGroups and Actions this way, read Architecture.

Quickstart¶

Prerequisites¶

Step 1 — Install LHP¶

Step 2 — Scaffold a project¶

Step 3 — Configure substitutions/dev.yaml¶

Step 4 — Write your first FlowGroup¶

Step 5 — Validate¶

Step 6 — Generate Python code¶

Step 7 — Run it in Databricks¶

Next steps¶

Step 3 — Configure `substitutions/dev.yaml`¶