Decisions¶

Most LHP design choices reduce to a handful of decisions. Each section gives a one-screen matrix and a pointer to the relevant reference page. For definitions of FlowGroups, Actions, Presets, Templates, and Blueprints, see Architecture.

Preset vs Template vs Blueprint¶

LHP has five reusability primitives. They layer rather than compete. Pick the one that factors out the axis of repetition you want to eliminate.

Primitive	Use when you want to factor out…	Where it lives
Action	Nothing — this is the atomic unit.	Inside a FlowGroup
Preset	Default values (table properties, `cloudFiles` options, Spark config) repeated across actions of one type.	`presets/*.yaml`
Template	A parametrised group of actions repeated inside a single FlowGroup.	`templates/*.yaml`
FlowGroup	Nothing — this is the unit of generation.	`pipelines/*/.yaml`
Blueprint	A parametrised list of FlowGroups repeated across many similar deployments.	`blueprints/*.yaml`
Instance	Parameter values supplied to a Blueprint.	`pipelines/*/.yaml` (co-located)

Tip

Factor by the smallest axis that repeats. Three actions repeating with a table name is a Template; twenty FlowGroups repeating across regional sites is a Blueprint.

Template vs Blueprint — the most common confusion¶

Both reduce repetition, but at different granularities.

Use a Template when…	Use a Blueprint when…
The same three actions ingest a CSV table; you want to parametrise them by table name.	The same bronze/silver shape repeats across ten regional sites; you want to parametrise it by `site_name`.
One Template + one FlowGroup yields one FlowGroup’s actions.	One Blueprint + N instances yield N × M synthetic FlowGroups.
Used via `use_template:` in a FlowGroup file.	Used via `use_blueprint:` in an instance file.
Parameters as Jinja2 `{{ var }}`.	Parameters as `%{var}` local variables.

The two compose: a Blueprint FlowGroup spec can declare use_template: like a disk-sourced FlowGroup.

Preset layering — how many to apply¶

Presets deep-merge in order; explicit action config always wins. Pick depth by how much variation the workload tolerates:

One global preset (for example bronze_layer) when every Bronze table shares the same options.
Layered presets (bronze_layer + cdc_overrides) when a subset of FlowGroups overrides a few keys. Reach for extends: only when the second preset is reusable as a named building block.
No preset when the FlowGroup is one-of-a-kind — inlining a single option is clearer than naming a preset for it.

Streaming vs batch¶

Lakeflow Declarative Pipeline decides execution order at runtime, but you pick streaming or batch when you choose how a Load reads and which Write target receives. The two ends must agree.

Workload shape	Pick at Load	Pick at Write
Files trickling into object storage; exactly-once and schema evolution.	`cloudfiles`	`streaming_table`
Delta source appended by another job; you want the latest rows.	`delta` with `readMode: stream`	`streaming_table`
Delta source you re-aggregate on a schedule.	`delta` with `readMode: batch`	`materialized_view`
Change Data Capture (CDC) events; SCD Type 1 or Type 2.	`delta` (CDF) or `cloudfiles`	`streaming_table` `mode: cdc`
Full snapshots that LHP diffs.	`delta` or `python`	`streaming_table` `mode: snapshot_cdc`
Gold dashboards; freshness measured in hours.	`sql` against Silver tables	`materialized_view`

If you mix the two — a batch Load feeding a streaming_table — the pipeline parses, but Lakeflow re-reads the source on every refresh. Usually not what you want.

Choosing a load source¶

Pick the Load sub-type by what the data looks like at rest and how it is delivered:

Sub-type	Use when…
`cloudfiles`	Files arrive in object storage (S3, ADLS, GCS, Unity Catalog volumes); incremental ingestion with checkpoints and schema evolution. Streaming only.
`delta`	Reading an existing Delta table or its Change Data Feed (CDF). Batch or streaming.
`sql`	An arbitrary SQL query materialised as a temporary view (often joins or windowed aggregates across already-loaded sources).
`jdbc`	Pulling from an external RDBMS (Oracle, SQL Server, Postgres, MySQL); credentials via Databricks secrets.
`python`	The format is not covered by a built-in sub-type, or you need custom pre-processing in Python before the flow sees the data.
`custom_datasource`	You have or want a PySpark DataSourceV2 implementation and want LHP to register and invoke it.

For the full sub-type reference see Load Actions; for a walk-through of cloudfiles see Ingest with Auto Loader.

Choosing a write target¶

Pick the Write sub-type by the kind of object you want to produce:

Sub-type	Use when…
`streaming_table`	Persisting incrementally arriving data. Bronze and Silver layers, CDC, fan-in from multiple sources. Supports `standard`, `cdc`, and `snapshot_cdc` modes.
`materialized_view`	Batch-computed analytics — Gold dashboards and aggregations that refresh on a schedule.
`sink`	Pushing data out of the lakehouse — Kafka, Delta to an external catalog, Azure Event Hubs, or a custom REST endpoint.

For the full sub-type reference see Write Actions.

Choosing a write mode (streaming_table)¶

Within streaming_table, the mode field decides how rows are applied to the target:

Mode	When to use	Notes
`standard` (default)	Append-only or fan-in workloads where each row is new.	Multiple write actions targeting the same table fan in via `@dp.append_flow`; only the first action sets `create_table: true`.
`cdc`	Source emits change events (insert / update / delete) with a sequence column. SCD Type 1 or Type 2 targets.	Requires `cdc_config.keys` and `sequence_by`. Supports `scd_type: 1` or `scd_type: 2` with optional `track_history_column_list`. Multi-CDC fan-in works when contributors agree on the shared CDC fields.
`snapshot_cdc`	Source delivers full snapshots rather than change events; LHP diffs successive snapshots.	Backed by `create_auto_cdc_from_snapshot_flow()`. Source is a Delta table or a Python function returning `(df, version)`.

Single-job vs multi-job orchestration¶

By default, LHP generates one orchestration job that runs every Lakeflow pipeline in dependency order. Set job_name on a FlowGroup to split work across multiple jobs.

Pick…	When…
Single job (omit `job_name`)	All pipelines share one cadence; you want one place to monitor success; the dependency graph is small enough to run end-to-end on every trigger.
Multi-job (set `job_name` on every FlowGroup)	Pipelines have different schedules (hourly POS, nightly ERP); source systems need isolation (a failing SAP feed must not block NCR ingestion); different teams own different jobs and want separate alerting and permissions.

If you set job_name on any FlowGroup, you must set it on every FlowGroup — the validator rejects mixed configurations. LHP emits a master orchestration job that triggers the per-system jobs in dependency order. For the full reference see Bundle Configuration Reference.

Decisions¶

Preset vs Template vs Blueprint¶

Template vs Blueprint — the most common confusion¶

Preset layering — how many to apply¶

Streaming vs batch¶

Choosing a load source¶

Choosing a write target¶

Choosing a write mode (streaming_table)¶

Single-job vs multi-job orchestration¶

See also¶