Decisions

Most LHP design choices reduce to a handful of decisions. Each section gives a one-screen matrix and a pointer to the relevant reference page. For definitions of FlowGroups, Actions, Presets, Templates, and Blueprints, see Architecture.

Preset vs Template vs Blueprint

LHP has five reusability primitives. They layer rather than compete. Pick the one that factors out the axis of repetition you want to eliminate.

Primitive

Use when you want to factor out…

Where it lives

Action

Nothing — this is the atomic unit.

Inside a FlowGroup

Preset

Default values (table properties, cloudFiles options, Spark config) repeated across actions of one type.

presets/*.yaml

Template

A parametrised group of actions repeated inside a single FlowGroup.

templates/*.yaml

FlowGroup

Nothing — this is the unit of generation.

pipelines/**/*.yaml

Blueprint

A parametrised list of FlowGroups repeated across many similar deployments.

blueprints/*.yaml

Instance

Parameter values supplied to a Blueprint.

pipelines/**/*.yaml (co-located)

Tip

Factor by the smallest axis that repeats. Three actions repeating with a table name is a Template; twenty FlowGroups repeating across regional sites is a Blueprint.

Template vs Blueprint — the most common confusion

Both reduce repetition, but at different granularities.

Use a Template when…

Use a Blueprint when…

The same three actions ingest a CSV table; you want to parametrise them by table name.

The same bronze/silver shape repeats across ten regional sites; you want to parametrise it by site_name.

One Template + one FlowGroup yields one FlowGroup’s actions.

One Blueprint + N instances yield N × M synthetic FlowGroups.

Used via use_template: in a FlowGroup file.

Used via use_blueprint: in an instance file.

Parameters as Jinja2 {{ var }}.

Parameters as %{var} local variables.

The two compose: a Blueprint FlowGroup spec can declare use_template: like a disk-sourced FlowGroup.

Preset layering — how many to apply

Presets deep-merge in order; explicit action config always wins. Pick depth by how much variation the workload tolerates:

  • One global preset (for example bronze_layer) when every Bronze table shares the same options.

  • Layered presets (bronze_layer + cdc_overrides) when a subset of FlowGroups overrides a few keys. Reach for extends: only when the second preset is reusable as a named building block.

  • No preset when the FlowGroup is one-of-a-kind — inlining a single option is clearer than naming a preset for it.

Streaming vs batch

Lakeflow Declarative Pipeline decides execution order at runtime, but you pick streaming or batch when you choose how a Load reads and which Write target receives. The two ends must agree.

Workload shape

Pick at Load

Pick at Write

Files trickling into object storage; exactly-once and schema evolution.

cloudfiles

streaming_table

Delta source appended by another job; you want the latest rows.

delta with readMode: stream

streaming_table

Delta source you re-aggregate on a schedule.

delta with readMode: batch

materialized_view

Change Data Capture (CDC) events; SCD Type 1 or Type 2.

delta (CDF) or cloudfiles

streaming_table mode: cdc

Full snapshots that LHP diffs.

delta or python

streaming_table mode: snapshot_cdc

Gold dashboards; freshness measured in hours.

sql against Silver tables

materialized_view

If you mix the two — a batch Load feeding a streaming_table — the pipeline parses, but Lakeflow re-reads the source on every refresh. Usually not what you want.

Choosing a load source

Pick the Load sub-type by what the data looks like at rest and how it is delivered:

Sub-type

Use when…

cloudfiles

Files arrive in object storage (S3, ADLS, GCS, Unity Catalog volumes); incremental ingestion with checkpoints and schema evolution. Streaming only.

delta

Reading an existing Delta table or its Change Data Feed (CDF). Batch or streaming.

sql

An arbitrary SQL query materialised as a temporary view (often joins or windowed aggregates across already-loaded sources).

jdbc

Pulling from an external RDBMS (Oracle, SQL Server, Postgres, MySQL); credentials via Databricks secrets.

python

The format is not covered by a built-in sub-type, or you need custom pre-processing in Python before the flow sees the data.

custom_datasource

You have or want a PySpark DataSourceV2 implementation and want LHP to register and invoke it.

For the full sub-type reference see Load Actions; for a walk-through of cloudfiles see Ingest with Auto Loader.

Choosing a write target

Pick the Write sub-type by the kind of object you want to produce:

Sub-type

Use when…

streaming_table

Persisting incrementally arriving data. Bronze and Silver layers, CDC, fan-in from multiple sources. Supports standard, cdc, and snapshot_cdc modes.

materialized_view

Batch-computed analytics — Gold dashboards and aggregations that refresh on a schedule.

sink

Pushing data out of the lakehouse — Kafka, Delta to an external catalog, Azure Event Hubs, or a custom REST endpoint.

For the full sub-type reference see Write Actions.

Choosing a write mode (streaming_table)

Within streaming_table, the mode field decides how rows are applied to the target:

Mode

When to use

Notes

standard (default)

Append-only or fan-in workloads where each row is new.

Multiple write actions targeting the same table fan in via @dp.append_flow; only the first action sets create_table: true.

cdc

Source emits change events (insert / update / delete) with a sequence column. SCD Type 1 or Type 2 targets.

Requires cdc_config.keys and sequence_by. Supports scd_type: 1 or scd_type: 2 with optional track_history_column_list. Multi-CDC fan-in works when contributors agree on the shared CDC fields.

snapshot_cdc

Source delivers full snapshots rather than change events; LHP diffs successive snapshots.

Backed by create_auto_cdc_from_snapshot_flow(). Source is a Delta table or a Python function returning (df, version).

Single-job vs multi-job orchestration

By default, LHP generates one orchestration job that runs every Lakeflow pipeline in dependency order. Set job_name on a FlowGroup to split work across multiple jobs.

Pick…

When…

Single job (omit job_name)

All pipelines share one cadence; you want one place to monitor success; the dependency graph is small enough to run end-to-end on every trigger.

Multi-job (set job_name on every FlowGroup)

Pipelines have different schedules (hourly POS, nightly ERP); source systems need isolation (a failing SAP feed must not block NCR ingestion); different teams own different jobs and want separate alerting and permissions.

If you set job_name on any FlowGroup, you must set it on every FlowGroup — the validator rejects mixed configurations. LHP emits a master orchestration job that triggers the per-system jobs in dependency order. For the full reference see Bundle Configuration Reference.

See also

  • Architecture — definitions for FlowGroups, Actions, Presets, Templates, and Blueprints.

  • Presets Reference — preset schema and deep-merge semantics.

  • Templates Reference — Jinja2 parameter syntax and template composition.

  • Blueprints — Blueprint and instance schema, parameter validation, and expansion.