Decisions ========= .. meta:: :description: Decision matrices for common Lakehouse Plumber design choices: which reuse primitive to pick, streaming vs batch, load source, write target, write mode, and single vs multi-job orchestration. Most LHP design choices reduce to a handful of decisions. Each section gives a one-screen matrix and a pointer to the relevant reference page. For definitions of FlowGroups, Actions, Presets, Templates, and Blueprints, see :doc:`architecture`. Preset vs Template vs Blueprint ------------------------------- LHP has five reusability primitives. They layer rather than compete. Pick the one that factors out the axis of repetition you want to eliminate. .. list-table:: :header-rows: 1 :widths: 20 60 20 * - Primitive - Use when you want to factor out… - Where it lives * - **Action** - Nothing — this is the atomic unit. - Inside a FlowGroup * - **Preset** - Default values (table properties, ``cloudFiles`` options, Spark config) repeated across actions of one type. - ``presets/*.yaml`` * - **Template** - A parametrised group of actions repeated inside a single FlowGroup. - ``templates/*.yaml`` * - **FlowGroup** - Nothing — this is the unit of generation. - ``pipelines/**/*.yaml`` * - **Blueprint** - A parametrised list of FlowGroups repeated across many similar deployments. - ``blueprints/*.yaml`` * - **Instance** - Parameter values supplied to a Blueprint. - ``pipelines/**/*.yaml`` (co-located) .. tip:: Factor by the **smallest axis that repeats**. Three actions repeating with a table name is a Template; twenty FlowGroups repeating across regional sites is a Blueprint. Template vs Blueprint — the most common confusion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Both reduce repetition, but at different granularities. .. list-table:: :header-rows: 1 :widths: 50 50 * - Use a Template when… - Use a Blueprint when… * - The same three actions ingest a CSV table; you want to parametrise them by table name. - The same bronze/silver shape repeats across ten regional sites; you want to parametrise it by ``site_name``. * - One Template + one FlowGroup yields one FlowGroup's actions. - One Blueprint + N instances yield N × M synthetic FlowGroups. * - Used via ``use_template:`` in a FlowGroup file. - Used via ``use_blueprint:`` in an instance file. * - Parameters as Jinja2 ``{{ var }}``. - Parameters as ``%{var}`` local variables. The two compose: a Blueprint FlowGroup spec can declare ``use_template:`` like a disk-sourced FlowGroup. Preset layering — how many to apply ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Presets deep-merge in order; explicit action config always wins. Pick depth by how much variation the workload tolerates: - **One global preset** (for example ``bronze_layer``) when every Bronze table shares the same options. - **Layered presets** (``bronze_layer`` + ``cdc_overrides``) when a subset of FlowGroups overrides a few keys. Reach for ``extends:`` only when the second preset is reusable as a named building block. - **No preset** when the FlowGroup is one-of-a-kind — inlining a single option is clearer than naming a preset for it. Streaming vs batch ------------------ :term:`Lakeflow Declarative Pipeline` decides execution order at runtime, but you pick streaming or batch when you choose how a Load reads and which Write target receives. The two ends must agree. .. list-table:: :header-rows: 1 :widths: 30 35 35 * - Workload shape - Pick at Load - Pick at Write * - Files trickling into object storage; exactly-once and schema evolution. - ``cloudfiles`` - ``streaming_table`` * - Delta source appended by another job; you want the latest rows. - ``delta`` with ``readMode: stream`` - ``streaming_table`` * - Delta source you re-aggregate on a schedule. - ``delta`` with ``readMode: batch`` - ``materialized_view`` * - Change Data Capture (:term:`CDC`) events; :term:`SCD` Type 1 or Type 2. - ``delta`` (CDF) or ``cloudfiles`` - ``streaming_table`` ``mode: cdc`` * - Full snapshots that LHP diffs. - ``delta`` or ``python`` - ``streaming_table`` ``mode: snapshot_cdc`` * - Gold dashboards; freshness measured in hours. - ``sql`` against Silver tables - ``materialized_view`` If you mix the two — a batch Load feeding a ``streaming_table`` — the pipeline parses, but Lakeflow re-reads the source on every refresh. Usually not what you want. Choosing a load source ---------------------- Pick the Load sub-type by what the data looks like at rest and how it is delivered: .. list-table:: :header-rows: 1 :widths: 22 78 * - Sub-type - Use when… * - ``cloudfiles`` - Files arrive in object storage (S3, ADLS, GCS, Unity Catalog volumes); incremental ingestion with checkpoints and schema evolution. Streaming only. * - ``delta`` - Reading an existing Delta table or its Change Data Feed (CDF). Batch or streaming. * - ``sql`` - An arbitrary SQL query materialised as a temporary view (often joins or windowed aggregates across already-loaded sources). * - ``jdbc`` - Pulling from an external RDBMS (Oracle, SQL Server, Postgres, MySQL); credentials via Databricks secrets. * - ``python`` - The format is not covered by a built-in sub-type, or you need custom pre-processing in Python before the flow sees the data. * - ``custom_datasource`` - You have or want a PySpark DataSourceV2 implementation and want LHP to register and invoke it. For the full sub-type reference see :doc:`actions/load_actions`; for a walk-through of ``cloudfiles`` see :doc:`ingest_with_autoloader`. Choosing a write target ----------------------- Pick the Write sub-type by the kind of object you want to produce: .. list-table:: :header-rows: 1 :widths: 25 75 * - Sub-type - Use when… * - ``streaming_table`` - Persisting incrementally arriving data. Bronze and Silver layers, CDC, fan-in from multiple sources. Supports ``standard``, ``cdc``, and ``snapshot_cdc`` modes. * - ``materialized_view`` - Batch-computed analytics — Gold dashboards and aggregations that refresh on a schedule. * - ``sink`` - Pushing data **out** of the lakehouse — Kafka, Delta to an external catalog, Azure Event Hubs, or a custom REST endpoint. For the full sub-type reference see :doc:`actions/write_actions`. Choosing a write mode (streaming_table) --------------------------------------- Within ``streaming_table``, the ``mode`` field decides how rows are applied to the target: .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Mode - When to use - Notes * - ``standard`` *(default)* - Append-only or fan-in workloads where each row is new. - Multiple write actions targeting the same table fan in via ``@dp.append_flow``; only the first action sets ``create_table: true``. * - ``cdc`` - Source emits change events (insert / update / delete) with a sequence column. SCD Type 1 or Type 2 targets. - Requires ``cdc_config.keys`` and ``sequence_by``. Supports ``scd_type: 1`` or ``scd_type: 2`` with optional ``track_history_column_list``. Multi-CDC fan-in works when contributors agree on the shared CDC fields. * - ``snapshot_cdc`` - Source delivers full snapshots rather than change events; LHP diffs successive snapshots. - Backed by ``create_auto_cdc_from_snapshot_flow()``. Source is a Delta table or a Python function returning ``(df, version)``. Single-job vs multi-job orchestration ------------------------------------- By default, LHP generates one orchestration job that runs every Lakeflow pipeline in dependency order. Set ``job_name`` on a FlowGroup to split work across multiple jobs. .. list-table:: :header-rows: 1 :widths: 35 65 * - Pick… - When… * - **Single job** (omit ``job_name``) - All pipelines share one cadence; you want one place to monitor success; the dependency graph is small enough to run end-to-end on every trigger. * - **Multi-job** (set ``job_name`` on every FlowGroup) - Pipelines have different schedules (hourly POS, nightly ERP); source systems need isolation (a failing SAP feed must not block NCR ingestion); different teams own different jobs and want separate alerting and permissions. If you set ``job_name`` on any FlowGroup, you must set it on every FlowGroup — the validator rejects mixed configurations. LHP emits a master orchestration job that triggers the per-system jobs in dependency order. For the full reference see :doc:`bundle_config_reference`. See also -------- - :doc:`architecture` — definitions for FlowGroups, Actions, Presets, Templates, and Blueprints. - :doc:`presets_reference` — preset schema and deep-merge semantics. - :doc:`templates_reference` — Jinja2 parameter syntax and template composition. - :doc:`blueprints` — Blueprint and instance schema, parameter validation, and expansion.