Performance =========== .. meta:: :description: Reasoning behind LHP performance choices — streaming versus materialized view trade-offs, dependency resolution, and runtime tuning levers. For the decision frameworks behind individual generator choices, see :doc:`../decisions`. This page explains the performance properties LHP inherits or constructs, and the trade-offs the framework intentionally exposes. Two performance domains, not one -------------------------------- LHP has two performance stories that get conflated. *Generation performance* is how long ``lhp generate`` takes. *Runtime performance* is how long the Databricks pipeline takes once deployed. They are governed by different mechanisms and tuned with different levers. Generation is bounded by file I/O and YAML parsing; a project with one thousand :term:`FlowGroups ` generates in seconds. Runtime is bounded by data volume, shuffle costs, and the cost of joins; LHP does not control these directly, but the choice of write target (:term:`streaming table ` versus :term:`materialized view `) and the use of ``cluster_columns`` versus ``partition_columns`` affect them significantly. Streaming tables versus materialized views ------------------------------------------ The most consequential write-target decision in any LHP project is ``streaming_table`` versus ``materialized_view``. The wrong choice shows up at the worst time — when a dimension changes and downstream metrics quietly stop updating, or when a backfill takes ten times longer than expected. The mental model that holds up: streaming tables are append-optimal and incremental, materialized views are recompute-correct and may fully recompute on source changes. **Streaming tables** generate ``dp.create_streaming_table()`` plus ``@dp.append_flow()`` functions. They process new records as they arrive. They do not recompute when old data or referenced dimensions change. The right uses are bronze ingestion (each record is processed once), change-data-capture targets (``mode: cdc`` with explicit ``cdc_config``), and any append-only flow where you want lower latency and lower cost. **Materialized views** generate ``@dp.materialized_view()``. They reprocess when source data changes — including when joined dimensions change. They are the correct choice for any logic that depends on multiple tables maintaining a consistent view. For silver enrichment that joins a fact stream against a slowly-changing dimension, the materialized view is the only target that gives correct results without you writing your own change-tracking logic. The trap that catches most teams is using streaming tables for joins. The streaming table picks up new fact rows. It does not see dimension updates, so a customer's enriched record retains whatever ``customer_name`` was current at the moment the fact arrived. For analytics, this is usually wrong. The fix is the materialized view. CDC patterns and snapshot CDC ----------------------------- LHP supports two CDC modes on streaming tables. The ``cdc`` mode generates ``dp.create_auto_cdc_flow()`` and consumes a change feed with ``keys``, ``sequence_by``, and ``scd_type`` (1 or 2). The ``snapshot_cdc`` mode generates ``dp.create_auto_cdc_from_snapshot_flow()`` and consumes full snapshots, computing the differences itself. The choice depends on what the source provides. If you have a change feed — Debezium output, a Delta change-data feed, a Kafka topic with insert/update/delete markers — use ``cdc``. If you have periodic full dumps with no change indicators, use ``snapshot_cdc`` and let Databricks detect changes by comparing snapshots. The ``snapshot_cdc`` mode is more expensive at runtime (it has to diff snapshots) but tolerates messier sources. The trade-off is worth it when the source genuinely does not emit change events. Rate limiting and Auto Loader ----------------------------- CloudFiles loads can flood downstream tables if the landing zone has a large backlog. The Auto Loader options ``cloudFiles.maxFilesPerTrigger`` and ``cloudFiles.maxBytesPerTrigger`` cap per-microbatch ingestion, so bronze ingestion runs at a sustainable rate and downstream pipelines keep up. Set these in a ``brz_standard`` preset, not on individual FlowGroups — the values are organisational policy, not per-FlowGroup tuning. Pairing them with ``cloudFiles.schemaEvolutionMode: rescue`` and ``cloudFiles.rescuedDataColumn: _rescued_data`` (also in the preset) makes the bronze layer robust to schema drift: unknown fields land in ``_rescued_data`` rather than failing the load, and the rate limiter keeps the ingestion pace within capacity. Clustering and partitioning --------------------------- LHP supports both ``cluster_columns`` (liquid clustering) and ``partition_columns`` on write targets. The modern recommendation is ``cluster_columns`` for almost every case. Liquid clustering is incremental, allows the keys to be redefined without rewriting the table, and works well with high-cardinality columns where partitioning would create too many small files. Partition columns still have a place for very predictable date-partitioned access patterns, but the default should be liquid clustering. The cost of getting it wrong with clustering is lower — you change the keys and continue — while the cost of getting it wrong with partitioning can mean rewriting the table. Dependency resolution as a non-decision --------------------------------------- LHP's dependency resolver topologically sorts actions within a FlowGroup. The Load comes first, then Transforms in dependency order, then Write, then Tests. You can list actions in any order in the YAML; LHP works out the right one. This is intentional. Asking the YAML author to maintain action order manually creates a class of bug (an action referencing a target defined later in the file) that the resolver eliminates entirely. The recommendation to list actions in Load/Transform/Write/Test order is for readability, not correctness. The same principle applies across FlowGroups within a pipeline. ``lhp deps`` analyses cross-FlowGroup dependencies and produces a dependency graph at pipeline level. The orchestration job that LHP generates (``--format job``) respects this graph, so a downstream pipeline only runs after its upstream completes. Anti-patterns ------------- **Streaming tables for join-based enrichment.** Stale dimension data is the predictable outcome. Materialized views are correct here. **Partition columns by reflex.** Liquid clustering is the modern default, and it tolerates being wrong much better. See also -------- - :doc:`../decisions` for write-target and action-type decision frameworks. - :doc:`../dependency_analysis` for how ``lhp deps`` works under the hood. - :doc:`../actions/write_actions` for the full streaming-table and materialized-view configuration surface.