Performance¶

For the decision frameworks behind individual generator choices, see Decisions. This page explains the performance properties LHP inherits or constructs, and the trade-offs the framework intentionally exposes.

Two performance domains, not one¶

LHP has two performance stories that get conflated. Generation performance is how long lhp generate takes. Runtime performance is how long the Databricks pipeline takes once deployed. They are governed by different mechanisms and tuned with different levers.

Generation is bounded by file I/O and YAML parsing; a project with one thousand FlowGroups generates in seconds. Runtime is bounded by data volume, shuffle costs, and the cost of joins; LHP does not control these directly, but the choice of write target (streaming table versus materialized view) and the use of cluster_columns versus partition_columns affect them significantly.

Streaming tables versus materialized views¶

The most consequential write-target decision in any LHP project is streaming_table versus materialized_view. The wrong choice shows up at the worst time — when a dimension changes and downstream metrics quietly stop updating, or when a backfill takes ten times longer than expected.

The mental model that holds up: streaming tables are append-optimal and incremental, materialized views are recompute-correct and may fully recompute on source changes.

Streaming tables generate dp.create_streaming_table() plus @dp.append_flow() functions. They process new records as they arrive. They do not recompute when old data or referenced dimensions change. The right uses are bronze ingestion (each record is processed once), change-data-capture targets (mode: cdc with explicit cdc_config), and any append-only flow where you want lower latency and lower cost.

Materialized views generate @dp.materialized_view(). They reprocess when source data changes — including when joined dimensions change. They are the correct choice for any logic that depends on multiple tables maintaining a consistent view. For silver enrichment that joins a fact stream against a slowly-changing dimension, the materialized view is the only target that gives correct results without you writing your own change-tracking logic.

The trap that catches most teams is using streaming tables for joins. The streaming table picks up new fact rows. It does not see dimension updates, so a customer’s enriched record retains whatever customer_name was current at the moment the fact arrived. For analytics, this is usually wrong. The fix is the materialized view.

CDC patterns and snapshot CDC¶

LHP supports two CDC modes on streaming tables. The cdc mode generates dp.create_auto_cdc_flow() and consumes a change feed with keys, sequence_by, and scd_type (1 or 2). The snapshot_cdc mode generates dp.create_auto_cdc_from_snapshot_flow() and consumes full snapshots, computing the differences itself.

The choice depends on what the source provides. If you have a change feed — Debezium output, a Delta change-data feed, a Kafka topic with insert/update/delete markers — use cdc. If you have periodic full dumps with no change indicators, use snapshot_cdc and let Databricks detect changes by comparing snapshots.

The snapshot_cdc mode is more expensive at runtime (it has to diff snapshots) but tolerates messier sources. The trade-off is worth it when the source genuinely does not emit change events.

Rate limiting and Auto Loader¶

CloudFiles loads can flood downstream tables if the landing zone has a large backlog. The Auto Loader options cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger cap per-microbatch ingestion, so bronze ingestion runs at a sustainable rate and downstream pipelines keep up.

Set these in a brz_standard preset, not on individual FlowGroups — the values are organisational policy, not per-FlowGroup tuning. Pairing them with cloudFiles.schemaEvolutionMode: rescue and cloudFiles.rescuedDataColumn: _rescued_data (also in the preset) makes the bronze layer robust to schema drift: unknown fields land in _rescued_data rather than failing the load, and the rate limiter keeps the ingestion pace within capacity.

Clustering and partitioning¶

LHP supports both cluster_columns (liquid clustering) and partition_columns on write targets. The modern recommendation is cluster_columns for almost every case. Liquid clustering is incremental, allows the keys to be redefined without rewriting the table, and works well with high-cardinality columns where partitioning would create too many small files.

Partition columns still have a place for very predictable date-partitioned access patterns, but the default should be liquid clustering. The cost of getting it wrong with clustering is lower — you change the keys and continue — while the cost of getting it wrong with partitioning can mean rewriting the table.

Dependency resolution as a non-decision¶

LHP’s dependency resolver topologically sorts actions within a FlowGroup. The Load comes first, then Transforms in dependency order, then Write, then Tests. You can list actions in any order in the YAML; LHP works out the right one.

This is intentional. Asking the YAML author to maintain action order manually creates a class of bug (an action referencing a target defined later in the file) that the resolver eliminates entirely. The recommendation to list actions in Load/Transform/Write/Test order is for readability, not correctness.

The same principle applies across FlowGroups within a pipeline. lhp deps analyses cross-FlowGroup dependencies and produces a dependency graph at pipeline level. The orchestration job that LHP generates (--format job) respects this graph, so a downstream pipeline only runs after its upstream completes.

Anti-patterns¶

Streaming tables for join-based enrichment. Stale dimension data is the predictable outcome. Materialized views are correct here.

Partition columns by reflex. Liquid clustering is the modern default, and it tolerates being wrong much better.