Topic
Loading and Operations
ETL versus ELT, change data capture, incremental loading, watermarking, idempotency, and the operational disciplines that keep production warehouses correct.
24 entries
Techniques
11
Technique
Advanced dimensional modeling: bridge tables, inferred members, multi-timezone, and the awkward cases
How to model the dimensional cases the textbook example never quite covers: multivalued dimensions and bridge tables, inferred members for late-arriving dimensions, free-text comments, and facts that span multiple time zones.
Read →Technique
Building a data warehouse: a four-phase practitioner's playbook
How warehouse projects actually get built, organized as discovery, design, development, and deployment, with the Kimball-versus-Inmon design choice treated as a concrete decision rather than an academic debate.
Read →Technique
Change data capture: implementation strategies
How log-based, timestamp-based, and trigger-based change data capture actually work in production, including the initial snapshot handoff, schema evolution failure modes, and the operational disciplines that keep CDC pipelines correct.
Read →Technique
Data cleansing in the warehouse: where it belongs and what it does
Where data cleansing sits in a modern warehouse load: the staging-to-curated boundary, the rule categories that catch real defects, the test-at-the-transform-layer pattern, and the observability that catches the drift the rules miss.
Read →Technique
Data extraction models: full, incremental, log-based, query-based, file-based, API, and streaming
The seven data extraction patterns a warehouse encounters in practice, what each one assumes about the source, where each one fails, and how the modern connector stack (Fivetran, Airbyte, Estuary, Debezium, Kafka) decides between them.
Read →Technique
Data masking in the data warehouse
How static, dynamic, and on-the-fly data masking actually work in a cloud warehouse, including the mask-before-load versus mask-in-warehouse axis, column-level masking policies on Snowflake, BigQuery, and Databricks, and the trade-offs between tokenization, encryption, and hashing under GDPR, CCPA, and HIPAA.
Read →Technique
Data virtualization: federated query in modern warehouse stacks
How data virtualization works as a technique, what it shares with and how it differs from federated query and the logical data warehouse, where it fits in cloud warehouse stacks, and the failure modes that determine when virtualization holds up in production.
Read →Technique
Data warehouse metadata: catalogs, lineage, and the metadata repository in 2026
How technical, business, and operational metadata get organized in a modern warehouse stack, including the shift from monolithic metadata repositories to federated data catalogs, dbt-driven lineage, and OpenLineage as the cross-tool standard.
Read →Technique
Data warehouse testing: validation, regression, and performance
What to test in a production warehouse pipeline, where each kind of test lives, and how dbt tests, Great Expectations, and contract patterns fit together without producing a green dashboard over wrong data.
Read →Technique
Slowly changing dimensions: implementation strategies
How SCD Type 1, 2, 3, and the hybrid types actually work in a production warehouse, including active row identification, fact loading under Type 2, and the edge cases that bite teams in practice.
Read →Technique
Surrogate key management: generation, lookup, and the cases that bite
How to generate and manage surrogate keys in a 2026 cloud warehouse: integer sequences, hash-based deterministic keys, UUID v7, the fact-loading lookup under Type 2 SCD, and the edge cases that produce silent errors.
Read →
Glossary
10
Glossary
Change data capture
A category of techniques for identifying and propagating changes from a source data system to a downstream consumer without copying the full source state on each load cycle.
Read →Glossary
Data catalog
A searchable index over the metadata of the data assets in an analytics platform: tables, columns, dashboards, models, owners, descriptions, and lineage, federated from the upstream tools that produce each piece.
Read →Glossary
Data contract
A declarative specification of the schema, semantics, and operational guarantees a data producer promises to consumers, used to enforce expectations at the source boundary rather than discover violations downstream.
Read →Glossary
Data lineage
The recorded graph of how a data value flows from source to destination across the pipeline: which sources fed which models fed which dashboards, at table or column granularity, derived from build artifacts and runtime events rather than maintained by hand.
Read →Glossary
Data quality
The degree to which data is fit for its intended use, framed across five dimensions in the practitioner literature: validity, completeness, consistency, uniqueness, and conformity.
Read →Glossary
Idempotency
The property that an operation produces the same result whether it runs once or many times against the same input. Central to safe recovery in warehouse load pipelines, REST APIs, and distributed stream processing.
Read →Glossary
Inferred member
A placeholder dimension row inserted at fact-load time when the fact references a business key that does not yet exist in the dimension, with a flag that flips when the real attributes arrive.
Read →Glossary
Referential integrity
The property that every foreign key value in a child table actually exists in the parent table it references. In a data warehouse, the question of where to enforce this property, in the database engine, in the transformation layer, or not at all, is a long-running design debate that the move to cloud platforms has decisively reshaped.
Read →Glossary
Slowly changing dimension
A dimension whose attribute values change over time at a rate slower than fact table growth, requiring explicit strategies to preserve or overwrite history.
Read →Glossary
Watermark
A stored value marking progress through a stream of data. In warehouse loading, the boundary between data already ingested and data still pending; in stream processing systems, the event-time marker indicating which records are complete enough to act on.
Read →
