Skip to article
Data Warehouse Info

A practitioner's reference for analytical data warehousing.

Reference Articles · Technique Deep-Dives · Courses · Glossary

Decision


Referential integrity in a data warehouse

Referential integrity in a data warehouse is a decision, not a default. A framework for choosing between database-enforced foreign keys, informational constraints, ELT-layer assertions, and unenforced declarations on Snowflake, BigQuery, Redshift, Databricks, and lakehouse table formats.

By Farhan Ahmed Khan


Referential integrity in a data warehouse is the property that every foreign key value in a fact or child table actually exists in the dimension or parent table it references. Achieving it is uncontested. Enforcing it inside the warehouse with database-level foreign key constraints is one of the longest-running debates in warehouse design, because every cloud warehouse engine the reader is likely operating today, Snowflake, BigQuery, Redshift, Databricks, accepts foreign key declarations in DDL but does not enforce them at write time. That changes the decision from "should the database check this?" to "where in the pipeline should the check live, and what form should it take?" The warehouse loading and operations pillar covers loading patterns broadly; this page is the structured decision framework for the referential integrity choice specifically.

TL;DR. Don't ask "should we enforce referential integrity in the warehouse?" Ask "where does the check live, and what does it cost when it fails?" Default to informational foreign key constraints in DDL (declared but unenforced) for optimizer hints and lineage clarity, plus transformation-layer assertions (dbt tests, Great Expectations, or equivalent) running after each load to catch real violations. Reserve engine-enforced constraints for the small number of warehouses still running on a strict-RDBMS engine where the workload tolerates write-time overhead. Push violations to a quarantine pattern (the unknown-member surrogate key) rather than failing the load, because the wrong decision in this debate is the one that lets a single late-arriving record halt the entire morning's reporting.

The decision space

The legacy debate framed the choice as binary: either the database enforces every foreign key at write time, or referential integrity lives entirely in the loading pipeline. That framing made sense when warehouses ran on the same RDBMS engines as operational systems (Oracle, SQL Server, DB2, Teradata) and the enforcement question was about turning a feature on or off. It is the wrong framing for a 2026 stack.

Modern warehouses parse foreign key declarations but do not enforce them by default. Snowflake supports FOREIGN KEY syntax in CREATE TABLE and stores the metadata, but the constraint is informational only; the warehouse will accept a fact row that points at a non-existent dimension key without complaint. BigQuery added foreign key and primary key support in 2023, also informational, also not enforced at write time. Redshift's foreign key constraints are informational by default and not enforced unless explicitly required; the optimizer uses them for join planning. Databricks accepts FOREIGN KEY declarations on Unity Catalog tables, again informational. The lakehouse formats (Iceberg, Delta) carry the same pattern at the table-format layer: constraints can be declared in catalog metadata but enforcement depends on the query engine and is generally not done at write time. The cloud era didn't remove the constraint syntax. It changed what the syntax does.

The actual options on the table are four:

  1. Engine-enforced foreign keys. The database checks every insert and rejects rows that violate the constraint. Available by default only on traditional RDBMS warehouses (some on-prem SQL Server / Oracle / Teradata installations) and by explicit configuration on a few cloud engines.
  2. Informational foreign keys. Constraints declared in DDL and stored in the catalog, but not enforced at write time. The optimizer uses them for join elimination, rewrite, and cost estimation. The default on Snowflake, BigQuery, Redshift, Databricks, and lakehouse formats.
  3. Transformation-layer assertions. Tests that run after each load, querying the warehouse to find rows in a child table whose foreign key does not match a parent table row. Implemented as dbt tests (relationships), SQLMesh audits, Great Expectations expectations, or hand-written assertion queries. Violations surface as test failures the team can route to alerts or to a quarantine flow.
  4. Unenforced. No declaration, no test, no assertion. The pipeline trusts the source systems and the transformation logic to produce consistent foreign keys. The historical "we'll catch it in ETL" position, often with the ETL never quite catching it.

The decision is which combination of these four to adopt, on which tables, with what failure handling. The framing that the legacy article got right is that this is a decision, not a default. The framing it got wrong is that the decision is between engine-enforced and unenforced. The realistic choice is between informational plus transformation-layer assertions (the modern default) versus engine-enforced (specialized, narrowing) versus unenforced (almost always wrong, but occasionally defensible).

The framework

Six criteria, applied in order. The first one that disqualifies an option removes it from consideration; the surviving option that scores highest on the remaining criteria is the recommendation.

Criterion 1: What does the engine actually support?

The first filter is mechanical. Engine-enforced foreign keys are only available where the engine enforces them.

Traditional RDBMS warehouses (SQL Server, Oracle, DB2, PostgreSQL-based, on-prem Teradata) enforce foreign keys at write time by default and can be configured to defer enforcement to commit time. Engine enforcement is on the table.

Snowflake, BigQuery, Redshift, Databricks accept the DDL syntax and store the metadata as informational constraints only. The RELY clause on Redshift and Databricks tells the optimizer to trust the constraint for query rewrite, but the engine does not validate it at write time. Engine enforcement is not on the table. The Snowflake documentation is explicit: "FOREIGN KEY constraints are not enforced." BigQuery's documentation says the same: "BigQuery does not enforce these constraints." This is not a configuration choice the team can change; it is how the engine works.

Lakehouse table formats (Iceberg, Delta, Hudi) follow the warehouse engines' pattern: constraints in metadata, no enforcement. Engine enforcement is not on the table.

The practical result: for the majority of teams operating on a cloud warehouse or lakehouse, criterion 1 immediately removes engine-enforced foreign keys from the option set. The remaining choice is between informational + transformation-layer assertions and unenforced.

Criterion 2: What is the cost of a single violation reaching production?

The next filter is about consequences. What happens when a fact row with a foreign key that doesn't resolve makes it into a published mart?

For most analytical workloads, a single orphan fact produces a small, recoverable error: a row missing from a GROUP BY total, a slightly inaccurate cohort count, an unknown-member showing up in a drill-down. The cost is real but bounded, and the fix is upstream once the cause is identified.

For workloads where one wrong number is genuinely catastrophic (regulatory reporting, financial close, audit trails), the cost of a single violation reaching the published layer is higher. These workloads justify more aggressive enforcement, including in the small set of cases where engine-enforced foreign keys remain available. The cost of an engine rejecting a load is a paged engineer at 2am; the cost of an unnoticed orphan in a regulator-filed report is a different category of problem.

Criterion 3: How much write-time overhead can the workload tolerate?

The historical argument against engine-enforced foreign keys was the write-time cost. Every insert checks every foreign key column against the parent table's primary key index. For a fact table loading millions of rows per cycle, the cost is real. Snowflake's and BigQuery's decision to make foreign keys informational was substantially about avoiding this overhead at scale.

On a high-volume incremental load, write-time foreign key enforcement adds latency proportional to the number of foreign keys per row and the cardinality of the parent tables. For a fact table with eight dimensions and 50 million rows per load, that overhead is non-trivial. For a dimension table with 50,000 rows per load and three foreign keys, it is negligible.

This criterion typically doesn't decide the answer (criterion 1 has usually already eliminated engine enforcement), but it is the reason cloud warehouse vendors chose the informational pattern in the first place, and it is the reason teams running engine enforcement on volume fact tables often regret it. If criterion 1 left engine enforcement on the table and the workload is high-volume, criterion 3 starts arguing against it.

Criterion 4: How will violations be handled?

The question is not "do we want violations to be impossible?" The question is "what happens when one occurs?" There are three meaningful options.

Fail the load. Engine-enforced foreign keys do this by default: the offending row is rejected, the transaction aborts, and the load fails. On a streaming pipeline, this can take down ingestion. On a batched nightly load, it leaves the warehouse in whatever partial state the load reached before the failure. Teams that adopt this posture must have alerting and on-call rotation prepared for it.

Quarantine the row. The standard pattern is the unknown-member surrogate key: the dimension carries a well-known row (typically with surrogate key zero or negative one) representing "this foreign key didn't resolve at load time." The fact row is loaded with the unknown-member surrogate, the violation is logged, and downstream reports continue to function. When the actual dimension record arrives later, the fact's foreign key can be updated in place (or, more commonly, the warehouse is configured to repair on next load by rejoining the late-arriving dimension). This is the standard handling for late-arriving dimensions covered in the warehouse loading and operations pillar and is the right default for most fact-loading workloads.

Log and continue. The fact row is loaded with whatever foreign key value it has, including the invalid one. A monitoring query identifies orphans periodically. This is the weakest option but is sometimes correct for tables where the foreign key is informational rather than load-bearing for downstream queries.

The decision here is more important than the choice between options 1, 2, and 3 in the previous section, because it determines the operational cost of the strategy. A team that chooses engine enforcement without an operational plan for the failures will discover that the failure mode is worse than the violation it was preventing.

Criterion 5: Where does the team already operate?

The teams running modern stacks already have transformation-layer tooling: dbt, SQLMesh, Coalesce, or hand-rolled SQL transformation frameworks. All of them have testing patterns. dbt's relationships test is the canonical example: declared in the model's YAML, it runs as a SQL query after each model build, verifying that every value in the foreign key column exists in the referenced parent table. SQLMesh's audits and Great Expectations' expect_column_values_to_be_in_set cover the same ground.

The transformation-layer pattern is where the team already operates. It scales with the warehouse, integrates with the existing CI/CD pipeline, surfaces violations as test failures the team already knows how to triage, and lets violations be routed to severity tiers (block deployment on a violation in a critical mart; warn on a violation in a sandbox). For teams running cloud warehouses, this is structurally the right place for referential integrity checks.

For teams running on a traditional RDBMS where engine enforcement remains available, the question is whether they're also running a transformation framework with testing. If yes, transformation-layer assertions are still the better fit because they integrate with the team's existing observability. If no, engine enforcement may be the simpler answer for that team, with the understanding that the rest of the pipeline observability has gaps.

Criterion 6: What about the optimizer?

Informational foreign keys, declared but unenforced, give the query optimizer information it can use for join elimination, rewrite, and cardinality estimation. Snowflake, BigQuery, Redshift, and Databricks all use informational primary key and foreign key constraints to varying degrees, with Redshift and Databricks using a RELY clause to tell the optimizer to trust the constraint for query rewrite. Declaring the constraints is essentially free (no write-time cost) and can produce measurable query plan improvements on joins that would otherwise generate unnecessary work.

This criterion doesn't decide between options; it argues that even teams whose primary integrity check lives in the transformation layer should still declare the constraints in DDL as informational. The catalog metadata is useful for the optimizer, for documentation, for lineage tools that read schema metadata, and for downstream consumers (Power BI, Looker, Tableau) that read constraint information to auto-generate join paths.

Worked examples

Scenario 1: Cloud-warehouse retail analytics

A retail analytics team runs on Snowflake. Their warehouse has 40 fact tables and 200 dimensions, with daily incremental loads totaling 80 million rows. The team uses dbt for transformation, runs tests in CI, and has Datadog alerting on test failures. The CFO's quarterly board reporting pulls from a curated mart.

Criterion 1 removes engine enforcement immediately: Snowflake doesn't enforce foreign keys. Criterion 2 raises the stakes on the board-reporting mart; that mart's referential integrity matters more than the rest. Criterion 4 points to the unknown-member pattern: a failed load at 4am due to a single source-system glitch would be worse than the violation. Criterion 5 confirms dbt as the right enforcement layer. Criterion 6 argues for declaring the constraints in DDL even though they're unenforced.

The implementation: declare foreign keys in DDL as informational constraints across the model. Run dbt relationships tests after each load. For most tests, severity is warn; the test runs, violations are logged, and the load continues. For the board-reporting mart's fact tables, severity is error; a violation blocks the mart from refreshing until resolved, while the rest of the warehouse continues to update. Unknown-member rows are pre-seeded in each dimension at surrogate key zero, and the fact-load transformation joins to the dimension with COALESCE(dim.surrogate_key, 0) so unmatched foreign keys default to the unknown member.

Scenario 2: Regulated financial reporting on a lakehouse

A financial services firm runs on Databricks with Delta Lake. Some of their pipelines feed regulatory reports that the firm files with a regulator monthly; one wrong number triggers a refile process that costs several days of senior staff time.

Criterion 1 removes engine enforcement: Databricks does not enforce foreign keys at write time on Delta tables, regardless of declarations. Criterion 2 is the dominant factor here: the cost of a single violation reaching the regulatory report is high enough to justify substantial defensive engineering. Criterion 4 argues against the unknown-member pattern for this specific workload, because an unknown-member row in a regulatory aggregation is wrong in a way that needs to be visible, not gracefully degraded. Criterion 5 points to a layered assertion pattern.

The implementation: informational constraints in DDL via Unity Catalog. Three layers of assertion. First, dbt relationships tests run after each model build, with error severity on the regulatory-feeding models. A violation halts the regulatory pipeline (but not the broader warehouse). Second, a separate pre-filing validation job runs Great Expectations expectations against the staged regulatory output before it is submitted; any orphan key fails the job and pages the on-call. Third, a reconciliation query runs against the regulatory output and the source systems' aggregates, flagging any discrepancy above a tight threshold. The pattern accepts higher engineering cost in exchange for catching the violations that matter before they reach a filed document.

Scenario 3: Legacy on-prem warehouse migrating to cloud

A team is running an existing SQL Server warehouse with engine-enforced foreign keys on every dimension-to-fact relationship. They're migrating to BigQuery. The historical pattern has worked, the team is fluent in it, and the question is what to do on the new platform.

Criterion 1 removes engine enforcement on the new platform: BigQuery doesn't enforce foreign keys. Criterion 5 says the team doesn't yet operate transformation-layer assertions; they're adopting dbt as part of the migration. Criterion 4 is open; the team has historically failed loads on violations because the engine forced them to.

The implementation: during migration, declare foreign keys in DDL on BigQuery as informational constraints (for documentation and optimizer use). Build out dbt relationships tests in parallel as the team migrates each subject area, with severity calibrated to the downstream consumer's tolerance. Operationally, the team is moving from "load fails if a violation occurs" to "load succeeds, test fails, team triages." The shift in mental model is the larger change than the technical implementation. Plan for it explicitly: a runbook for what triage looks like, who owns it, and what the escalation path is when a test failure blocks a mart refresh. Teams that migrate the constraints without migrating the operational pattern around them produce silent gaps.

Advertisement
300 × 250

Common mistakes

Treating informational constraints as enforcement. A declared but unenforced foreign key is a documentation and optimizer artifact, not a runtime check. A team that declares the constraints and assumes the warehouse is enforcing them is operating without integrity checks they think they have. Verify the engine's actual behavior; do not assume DDL syntax implies enforcement.

Failing the load on every violation. Engine enforcement defaults to this, and teams sometimes replicate the pattern in transformation tests by setting every relationships test to error severity. The result is that a single anomalous source record halts overnight refresh and the team comes in to a warehouse that didn't update. Triage time on a stale warehouse usually exceeds the cost of the violations the failure was preventing. Reserve error severity for the small number of pipelines where a violation reaching the consumer is genuinely catastrophic; default the rest to warn or to the unknown-member quarantine pattern.

Skipping declarations because they're not enforced. Teams sometimes conclude that since the engine doesn't enforce foreign keys, declaring them is pointless. This misses the optimizer benefit, the catalog metadata benefit, and the downstream-consumer benefit. Declarations are essentially free; declare them.

Running tests only at deployment time. A dbt relationships test that runs in CI but not after each scheduled load catches violations introduced by code changes but misses violations introduced by data drift. Production loads need the same tests running on the same cadence as the loads themselves. The tooling makes this straightforward; the discipline is to wire it up.

No quarantine path for unmatched foreign keys. Teams that adopt the assertion pattern without an unknown-member surrogate key in each dimension end up either failing the load (criterion 4 mistake above) or silently dropping the offending fact rows in transformation logic. Both outcomes are worse than the unknown-member pattern, which is well-understood, recoverable, and visible in downstream reports.

Confusing constraints with primary keys. Primary key constraints have the same enforcement story on cloud warehouses as foreign keys: declared but not enforced. A team relying on the warehouse to reject duplicate inserts will be surprised. Duplicate detection is a separate transformation-layer assertion (dbt's unique test) and needs its own treatment alongside the referential integrity assertions.

Closing

The 2019-era debate about whether to enforce referential integrity in the warehouse has been quietly settled by the warehouse engines themselves: they parse the constraints and do not enforce them, and that is the right default for the workloads they were built for. The question that still requires a decision is where the integrity check lives, what it does when it fires, and how the team operates the resulting alerts. Informational constraints in DDL plus transformation-layer assertions with thoughtful severity is the answer that fits the majority of 2026 stacks. The framework above lets the small set of cases that genuinely deserve a different answer get one.

For the loading patterns that sit alongside this decision, see the warehouse loading and operations pillar. For the choice between transformation patterns, see the ETL vs ELT comparison. For the late-arriving dimension handling that the unknown-member pattern interacts with, see slowly changing dimensions.