Data cleansing in the warehouse: where it belongs

Most warehouse data quality failures are not surprises. They are predictable defects that the load pipeline either checked for and rejected, checked for and let through with a flag, or did not check for at all. The warehouse loading and operations pillar introduces the load lifecycle at overview depth. This article covers where cleansing sits within that lifecycle in a modern stack: the staging-to-curated boundary, the rule categories that catch the defects worth catching, the pattern of expressing those rules as tests at the transform layer, and the observability that catches what the rules miss.

TL;DR. Data cleansing belongs at the staging-to-curated boundary, not at ingest and not as a post-hoc cleanup. Express rules as tests in the transformation layer (dbt tests, Great Expectations, soda-core). Pair the rule layer with observability to catch the silent drift that rule-based checks cannot anticipate.

What data cleansing actually is

Data cleansing is the process of detecting and resolving defects in source data so that downstream analytical models can rely on the data conforming to a known shape. Defects come in five recurring categories:

Validity: values that violate the column's domain. A negative price, a future birthdate, a postal code that does not match any known format.
Completeness: missing values where the analytical model requires presence. A NULL customer_id on an order row.
Consistency: values that conflict between fields that should agree. A city of Chicago paired with a postal code in the 77xxx range.
Uniqueness: duplicate records for what should be a single business entity. The same customer entered twice with different surrogate keys, often with cosmetic variation in name or address.
Conformity: the same underlying value expressed differently across sources. United States, US, USA, and U.S.A. all referring to one country; M/F versus male/female for gender.

These are the categories Wang and Strong codified in the 1996 Journal of Management Information Systems paper that still anchors the academic framing, and the categories the DAMA-DMBOK uses for data quality dimensions. The names have settled across vendor documentation; the work is identifying which categories matter for which columns and where in the pipeline each rule should run.

The category that gets the most attention in vendor marketing is duplicate detection, because it is visible and demonstrable. The category that produces the most analytical errors in practice is conformity, because it is silent: two systems naming the same country differently produce reports that disagree without any system flagging an error.

Where cleansing belongs in the modern stack

The pattern that has settled across cloud-warehouse stacks is a three-layer separation: raw, staging, and curated.

The raw layer holds source data as ingested, with minimal transformation. The shape and content match the source system. Failed-quality records are not rejected here; rejecting at ingest discards evidence the team will need when diagnosing the upstream defect.

The staging layer applies the conformity and validity rules that bring data into the warehouse's expected shape: type casting, string normalization, country code mapping, sentinel-to-NULL conversion. Records that fail validation are flagged but not deleted. The flag indicates which records will be excluded from downstream models and why.

The curated layer holds the cleansed, modeled data that analytical workloads query. Joins to dimensions resolve cleanly here because the conformity rules have already harmonized identifiers. Aggregations are correct because the validity rules have already excluded the records that would produce nonsense totals.

The reason for this layering is operational, not aesthetic. Reprocessing the warehouse after a rule change requires the raw layer to be intact. Diagnosing a downstream defect requires tracing back through the staging layer to see what was flagged. Conflating raw and staging into a single ingest step destroys both capabilities.

The staging layer is also where data contracts get enforced when contracts are part of the architecture. A contract names the columns, types, and validity constraints the producing system agrees to deliver. The staging layer's tests are the contract's runtime check. A contract violation surfaces as a test failure at staging, not as a wrong number in a dashboard three weeks later.

Categories of cleansing rules

Cleansing rules fall into five operational categories, mapped roughly to the defect categories above. The categorization matters because the rules in different categories run against different parts of the pipeline and produce different operational responses to failure.

Type and format rules. Cast strings to dates, numbers to decimals, booleans to a normalized representation. Reject values that cannot be cast, or coerce them to NULL with a flag. Most modern ingestion tools (Fivetran, Airbyte, custom Python) apply some of these automatically; the warehouse-side staging logic backs them up because automatic casting fails silently more often than the documentation suggests.

Domain validity rules. Check that values fall within the column's allowed domain. Numeric ranges, enumerated values, regex patterns for structured strings (postal codes, phone numbers, email addresses). The rules either reject the row to a quarantine table or admit it with a quality flag. The choice depends on downstream tolerance: a financial fact with an invalid amount must be quarantined; a customer record with a malformed phone number can be admitted with the phone marked unverified.

Referential rules. Check that a foreign-key value exists in the referenced table at the time the rule runs. A fact row referencing a missing dimension key is the canonical case. The standard handlings are the inferred-member pattern from Kimball (admit the fact with a placeholder dimension row that gets backfilled when the real dimension arrives) or quarantine to a side table for review. Cloud warehouses generally do not enforce foreign keys at the storage layer, so referential rules are an explicit pipeline concern rather than a free property of the schema.

Conformity rules. Map source-specific representations to a canonical form. Country codes to ISO 3166, currencies to ISO 4217, units of measure to a single base unit, gender or marital status enumerations to the warehouse's canonical set. The rules live in lookup tables maintained alongside the staging models. Lookup-table coverage is where most conformity defects originate; new source values that nobody has mapped pass through and silently corrupt the joined result.

Deduplication and entity resolution rules. Identify multiple records that refer to the same business entity. The full discipline is its own field: probabilistic matching, blocking strategies, supervised learning over training pairs. For most warehouse contexts, the simpler patterns suffice: exact match on a normalized business key (case-folded, whitespace-trimmed, with sentinel-NULL replacement) for the obvious cases, and a survivorship rule that chooses which source's attribute wins when records merge.

Survivorship is worth naming explicitly because teams routinely treat it as an implementation detail and pay for that downstream. A merge that picks attributes arbitrarily produces a customer record where the address comes from one source and the phone from another, with no record of which source contributed which field. A survivorship rule that is explicit about precedence (most recent update timestamp, source system priority, longest non-NULL string) produces a record that is reproducible and auditable. The cost is that the rule must be defined; the alternative is silent inconsistency.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

Expressing rules as tests at the transform layer

The traditional pattern, the one the legacy literature describes, treats cleansing as a stage in the ETL pipeline: extract from source, transform with cleansing logic applied imperatively, load the cleansed result. The pattern that has displaced it in cloud-warehouse stacks expresses cleansing rules as declarative tests against the staging models, run by the transformation framework as part of the build.

The three tools that anchor this pattern across the cluster:

Tool	Test model	Where it runs
dbt tests	Declarative tests in YAML alongside model definitions; generic tests (`unique`, `not_null`, `accepted_values`, `relationships`) plus custom SQL tests	After each model materializes, as part of `dbt build` or `dbt test`
Great Expectations	Expectation suites attached to data assets; pandas, Spark, or SQL backends; documented data docs site	Standalone or invoked from an orchestrator (Airflow, Dagster) against the staging tables
soda-core	YAML-defined checks (volume, freshness, schema, custom SQL) with CLI and library entry points	Standalone or as a CI/CD step against the warehouse

The reason these tools have displaced imperative cleansing logic is that tests are declarative artifacts the team can read, version-control, and reason about as a unit. A failed test produces a named failure tied to a specific column and rule. The imperative alternative is buried in transformation code where the rule is implicit and the failure mode is a wrong number rather than a named alert.

The pattern also matches the analytics engineering workflow. The same dbt project that builds the dimensional models declares the tests that validate them. The same Git history that tracks model changes tracks rule changes. The same CI pipeline that gates a model change to production gates a rule change. None of this is possible when cleansing lives in a separate ETL tool whose state is not version-controlled alongside the analytical code.

The trade-off is that test-based cleansing is fundamentally rule-driven. The rules catch what they were written to catch. Defects that fall outside the rule set pass through silently. This is the gap that observability tooling closes.

Observability for the defects rules cannot anticipate

Rule-based checks fail on the defects nobody thought to write a rule for. A column whose value distribution suddenly shifts. A daily row count that drops to half its normal volume. A categorical value that newly appears and is silently mapped to "Other" by the conformity rule's default branch. The signal is the change, and rule-based tests do not see changes; they see violations.

Data observability tools (Monte Carlo, Anomalo, Bigeye, Metaplane, Soda Cloud's observability tier) close this gap by establishing statistical baselines for each tracked table and alerting on anomalies. The baseline tracks volume, freshness, schema, and a column-level distribution profile (cardinality, null rate, numeric percentiles, string length). An incoming load that lands within the baseline passes silently. A load that violates the baseline produces an alert that points at the specific table, column, and metric that deviated.

The relationship between rule-based tests and observability is complementary rather than competing. Tests express the rules the team knows it needs. Observability surfaces the patterns the team did not know to encode. A mature data quality posture runs both: tests as the gating layer at staging, observability as the monitoring layer across the curated tables that downstream consumers query.

The cost dimension is real. Observability tools run profiling queries against the warehouse on a schedule, and the queries are not free on credit-priced platforms (Snowflake, BigQuery, Databricks SQL). Configuring the profiling depth to match how much each table is queried (deep profiling on fact and dimension tables, schema-and-freshness-only on staging tables) keeps the bill bounded. The default settings on most tools profile every table at high frequency, which is wasteful when most tables are intermediate and do not warrant the same scrutiny as the curated outputs.

Single-source versus multi-source defects

The defect distribution looks different when data comes from one source versus many. The distinction matters because the rule categories that dominate each case are not the same.

Single-source defects are dominated by validity and completeness failures. A source system with permissive schema design (free-text fields, no enumerated constraints, optional foreign keys) produces records whose individual fields fail validity in predictable ways. The cleansing rules at staging catch most of them: type casts, regex pattern checks, NULL-rate thresholds, range checks on numeric columns. The work is unglamorous and reasonably automated.

Multi-source defects are dominated by conformity, uniqueness, and survivorship problems. The same business entity exists in multiple sources, each with its own identifier, its own representation of attributes, its own update cadence. The rules that handle these are more involved: a conformity layer that maps each source's representation to a canonical form, a matching layer that identifies which records across sources refer to the same entity, a survivorship layer that decides which source's value wins when they disagree. These are the rules that benefit most from being expressed explicitly rather than buried in transformation logic, because the decisions are policy-bearing and need to be reviewable.

A worked example. Two source systems carry customer records. Source A has customer_id, name, street_address, city_state_zip. Source B has cust_no, last_name, first_name, address_line_1, city, state, postal_code. The conformity layer parses A's name into first and last name, parses A's city_state_zip into separate components, normalizes both sources' address representations to a canonical structure. The matching layer compares normalized records on a blocking key (postal code prefix, say) and uses string similarity on the rest of the attributes to identify probable matches. The survivorship layer applies an explicit precedence: most recent update wins for mutable fields, longest non-NULL string wins for free-text fields, source A wins by default for conflict between sources of equal recency. The output is a canonical customer table with one row per business entity, plus a mapping table recording which source records contributed to which canonical row.

This is one worked example of a class of problem. The full discipline is master data management, which is its own field and the right scope for a dedicated platform when the entity-resolution problem grows past what the warehouse layer can carry. For most analytical warehouses, the simpler patterns documented here cover the defect distribution without requiring an MDM platform.

Operational disciplines

The disciplines that distinguish a cleansing layer that holds up in production from one that produces alert fatigue and silent drift:

Quarantine, do not delete. Records that fail validation go to a quarantine table, not to the trash. The team needs to be able to see what was rejected and why, both to diagnose upstream defects and to recover records that were rejected on a since-corrected rule. Deleting rejections destroys both capabilities.

Track rule provenance. Each cleansing rule has a name, a definition stored in version control, and a record of which release introduced it. A failure alert references the rule by name. A rule change goes through code review like any other code change. Rules that exist only in a runtime configuration file are rules nobody is responsible for.

Make rule failures alertable, not aggregable. A rule that fails on 0.3% of rows every cycle is a rule whose threshold needs adjusting or whose alert is being ignored. The team that sees "147 rule failures today" without breaking that number down by rule is the team that is no longer reading the alerts. Aggregating failures across rules into a single dashboard number is worse than not tracking them, because it produces the appearance of monitoring without the substance.

Budget the observability profiling cost. Profile fact and dimension tables daily; profile staging tables at schema-and-freshness depth only; do not profile intermediate models that nobody queries directly. The default settings on most observability platforms profile everything, and the resulting credit consumption is one of the recurring objections leadership raises when reviewing the platform cost.

Plan for source schema drift before it happens. Source systems change schemas without coordinating with the warehouse team. The staging models that depend on those schemas break when they do. Tests that detect schema drift early (a new column that nobody has mapped, a removed column that the staging model still references) produce a clean failure at staging rather than a propagated wrong number downstream. The mechanism is unglamorous: schema-drift checks running as part of every load cycle, with explicit handling for the additions and removals the team chooses to absorb versus the ones that require a model update.

The change data capture technique article covers the load mechanism that delivers source changes to the staging layer where cleansing applies. The slowly changing dimensions article covers the change-handling that interacts with cleansing at the dimensional layer, including the hash-based change detection that depends on attribute normalization. The warehouse loading and operations pillar covers the broader operational context, including idempotency and watermarks, which constrain how cleansing rules interact with re-runs and partial recovery.

For the definitional version of the term, see the data quality glossary entry.

Closing

Data cleansing is not a phase that happens once and concludes. It is a continuous discipline expressed as rules tested at the staging-to-curated boundary, paired with observability that catches the patterns the rules do not anticipate. The rules belong in version control alongside the models they validate. The observability layer belongs across the curated tables that downstream consumers depend on. The failure mode the discipline exists to prevent is the silent one: a defect that nobody flagged, that nobody noticed, that surfaced as a wrong number in a report that informed a decision.

Reference

The five-category framing of data quality dimensions traces to:

Richard Y. Wang and Diane M. Strong, "Beyond Accuracy: What Data Quality Means to Data Consumers," Journal of Management Information Systems, Vol. 12, No. 4, 1996.
DAMA International, DAMA-DMBOK: Data Management Body of Knowledge, 2nd ed., Technics Publications, 2017. Chapter 13 covers data quality dimensions and management.
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit, 3rd ed., Wiley, 2013. The inferred-member and quarantine patterns at the dimensional layer.

For the modern test-driven cleansing pattern, the dbt project documentation on tests and the Great Expectations documentation on expectation suites are the canonical references; both projects publish maintained guides that supersede earlier vendor whitepapers on the topic.