Slowly changing dimensions: implementation strategies

Most analytical questions are eventually questions about state over time. What did the customer's loyalty tier look like when they placed this order. Which sales region was this representative assigned to last quarter. What product category did this SKU belong to on the day it was discounted. The dimensional modeling pillar introduces slowly changing dimensions as the warehouse mechanism for answering those questions correctly. This article goes one level deeper into the mechanics: what each SCD type looks like in column form, how active row identification interacts with fact loading, and the edge cases that quietly produce wrong answers when the implementation is sloppy.

The framing throughout is attribute-level rather than dimension-level. A single dimension typically applies different SCD strategies to different attributes, and the implementation is cleaner when that variation is treated as the default rather than as an exception.

TL;DR. SCD type selection is an attribute-level design decision. Type 2 is the default for attributes where historical accuracy matters; Type 1 is the default for current-state attributes. Active-row identification by date range is the only mechanism that handles late-arriving facts correctly.

The shape of the problem

A dimension is a denormalized table of descriptive attributes joined to fact tables by a surrogate key. As long as those attributes never change, the warehouse has nothing to manage. The trouble starts when an attribute does change. Now there are three different things the warehouse might mean when it says "the customer's region," and the choice between them determines whether historical analysis is accurate, partially accurate, or quietly wrong.

The first meaning is the value that was true when the fact happened. A sale recorded on a date when the customer lived in the central region should, for region-based analysis, count toward the central region. The customer's later move to the eastern region is irrelevant to that historical fact.

The second meaning is the value that is true now. A customer-care dashboard listing customers and their current regions wants the current value regardless of how many regions the customer has lived in. Old values are noise.

The third meaning is the most recent prior value, which is occasionally useful as a transitional view. "What region did this customer move from when they moved" is sometimes a real analytical question. It is rarely the only question, and almost never the most important one.

SCD strategies are the encoded answers to which meaning the warehouse will produce when an attribute is queried. Type 1 produces the current value. Type 2 produces the value that was true at the time of the fact. Type 3 produces one prior value alongside the current one. The hybrid types combine these for cases where a single answer is not enough.

SCD types in detail

The SCD types in common use:

Type	Row behavior	History retained	Typical use
0	Original value, never updated	Original only	Anchor attributes: signup date, hire date
1	Overwrite in place	None	Data corrections; current-state attributes (email, display name)
2	New row per change, effective dates	Full	Analytical attributes where history matters (region, segment, tier)
3	Prior-value column on the same row	One transition	Limited transitional analysis ("region the customer moved from")
4	Current row in main table, history in companion	Full, separated	Very wide dimensions where history is queried rarely
6	Type 1 + Type 2 + Type 3 on the same attribute (1+2+3=6)	Full + current + transition	Attributes that need both as-of-event and as-of-now joins
7	Dual surrogate keys on the fact (durable + current)	Full	High-volume facts where both as-of-event and as-of-now queries are common

SCD Type 0: retain the original

The dimension row's attribute is set when the row is first created and never changes afterward. The load logic ignores subsequent changes to that attribute on the source system. The column structure is unremarkable: the dimension carries the attribute alongside the others, with no version metadata.

Type 0 is the right choice for attributes whose original value is the historically meaningful one and where downstream analysis treats the value as an anchor rather than a current-state lookup. Customer signup date, account opening date, employee hire date, and product launch date are common examples. A signup date that gets refreshed to today every time a customer record is synced is a defect, not a feature, and Type 0 is the explicit guard against that defect at the warehouse level.

The technique sometimes gets conflated with no change handling at all, which is a different thing. Type 0 is an active design choice: the warehouse acknowledges that the attribute might change upstream and explicitly refuses to propagate the change. That contrasts with Type 1, where the warehouse propagates the change by overwriting.

SCD Type 1: overwrite in place

The dimension row is updated in place. The previous attribute value is gone. Any fact already loaded that referenced this dimension row now joins to the new value because the surrogate key it points at has not changed.

The column structure is unremarkable: the dimension table has its surrogate key, its business key, and one column per attribute. There are no version columns, no effective dates, no current-record flag. The load process detects a change and writes an UPDATE against the affected row.

Type 1 is the right choice for two situations that look superficially different but produce the same decision. The first is data correction: a typo, a miscategorization, or a back-office fix that should retroactively replace the wrong value everywhere. The second is for attributes where current state is always what the analytical use case needs and historical state has no defensible interpretation. The customer's preferred display name is usually Type 1. The customer's email address is usually Type 1, since a typo in last week's email field would corrupt every fact joined to that customer if treated as a tracked change.

SCD Type 2: new row per change

The dimension table grows a row each time a tracked attribute changes. The business key is preserved across all versions; the surrogate key is new on each row. The result is multiple rows for the same business entity, one per period of stable attribute values.

The Type 2 dimension table carries the standard attribute columns plus the metadata columns needed to identify which version of the row is current and which time period each version covers. A typical layout includes:

effective_from: the timestamp this version of the row became active
effective_to: the timestamp this version stopped being current, or a sentinel like 9999-12-31 for the currently active row
is_current: a boolean or single-character flag indicating which row is the active version for this business key
version: an incrementing integer, optional, useful for debugging and rarely used in production queries

The load logic for Type 2 has two passes. When a change is detected for a business key, the existing current row is updated to set effective_to to the change timestamp and is_current to false. A new row is then inserted with the new attribute values, the change timestamp as effective_from, the sentinel value as effective_to, and is_current set to true. The new row gets a freshly minted surrogate key.

Type 2 is the default for any attribute where historical accuracy matters. Region assignments, sales territories, product categorizations, customer segmentation tiers, and most demographic attributes that drive analysis are all Type 2 by default. The cost is straightforward: the dimension grows faster than it would under Type 1, and queries need to be aware of which row to join to.

The same business key flowing through three Type 2 versions looks like this:

Three rows, three surrogate keys, one business key. Each row covers a contiguous time period. Facts loaded across the change boundaries join to whichever surrogate was active at the event timestamp.

SCD Type 3: add a column for the prior value

The dimension row is updated in place, but a parallel column preserves the previous value. A region column might be accompanied by prior_region and region_changed_at. When the region changes, the current region value moves to prior_region, the new value lands in region, and the timestamp updates.

Type 3 captures exactly one historical transition. Earlier transitions are lost. The implementation cost grows linearly with how much history is wanted because each additional historical step needs its own column, and after two or three columns the model has clearly become the wrong choice.

The real use case for Type 3 is rare: situations where exactly one transition is operationally meaningful and the analytical use case asks "before versus after the recent change" rather than "the whole history." Sales territory realignments are a recurring example. A reorganization assigns each representative to a new territory; reports for the next quarter want to compare performance in the prior territory to performance in the new one. After that quarter, the analytical interest in the prior assignment fades. Type 3 captures exactly this and nothing more.

SCD Type 6 and the other hybrids

Type 6 is Type 1 plus Type 2 plus Type 3 on the same attribute. The dimension row carries the current value (Type 1 behavior), full historical rows (Type 2 behavior), and a stable historical reference column (Type 3 behavior). The combination is named by addition: 1 + 2 + 3 = 6.

In practice, Type 6 most often surfaces as Type 2 with an additional column that always carries the current value of the attribute. A customer dimension with Type 2 versioning for the loyalty_tier attribute might also carry a current_loyalty_tier column that is overwritten in place every time the tier changes. The Type 2 history answers "what tier were they at when they placed this order"; the current_loyalty_tier column answers "what tier are they at now" without needing to find the current row of the dimension. Analytical queries that don't care about history can avoid the active-row filtering step.

Type 4 is the less common variant. The current state lives in the main dimension; historical changes live in a separate history table joined when needed. It is occasionally appropriate when historical changes are queried rarely but in volume, and when keeping the main dimension narrow has a real performance benefit. Type 7 places two surrogate keys on the fact table for each dimension reference: the current-row surrogate (which always resolves to the latest dimension state) and a durable identifier that is stable across all versions of the same business entity (which lets the fact also resolve to the historical state in force at the time of the event). Kimball calls these "dual surrogate keys" or sometimes a "durable identifier" pattern. It is rare and worth knowing exists; reaching for it is a sign the underlying question genuinely requires both as-of-event and as-of-now joins on the same query.

Active row identification

The fact load process needs to find the surrogate key for the version of the dimension that was current at the time the fact occurred. Three mechanisms make this lookup work, and each constrains what the warehouse can do downstream.

Date-range identification stores effective_from and effective_to on each row. The fact-load lookup joins on the business key and adds a predicate that the fact's event timestamp falls within the row's effective range. Historical resolution is natural: any past date produces the row that was current then. The downside is that every dimension row carries two timestamp columns and every lookup carries a range condition.

Flag identification stores is_current and nothing else. The fact-load lookup joins on the business key and filters to is_current = true. This is fast and simple but only works for current-state resolution. A late-arriving fact with an event timestamp before the most recent change will be incorrectly joined to the current row rather than the row that was active at the event timestamp. Flag-only implementations are common and frequently wrong for any warehouse that loads facts in non-chronological order.

Version identification stores an incrementing version integer. By itself it provides no temporal information, only ordering. Version-based implementations work for current-state resolution (the row with the highest version per business key) but require an additional date range or change timestamp to do historical resolution. Version is most useful as a debugging and audit field rather than the primary active-row mechanism.

Production implementations frequently combine date-range and flag identification. The flag is the fast path for the common case of current-state queries. The date range is the correct path for historical resolution. Keeping both means the fact loader can choose the right mechanism per fact source: streaming current-state events can use the flag; backfills and late-arriving facts can use the range.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

Fact loading under SCD Type 2

Type 2 changes the surrogate-key lookup during fact loading. Under Type 1, the lookup is straightforward: join the incoming fact record to the dimension on the business key and return the single surrogate. Under Type 2, the dimension contains multiple rows per business key, and the load process must resolve which one.

The resolution logic depends on the meaning the fact source is reporting. Event facts, where the row represents something that happened at a known time, resolve to the dimension row that was active at the event time. The lookup joins on the business key and applies the effective-date range condition. This is the standard case and produces the historically accurate joins that Type 2 exists to support.

Snapshot facts, where the row represents state at a periodic boundary (end of day, end of month), resolve to the dimension row that was current at the snapshot time. The same range-based lookup applies, with the snapshot timestamp as the event timestamp.

Late-arriving facts, where the event time precedes the load time by more than the normal latency, present the same lookup against the same date-range mechanism. The fact's event timestamp is the lookup parameter regardless of when the fact arrived. A Type 2 dimension with date-range identification handles late arrivals correctly without any special-case code. This is the property that makes Type 2 worth its cost.

Real-time facts where the timestamp is the load time itself collapse to current-state resolution and can use either the flag or the range. The choice usually follows the architectural pattern of the rest of the loader rather than being made per fact.

Two failure modes are common enough to call out. The first is using is_current filtering against late-arriving facts: every late fact ends up joined to whatever dimension row is current at load time, producing systematically wrong history. The second is failing to handle the case where no dimension row is active at the fact's timestamp, which happens when a fact's business key is not yet in the dimension. The standard responses are either to add a placeholder dimension row that becomes the current row and gets backfilled when the real dimension data arrives, or to use a designated "unknown" dimension row for the join. The first is more accurate; the second is simpler and acceptable when the business key gap is rare.

Lakehouse table formats (Iceberg, Delta) provide time-travel queries that can answer some as-of-event questions without explicit Type 2 history: a query targets a past snapshot timestamp and reads the dimension as it was then. The mechanism is real but bounded. Time travel works within the table's snapshot retention window, typically days to weeks rather than the multi-year history a dimensional warehouse maintains, and it queries the table at a single past point rather than producing the row active at each fact's event time. Type 2 remains the correct technique when historical accuracy needs to exceed retention, when fact loading joins on each fact's event timestamp rather than a fixed past point, or when the dimension lives outside the lakehouse format. Time travel can substitute for ad-hoc historical analysis but not for the fact-loading lookup mechanism.

Edge cases and gotchas

Type designations that vary by attribute are the rule rather than the exception. A single customer dimension routinely has Type 1 attributes for fields like display name, Type 2 attributes for region and tier, and possibly a Type 3 attribute for prior-territory analysis. Implementations that treat SCD type as a dimension-wide setting either over-version everything or fail to track the attributes that matter. The model should carry an SCD type designation per attribute, and the load logic should consult the designation when processing changes. This separation is one of the higher-leverage design decisions in a warehouse and worth getting right at the start.

Retroactive corrections to Type 2 history are operationally awkward. If a dimension attribute is discovered to have been wrong for a past period, fixing it means either updating the affected historical row (which loses the audit trail of the wrong value) or inserting a correction row (which can produce gaps or overlaps in the date range). Most warehouses pick one policy and document it. The right policy depends on whether the warehouse is itself the system of record or whether an upstream system is, and on whether external reporting has already cited the wrong values.

Hashing the tracked attributes is the standard way to detect changes during incremental load. The dimension row carries a hash of the concatenated tracked attribute values, computed over a normalized representation: NULLs replaced with a sentinel string, whitespace trimmed, case folded where the business semantics treat the values as case-insensitive. Without normalization, the hash fires on cosmetic source variation (a trailing space added to a name field, a NULL replaced with an empty string by an upstream system, a case change in an email address) and produces a stream of spurious Type 2 version rows that don't represent meaningful business changes. The load compares the incoming hash to the stored hash. If they match, no change occurred and the load can skip processing the row. If they differ, a change occurred and the Type 2 sequence runs. This avoids row-by-row attribute comparison for dimensions with many tracked attributes.

Surrogate keys regenerate per row, not per business key. Two Type 2 rows for the same customer have different surrogate keys, and fact tables loaded across the change boundary correctly point at different surrogates. The business key is the stable identifier; the surrogate is the row identifier. Loading code that conflates the two produces fact tables that refer to non-existent or wrong dimension rows after the first attribute change.

Hybrid Type 6 columns get out of sync when the load logic for the current-state column and the Type 2 history are written separately. Keeping them aligned requires that the load update both as part of the same transaction or batch. A common failure pattern is to add the current-state column later as an optimization, with separate load logic, and to introduce a window where the current row's Type 2 attributes and the current-state column disagree.

Surrogate keys are the precondition for any Type 2 implementation. The technique for generating and managing them, and for handling the lookup mechanics during fact loading, has its own dedicated treatment in the surrogate key management article. The dimensional modeling pillar covers conformed dimensions and the broader context for why dimensional history matters. Late-arriving facts and late-arriving dimensions are closely related problems that share the same date-range lookup mechanics, and each has implementation patterns worth knowing.

For the definitional version of the term, see the slowly changing dimension and surrogate key glossary entries.

Closing

SCD type selection is an attribute-level design decision driven by what each attribute is for and what historical questions the warehouse needs to answer. Type 2 is the default for attributes where history matters; Type 1 is the default for attributes where it does not; Type 3 and the hybrid types serve narrower cases. Active row identification by date range is the only mechanism that handles late-arriving facts correctly, and any fact loader that depends on a current-record flag for historical resolution will produce wrong joins under load patterns the system was not designed for. The mechanics are not difficult, but they reward being made explicit early rather than rebuilt later.

Reference

The SCD typology originates in Ralph Kimball's dimensional modeling methodology and remains the working framework across analytical warehouses.

Ralph Kimball, Slowly Changing Dimensions, Kimball Group, 2008. The canonical treatment of Types 1, 2, and 3.
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed., Wiley, 2013. The full reference text for dimensional modeling, including chapter-length treatment of slowly changing dimensions and the hybrid types.