Data modeling phases: conceptual, logical, and physical

Most warehouse design work that goes wrong went wrong at the modeling layer, before any table was created. The three-phase decomposition into conceptual, logical, and physical models is the discipline that keeps that work tractable: it separates what the business cares about from how the warehouse represents it from what the chosen engine actually runs. The data warehouse fundamentals pillar introduces dimensional modeling as the dominant representation for analytical data; this article goes one level beneath that and covers the modeling phases themselves, in the form they take on a 2026 cloud warehouse stack.

TL;DR. The conceptual model defines what entities exist and how they relate, in business language. The logical model adds attributes, keys, normalization decisions, and an explicit modeling style (dimensional, data vault, 3NF). The physical model commits to engine-specific structures: column types, clustering, partitioning, distribution, materialization. Each phase has a different audience and a different failure mode. Skip a phase and the missed decisions resurface later as production incidents.

What each phase is for

The three phases are an organizing convention, not a workflow law. A small warehouse may compress conceptual and logical into a single design pass; a large enterprise rebuild may spend months on the conceptual model before any logical work begins. What matters is that the decisions each phase is responsible for actually get made, in the order that lets later phases depend on earlier ones.

A conceptual data model describes the entities a business cares about and the relationships between them, expressed in language a business stakeholder can read. Customers place orders. Orders contain line items. Products belong to categories. Stores sit in regions. The model is platform-independent, database-independent, and notation-independent in any deep sense: it is the shared vocabulary the rest of the design depends on.

A logical data model takes that vocabulary and gives it structure. Entities become tables. Relationships become foreign keys or junction tables. Attributes are listed with types, with primary keys chosen, with the normalization decisions made (or deliberately rejected, as dimensional modeling does). The logical model is still platform-independent in the strict sense, but it commits to a modeling style. A logical model built around facts and dimensions is committed to dimensional modeling. A logical model built around hubs, links, and satellites is committed to data vault. A logical model in third normal form is committed to operational-style normalization.

A physical data model translates the logical model into the specific representation the warehouse engine will run. Column types are concrete (NUMBER(18,2) rather than "decimal"). Clustering, partitioning, and distribution keys are declared. Materialization choices are made (table, view, materialized view, incremental). On a modern stack, much of the physical model is expressed in dbt as configuration on top of SQL transformations, not in standalone DDL.

The arrows are one-directional in principle and bidirectional in practice. Discoveries during physical implementation regularly surface gaps in the logical model; logical work occasionally reveals that the conceptual model was wrong about a relationship. The discipline is to update the upstream model when that happens, not to silently let the downstream representation drift.

The conceptual model: shared vocabulary, not a diagram

The conceptual model's job is to establish what the business means before any engineering team builds anything. It answers questions that are not technical: is a customer the same thing as an account, or are they distinct entities with a relationship? When a return is processed, does it modify the original order, or does it create a separate return entity? Is a product the SKU, the catalog item, or the manufacturer's part? The answers vary by company, and getting them wrong at this layer poisons every downstream decision.

The output is usually an entity-relationship diagram of some kind, but the diagram is secondary. The substantive artifact is the agreed vocabulary, often captured as a glossary alongside the diagram. Each entity has a definition that a business stakeholder and a data engineer would both accept. Each relationship is named and its cardinality is stated. The point is that two people in different departments reading the model would agree on whether a given real-world thing is or isn't a Customer.

In a data warehouse context, the conceptual model also surfaces the questions that drive the warehouse's existence. Which entities will be analyzed across time? Which relationships need historical accuracy? Which dimensions of the business need to be sliceable in reports? These are not yet implementation questions; they are scope questions, and the conceptual model is where they get pinned down.

Data contracts, in their 2026 incarnation, sit largely at this boundary. A data contract between an upstream producer (a source-system team) and a downstream consumer (the warehouse team) declares which entities, attributes, and semantics are stable enough to depend on. The contract is conceptual: it names entities and meanings, not column names and types. Treating contracts as part of the conceptual modeling work, rather than as a separate platform concern, keeps them tractable.

The conceptual model is platform-independent on purpose. A Customer entity does not change because the warehouse migrated from Redshift to Snowflake, or because the team adopted Iceberg. The entities the business cares about and the relationships between them are stable across infrastructure churn. That stability is the conceptual model's value; protect it.

The logical model: structure, keys, and modeling style

The logical model is where the modeling approach is committed. The same conceptual model can be realized as a star schema, a data vault, or a normalized operational schema. The choice is consequential and rarely reversible without significant rework.

Dimensional modeling is the dominant choice for analytical warehouses and the one this property treats as default. Entities from the conceptual model become either fact tables (representing business events at a declared grain) or dimension tables (representing the descriptive context of those events). The dimensional modeling pillar covers the mechanics in depth, including grain, surrogate keys, slowly changing dimension handling, and conformed dimensions across processes.

Data vault is the alternative when source-schema volatility is the dominant pressure: hubs hold business keys, links hold relationships between hubs, and satellites hold descriptive attributes with full history. The data vault modeling pillar covers when it earns its complexity and when dimensional modeling is the better fit.

Third normal form (3NF) modeling, in the Inmon corporate-information-factory tradition, sometimes shows up as a logical-layer choice for an integration layer beneath dimensional marts. It is less common in greenfield 2026 builds than it was in 2010, but the pattern still appears where data vault feels heavier than the team wants.

Whatever the style, the logical model commits to several decisions the conceptual model deferred:

Attributes are listed with types at the abstract level (integer, varchar, decimal, date, timestamp), but not yet with engine-specific precision and scale. Primary keys are chosen. For dimensional models, this is where the surrogate-versus-natural-key decision lives: the logical model declares that a dimension's primary key is a warehouse-generated surrogate, with the natural key carried as an attribute. For data vault, this is where hash keys are declared.

Cardinalities are made explicit. A one-to-many relationship between Customer and Order in the conceptual model becomes a foreign key from Order to Customer (with the appropriate nullability) in the logical model. A many-to-many between Order and Promotion becomes a junction table with the right composite key.

Grain is declared for every fact table. This is the single most consequential logical-model decision in a dimensional design. The grain of fact_sales is not "sales data"; it is "one row per order line item at the time the shipment record is created." A grain statement specific enough that two designers reading it would independently agree on whether a candidate row belongs in the table is the bar to clear before moving to the physical model.

SCD strategy is declared at the attribute level, not the dimension level. A customer dimension's loyalty tier might need Type 2 history; the same dimension's preferred display name might be fine with Type 1 overwrite; the customer's signup date is Type 0 (never updated after the first load). Making these calls during logical modeling is much cheaper than discovering the gap during analysis.

The logical model is still warehouse-engine-independent. It does not yet care whether the warehouse is Snowflake, BigQuery, Redshift, or Databricks. That independence is intentional: a logical model that bakes in engine assumptions is harder to migrate when the platform decision changes, which it eventually will.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

The physical model: engine and operational reality

The physical model is the layer where the warehouse engine actually appears. It commits to concrete column types (NUMBER(38,9) on Snowflake, NUMERIC(38,9) on BigQuery, DECIMAL(38,9) on Redshift), to clustering or partitioning strategies, to distribution styles on Redshift, to micro-partition clustering keys on Snowflake, to materialization choices, and to the indexing or sort-key equivalents the chosen platform supports.

On a modern cloud-warehouse stack, much of the physical model is expressed in dbt rather than in handwritten DDL. A dbt model carries its physical configuration as config blocks: materialization (table, view, incremental, materialized_view), cluster-by or partition-by columns, on-schema-change behavior, and pre/post hooks. The SQL transformation is the logical-to-physical translation; the config is the engine-specific physical decisions. This colocation is useful: it keeps the physical decisions visible alongside the SQL that produced them, version-controlled, code-reviewed, and reproducible across environments.

{{
  config(
    materialized='incremental',
    unique_key='order_line_sk',
    cluster_by=['order_date'],
    on_schema_change='append_new_columns'
  )
}}
select ...

That config block is the physical model for one fact table. The incremental materialization commits to a load pattern; the cluster_by commits to a scan-pruning strategy; the unique_key commits to merge semantics; the on_schema_change commits to a contract about how additive source changes propagate. None of those decisions belonged in the logical model; all of them belong in the physical one.

Different engines force different physical commitments:

On Snowflake, the relevant levers are micro-partition clustering keys (for tables large enough to benefit from cluster maintenance), search optimization (for point-lookup workloads scan-pruning cannot address), and materialized views. Storage is automatic and partition-pruning is implicit; the cluster key is a hint about which columns the engine should bias toward physical co-location.

On BigQuery, the levers are partitioning (almost always by an ingestion-time or event-time date), clustering (up to four columns, ordered by cardinality), and materialized views. Partitioning is hard: it shows up in cost as much as in performance, because BigQuery's on-demand pricing charges for bytes scanned.

On Redshift, the historical levers were distribution style (KEY, EVEN, ALL) and sort keys (compound or interleaved); the newer serverless mode auto-manages most of this. On a provisioned cluster, distribution decisions interact with join performance in ways that have no equivalent on Snowflake or BigQuery, and a physical model designed without considering them will produce a working warehouse that performs badly.

On Databricks SQL with Delta or Iceberg tables, the levers are partitioning (by event date or another high-cardinality dimension), Z-ordering or liquid clustering (the engine-specific scan-pruning structure), and statistics collection. The lakehouse pattern means the physical model also includes the table format choice itself, which dimensional modelers on dedicated warehouses do not have to think about.

Cost dimensions belong in the physical model too. A materialized view that refreshes hourly against a wide source table will dominate the warehouse's monthly bill if the refresh trigger is wrong; an incremental model with a poorly chosen unique key will quietly process the entire history on every run. These are not logical-model concerns; they are physical decisions about how the engine spends credits, and they need to be made deliberately.

Where the phases break down in practice

The textbook progression from conceptual to logical to physical is a useful frame and a misleading one if read too literally. Real warehouse projects iterate. Three failure modes recur often enough to call out.

Skipping the conceptual model. Teams under delivery pressure go straight to logical design (dimensional or data vault) without first agreeing on what a Customer is, whether returns modify orders or create separate entities, or whether the warehouse's grain of analysis matches the business's grain of operation. The work appears to go faster for two weeks and then slows for two months as definitional ambiguity surfaces during loading, during analysis, or worst of all, in production reports that quietly disagree with the operational systems. The fix is more expensive than the original conceptual conversation would have been.

Collapsing logical into physical. This is the dbt-era version of the same mistake. A team writes dbt models that interleave logical structure (grain, surrogate keys, SCD strategy) with physical configuration (clustering, partitioning, materialization), with no separation between the two. The result is a warehouse that is hard to migrate, hard to reason about, and hard to test, because the modeling decisions and the engine decisions are tangled in the same SQL files. Some discipline about what belongs in a staging layer versus a core (logical) layer versus a marts (physical, business-facing) layer is the standard mitigation, and the dbt project structure guide is explicit about it for a reason.

Treating the physical model as an afterthought. The reverse failure: the logical model is well-designed, the dbt project is clean, but no one has thought about how the chosen engine will physically lay out the data until the warehouse is loaded and the queries are slow. Engine-specific optimization is not optional on modern cloud platforms; the cost structure forces the question whether the team wants to think about it or not. A physical model that simply accepts the engine's defaults is a model that will be re-done under operational pressure.

The discipline that addresses all three is to make the phase explicit in the design conversation. Every modeling decision has a phase. The conversation "what should our grain be" is a logical-model conversation; the conversation "should this table be clustered by order_date" is a physical one; the conversation "is a return a modification of an order or its own thing" is a conceptual one. Mixing them in a single design session produces incoherence; separating them keeps the work tractable.

Handoffs and ownership

Different roles tend to own different phases. The conceptual model is owned jointly by data engineers and business stakeholders; one party alone cannot do it. The logical model is owned by data engineers with input from analytics engineers and downstream consumers. The physical model is owned by data engineers and platform engineers, often through dbt configuration in a shared repository.

The handoffs matter more than the ownership distribution. Each phase's output is the next phase's specification, and a sloppy handoff is where most modeling problems originate. A conceptual model with ambiguous definitions produces a logical model with arbitrary structural choices. A logical model with vague grain produces a physical model that loads incorrect aggregates. The discipline is to make each phase's output reviewable as a standalone artifact before the next phase starts building on it.

Tooling has shifted over the past decade. The standalone data-modeling tools (ER/Studio, ERwin, PowerDesigner) that dominated logical modeling are now joined by dbt for the physical layer and by lighter conceptual-modeling notations (often just Mermaid or draw.io diagrams in a docs repo). The choice of tool matters less than the discipline of producing artifacts at each phase that the next phase can build on. A conceptual model in a Notion page and a dbt project for the physical layer is a perfectly defensible 2026 toolchain. A logical model that exists only in someone's head is not.

The dimensional modeling pillar covers the dominant logical-layer choice for analytical warehouses. The data vault modeling pillar covers the alternative for source-volatile environments. The slowly changing dimensions technique and change data capture technique are practice articles that live at the boundary between logical and physical decisions. The warehouse loading and operations pillar covers what happens after the physical model is in place. Building a data warehouse is the end-to-end project context these modeling phases sit inside, and the OLTP vs OLAP comparison covers the transactional-versus-analytical workload distinction the conceptual phase establishes.

Reference

Ralph Kimball and Margy Ross, The Data Warehouse Toolkit, 3rd edition, Wiley, 2013. Chapter 18 covers dimensional design at logical and physical layers; the four-step design process described there is the canonical practitioner anchor for the logical-to-physical transition.
dbt project structure guide. The staging / intermediate / marts layering convention is the explicit articulation of how logical and physical concerns separate in a modern transformation project.