Building a data warehouse: a four-phase practitioner's playbook

Most warehouse projects fail in predictable ways, and almost all of those ways are decided before any tables get created. The data warehouse pillar covers what a warehouse is and where it sits in the data stack. This article covers the other question that pillar leaves open: how a warehouse project actually gets built, organized as the four phases practitioners reliably run through, with the design-phase decision between Kimball and Inmon treated as a concrete choice rather than an academic debate.

TL;DR. Warehouse projects move through discovery, design, development, and deployment. The phases are sequential in the order of first execution and concurrent thereafter; later phases routinely feed back into earlier ones. The Kimball-versus-Inmon decision belongs in the design phase and turns on a small set of practical criteria, not on theology. Cloud-warehouse-by-default has changed the deployment phase more than the others; it has not eliminated any of them.

The four phases, briefly

Discover identifies what the warehouse is for and what it integrates. Design produces the model and the architectural commitments around it. Develop turns the model into running pipelines. Deploy puts the infrastructure, the analytics layer, and the operational disciplines in place. Each phase has a primary deliverable, a set of decisions that should not slip into the next phase, and a feedback loop that experienced teams plan for.

The pattern is older than the cloud and survives the move to it. What changed is mostly which tools sit inside each phase, not the phases themselves.

Discover: what the warehouse is for

The discovery phase produces two artifacts: a written statement of what analytical questions the warehouse exists to answer, and an inventory of the source systems that hold the data needed to answer them. Skipping either is the most common way to start a project that quietly fails three months in.

The analytical-questions artifact is not a list of reports or a list of dashboards. It is the set of decisions the organization wants to make better, expressed as questions the warehouse should be able to answer. "How does revenue retention vary across customer cohorts by acquisition channel" is a question. "Build a retention dashboard" is a deliverable. The dashboard derives from the question; the question does not derive from the dashboard. Teams that work in the other direction routinely ship dashboards that nobody uses, then build a warehouse around a usage pattern that nobody validated.

The source-systems inventory names every system that produces or holds the data referenced by the analytical questions, along with the access path, the data owner, the volume, and the change rate. A CRM, a billing system, a product analytics platform, a marketing automation tool, and a finance ERP each count as one source. The inventory should also flag the systems that are missing data the questions require, because acquiring that data is often the most expensive part of the project and gets discovered too late if the inventory is informal.

Discovery is where business owners and engineering owners need to be in the room together. Engineering left to itself tends to model the source systems faithfully and miss that nobody will use the result. Business owners left to themselves tend to specify reports without understanding what the source data actually supports. The artifact that ends the phase is a shared document signed off by both, naming the in-scope questions, the in-scope sources, and what is explicitly out of scope for the first phase of delivery.

Two failure modes show up routinely. The first is scope inflation: every department wants their data in the warehouse and every analytical question they have today is in scope. The phase ends with a 200-question list and a project that will never finish. The fix is ruthless prioritization to the questions whose answers actually drive decisions, with the rest queued as future phases. The second is scope deflation: the warehouse gets sized for one team's reports without considering the cross-team integrations that justify a warehouse over a few tactical extracts. A warehouse that only serves one team's reporting is rarely worth the operational overhead.

Design: the model and the architecture

The design phase produces the warehouse's data model, the architectural commitments around it, and the technology choices that follow. This is the phase where most of the consequential decisions get made, and where reversing a wrong call costs the most.

The first decision is the modeling approach. The two established options are Ralph Kimball's dimensional modeling and Bill Inmon's normalized enterprise data warehouse with downstream data marts. They reflect different starting points and produce different shapes of warehouse.

Kimball's bottom-up dimensional approach builds the warehouse as a set of star schemas, each anchored on a fact table representing a business process and surrounded by conformed dimensions shared across processes. The model is denormalized for query performance and oriented around how analysts actually ask questions. Time-to-first-value is short because each star schema delivers usable analytics as soon as it loads. The bus matrix, listing business processes against shared dimensions, is the planning artifact that keeps the dimensional model coherent as new processes get added.

Inmon's top-down enterprise data warehouse builds a normalized, atomic-grain integration layer first, then derives departmental data marts (which are often dimensional) on top. The integration layer is the system of record; the marts are consumption-oriented derivatives. Time-to-first-value is longer because the integration layer needs substantial coverage before any mart is useful. The integration layer's value compounds, however: once the model is in place, adding marts is comparatively cheap, and the normalized atomic data supports auditability and source traceability that dimensional models alone do not.

The decision is not theological. Both approaches still ship working warehouses. The practical criteria:

Criterion	Kimball fits	Inmon fits
Time to first useful analytics	Weeks	Months to quarters
Number and diversity of source systems	Few to moderate	Many, heterogeneous
Auditability and source traceability priority	Moderate	High (regulated industries, finance)
Schema volatility in sources	Low to moderate	High (Inmon's normalization absorbs it better)
Team's comfort with dimensional modeling	Required	Required for the mart layer regardless
Storage cost sensitivity (denormalization overhead)	Higher	Lower
Cross-process consistency required	Conformed dimensions handle it	Integration layer enforces it natively

In production, the choice is often not binary. Many warehouses use a normalized integration or staging layer for raw integrated data and dimensional marts for consumption, which is closer to Inmon's pattern than to Kimball's. Others build conformed dimensional models directly off staged source data, which is closer to Kimball's. Data vault is a third option for environments where the integration layer needs to absorb heavy schema change with full auditability; the data vault modeling pillar covers when it earns the additional layer of complexity.

Whichever approach is chosen, dimensional modeling is the substrate for the consumption layer in essentially every case. Fact and dimension tables, declared grain, surrogate keys, slowly changing dimensions, and conformed dimensions are common across all three modeling philosophies once the data reaches the analytics surface.

The second design decision is the platform. Cloud columnar warehouses are the default in 2026: Snowflake, BigQuery, Databricks SQL, and Redshift cover the dominant share of new builds, and the lakehouse table formats (Iceberg, Delta) increasingly factor into the design when the warehouse needs to share storage with other workloads. The modern warehouse platforms pillar covers the platform-level trade-offs in detail. The design-phase output is a platform choice plus the high-level account, storage, and compute layout that follows from it.

The third design decision is the loading strategy. The choice between ETL and ELT interacts with the platform: ELT is the dominant pattern on cloud warehouses with elastic compute, while ETL still appears where governance forbids landing untransformed data inside the warehouse. The strategy also commits the team to a change data capture approach for incremental loads, since full reloads stop being viable above modest data volumes.

The design phase ends with a written design document covering the model (entity-relationship diagrams, conformed dimensions, declared grain for each fact table), the platform and account layout, and the loading architecture. Pipelines should not start being built until this document exists and has been reviewed.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

Develop: pipelines, transformations, tests

The development phase turns the design into running data flows. Two components dominate: source-to-target mapping and load configuration.

Source-to-target mapping is the explicit correspondence between fields in source systems and columns in warehouse tables. The mapping carries the data type conversions, the cleansing rules, the joins needed to denormalize source entities into dimensions, and the aggregations needed to produce fact-table measures. In a warehouse with hundreds of source columns and dozens of dimensions and facts, the mapping is itself a substantial artifact. Treating it as code rather than as a one-time spreadsheet exercise is what separates pipelines that survive source-system changes from pipelines that break every quarter.

Modern transformation tools (dbt is the dominant choice for SQL-based transformation; SQLMesh and Coalesce occupy adjacent niches; Spark and Snowpark cover the Python-heavy work) encode the mapping as version-controlled code with tests, lineage, and documentation as first-class concerns. The warehouse loading pattern that has consolidated around dbt is: extract-and-load tools (Fivetran, Airbyte, Estuary, custom Python) land raw source data in the warehouse, dbt transforms it through a series of versioned models, and the final dimensional and fact tables are the published consumption layer. The warehouse loading and operations pillar covers the operational disciplines around this pattern.

Load configuration specifies how each table loads: full versus incremental, the CDC mechanism for incremental tables, the schedule, the dependencies between tables, and the recovery behavior on failure. Full loads are appropriate for small reference tables and for periodic reconciliation of incrementally-loaded tables. Incremental loads are the production default for fact tables and large dimensions; the CDC mechanism, whether log-based, timestamp-based, or trigger-based, is determined per source by what the source exposes. The change data capture article covers the mechanics and the failure modes in depth.

The development phase also commits the team to a testing discipline. Source-data tests catch upstream changes that would corrupt the warehouse silently. Model tests catch schema regressions and grain violations. Reconciliation tests catch the slow drift between incremental loads and the underlying source state. Without these, the warehouse stays correct until it doesn't, and the failure usually shows up first in a report that quietly disagrees with the source system.

Development is also where dimensional behaviors get implemented in code. The slowly changing dimension strategies declared during design (Type 1 for current-state attributes, Type 2 for attributes where history matters, occasionally Type 3 or the hybrids for specific cases) need their load logic written and tested before production data flows through. The cost of getting SCD logic wrong shows up retroactively as analyses that quietly disagree with history.

Deploy: infrastructure, BI, operations

The deployment phase puts the warehouse into production. The phase is rarely a single event; production warehouses get rolled out incrementally, with subject areas going live one at a time and the analytics consumers onboarded in waves.

The first deployment task is the platform itself. On a cloud warehouse, this includes the production account or project, the network and access configuration, the resource boundaries (warehouses, slots, clusters depending on the platform), and the cost monitoring. Credit-based pricing makes the cost dimension immediate in a way it was not on capital-purchased on-premises systems: a runaway query or a poorly designed transformation can produce a five-figure bill in a single day, and the protection against that is monitoring and resource caps, not goodwill.

The second deployment task is the integration of the BI and analytics consumption layer. A warehouse without a consumption surface is, in practical terms, not deployed. Looker, Tableau, Power BI, Metabase, and the warehouse-native semantic layers (dbt's semantic layer, Cube, native semantic views in the warehouse) sit in this layer and are typically chosen alongside the warehouse rather than after it. The consumption layer is also where the semantic-layer-versus-direct-SQL question gets resolved: production-grade analytics increasingly happen through a semantic layer that holds the metric definitions, so that "revenue" and "active users" mean the same thing across every dashboard and every analyst.

The third deployment task is the operational disciplines that keep the warehouse correct after launch: monitoring, alerting, on-call rotation, change management for source-system updates, and the documentation that lets the next engineer pick up the system. Warehouses without these accumulate quiet failures that become loud failures the first time a senior leader notices a number does not match what they expected.

Deployment is incremental for the same reason discovery prioritizes ruthlessly: a warehouse that tries to ship every subject area at once usually ships nothing. The sequence that works is high-value subject area first, validated with real users, then expansion. Each wave of rollout feeds back into the design and development phases: questions that worked in concept turn out to need additional dimensions; sources that looked clean in discovery turn out to need cleansing rules nobody anticipated. Treating these feedback loops as part of the process rather than as setbacks is what makes the difference between a warehouse that gets better with use and one that ossifies.

What the cloud era changed and didn't

The four phases predate the cloud and survive it largely unchanged. What the cloud era changed is mostly the deployment phase: separation of storage and compute, elastic capacity, credit-based costing, and the operational simplification of not running database servers. The design phase still has the same modeling decisions. The development phase still has source-to-target mapping and incremental-load configuration as its dominant work. The discovery phase still depends on the same conversation between business and engineering owners.

What did not survive the move to cloud is the false economy of skipping any of the four phases. Cloud warehouses make it easy to start: spin up an account, load some data, run a query. The phase that gets skipped is almost always discovery or design, and the consequence is the same as it always was. A warehouse without a stated purpose drifts. A warehouse without a stated model accumulates contradictions. The cloud removed many of the infrastructure-management excuses for skipping the phases; the phases themselves remain.

The data warehouse pillar covers the warehouse's place in the broader data stack. For the design phase decisions in depth: dimensional modeling, data vault modeling, and the star schema versus snowflake schema comparison. For the development phase: warehouse loading and operations, the ETL versus ELT comparison, change data capture, and slowly changing dimensions. For platform selection in the design phase: modern warehouse platforms. The warehouse versus mart versus lake comparison clarifies where the warehouse fits among adjacent analytical surfaces.

Reference

Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed., Wiley, 2013. The canonical reference for the dimensional approach, including the bus matrix and conformed dimensions as the planning artifacts that keep a Kimball-style build coherent across business processes.
W. H. Inmon, Building the Data Warehouse, 4th ed., Wiley, 2005. The canonical reference for the top-down enterprise-data-warehouse approach. The current edition predates the cloud-warehouse era but the architectural argument is unchanged.
Kimball Group archive. The collected design tips remain the working reference for dimensional modeling in production warehouses.