Skip to article
Data Warehouse Info

A practitioner's reference for analytical data warehousing.

Reference Articles · Technique Deep-Dives · Courses · Glossary

Data warehouse fundamentals


What is a data warehouse?

A data warehouse is a centralized repository for integrated, historical analytical data. This guide covers architecture, dimensional modeling, ETL vs ELT, cloud platforms, and the trade-offs that determine when a warehouse is the right tool.


A data warehouse is a centralized repository that stores integrated, historical data from operational systems, structured for analytical queries rather than transactional processing. It is not where data originates. It is where data goes once it has been extracted from the systems that generate it, cleaned to a consistent standard, and shaped into structures that make complex analysis across large data volumes practical.

The distinction from an operational database sounds obvious but has real architectural consequences. A database is optimized for high volumes of short, concurrent read-write transactions with minimal latency. A warehouse is optimized for complex queries that scan large volumes of historical records, aggregating across millions of rows that may span years of data. The schema design that makes a transactional system fast makes an analytical system slow. A data warehouse is not simply an operational database that has grown large. It is a database architecture optimized for analytical workloads, built around a fundamentally different set of trade-offs.

This page covers what a data warehouse actually does, how data gets structured inside one, how it gets in and stays current, where warehouses sit in relation to data lakes and lakehouses, what the shift to cloud platforms has changed, and the trade-offs that determine when a warehouse is the right choice and when something else makes more sense.

What a data warehouse actually does

The job of a data warehouse is to make historical, integrated data available for analytical queries. Three words in that sentence do the real work.

Historical means the warehouse accumulates records over time rather than only maintaining current state. An operational CRM updates a customer record when their address changes. A well-designed warehouse typically preserves both the old and new states, using snapshots or slowly changing dimension techniques, so analysts can ask how many customers changed regions last quarter rather than only where customers are located now. A snapshot tells you where things are. A history tells you how they got there, at what rate they are changing, and what conditions held at any point in the past.

Integrated means data from different source systems has been reconciled into a consistent representation. A customer in a CRM is probably an account in an ERP and a user in a web analytics platform. The warehouse resolves these into a unified model so analysts can join across source systems without doing that reconciliation themselves at query time. This integration work is where most of the effort in building and maintaining a warehouse actually lives, not in the querying.

Analytical means the warehouse is designed for the patterns that BI tools and analysts generate: aggregations, group-bys, window functions, time-series analysis, joins across large tables, range scans over months or years of data. These queries scan wide and deep. They read enormous volumes of rows and return summary results. They almost never update individual rows. The physical storage format, indexing strategy, and query engine all follow from this read-heavy, scan-heavy access pattern.

Most warehouse implementations also include a staging layer, where source data lands before it has been integrated and cleaned. Staging is a working surface: it tolerates inconsistency, changes completely with each load cycle, and is not intended for end-user queries. The warehouse proper is where the clean, governed, history-tracking data lives. Maintaining these two layers to different standards is one of the more important operational disciplines in warehouse work.

How data is structured: dimensional modeling

The most widely used design pattern for organizing analytical data in a warehouse is dimensional modeling, developed and codified in the 1990s and still the primary framework most warehouse teams use today, including teams working on cloud platforms that didn't exist when the pattern was formalized. That persistence isn't inertia. The separation of facts from dimensions maps directly onto how people actually ask analytical questions: what happened, and what was the context.

Fact tables represent business events or measurements at a declared grain. They typically contain foreign keys to dimension tables plus numeric measures. Some fact tables are factless, recording the occurrence of an event without any additive measures. Dimensions carry the descriptive context of those events: the customer who made the purchase, the product that was sold, the date it occurred, the channel or location. Dimension tables are typically much smaller in volume than fact tables, though they grow complex when they carry a long history of changes.

These two types of tables join together in a star schema, named for the shape the entity-relationship diagram makes when a central fact table is surrounded by its dimension tables. A sales fact table might reference a customer dimension, a product dimension, a date dimension, and a store dimension. Analysts can aggregate sales while filtering or grouping by any combination of those attributes. The schema is deliberately denormalized: redundancy in dimension tables is acceptable because query performance matters more than storage efficiency at this layer.

dim_customer

fact_sales

dim_product

dim_date

dim_store

Grain is one of the most consequential design decisions in a dimensional model, and one of the most frequently skipped. The grain of a fact table is the precise definition of what one row represents. A sales fact table might be at the order line level, the order level, or a daily summary by product and store. The choice determines which questions can be answered from the table, which require a different table, and which joins are safe to make. Mixing grains inside a single fact table is one of the most common modeling errors, and it is genuinely difficult to fix after loading has started. Getting grain written down and agreed on before schema design begins is worth whatever it costs.

Surrogate keys are warehouse-generated identifiers assigned independently of whatever primary keys exist in source systems. The dimension's surrogate key is assigned at load time and used as the foreign key reference in fact tables. This decoupling from source system keys lets the warehouse manage its own identity across sources and over time, even when source system keys change, get reused, or overlap when multiple systems are being consolidated.

The behavior of dimensions over time is managed through slowly changing dimension (SCD) strategies. When a customer changes their region, there are at least three coherent approaches: overwrite the existing record so only current state is visible (Type 1), create a new version of the row with effective date tracking so both old and new states are available (Type 2), or keep limited history in additional columns on the same row, such as the current value and the immediately prior value (Type 3). Type 2 is the most analytically useful because it lets you join facts to the dimension state that was true at the time of the transaction, not just today. The cost is dimension table complexity and more involved surrogate key lookups during fact loading. Different attributes in the same dimension often warrant different strategies: a customer's loyalty tier needs full Type 2 history, their preferred display name can probably just be overwritten. Many ETL, ELT, and data modeling tools support attribute-level SCD configuration, or allow teams to implement it through custom transformation logic.

Data vault modeling is an alternative to the dimensional approach, designed for environments where auditability, source traceability, and tolerance of schema changes in source systems are higher priorities than query simplicity. Data vault organizes data into hubs (business keys), links (relationships between hubs), and satellites (descriptive attributes with full history). It is more normalized and more resilient to schema changes, but it requires a transformation layer to produce structures that BI tools can consume. Some organizations use data vault for the integration layer and build dimensional marts on top. Others use dimensional modeling end-to-end. Both are defensible depending on the environment.

How data gets in and stays current

Getting data from source systems into a warehouse and keeping it current is a permanent operational concern, not a one-time problem. Most warehouse teams spend more time maintaining load pipelines than they spent on the original design.

The foundational pattern is extract, transform, load (ETL): extract data from source systems, transform it to match the warehouse schema, then load it in. The transformation happens outside the warehouse in a dedicated processing layer. ETL dominated for decades, and for good reason: it kept raw data out of the warehouse until it was clean and governed.

ELT inverts this order. Data lands in the warehouse first; transformation runs inside the warehouse against data already in place. Cloud warehouses with elastic compute have made this practical at scale. Pushing transformation into the warehouse often simplifies the pipeline architecture and takes advantage of the query engine the warehouse was built around. The trade-off is that partially governed data exists inside the warehouse before transformation has run, which raises access control questions that ETL environments handle before data ever arrives. ELT is the pattern that dominates modern cloud warehouse deployments. Whether it is the right choice for a given situation depends on the transformation complexity, governance requirements, and the economics of the specific platform.

Incremental loading is the practice of loading only new or changed records rather than reloading entire tables on each cycle. Full reloads are fine at small volumes. At production scale, they are slow, expensive, and disruptive to concurrent queries during the load window. Change data capture (CDC) tracks which records have been inserted, updated, or deleted since the last load and passes only those changes to the pipeline.

CDC implementations vary by source system. Log-based CDC reads the database transaction log directly. Because the log is the system of record for every committed write, it captures hard deletes (which leave no audit trace) and preserves the exact sequence of changes, which is what makes log-based CDC the more complete option. The cost is that it requires access to the source system's internals and can lag or fail if downstream processing cannot keep up, log retention windows are exceeded, or schema changes are not handled correctly. Timestamp-based CDC relies on audit fields like last-modified dates, which is simpler to implement but misses hard deletes unless the source maintains a deletion log, and is sensitive to clock synchronization issues between systems.

In the standard dimensional loading pattern, dimension tables are populated before the fact tables that reference them, because the surrogate key lookups that populate fact table foreign keys require the dimension rows to already exist. Parent tables must load before children. In a warehouse with many tables, resolving this dependency graph by hand is error-prone. Deriving it from the referential integrity constraints in the model is considerably more reliable.

Late-arriving data is a problem every warehouse in production will eventually face. A dimension record arrives after the facts that reference it. A source system backfills a previous month. How the warehouse handles these cases depends on design decisions made before the first load runs. Late-arriving dimensions are commonly handled with placeholder records: a surrogate key is assigned at fact load time even though the full dimension attributes haven't arrived, and the placeholder is updated or versioned when the actual record comes in. This preserves referential integrity and keeps facts from being dropped while the pipeline waits for stragglers. Late-arriving facts, where the transaction date is in a past period, require reprocessing against the dimension state that was current at the time of the transaction. That is only possible if the dimension history was preserved with Type 2 tracking in the first place.

The shift to cloud warehouse platforms

Warehouse infrastructure has changed substantially since the early 2010s. Cloud warehouse platforms have replaced most on-premises analytical appliances for new builds, and most warehouse teams will spend their careers working primarily on cloud infrastructure. Understanding the architectural differences matters for both designing a new warehouse and operating an existing one.

The central change is the separation of storage from compute. Traditional on-premises systems coupled these: the machines that stored the data also ran the queries. Cloud warehouses store data in object storage independently of the compute clusters that query it. Compute can scale up or down without touching storage, multiple compute clusters can query the same data simultaneously, and compute costs accrue only while queries are running.

Columnar storage is standard across all major cloud warehouse platforms. Data stored by column means a query aggregating one field across millions of rows reads only the bytes for that column, not entire rows. Combined with the compression columnar formats achieve on the repeated value distributions common in analytical data, this makes scan-heavy queries considerably faster than row-oriented storage at equivalent scale.

The major platforms each make different architectural bets. Snowflake separates storage and compute cleanly and handles variable concurrency well. BigQuery is fully serverless, which removes cluster management; under on-demand pricing, cost prediction can be harder when query patterns are unpredictable, though capacity-based pricing can reduce that uncertainty. Redshift integrates deeply with AWS services and has a mature toolset; it now offers both provisioned clusters and a serverless option, making it a less uniform example of the clean storage-compute separation than Snowflake or BigQuery. Databricks is built around the lakehouse pattern rather than the standalone-warehouse pattern, with two components carrying its warehouse-style use cases: Databricks SQL provides ANSI SQL query execution against tables stored in open formats, and Photon is the native vectorized engine that closed most of the historical performance gap to purpose-built warehouse platforms. The platform fits best when relational analytical workloads sit alongside significant semi-structured data, machine learning, or streaming.

The differences between platforms matter most at selection time. Once a platform is established, the warehouse design principles are the same regardless of which cloud hosts them. Dimensional models, surrogate keys, SCD handling, incremental loading, and governed schemas apply to all of them.

One adjustment that catches experienced practitioners on their first cloud project is that major cloud warehouses do not use traditional OLTP-style indexes as the primary performance mechanism. Instead, each platform has its own optimization structures: Snowflake uses micro-partitions with optional clustering keys, plus secondary structures for point-lookup workloads when scan-pruning isn't sufficient; BigQuery uses partitioning and clustering to control which data is scanned; Redshift uses sort keys and distribution styles that affect how data is physically organized and co-located across nodes. Understanding the platform-specific equivalent is necessary before you can tune query performance effectively. Across all platforms, the levers common to good query design still apply: predicate pushdown, avoiding unnecessary cross-joins, materialized views for frequently repeated computations, and result caching where the platform supports it.

Trade-offs and when a warehouse isn't the right tool

A warehouse is not the right choice for every analytical problem. The cases where it fails are worth understanding before the infrastructure is in place.

The clearest case against a warehouse is when the data is predominantly unstructured or semi-structured and the query patterns are exploratory. A warehouse expects data to arrive in a shape that matches the schema. If the schema isn't yet known, or if the data is event streams, log files, or documents that don't map cleanly to a relational model, the preprocessing cost to fit it into the warehouse may not be worth the query performance benefit. A data lake or lakehouse architecture accepts raw data more readily and defers the schema question.

Latency is the second constraint. Batch-oriented warehouses run load cycles on hourly or daily schedules. If the use case requires queries against data that is minutes old, a batch warehouse won't serve it. Near-real-time OLAP systems and streaming architectures exist for this, but they introduce their own operational complexity. Many teams discover they overestimated the real-time requirement when they look closely at the actual analytical workflows their users run. The right question to ask early is what analytical decisions need data fresher than one hour, and whether those decisions genuinely need to be made at that frequency.

Warehouse sprawl is a failure mode that develops over time, not at initial design. As more teams build on the warehouse, fact tables, dimension tables, and derived analytical layers multiply. Without governance, tables get built redundantly, business metric definitions diverge between teams, and the single source of truth the warehouse was supposed to provide erodes quietly over months. The technical countermeasure is a well-maintained dimensional model with clear ownership, documented grain and business keys, and consistent metric definitions enforced at the semantic layer. The harder part is the organizational discipline to maintain it.

Mixed SCD types within a single dimension are a common source of bugs that are subtle enough to get into production. Different attributes on the same dimension can legitimately need different historical tracking. A customer's loyalty tier needs full Type 2 history. Their preferred display name can probably be overwritten with Type 1. Applying one SCD strategy across the entire dimension either loses history that was needed or carries history that isn't, and fixing it after loading has started means rebuilding the dimension from source. Making the attribute-level SCD decision during modeling costs much less than discovering the gap during analysis.

The surrogate key lookup in fact loading is the most operationally sensitive step in a warehouse load cycle. For each dimension relationship in a fact table, the load process must resolve the source system's business key to the correct warehouse surrogate, specifically the surrogate representing the dimension state that was current at the time of the transaction, not the current state at load time. If the dimension uses effective and expiration date tracking, the lookup joins on a date range condition in addition to the business key match. Getting this wrong produces facts joined to the wrong dimension version. The resulting errors are hard to detect because the data looks complete: the joins succeed, just against the wrong rows. Test it against known historical data before the first production run.

Practical guidance

Where warehouse projects actually go wrong, in concrete terms.

Get grain written down before anything else. The grain of each fact table should be stated precisely, in writing, before schema design begins: "one row represents one order line item at the time of shipment." Grain is the hardest decision to change after loading has started. Teams that skip this consistently find the problem at query time when a join produces a fan-out nobody expected. The conversation is much more expensive to have then.

Design the model before building the pipelines. The temptation is to start with connectivity: get the data moving first, sort out the structure later. Reverse the order. The dimensional model is the specification the load pipelines derive from. Pipelines built before the model is settled will need to be rewritten as the model evolves. That rework compounds.

Decide on incremental loading strategy during design, not six months into production. Full reloads that work fine in development on a small dataset reliably become operational problems as data volumes grow. The CDC mechanism, the audit fields it depends on, and the handling of hard deletes all need to be pinned down early.

Make load ordering explicit and derived from the model. The dependency graph between dimensions and facts needs to live somewhere. Teams that manage it by hand tend to discover its importance the first time a dimension fails to finish before the facts that depend on it start loading. Deriving it from referential integrity constraints removes a class of maintenance error that compounds as the warehouse grows.

Plan for the model changing, because it will. Source systems change. Business requirements change. A dimension designed with four attributes will eventually need a fifth. A grain decision that seemed right will miss a use case. Warehouses built assuming the initial design is permanent accumulate technical debt faster than those built with change as a first-class assumption.

Dimensional modeling is covered in depth in the dimensional modeling pillar. The trade-off between ETL and ELT loading approaches is covered in the ETL vs ELT comparison. The distinction between warehouses, marts, lakes, and lakehouses is covered in the data warehouse vs data lake vs data mart vs lakehouse comparison. The database-architecture distinction between transactional and analytical systems is covered in the OLTP vs OLAP comparison. For cloud platform decisions, the modern warehouse platforms pillar covers Snowflake, BigQuery, Redshift, and their architectural trade-offs. The four-phase practitioner playbook for taking a warehouse project from objectives to production, including the Kimball-vs-Inmon decision criteria, is in building a data warehouse. The data modeling phases technique article covers how the conceptual / logical / physical separation translates onto a 2026 cloud-warehouse stack with dbt for the physical layer. The logical data warehouse covers the pattern in which the physical warehouse coexists with lakes, lakehouses, and operational stores behind a unified federated query layer; data virtualization covers the federated-query alternative to materialized integration. The data warehouse metadata article covers the catalog and lineage layer that sits across the rest of the stack. Star schema, surrogate key, slowly changing dimension, change data capture, enterprise data warehouse, abstraction layer, data catalog, data lineage, data lake, data mart, and data lakehouse all have glossary entries.