Data warehouse vs data lake vs data mart vs lakehouse

Practitioners planning an analytical platform routinely encounter the same four words used interchangeably: data warehouse, data mart, data lake, data lakehouse. The four are not synonyms. They name four distinct architectural commitments, each suited to a different class of workload, and conflating them is the most common reason an analytical platform ends up doing none of the four jobs well. The data warehouse pillar covers the warehouse in depth; this page is the comparison between the four and the conditions under which each is the right choice.

TL;DR. Use a data warehouse for structured, governed business intelligence on integrated historical data. Use a data lake when the workload is exploratory analytics or machine learning on raw, varied data the warehouse can't economically host. Use a data mart when one department needs a curated slice of the warehouse and the rest of the organization doesn't. Use a lakehouse when both warehouse and lake workloads need to run on the same physical storage and the team is willing to manage table-format metadata to make that work. In most production environments at any meaningful scale, the answer is some combination of these, not a single one.

The short answer

The four belong to two different generations of analytical architecture. The data warehouse and data mart are the established structured-analytics pattern, dating from the early 1990s and still the production default for governed business intelligence. The data lake is the response to the rise of unstructured and semi-structured data and to workloads (machine learning, exploratory science, log analysis) that the warehouse model wasn't designed for. The lakehouse is the architectural compromise that arrived in the late 2010s: an attempt to do warehouse work and lake work on the same physical storage, using an open table format to add the structure and governance the bare lake lacks.

Choosing between them depends less on the data's volume than on the workload's shape. Structured queries, joins across modeled entities, time-series analysis over historical state, and BI tool integration favor the warehouse. Exploratory analysis on heterogeneous data, training datasets that need to be reshaped frequently, and large-volume semi-structured workloads favor the lake. A team that legitimately needs both, on the same data, with consistent governance, is the case the lakehouse was built for.

Mixed architectures are routine. A typical 2026 stack might land raw ingestion in a lake, promote modeled data into a warehouse for BI, expose a few departmental marts on top, and use lakehouse table formats on the lake side to give data scientists ACID and time travel. The right shape is the one that fits the workloads, not the one that fits a single vendor's product story.

What each one is

Data warehouse

A data warehouse is a centralized repository for integrated, historical, structured data, optimized for analytical queries. Data lands in the warehouse after being extracted from source systems, transformed to fit a defined schema (typically dimensional or data vault), and loaded into governed tables. Schema-on-write: the structure is decided before the data is written, and the warehouse enforces it.

The defining commitment is that the schema and the governance are upstream of any analytical query. Analysts query an integrated, consistent representation of business reality, not raw source data. The cost of that consistency is the engineering effort to build and maintain the integration and modeling layers, which is where most warehouse work actually lives.

Modern cloud warehouses (Snowflake, BigQuery, Redshift, Databricks SQL) share the same architectural pattern: columnar storage, separation of compute from storage, native query engines optimized for analytical scan-heavy workloads. The platform-level decisions are covered in the modern warehouse platforms pillar.

Data mart

A data mart is a curated subset of the warehouse organized around a single department or subject area. Marketing has a mart with marketing dimensions and metrics; finance has a mart with the chart of accounts; sales has a mart aligned to its territory and quota structures. The mart is not a different architectural pattern from the warehouse. It is a deployment style of warehouse content, packaged for one audience.

The reason marts exist as a named pattern is governance and performance, not technology. A small focused set of tables is easier for a department's analysts to understand than the full enterprise model. Departmental marts also let teams iterate without coordinating every schema change with the enterprise team. The cost is that the same metrics defined slightly differently across marts is a leading cause of cross-departmental reporting that quietly disagrees with itself.

Two patterns are common: independent marts that get integrated upward into an enterprise warehouse over time (the Kimball-conformed-dimension approach), and marts derived downward from a central warehouse that already integrates source data (the Inmon approach). The right pattern depends on whether departmental autonomy or enterprise consistency is the higher priority at the start. Both approaches converge to similar end states; the difference is mostly about which constraints are accepted up front.

Data lake

A data lake is object storage holding files in whatever shape the source systems produced them: CSV, JSON, Parquet, Avro, ORC, log files, images, audio, video, weblogs, social data. Schema-on-read: the structure is decided at query time, not at write time. The lake imposes minimal commitments on the data going in. The cost is that the structure has to be supplied by every consumer that reads the data, and consumers reading the same files can resolve them to incompatible schemas without anything alerting either consumer that they disagree.

Lakes are typically backed by cloud object storage (S3, Google Cloud Storage, Azure Data Lake Storage) and queried by engines that operate on object storage directly: Spark, Presto, Trino, Dremio, Athena, BigQuery's external table support. The economics differ sharply from the warehouse: storage costs are an order of magnitude lower, but compute costs depend heavily on how often the data is read and how thoroughly the queries can prune the files they scan.

The job a lake does well is hosting data the warehouse cannot economically take: petabyte-scale logs and events, unstructured assets, data of unclear future analytical use that needs to be retained for option value. The job a lake does poorly is anything that requires transactional consistency, fine-grained governance, low-latency analytical queries against modeled data, or the kinds of stable schema contracts that downstream BI depends on. Without additional structure on top, a lake is a storage pattern, not an analytical platform.

Data lakehouse

A data lakehouse adds an open table format on top of object storage so that lake-style data can be queried with warehouse-style guarantees. The table format (Apache Iceberg, Delta Lake, or Apache Hudi) is a metadata layer over Parquet files that provides ACID transactions, schema enforcement and evolution, time-travel queries against past snapshots, and the kind of consistent table abstraction that warehouse engines and BI tools expect.

The architectural commitment is that data physically lives in object storage (the lake's storage layer) but is exposed through a table abstraction (the warehouse's interface). Multiple engines can read and write the same tables consistently: Snowflake reads external Iceberg tables; Databricks SQL writes Delta; BigQuery's BigLake exposes Iceberg; Trino, Spark, Flink, and others speak all three formats. The result is one physical copy of the data serving both warehouse and lake workloads, with table-format metadata mediating the consistency.

What the lakehouse does not yet match consistently is the depth of warehouse-native optimization (clustering, micro-partitioning, query-plan caching, materialized view maintenance) that proprietary warehouse engines have spent two decades developing. Performance parity exists for many workloads, particularly large scan-heavy aggregations; it is less consistent for complex joins, point lookups, and small-result-set queries against large tables. The category is moving fast and 2026 isn't 2024, but the gap should be evaluated against actual workloads rather than vendor benchmark claims.

Comparison along key axes

Axis	Data warehouse	Data mart	Data lake	Data lakehouse
Data shape	Structured, modeled (star, snowflake, or data vault)	Subset of warehouse, modeled for one domain	Raw or semi-structured, schema-on-read	Structured at the table-format layer, on raw storage
Storage layer	Proprietary columnar managed by the warehouse	Same as parent warehouse	Object storage (S3, GCS, ADLS)	Object storage plus table-format metadata
Schema management	Schema-on-write, enforced at load	Schema-on-write, inherited from warehouse	Schema-on-read, defined per consumer	Schema-on-write, enforced by table format
Query engine	Warehouse-native (Snowflake, BigQuery, Redshift, Databricks SQL)	Same as parent warehouse	Spark, Presto, Trino, Dremio, Athena, external tables	Warehouse engines plus lake engines, on the same tables
Transactions	ACID	ACID	Not native; relies on application discipline	ACID via the table format
Time travel and versioning	Limited; proprietary	Limited; inherited	Not without table-format layer	First-class; the table format provides it
Governance	Strong; access control and lineage are warehouse-native	Inherits from parent warehouse	Weak by default; delegated to applications and catalogs	Variable; improving as catalogs (Unity, Polaris, Glue) mature
Typical workloads	Governed BI, dashboards, financial reporting	Departmental BI on one subject area	ML training, exploratory analytics, log analysis, retention	BI and ML on the same storage, when the team can manage table-format metadata
Time to first value	Weeks to months	Hours to days, on top of an existing warehouse	Hours to weeks for ingestion; weeks to months for analytical usability	Variable; depends on table-format and catalog tooling maturity
Cost model	Storage plus compute, often separately metered	Sub-set of warehouse cost	Object storage cheap; compute variable	Object storage plus engine compute

The axes don't all weigh equally. Three deserve specific attention because they're the ones most often misjudged at platform-selection time.

Schema management is the deepest divide. Schema-on-write costs effort up front and prevents an entire class of analytical errors. Schema-on-read defers the cost to every consumer, which is fine for one or two analysts running ad-hoc queries and breaks down when the data has to support governed reporting. Lakehouses get this right because the table format enforces schema even though the data lives in lake-style storage. Lakes without table formats put the schema burden everywhere, including in places where it doesn't get enforced.

Governance is where lakes fail quietly. A warehouse's governance model (row-level access, column masking, lineage from source to report) is built into the platform. A bare lake delegates this to applications and external catalogs, and the typical failure mode is that some applications enforce policies that others don't. The lakehouse pattern improves this, but the maturity of the surrounding catalog tooling matters more than the table format does.

Cost economics differ in shape, not just magnitude. Warehouse cost is roughly linear in compute usage with relatively predictable storage cost. Lake cost is dominated by object storage (cheap) plus the compute used to query it (highly variable depending on access pattern and engine). Lakehouse cost depends on which engines are accessing the data and whether the table-format metadata is being maintained efficiently. Teams that select a platform on advertised storage-cost-per-TB without modeling the compute pattern routinely discover that their actual bills don't match either the warehouse-vendor or lake-vendor pitch.

The relationships between the four architectures look like this:

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

Decision criteria

The four don't compete one-for-one. The honest decision is usually about which combination fits the workload, not which single one wins. A practical set of conditional rules:

Choose a data warehouse when the workload is structured BI on integrated historical data and the analytical questions are reasonably stable. This is the production default for finance, sales analytics, operational reporting, and most governed metric layers. The integration and modeling cost pays back at every subsequent query.

Add data marts when departments need their own focused view and the cost of cross-departmental schema coordination exceeds the cost of duplicating some content downstream. Marts work best when the conformed-dimension discipline is real and enforced; without it, marts produce the cross-departmental reporting inconsistencies they were supposed to prevent.

Add a data lake when the data is too varied, too semi-structured, or too voluminous for the warehouse to host economically. ML training data, log archives, raw event streams, and unstructured content are the canonical cases. The lake is not a cheap replacement for the warehouse; it is a different tool for a different job. Treating it as a cheap warehouse is one of the more expensive mistakes in this space.

Reach for a lakehouse when both warehouse and lake workloads need to run on the same physical data, with consistent governance, and the team has the capacity to manage table-format metadata. The combined-storage proposition is real and increasingly mature, but the operational discipline isn't free. Teams that want lakehouse benefits without owning the metadata layer tend to discover that the catalog, the compaction jobs, the schema-evolution policies, and the multi-engine consistency are real ongoing work.

Avoid choosing one because it's newer. The lakehouse is the most recent architectural pattern but is not strictly an upgrade to the warehouse for every workload. A team running stable governed BI on Snowflake or BigQuery doesn't gain much by moving to a lakehouse unless they have lake-style workloads on the same data. Conversely, a team with significant unstructured workloads next to BI has more reason to consolidate on lakehouse architecture than to operate a warehouse and a lake side by side.

Beware vendor-driven category framing. Each of the major platforms has a commercial interest in framing the comparison around its strengths. Warehouse-first vendors describe the lakehouse as a workaround for not having a real warehouse; lake-first vendors describe the warehouse as an obsolete category that the lakehouse subsumes. Neither framing is honest. The technologies have different cost profiles, different governance models, and different operational disciplines, and the choice depends on the workload rather than on the vendor's positioning.

Where each fails quietly

The clearest signals are the failure modes, because vendor marketing tends to be honest about the upside and silent about the downside. Three failure patterns worth knowing before committing.

Warehouses fail quietly when the cost of getting data in exceeds the value of querying it. A warehouse built around a heavy ETL pipeline that requires every new source to be modeled before it can be analyzed makes some workloads (early-stage product analytics, ad-hoc data science) impractically slow to start. The right response is usually to expose those workloads to a lake or a lakehouse alongside the warehouse, not to load every new source into the warehouse anyway.

Lakes fail quietly when governed analytics start running on them without governance. A team that uses a lake for ML training and then starts exposing the same data to BI dashboards typically discovers that the schema, access controls, and metric consistency the warehouse used to enforce are now distributed across whichever applications query the lake. The data warehouse pillar covers some of these governance dynamics; the lake equivalent is harder because the responsibility for governance has no clear home.

Lakehouses fail quietly when the metadata layer is treated as set-and-forget. Iceberg, Delta, and Hudi all require ongoing maintenance: snapshot expiration, compaction of small files, schema-evolution policy, metastore consistency across engines. Teams that adopt the table format without budgeting for that maintenance discover the metadata bloat, the small-file performance degradation, and the cross-engine consistency drift the hard way, typically when query performance degrades or a write from one engine produces results another engine can't read.

The data warehouse pillar covers what warehouses do in depth, including the dimensional modeling and loading patterns this comparison touches on. The modern warehouse platforms pillar covers the cloud columnar warehouse engines (Snowflake, BigQuery, Redshift, Databricks SQL) and the architectural decisions specific to each. The warehouse loading and operations pillar covers the ETL/ELT loading patterns that move data into these architectures and the operational disciplines that keep them current. For definitional anchors, see the glossary entries for data lake, data mart, data lakehouse, Apache Iceberg, and Delta Lake.

Reference

The four-way distinction is recent enough that no single canonical source covers it. The component frameworks are well-established.

Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed., Wiley, 2013. The foundational treatment of the warehouse and mart patterns, including conformed dimensions across marts.
Apache Iceberg specification. The open table format that gives lakes warehouse-style consistency, increasingly the de-facto interoperability layer between engines.
Delta Lake documentation. The table format developed at Databricks and the basis of the original lakehouse architecture publications.
Apache Hudi documentation. The third major open table format, with emphasis on streaming ingestion and record-level updates.