Modern warehouse platforms

Platform selection is treated as a procurement decision more often than it should be. The choice between Snowflake, BigQuery, Redshift, and Databricks has architectural consequences that compound over years: the performance characteristics you tune around, the cost model that shapes your infrastructure decisions, the workload isolation patterns available when demand spikes, and the degree to which the platform constrains or enables the data modeling approaches you want to use. Getting this wrong is recoverable, but migration is expensive enough that it's worth making the initial choice deliberately.

The data warehouse fundamentals pillar covers the shift from on-premises appliances to cloud warehouse architecture, storage-compute separation, and columnar storage. This page takes those as given.

The commonalities between platforms matter because they determine what the choice actually affects.

Every major cloud warehouse platform stores data in columnar format in object storage, separates storage costs from compute costs, and supports standard SQL for analytical queries. Dimensional models, surrogate keys, SCD handling, incremental loading, and governed schemas work the same way regardless of which platform hosts them. A team moving from Redshift to Snowflake carries their warehouse design with them. The SQL may need minor adjustments for dialect differences, but the model is unchanged.

Performance tuning is where platforms diverge most sharply from each other, but the direction of the divergence is consistent: all of them reward the same fundamental query disciplines (predicate pushdown, partition pruning, avoiding full scans on large tables, materializing frequently repeated computations) even though the specific mechanisms differ. Learning to tune on one platform makes learning to tune on another considerably faster.

The platform is not the bottleneck in most analytical workloads. Data quality, modeling decisions, and query design matter more than which cloud database engine is underneath. Teams that change platforms hoping to solve a modeling or data quality problem tend to find the problem followed them.

Snowflake

Snowflake's architecture separates storage and compute more completely than any other major platform. Data lives in Snowflake's managed object storage layer. Queries run on virtual warehouses, which are independent compute clusters that can be sized, started, stopped, and scaled without any relationship to the storage layer. Multiple virtual warehouses can query the same data simultaneously without interfering with each other.

This architecture's most useful property is workload isolation. A reporting virtual warehouse and an ELT loading virtual warehouse can run concurrently against the same tables without resource contention. A BI dashboard query that's running a heavy scan doesn't compete with the load job that's inserting new records. Teams can size each virtual warehouse for its workload independently, and each starts and stops on its own schedule, with costs accruing only while it's running.

Snowflake stores data in micro-partitions, which are small compressed columnar files with automatically maintained metadata about the range of values they contain. When a query includes a predicate that can be evaluated against the micro-partition metadata, Snowflake can skip entire micro-partitions without reading them. This is called micro-partition pruning, and it is the primary performance mechanism for selective queries. Clustering keys allow explicit control over which column or columns Snowflake uses to physically organize micro-partitions, which improves pruning effectiveness on tables that are frequently queried on a particular column that isn't naturally ordered in the data. Clustering is worth configuring on very large tables with predictable query patterns; it adds maintenance overhead and cost and is unnecessary on smaller tables or tables with unpredictable query patterns.

Snowflake's Time Travel feature retains previous versions of table data for a configurable retention period, defaulting to one day on standard tiers and up to 90 days on enterprise tiers. This means a query can ask "what did this table look like yesterday" without maintaining a separate historical copy. Time Travel is useful for debugging load pipelines, recovering from accidental modifications, and auditing data at points in the past. The Fail-safe period extends retention further for disaster recovery purposes, though data in Fail-safe can only be recovered by Snowflake support rather than by user queries.

Snowflake's Data Sharing capability allows data to be shared across accounts, including accounts owned by external organizations, without copying the underlying data. The recipient queries the shared data through their own virtual warehouse; the provider controls access and the data remains in the provider's storage. For organizations that distribute data to partners, customers, or subsidiaries, this removes the overhead of maintaining synchronized copies.

The cost model charges separately for storage and compute. Storage costs are low and predictable. Compute costs scale directly with the size and run duration of virtual warehouses. The cost model rewards turning warehouses off when not in use, which Snowflake supports through auto-suspend and auto-resume configuration. Warehouses that run continuously when they should be idle are the most common source of unexpectedly large Snowflake bills. The other common source is warehouse sizing: running a larger virtual warehouse than the workload requires wastes money, but running a warehouse that's too small causes queries to queue or spill to disk, which can make query times worse than simply using a larger warehouse.

BigQuery

BigQuery's architecture is serverless: there are no clusters to provision, size, or manage. Query execution is handled by Google's shared infrastructure, which allocates resources to queries dynamically based on their requirements and the available capacity. From the user's perspective, queries are submitted and return results; the infrastructure management is invisible.

This architecture makes BigQuery exceptionally easy to operate. There is no cluster configuration, no scaling decision, and no idle cluster consuming costs. It is also the architecture that makes cost behavior the hardest to predict when query patterns are variable or exploratory. Under BigQuery's on-demand pricing model, queries are billed by the volume of data they scan, measured in terabytes. A well-optimized query that scans a small partition costs very little. An exploratory query that scans a large table without partition pruning can generate a significant cost in seconds. Analysts who are accustomed to running ad-hoc queries without cost consequences need to adjust.

Capacity pricing is the alternative. Rather than paying per byte scanned, capacity pricing reserves a fixed number of processing slots for a monthly or annual commitment. Under capacity pricing, query costs become predictable regardless of what gets run against the platform. The trade-off is that capacity reservations require commitment and may be underutilized during periods of low query volume. Organizations with predictable, high-volume analytical workloads generally find capacity pricing more economical than on-demand. Organizations with variable or exploratory workloads may find on-demand pricing more appropriate, provided they implement cost controls such as query cost estimates, maximum bytes billed settings, and BI tool query optimization.

BigQuery partitions tables by a date or timestamp column, or by integer range. Queries that include a predicate on the partition column scan only the relevant partitions rather than the entire table. Clustering, which in BigQuery is distinct from and complementary to partitioning, physically sorts data within each partition by one or more specified columns, improving scan efficiency for queries that filter on those columns. The combination of partitioning and clustering is BigQuery's primary performance and cost optimization mechanism. Tables that aren't partitioned and clustered effectively are the most common source of both slow queries and unexpectedly high costs.

BigQuery's slot-based execution model means concurrency is managed differently than in cluster-based systems. On-demand queries compete for slots from Google's shared pool. Capacity reservations guarantee a slot allocation that is available exclusively to the reserved workload. Within a reservation, workload management rules can allocate slots to different projects or teams, providing isolation between workloads without the explicit cluster management that Snowflake's virtual warehouses require.

BigQuery Omni extends BigQuery's query engine to data stored in AWS and Azure object storage, allowing SQL queries that span data across cloud providers without moving the data. This is architecturally interesting for organizations with multi-cloud data estates, though the cross-cloud query latency and data governance implications require evaluation before treating it as a general-purpose integration mechanism.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

Redshift

Redshift's architecture predates the serverless and fully decoupled storage-compute models of Snowflake and BigQuery. In its original form, Redshift is a provisioned cluster: a set of nodes that each store a portion of the data and contribute compute to query execution. The cluster is sized at provisioning time, and resizing it requires a migration operation that takes time and affects availability. This design made Redshift a significant advance over on-premises appliances when it launched, but creates operational overhead that the newer platforms don't impose.

Redshift's distribution styles control how data is physically spread across cluster nodes. Choosing the right distribution style for each table affects join performance substantially: tables joined frequently should be distributed on the join key so that matching rows are co-located on the same nodes, avoiding data redistribution during query execution. Even distribution spreads rows evenly across nodes for scan-heavy workloads that don't need join co-location. Key distribution and sort key selection are Redshift-specific tuning decisions that have no direct equivalent on Snowflake or BigQuery, and they represent significant operational expertise to apply correctly.

RA3 node types, introduced in 2019, were the intermediate step toward storage-compute decoupling on provisioned Redshift. RA3 nodes use managed storage where active data sits on local SSD and cold data lives in a managed Redshift Managed Storage layer backed by S3, with movement between the two handled automatically. This decoupled the cluster's compute sizing from its data volume, which classic dense-compute and dense-storage nodes had bound together. Most provisioned Redshift work in 2020-onward is on RA3.

Redshift Serverless, introduced in 2021 and since matured, takes the decoupling further. Capacity scales automatically based on workload demand, eliminating the cluster provisioning and resizing concern entirely. Redshift Serverless charges by the RPU-hour, where RPU is Redshift Processing Units, a capacity abstraction similar in spirit to BigQuery's slots. The serverless model has made Redshift's operational profile considerably closer to BigQuery and Snowflake for workloads that fit it, though the underlying distribution and sort key concepts still apply to performance tuning.

Redshift's deepest advantage is integration with AWS services. Organizations with data pipelines built on AWS Glue, EMR, Kinesis, and Lambda, and analytical workloads that reference data in S3 through Redshift Spectrum, often find Redshift the lowest-friction option because it fits into existing AWS operational patterns. IAM, VPC, CloudWatch, and cost allocation all integrate with Redshift in ways that require more configuration to replicate on Snowflake or BigQuery. For teams deeply embedded in AWS, that integration is worth weighing against the newer platforms' architectural advantages.

Databricks and the lakehouse pattern

Databricks occupies a different position from the three platforms above. It is not primarily a warehouse platform; it is a unified analytics platform built on Apache Spark, with warehouse-style SQL querying added through Databricks SQL. Understanding what it is and what it's for requires understanding the lakehouse pattern it implements.

A lakehouse stores data in open file formats, primarily Apache Parquet, in object storage, with a table format layer that adds database-like properties: ACID transactions, schema enforcement, versioning, and efficient query planning. The two dominant open table formats are Apache Iceberg and Delta Lake. Delta Lake was developed by Databricks and is the native format on the Databricks platform; it has been an open-source project under the Linux Foundation since 2019, and Databricks-led Delta UniForm now provides Iceberg-format read access to Delta tables, blurring the historical "pick one" framing. Iceberg has the broader cross-platform integration story, with native support in BigQuery, Snowflake, Redshift, and most other major platforms. Both formats are converging on similar capabilities; the choice between them increasingly depends on which other tools and platforms in your stack already support one or the other rather than on a meaningful capability gap.

The lakehouse architecture's core argument is that separating storage from the processing engine allows organizations to avoid proprietary lock-in: the data lives in open formats in object storage that any compliant engine can read. Databricks SQL can run warehouse-style queries against the same tables that Spark ML jobs use for training, that streaming pipelines write to, and that ad-hoc Python notebooks explore. The unification of storage means data doesn't need to be copied between a data lake and a warehouse to serve different analytical use cases.

Databricks SQL provides ANSI SQL query execution with performance competitive with purpose-built warehouse platforms on most workload types, particularly since the introduction of Photon, Databricks' native vectorized query engine. The industry benchmark exchanges of 2021 and 2022, when Databricks and Snowflake each published TPC-DS results claiming wins on the same workload depending on configuration, were the visible evidence that the performance gap had narrowed to "depends on the workload" rather than a clear order. The warehouse platforms still retain advantages in specific workload types, particularly highly concurrent BI query workloads.

The practical case for Databricks over a purpose-built warehouse platform is strongest when the workload mix includes significant machine learning, streaming data processing, and large-scale data transformation alongside relational analytical queries, and when avoiding proprietary data format lock-in is an organizational priority. Organizations doing primarily relational analytics with standard BI tooling typically find the purpose-built warehouse platforms simpler to operate and better optimized for their specific workload.

Unity Catalog, Databricks' unified governance layer, provides a common metadata and access control model across all Databricks workloads: SQL, notebooks, ML experiments, and streaming jobs. It also supports external data sources including data managed in other platforms, positioning it as a cross-platform governance layer rather than just a Databricks-specific catalog.

The four platforms compared

The architectural axes that recur in platform-selection conversations:

Platform	Storage / compute	Concurrency mechanism	Pruning structure	Cost model	Native table format
Snowflake	Fully decoupled; virtual warehouses on shared storage	Per-warehouse workload isolation	Micro-partition metadata + clustering keys	Compute credits + storage GB-month	Snowflake micro-partitions (Iceberg read/write supported)
BigQuery	Serverless; no clusters	Slot-based, shared on-demand or reserved capacity	Partition column + clustering	On-demand bytes scanned or reserved slots	BigQuery native (Iceberg read/write supported)
Redshift	Provisioned cluster (RA3/dense) or Redshift Serverless (RPUs)	Cluster-shared or per-workgroup	Distribution key + sort key	Node-hour or RPU-hour + storage	Redshift-managed (Iceberg via Spectrum)
Databricks	Decoupled; open formats in object storage	SQL warehouses, classic or serverless	File-level statistics + Z-order clustering	DBUs (compute time) + storage	Delta Lake (Iceberg via UniForm)

The differences compound at scale. A workload that fits one platform's pruning structure and concurrency model will fit another's badly without significant rework, even though the SQL surface looks similar.

Open table formats

Apache Iceberg and Delta Lake deserve attention beyond their role in the Databricks discussion, because they have changed the platform question in a meaningful way.

Both formats provide ACID transaction support, schema evolution, time travel, and efficient query planning on data stored in object storage as Parquet files. More importantly, both are readable by multiple query engines. An Iceberg table can be queried by Snowflake, BigQuery, Redshift, Spark, Trino, Flink, and DuckDB, using the same physical data files. The data is decoupled from the query engine that created it, which proprietary formats don't allow.

Organizations adopting open table formats retain the ability to switch query engines or run multiple engines against the same data without ETL or copying. A team that stores their analytical data in Iceberg tables in S3 can run Snowflake for BI workloads, Spark for ML, and Trino for ad-hoc exploration against the same physical data. The table format is the portability layer.

Snowflake, BigQuery, and Redshift all support reading from external Iceberg tables. Snowflake additionally supports Iceberg as a native storage format for tables that live in Snowflake's architecture but are stored as Iceberg files in the customer's own object storage bucket, providing both the query performance and operational simplicity of Snowflake and the portability of open format storage. BigQuery has similar capabilities through BigLake.

For teams making platform decisions in 2026, the open table format question is worth separating from the query engine question. Choosing a query engine doesn't necessarily mean accepting proprietary storage lock-in. Choosing to store data in Iceberg from the start preserves optionality about which engines access it.

Platform selection criteria

The decision framework that actually matters for most teams comes down to five considerations.

Most teams, if they're honest about it, will find that cloud provider alignment decides the question before anything else does. Organizations deeply invested in AWS, Google Cloud, or Azure have infrastructure, security, identity management, and billing patterns that favor the native option or the platform with the deepest integration. The organizational cost of managing cross-cloud dependencies is often underestimated. Teams that are genuinely cloud-agnostic have more flexibility, but they're a minority.

Assuming cloud alignment doesn't eliminate the options, workload character is the next filter. Highly concurrent BI workloads with many simultaneous users favor Snowflake's workload isolation model. Serverless simplicity with variable query volume favors BigQuery. Deep AWS integration with join-heavy transformations may favor Redshift. Mixed workloads combining ML, streaming, and SQL analytics favor Databricks. No platform is universally best; the workload composition determines where each platform's advantages are actually relevant.

Cost model fit matters more than list price, and the two are often very different numbers. Snowflake's compute-while-running model is economical for bursty workloads and expensive for always-on workloads where virtual warehouses run continuously. BigQuery's on-demand pricing is economical for light or variable query volumes and expensive for heavy, unpredictable scanning. Redshift's provisioned model is economical for stable, predictable workloads and inflexible for workloads that vary significantly. Model the expected cost under each platform's pricing structure against your actual query and load patterns before committing.

A platform the team can operate confidently is worth more than one that would be theoretically optimal if staffed by a larger, more specialized team. Snowflake is operationally simpler than provisioned Redshift for most teams. BigQuery is simpler still but requires more cost management discipline. Databricks requires Spark expertise for anything beyond SQL workloads.

Portability requirements are worth evaluating explicitly rather than assuming they won't matter later. Proprietary storage formats create switching costs that accumulate quietly. Open table formats reduce them. Organizations in regulated industries where data portability is a compliance consideration should treat this as a first-class decision criterion rather than a footnote.

Practical guidance

Platform decisions go wrong in predictable ways. These are the ones worth knowing before the contract is signed.

Don't select on benchmark performance. Every major platform has benchmark results that flatter it under the right conditions. Benchmarks don't represent your specific workload, your specific data distributions, or your team's specific query patterns. Run your own queries against a representative data sample on any platform you're seriously evaluating before committing.

Model the cost before the contract. Take three months of your actual query and load patterns and calculate what they would cost under each platform's pricing model. Include idle time in the Snowflake calculation. Include data scan volume in the BigQuery calculation. Include node-hours in the Redshift provisioned calculation. The platform with the lowest list price is often not the one with the lowest actual cost.

Treat the platform as infrastructure, not as a solution. The warehouse platform is the substrate on which dimensional models, load pipelines, and analytical layers run. Switching platforms doesn't fix a bad dimensional model or a poorly governed load pipeline. Before attributing warehouse performance or quality problems to the platform, verify that the underlying data model, query design, and loading discipline are sound.

Establish cost controls before analysts start querying. On platforms with usage-based pricing, particularly BigQuery on-demand, setting up cost alerts, maximum bytes billed limits, and query cost visibility before general analyst access is enabled prevents cost surprises. These controls are much harder to establish retroactively after analysts have developed habits around unrestricted query access.

Plan the migration path before you need it. Even if there's no current intention to migrate, knowing what it would take to move to a different platform, which objects would need to be recreated, which SQL would need to be ported, and how long it would take, informs how tightly to couple the data layer to platform-specific features. Coupling tightly to features that don't have equivalents on other platforms increases switching costs over time.

The data warehouse fundamentals pillar covers the architectural basis for cloud warehouse platforms. The warehouse loading and operations pillar covers the loading patterns that apply on all platforms. Snowflake, BigQuery, Redshift, Apache Iceberg, and Delta Lake all have glossary entries.

Modern warehouse platforms

Snowflake

BigQuery

Redshift

Databricks and the lakehouse pattern

The four platforms compared

Open table formats

Platform selection criteria

Practical guidance

Warehouse Fundamentals

Dimensional Modeling

Data Vault Modeling

Loading and Operations

Warehouse Automation

Analytics Modeling

Agentic Warehouse Engineering

What the major platforms share

Snowflake

BigQuery

Redshift

Databricks and the lakehouse pattern

The four platforms compared

Open table formats

Platform selection criteria

Practical guidance

Related content

Warehouse Fundamentals

Dimensional Modeling

Data Vault Modeling

Loading and Operations

Warehouse Automation

Analytics Modeling

Agentic Warehouse Engineering