Data virtualization: federated query in modern stacks

Data virtualization is the technique of exposing a unified query surface over data that physically lives in multiple separate systems, without copying it into a single store first. A query against the virtualization layer is decomposed into sub-queries that run against the underlying sources, the results are joined and aggregated by the virtualization engine, and the caller sees a single tabular answer. The data warehouse fundamentals pillar introduces the warehouse as a centralized, integrated, governed store; virtualization is the architectural counterweight to that pattern, useful precisely where centralization is impractical, expensive, or premature. This article covers the mechanics, the production-grade engines that have absorbed the technique under the federated-query banner, where virtualization fits in a 2026 stack, and the failure modes that determine whether it works.

TL;DR. Data virtualization in 2026 is mostly federated query, executed by engines like Trino, Presto, Starburst, Dremio, Athena, BigQuery's external tables, or Snowflake's external tables and Iceberg integration. It is a real tool for cross-source analytics, lake exploration, and the logical data warehouse pattern. It is not a replacement for materialized integration when query latency, governance, or cost predictability matter, and the lakehouse table-format era has reduced the cases where virtualization is the only answer.

What data virtualization actually does

A data virtualization layer presents a logical schema, typically SQL-addressable, that maps to physical tables, files, APIs, or services held in separate underlying systems. When a query arrives, the engine performs four steps.

First, it parses the query against the logical schema and produces an execution plan. The plan identifies which underlying sources hold the data the query touches and how their contributions must be joined and aggregated to produce the result.

Second, it pushes work down. Predicates, projections, and aggregations that can be evaluated by the source system are rewritten into the source's native query language and sent there. A Postgres source receives SQL filtered to the columns and rows the query needs; a REST API receives a parameterized request rather than a bulk extract. Pushdown is the single most consequential design feature in a virtualization engine; without it, every query degenerates into "pull all source data into the engine, then filter," which does not scale beyond demonstrations.

Third, it executes whatever remains in the virtualization engine itself. Cross-source joins, final aggregations, and computations the sources cannot evaluate run in the engine's compute layer. Modern engines (Trino, Presto, Starburst, Dremio, Athena, BigQuery, Snowflake) execute this layer with distributed columnar processing similar to a cloud warehouse.

Fourth, it caches selectively. Result caches, materialized views, and intermediate result sets allow repeated queries to skip the source round trip. Caching policy is where the line between "virtualization" and "materialized integration" gets blurry in practice: aggressive caching against slow sources eventually approximates a stale data warehouse, and the operational concerns converge.

The defining architectural commitment is that the source remains the system of record. The virtualization layer does not own a copy. When the source updates, the next query reads the updated value. When the source is unavailable, the query fails or returns partial results. This is the property that makes virtualization useful for real-time access and unavoidable as the source of every operational concern that follows.

Federated query: how the technique shipped in modern stacks

The vocabulary shifted in the cloud era. The category that called itself "data virtualization" through the 2010s overlaps almost completely with what cloud-native engines call "federated query" or "external tables" today. The mechanics are the same; the packaging changed.

Trino, Presto, and Starburst are the canonical federated-query engines. Trino (originally PrestoSQL, forked from Presto in 2018) is the open-source engine; Starburst is the commercial distribution with additional connectors, security, and operational tooling. Both expose dozens of connectors (Postgres, MySQL, Kafka, MongoDB, Elasticsearch, S3, Iceberg, Delta, Hive, Snowflake, BigQuery, plus the long tail) and execute distributed queries that span them. A SELECT joining a Kafka topic to a Postgres table and an Iceberg table on S3 is a single query, planned and executed as one.

Dremio is the federated-query engine with a stronger focus on the lake side. Its bet is that Apache Iceberg and Apache Arrow give a virtualization layer most of the performance characteristics of a warehouse while keeping data in open formats on object storage. Reflections (Dremio's materialized-view abstraction) sit between pure virtualization and persistent materialization.

Athena is AWS's managed Trino-derivative for querying data in S3 (plus federated connectors to other AWS services and external sources). The execution model is virtualization against object storage; the developer experience is closer to a serverless warehouse.

BigQuery external tables and Snowflake external tables are the major cloud warehouses' federated-query primitives. They expose object-storage data and, increasingly, tables managed by other engines (BigLake federates Iceberg tables across catalogs; Snowflake reads Iceberg tables managed by an external catalog). The mental model is "a regular warehouse table whose storage happens to live somewhere the warehouse does not own." Pushdown still happens at the storage layer; cross-system joins between, say, a BigQuery table and a federated Postgres source are supported but slower than queries that stay within native warehouse storage.

These engines are what production virtualization looks like in 2026. The standalone "data virtualization platform" market still exists (Denodo is the most visible example), and the dedicated category retains an edge in some enterprise scenarios: many built-in connectors to legacy and packaged-application sources, semantic-layer features, and security models designed for mixed-source governance. Whether that edge justifies a dedicated category over a federated-query engine is the kind of decision that depends on the existing stack rather than on a single technical answer.

Where virtualization fits in a 2026 stack

Three patterns account for most production virtualization in current architectures.

Lake exploration and ad-hoc analytics. Object storage holds large volumes of raw and semi-structured data that would be expensive to land in a warehouse and may not need to be there long-term. A federated query engine reads Iceberg or Delta tables, plus Parquet and JSON files, plus the warehouse for the modeled data, in a single query surface. Analysts and data scientists run exploratory queries without an ETL prerequisite. This is the case Athena, Trino, and BigQuery external tables most clearly serve.

The logical data warehouse. Some organizations cannot or will not consolidate analytical data into a single physical warehouse. Regulatory constraints, acquisitions, departmental autonomy, and source data that simply will not move all produce stacks where the analytical surface needs to span multiple physical stores. The logical data warehouse pattern names this: an analytical layer that looks like a single warehouse to query-time consumers but is composed at runtime from multiple physical stores. The pattern is mostly virtualization with judicious caching and a semantic layer on top. Where the consolidation would be possible but expensive, the logical pattern is often a transitional step toward eventual consolidation. Where consolidation is genuinely impossible, it can be the permanent architecture.

Real-time access against operational sources. Warehouse loads run on schedules; even streaming CDC pipelines incur some end-to-end lag. Some queries genuinely require the current state of the operational source, not the warehouse's representation of it. Virtualization gives the BI surface a way to reach the operational source directly for those queries, joining the live state against the modeled history that lives in the warehouse. The operational team's tolerance for analytical query load against their database is the gating constraint here; it is rarely as high as the analytical team would like.

A fourth pattern, prototyping and proof-of-concept analytics, is worth naming because it is the case where virtualization is least controversial. A new analytical question against a new source does not justify building a load pipeline. A federated query against the source answers the question directly. If the answer turns out to matter recurrently, the source then becomes a candidate for materialized integration. Virtualization works as the cheap path that lets you find out which sources warrant the expensive path.

The pattern virtualization does not fit cleanly is governed, high-concurrency BI on integrated business data. The warehouse exists because that workload needs predictable query latency, frozen historical state, and governance that survives source-system change. Virtualization can serve parts of it, but the production default for the core analytical surface remains the warehouse.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

How the lakehouse changed the math

The arrival of open table formats (Apache Iceberg, Delta Lake, Apache Hudi) in the late 2010s collapsed a category of cases that previously required virtualization. The pattern's reach shrank as a consequence.

Before table formats, querying lake data with warehouse-grade guarantees meant either copying it into the warehouse or virtualizing over it from a federated query engine. The virtualization path retained the storage cost advantage of the lake but accepted weaker consistency and less mature optimization. With table formats, the same files in object storage support ACID transactions, schema evolution, and time travel, and multiple engines (Trino, Spark, Snowflake, BigQuery, Databricks SQL) can read them consistently. The result is that a data lakehouse workload that would have required virtualization in 2018 now runs natively on the lakehouse table from whatever engine the team prefers.

The category that virtualization still uniquely addresses is heterogeneous source data: tables in Postgres, documents in Mongo, events in Kafka, files in S3 that have not been promoted to a table format, SaaS APIs, the long tail. The lakehouse is a homogeneous-storage pattern; once the data is in Iceberg or Delta, an open-format engine reads it directly. Virtualization is the answer for data that has not been so promoted, which in most real organizations is still substantial.

When data mesh enters the conversation

Data mesh became a frequent referent for virtualization-adjacent architectures starting around 2019. The connection is real but narrower than the broader discourse implies. Mesh is an organizational and product pattern: domain teams own their data as products, with discoverability, contracts, and quality guarantees attached. The architectural substrate that lets domain-owned data be consumed federatively across domains is often, but not necessarily, a federated query layer. Some mesh implementations rely heavily on virtualization; others materialize domain outputs into shared storage and use a more conventional warehouse or lakehouse engine.

The honest framing is that mesh is one organizational rationale that can produce a virtualization-heavy stack, alongside the regulatory, acquisition, and exploration rationales already mentioned. It is not a synonym for virtualization, and the term has been stretched enough that "we do data mesh" rarely communicates the technical architecture without follow-up questions. Treat the architectural decisions on their own terms.

Failure modes and operational concerns

Three categories of failure account for most production trouble with virtualization-heavy architectures.

Source load and latency surprises. A query that joins a small Postgres table to a large warehouse table looks innocuous in the planner. If the engine's optimizer picks the wrong join order or pushes the wrong predicates, the small table can drive a query plan that pulls millions of rows from the warehouse to evaluate the join in the federated engine. Source databases get hit with concurrent analytical workloads they were not sized for. The countermeasures are: tight query review for cross-source joins, explicit join hints where the optimizer needs guidance, and rate-limiting or concurrency caps on the connectors to operational sources. Mature engines do well on this within their preferred connector set; the failure rate climbs sharply for connectors to less mainstream sources.

Schema drift. A virtualization layer's contract with consumers is the logical schema. The contract with sources is whatever the source happens to look like today. When a source schema changes (column added, type changed, table renamed), the virtualization layer either breaks or silently produces wrong answers, depending on which kind of change happened and how the connector handles it. The discipline is the same as for warehouse loading pipelines: explicit schema validation, alerts on source schema change, and a defined process for propagating changes through the logical schema. The difference from a warehouse pipeline is that the propagation is faster, because there is no load step to update.

Cost predictability. Cloud-warehouse credit pricing, federated-engine billing for compute and data scanned, and operational-source query load all stack on each query. A query that runs cleanly in a warehouse for a known cost may, when re-implemented as a federated query, cost more, less, or wildly more depending on data layout, pushdown effectiveness, and how often the result is re-computed versus cached. Teams that adopt virtualization without modeling its cost behavior in their workload routinely discover that "the warehouse was expensive, but the federated query was unpredictable" is a worse position to be in. The countermeasures are workload analysis, monitoring per-query cost, and materializing expensive recurring queries into stable tables when virtualization does not pay back.

A fourth category worth naming briefly is governance and access control. The warehouse pattern centralizes governance at the storage boundary; data lands in the warehouse, governance applies, queries read governed data. Virtualization distributes governance across the sources, because each source enforces its own access controls and the virtualization engine inherits whatever the source provides. Reconciling that across heterogeneous sources, particularly for row-level and column-level security, is real engineering work. Engines with strong governance models (Starburst, Dremio, Denodo) invest heavily in this layer; engines treating it as the source's problem leave it to the integrating team.

When to reach for virtualization

The decision is rarely binary. Most production stacks in 2026 use both materialized integration and federated query, with the split governed by workload characteristics rather than ideology.

Virtualization fits when the workload is exploratory, the source is hard to extract, the data does not need to be historical, the access pattern is occasional rather than concurrent, the source data is heterogeneous, or the materialization cost would not pay back. Materialized integration (warehouse, lakehouse) fits when the workload is BI with predictable latency requirements, the source is extract-friendly, historical state matters, concurrency is high, governance must survive source change, or cost predictability matters.

The mistake to avoid in either direction is treating the choice as architectural fashion. A team that virtualizes everything because federation is "modern" reproduces the operational headaches centralized warehouses were built to solve. A team that warehouses everything because integration is "proper" pays for materialization on data that gets queried twice a quarter. The technique to pick is the one the workload actually calls for.

The data warehouse pillar covers the materialized-integration pattern virtualization sits opposite to. The data warehouse vs data lake vs data mart vs lakehouse comparison covers the four storage architectures virtualization queries across. The ETL vs ELT comparison covers the materialization patterns that handle the cases virtualization does not. For platform-level decisions on the warehouses that participate in federated queries, the modern warehouse platforms pillar covers Snowflake, BigQuery, Redshift, and Databricks. The full implementation depth for the logical data warehouse pattern is covered in its own technique article. Data integration places virtualization among the alternative approaches — ETL, ELT, change data capture, replication, and streaming — and the constraints that select between them. Data lake, data lakehouse, federated query, and semantic layer all have glossary entries.