Data integration: approaches and when to use each

Data integration is the work of bringing data from multiple source systems into a single, consistent representation that downstream tools can query as one. It spans three problems: moving the data out of its sources, transforming it to a common schema and standard, and reconciling the same real-world entity as different systems represent it. The established approaches divide along two axes, when transformation happens and whether the data is copied at all, and this article surveys them: ETL, ELT, change data capture, replication, data virtualization, and streaming, with the trade-offs that decide which one fits a given source, latency requirement, and governance constraint. Most of the effort goes into reconciliation, not transport, which is why integration is the part of warehouse construction that consumes the most of it. The warehouse loading and operations pillar covers loading mechanics at overview depth.

TL;DR. Data integration approaches divide along two axes: when transformation happens (ETL before load, ELT after) and whether the data is copied at all (physical integration into a warehouse versus federated query over data left in place). Batch suits most analytical workloads; change data capture and streaming serve the cases that genuinely need fresh data. Choose by source type, latency need, transformation complexity, and where governance has to be enforced, not by default.

What data integration is

Data integration is the discipline of combining data from separate systems into a unified, consistent form. The term covers three distinct problems that are easy to conflate: moving data out of a source and into a target, transforming it to a common schema and a consistent standard, and reconciling the same real-world entity as it is represented differently across systems. A complete integration approach addresses all three. The patterns below differ mostly in where and when each step happens, and whether the data is physically copied at all.

The reason integration is hard is the reconciliation problem, not the transport. A customer in a CRM, an account in a billing system, and a user in a product database are frequently the same person, carried under different keys, with different attributes, updated on different cadences. Integration is what resolves them into a single record an analyst can trust without redoing the reconciliation at query time. This is the same property the warehouse promises when it calls itself integrated, and the same commitment that makes a warehouse an enterprise data warehouse: the hard, durable work lives in producing that consistent representation, and every downstream query benefits from it having been done once.

It helps to separate integration from two adjacent terms. Data ingestion is just the movement step, getting data from a source into a landing area, with no claim about consistency. Data extraction is the source-side half of ingestion. Integration is the larger discipline that ingestion serves: it is only complete when the moved data has also been conformed and reconciled into a representation the rest of the stack can rely on.

Broader treatments of data integration also include data consolidation and migration, application and API integration, and middleware or iPaaS connectivity between operational systems. This article scopes to the approaches that bring source data into an analytical warehouse, where the goal is integrated, queryable history rather than system-to-system connectivity.

Six approaches (the main types, methods, and techniques of data integration) recur within that scope. ETL transforms data before loading it into the warehouse. ELT loads first and transforms inside the warehouse. Replication and change data capture keep an integrated copy current by moving only what changed. Data virtualization queries sources in place without copying them. Batch moves data on a schedule, and streaming moves it continuously as events. The first two are about when transformation runs; the rest are about how and how often data moves, and the sections below take them in turn.

Physical integration: ETL and ELT

Most warehouse integration is physical: data is copied from sources into the warehouse, transformed into the target schema, and stored. The long-standing question is when the transformation runs relative to the load, and the two answers are the most commonly named integration patterns.

Extract, transform, load (ETL) transforms data in a dedicated processing layer before it reaches the warehouse. Raw source data never lands in the warehouse in an ungoverned state; it is cleaned, conformed, and validated in flight, and only the finished result is written. This kept partially governed data out of the warehouse for decades, which mattered when warehouse storage and compute were scarce and expensive.

Extract, load, transform (ELT) inverts the order. Data lands in the warehouse first, in or close to its raw shape, and transformation runs inside the warehouse against data already in place. Cloud warehouses with elastic, separately billed compute made this practical at scale, and it is the dominant pattern in modern cloud deployments. ELT simplifies the pipeline, since the warehouse's own query engine does the transformation work, but it means raw or partially governed data exists inside the warehouse before transformation has run, which raises access-control questions ETL handles before data ever arrives. The choice between them turns on transformation complexity, governance requirements, and platform economics, and is covered in depth in the ETL vs ELT comparison.

The transformation itself, regardless of when it runs, is where most integration logic lives: mapping source fields to target columns, standardizing types and units, deduplicating records, applying business rules, and conforming dimensions so that the same entity carries the same key across every fact that references it. Reverse ETL, which pushes integrated warehouse data back out to operational systems, is a related but distinct movement and is treated as its own pattern rather than a warehouse-loading approach.

Keeping integrated data current: replication and change data capture

A one-time load integrates a snapshot. Keeping the integrated copy current as sources change is a separate, permanent problem, and reloading entire tables on every cycle stops being viable once the reload cost or load window outgrows its budget.

Replication into the warehouse, as distinct from same-engine database read replicas, maintains a copy of source data in or near the warehouse and keeps it synchronized. At small volume this can be a periodic full copy. At production scale it depends on change data capture (CDC): tracking which rows were inserted, updated, or deleted since the last cycle and moving only those changes. Log-based CDC reads the source database's transaction log directly, which captures hard deletes and preserves the exact order of changes; timestamp-based CDC queries an audit field such as a last-modified date, which is simpler but misses deletes unless the source records them. The mechanics, failure modes, and the snapshot-to-streaming handoff that trips up most implementations are covered in the change data capture technique article.

The point for integration is that incremental movement is not a tuning detail to add later. The CDC mechanism, the audit fields it relies on, the handling of hard deletes, and the watermark that tracks progress all have to be decided during design, because retrofitting them after a full-reload pipeline is in production usually means rebuilding it.

From schema to

hundreds of pipelines.

AI-mapped, dependency-aware, CDC-ready. Generated from your model, not built by hand.

See a demo

Virtual integration: querying sources in place

Not all integration copies data. Data virtualization integrates at query time: a federated query engine presents data that physically remains in its source systems as if it were one logical schema, resolving each query into sub-queries against the underlying stores and combining the results. Nothing is materialized in advance.

The appeal is that there is no copy to keep current and no load pipeline to maintain, so newly added sources are available quickly and the data is always as fresh as the source. The cost is paid at query time: every query reaches across systems with different performance characteristics, and the slowest source bounds the response. Source systems also absorb analytical query load they were not built for, which can interfere with their operational duties. Virtualization fits exploratory access, data that is queried rarely enough that maintaining a physical copy is not worth it, and cases where regulatory or practical constraints prevent copying the data at all. It fits poorly for repeated, latency-sensitive analytical queries over large volumes, where a materialized, well-structured warehouse is the clear answer.

In practice the two are not mutually exclusive. The logical data warehouse pattern combines them: a physical warehouse holds the high-value, frequently queried core, while a federation layer extends queries out to lakes, lakehouses, and operational stores that are not worth materializing. The data virtualization technique article covers the federated-query approach in depth.

Batch and streaming

Cutting across the physical-versus-virtual distinction is the question of how continuously data moves. Batch integration runs on a schedule, hourly, nightly, or some other cycle, processing accumulated changes in groups. It is simpler to reason about, easier to make idempotent and to reprocess, and sufficient for the large majority of analytical workloads, where decisions are made on data that is hours rather than seconds old.

Streaming integration moves changes continuously as events, usually through a message bus, so integrated data is available within seconds of the source change. It serves genuinely time-sensitive cases: operational dashboards that drive same-day decisions, fraud and anomaly detection, and analytics embedded in applications where users expect current figures. The cost is operational complexity: streaming systems are harder to make correct under replay, since the discipline is at-least-once delivery plus idempotent, effectively-once processing rather than any true exactly-once guarantee. They are also harder to reprocess after a logic change, and harder to debug than a batch job that can simply be re-run.

The common mistake is reaching for streaming because fresh data sounds better, without an analytical decision that actually depends on sub-hour latency. The question worth asking early is which decisions genuinely need data fresher than the batch cycle, and whether those decisions are made often enough to justify the operational cost. Many teams find, on inspection, that their real-time requirement was overestimated.

Choosing an approach

The approaches are not competitors to rank; they are tools matched to constraints. A few decision axes do most of the work.

Source type. A transactional database with accessible logs is a strong candidate for log-based CDC. A SaaS application reached only through an API constrains you to what the API exposes, often timestamp-based extraction. A file drop or event stream pushes toward batch file processing or streaming respectively. The source frequently decides the approach before preference enters.

Latency requirement. Most analytical workloads are well served by batch. Reserve CDC-fed near-real-time and streaming for the specific decisions that demonstrably need fresher data, and let the rest run on the simplest schedule that meets the requirement.

Transformation complexity and governance. Heavy transformation and strict pre-load governance favor ETL, where data is cleaned before it enters the warehouse. Where the warehouse's compute is the natural place to transform and raw-data access can be controlled, ELT is simpler. Where copying the data is impractical or prohibited, virtualization may be the only option.

Repetition and volume. Data queried often, at scale, with predictable patterns rewards physical materialization into a well-structured warehouse. Data queried rarely, or still being explored, rarely justifies the cost of a maintained pipeline, and virtualization or a light batch copy fits better.

Most real warehouses use several approaches at once: batch ELT for the bulk of sources, CDC for the high-volume transactional systems that cannot afford full reloads, streaming for the few genuinely real-time feeds, and federation for the long tail not worth materializing. The architecture is a portfolio, not a single choice.

Operational realities

Whichever approaches a warehouse uses, integration is a permanent operational concern rather than a one-time build, and a few disciplines decide whether it stays correct.

Schema drift is the most common source of breakage. Source systems add, rename, and retype columns on their own schedules, and an integration pipeline that assumes a fixed source shape fails or, worse, silently mis-maps data when the source changes. Detecting and handling schema changes deliberately, rather than discovering them through corrupted data, is a core operational requirement.

Idempotency is what makes a pipeline safe to re-run. A load that produces the same result whether it runs once or three times can be retried after a failure without creating duplicates, which is the difference between an integration job that recovers cleanly and one that requires manual cleanup after every hiccup. Designing loads to be idempotent, typically through merge-on-key rather than blind insert, is foundational.

Data quality and lineage close the loop. Integrated data is only trustworthy if its quality is validated as part of the pipeline, not assumed, and if its lineage is traceable, so that when a number looks wrong an analyst can follow it back to the source and transformation that produced it. The data warehouse metadata article covers the catalog and lineage layer that makes this possible across the stack, and data warehouse governance covers who owns the quality commitments and definitions the integrated data is held to.

The transform-timing choice is covered in depth in the ETL vs ELT comparison. The incremental-movement mechanism that keeps integrated data current is in the change data capture technique article, and the source-side extraction patterns are in data extraction models. The federated, no-copy approach is covered in data virtualization, and the pattern that combines physical and virtual integration behind one query layer is in logical data warehouse. For where integration sits in the larger picture, the warehouse loading and operations pillar covers loading at pillar depth and what a data warehouse is covers the integrated, historical, analytical foundation integration serves. Federated query, change data capture, watermark, idempotency, and data lineage have glossary entries.