The first design choice in a warehouse load pipeline is how to get data out of the source. Everything downstream, watermarks, idempotency, schema evolution, even cost, is shaped by the extraction model the connector uses. The warehouse loading and operations pillar covers the loading discipline as a whole; this article narrows in on the extraction half: what each pattern assumes about the source, where each one breaks, and how the choice is usually made in 2026 production stacks.
TL;DR. Full extraction is the right default for reference tables and anything small enough to reload cheaply. Incremental extraction is mandatory at scale; log-based change data capture is the production-grade mechanism where the source allows it, with timestamp-based query extraction as the practical fallback. File-based, API-based, and streaming extraction each handle a class of source that the relational connectors do not.
The shape of the decision
An extraction model is a contract between the warehouse loader and the source system. It commits to a way of identifying which records the loader needs to read on each cycle, a way of detecting failure, and a set of assumptions about what the source will accept. The choice is rarely about which model is best in the abstract; it is about which model the source actually supports and which one the team operating it will agree to maintain.
Three properties dominate the decision. The first is the source's capability surface: does it expose a transaction log, audit timestamps, a query interface, a bulk export, an event stream, or only a paginated API. The second is the change rate against the table size: a million-row table with a hundred changes a day rules out full reload, while a thousand-row reference table rules out the operational overhead of a streaming connector. The third is the staleness budget: a warehouse that can tolerate 24-hour-old data has options that a sub-minute analytical surface does not.
Most production warehouses run two or three of the patterns below in combination, picking the appropriate one per source. Treating the choice as global rather than per-source is the most common mistake.
Full extraction
Full extraction copies the source table in its entirety on every cycle. The loader issues a SELECT * against the source, truncates and reloads the warehouse target, and treats the previous load as discarded. There is no watermark, no incremental logic, no dependence on source-side audit columns.
The simplicity is the point. A full extract has exactly one failure mode (the load failed) and exactly one success mode (the target table now reflects the source). There is no possibility of missed changes, no clock-skew edge case, no incremental drift that compounds invisibly over weeks. For source tables small enough that a full reload completes within the load window at acceptable cost, this is almost always the right answer.
The legacy rule of thumb, full extraction for tables under roughly a thousand rows, is a defensible starting point but no longer the operative threshold in 2026. On a cloud warehouse, the cost calculation is about compute credits and source query load rather than raw row count. A million-row reference table that reloads in fifteen seconds against a quiet source is cheaper to operate as a full extract than as an incremental pipeline. A hundred-thousand-row transactional table on a busy OLTP source where the full extract takes a query slot every fifteen minutes is not.
Full extraction is also the default during an initial load or after a baseline reset. Every CDC pipeline starts with a full snapshot of the tables it will subsequently track incrementally; the snapshot-to-streaming handoff is covered in the change data capture article. The same applies after a schema migration that invalidates the incremental state, or after a quality incident that requires the warehouse copy to be rebuilt from authoritative source state.
The pathologies of full extraction at scale are predictable. Load windows extend in proportion to total table size rather than change volume. Source-side query load competes with the application workload during the extract. Egress costs on cloud sources accumulate linearly with table size on every cycle. Once any of those crosses a threshold, the table moves to incremental extraction and stays there.
Incremental extraction: the umbrella
Incremental extraction means reading only what has changed since the last successful cycle. The category covers several distinct mechanisms that share that abstract goal but differ substantially in how they detect change. The three that matter in production warehouses are query-based extraction against an audit timestamp, log-based extraction against the source's transaction log, and trigger-based extraction against a source-side change table.
The next three sections cover each in turn. The full operational depth on CDC specifically (snapshot handoff, schema evolution, retention margin, dead-letter discipline) lives in the change data capture article; the treatment here stays at the extraction-pattern level.
Query-based incremental extraction
Query-based extraction polls the source for records whose audit timestamp, typically updated_at or last_modified, exceeds the watermark from the previous cycle. The loader stores the watermark in its own metadata or in a dedicated table in the warehouse, advances it after each successful cycle, and reissues the query on the next schedule.
The approach requires only read access to the source tables, an audit timestamp column that is reliably updated on every write, and an index on that column. It does not depend on any source-system internals, which is what makes it portable across sources that would refuse log access. Most managed connectors that support arbitrary SQL sources offer query-based extraction as their default mode.
The known failure modes are routine. Hard deletes leave no audit timestamp; the query has no way to know a record is gone. Clock skew between the source and the watermark store creates a gap where records committed near the boundary fall through. Without an index on the audit column, every cycle runs a full table scan against the source, which the source-system team will eventually complain about. A few minutes of safety buffer on the high watermark, combined with idempotent merge semantics in the warehouse, handles the clock-skew case at the cost of reprocessing a small window each cycle.
Query-based extraction is the right answer for SaaS sources accessed via SQL or via a query interface that approximates SQL, for tables where hard deletes do not matter or are handled via a deleted_at flag, and for sources where log access is unavailable or politically expensive to negotiate. It is the wrong answer wherever deletes must be captured, wherever the source clock cannot be trusted, or wherever the polling interval is shorter than the audit column's update granularity.
Log-based extraction
Log-based extraction reads the source database's transaction log directly: the write-ahead log on Postgres, the binary log on MySQL, the redo log on Oracle, the CDC tables derived from the transaction log on SQL Server. Every committed change is recorded in the log before it is applied to the underlying tables, with enough information to reconstruct the operation type and the before-and-after row state. A connector subscribes to the log, parses the change events, and emits them to a message bus or directly to the warehouse loader.
Debezium is the canonical open-source implementation, with connectors for the major relational databases. Most cloud data integration platforms either embed Debezium or implement equivalent logic against the same database mechanisms. The output flows to Kafka, Pulsar, or Kinesis, then to a consumer that applies the changes to the warehouse via idempotent merge on primary key.
Log-based extraction is the most complete approach available against a relational source. It captures every change including hard deletes, preserves the sequence of changes rather than only the final state, and requires no modification to the source schema. The cost is operational. Log access typically requires DBA cooperation that takes longer to arrange than the connector takes to deploy. Log retention windows, configured on the source, constrain how far back the connector can reach if it falls behind: an unconsumed Postgres replication slot also prevents WAL cleanup, which can run the source's disk to zero if the connector stays inactive long enough. Schema changes on tracked tables can break the connector when the new schema arrives without coordinated handling.
These are the disciplines that distinguish a reliable log-based pipeline from one that is mostly correct. The change data capture article covers them at depth. The short version: monitor lag against retention, alert before the slot's window is at risk, and treat schema changes on tracked tables as cross-team coordination rather than as source-team prerogative.
Trigger-based extraction
Trigger-based extraction uses database triggers on the source system to write change events to a staging table as they occur. The warehouse loader reads from the staging table rather than from the source tables directly.
The approach is complete (triggers fire on every commit, including deletes), transactionally consistent with the writes that produced it, and does not require log access. The cost is that every write to a tracked table now incurs at least one additional write to the staging table, which the source-system team will reasonably push back on for any high-throughput table. The maintenance burden is recurring: every schema change to a tracked table requires the trigger definitions to be updated, and trigger definitions accumulate across the schema in ways that complicate migrations and source-system upgrades.
Trigger-based extraction fits a narrow band of cases: log access is unavailable, the source is relational, the audit-timestamp approach is rejected for completeness reasons, and the source-system team will accept the write overhead. Outside that band, log-based is the better default and query-based is the better fallback.
File-based extraction
Some sources do not expose a query interface at all. A partner system delivers a daily CSV. A mainframe writes a fixed-width extract to an SFTP drop. An operational application exports a Parquet snapshot to an object store on a scheduled job. The warehouse's job is to read the file, validate it, and load it.
The file-based pattern has been around longer than most current practitioners have, and it is not going away. Two shifts have changed how it works in 2026. The first is that the destination is now almost always object storage (S3, GCS, ADLS) rather than a local filesystem on the warehouse host. The second is that the dominant interchange format is no longer CSV: Parquet has become the default for bulk exports between systems that can produce columnar files, because the size reduction and the embedded schema both pay for themselves on every read. Avro shows up where row-oriented streaming records need an embedded schema. JSON Lines remains common for semi-structured exports. CSV persists for legacy interchange and for systems that cannot produce anything else.
The operational disciplines around file-based extraction are about contracts rather than mechanics. The drop location is a contract: well-known prefix, predictable file naming, a sentinel file or atomic rename to signal completion. The schema is a contract: either embedded in the file format (Parquet, Avro) or maintained out-of-band and version-tagged. The arrival cadence is a contract: a load that scans the drop and finds nothing is a different failure than a load that finds yesterday's file unchanged. Most file-based pipelines that go wrong in production go wrong on one of these contracts being violated without the violation being noticed.
The bulk-export-as-Parquet pattern deserves explicit mention. For sources that can produce a Parquet snapshot of an entire dataset on a schedule, the loader's job collapses to copying the file and registering it: no row-by-row extraction, no audit-column dependency, no incremental complexity. The cost is the staleness implied by the snapshot cadence, which is usually acceptable for analytical workloads against operational sources that are not strictly real-time. This is the pattern most lakehouse-first architectures default to for cross-system handoffs.
API-based extraction
SaaS sources, third-party data providers, and many operational systems expose only an HTTP API. The warehouse's loader becomes an API client: authenticate, paginate, page through results, retry on transient errors, respect rate limits, and accumulate the response into the warehouse target.
The mechanics are conceptually simple and operationally tedious. Each API has its own pagination scheme (offset, cursor, page token), its own rate-limit model (requests per second, requests per minute, monthly quota, weighted by endpoint), its own authentication flow (static key, OAuth refresh, signed request), its own incremental query semantics (date filter, cursor on updated_at, change-feed endpoint), and its own failure modes (silent truncation at a page boundary, retry-after headers that lie, eventual consistency in the response). Writing a reliable connector for one source is a project; writing one for thirty sources is a maintenance burden most warehouse teams cannot justify.
The managed-connector market exists because of this burden. Fivetran, Airbyte, Stitch, and several others maintain hundreds of source-specific connectors that handle the per-source quirks behind a uniform interface. The price is straightforward: a per-source cost (Fivetran's MAR-based pricing, Airbyte's volume-based pricing) in exchange for not having to write the connector. For SaaS sources where the data is small enough that the connector cost is a rounding error against the time it would take to maintain the connector in-house, this is the dominant pattern.
The rate-limit problem is worth calling out because it determines the staleness floor for API-based extraction. Most SaaS APIs cap throughput somewhere between hundreds and a few thousand requests per minute. A full reload of a large dataset against such a cap takes hours or days, regardless of warehouse-side capacity. Incremental extraction via the API's own change-feed endpoint, where the source offers one, sidesteps the cap by reading only the delta; without one, the only options are to live with the staleness or to negotiate higher-tier API access with the vendor. Some sources (Salesforce, Stripe, HubSpot) expose explicit change-data-capture endpoints that approximate log-based semantics over HTTP; using those wherever they exist is the right default.
Streaming extraction
For sources that already publish events to a message bus (Kafka, Pulsar, Kinesis, Redpanda), the warehouse can subscribe directly. There is no polling and no batch handoff: the loader consumes the stream as records arrive and applies them to the warehouse on its own schedule, which may be near-real-time or micro-batched.
Streaming extraction handles two distinct cases. The first is sources that are natively event-driven: a clickstream pipeline, an IoT telemetry feed, an application that emits domain events to Kafka as part of its normal operation. The warehouse is one of several consumers of the same stream. The second is sources whose changes are forwarded to a stream by a log-based CDC connector, with the warehouse consuming the downstream events rather than reading the log directly. From the warehouse's perspective both look the same: the loader is a consumer subscribing to a topic with a managed offset.
The operational concerns shift relative to the batch patterns. The watermark equivalent is the consumer offset, managed by the message bus rather than the loader. Idempotency on the warehouse side is non-negotiable because replay is built into the consumption model: every event must produce the same warehouse state on its tenth application as on its first, with merge by primary key the standard pattern. Schema enforcement moves from per-load validation to per-event schema registry coordination (Confluent's Schema Registry, Apicurio, AWS Glue Schema Registry). Dead-letter handling becomes a first-class topic rather than an afterthought.
Streaming extraction is the right answer when the source already speaks streaming and when the analytical use case justifies the operational overhead. It is the wrong answer for slowly changing reference data, for sources where the source team will not cooperate on schema registry discipline, and for any case where the warehouse's analytical surface is itself batch. A streaming connector feeding hourly analytical aggregates is mostly the worst of both patterns.
Choosing an approach
| Pattern | Where it fits | Where it fails |
|---|---|---|
| Full extraction | Reference tables, small tables, initial loads, baseline resets | Anywhere the reload cost or load window exceeds budget |
| Query-based incremental | SaaS or relational sources without log access; soft-delete sources; low-rate reference tables | Hard deletes; untrustworthy source clock; missing index on the audit column |
| Log-based CDC | Transactional relational sources where log access is available | DBA pushback on log access; Postgres slot disk pressure; uncoordinated schema changes |
| Trigger-based | Relational sources with no log access and rejected audit-timestamp approach | Source-team pushback on write overhead; recurring trigger maintenance |
| File-based | Partner deliveries, mainframe extracts, scheduled bulk exports, lakehouse handoffs | Broken delivery contracts; schema drift without versioning; arrival-cadence ambiguity |
| API-based | SaaS sources without a stream or bulk export; third-party data providers | Rate-limit ceiling on staleness; per-source quirks; pagination edge cases |
| Streaming | Natively event-driven sources; downstream of log-based CDC; sub-minute analytical surfaces | Source not already streaming; weak schema discipline; batch-only analytical use cases |
Most production warehouses run three or four of these in combination. A common shape: log-based CDC for the operational relational systems, query-based incremental for the SaaS sources without change-feed support, file-based ingestion for partner deliveries, and full extraction for the dozen reference tables that everyone forgets about until they need them. The managed-connector layer (Fivetran, Airbyte, Stitch, Estuary) handles the API-based sources in the same configuration regardless of which extraction mode the connector uses underneath, which is part of why those services are the default for that class of source.
The patterns are stable. The decision is per-source, and the decision changes when the source's capabilities change or when the staleness budget moves. Treating extraction as a single architectural choice across the warehouse is a category mistake that surfaces as either over-engineering on the slow sources or under-engineering on the fast ones.
Related
The warehouse loading and operations pillar covers the loading discipline at the level above extraction: watermarks, idempotency, load ordering, late-arriving data, and the monitoring practices that distinguish reliably-current warehouses from mostly-current ones. The change data capture article covers log-based and timestamp-based CDC at the operational depth this article omits, including the initial snapshot-to-streaming handoff and the schema-evolution failure modes. Slowly changing dimensions covers the dimensional-layer change handling that interacts with extraction at the warehouse end. Change data capture, watermark, and idempotency all have glossary entries.
Reference
- Martin Kleppmann, Designing Data-Intensive Applications, O'Reilly, 2017. Chapters 10 and 11 cover batch and stream processing at the level of architectural primitives, including the change-stream model that underlies log-based CDC.
- Debezium documentation. The reference implementation of log-based CDC; the connector docs for Postgres, MySQL, SQL Server, and Oracle each cover the source-specific mechanics.
- Apache Parquet specification and Apache Iceberg documentation. The current default formats for file-based bulk extraction in lakehouse architectures.
