The ETL process for a data warehouse is not a solved problem the first time the pipeline runs. It is an ongoing operational concern that compounds in complexity as data volumes grow, source systems change, and the number of tables and dependencies increases. Teams that treat loading as a deployment task rather than an operational discipline discover this at the worst possible time: when a load fails silently at 3am on the last day of the month and the Monday morning reports are wrong.
This page covers full load versus incremental loading, change data capture strategies and their failure modes, load ordering, idempotency and recovery, late-arriving data, and the monitoring practices that distinguish warehouses that are reliably current from warehouses that are mostly current.
Full load versus incremental loading
Every warehouse load strategy is either a full load, an incremental load, or a combination of both applied to different tables.
A full load truncates the target table and reloads it from scratch on each cycle. It is simple to implement and produces a predictably clean result: when the load completes, the table reflects exactly what the source contains. For small tables, reference data, and slowly changing lookup structures, full loads are usually the right choice. The simplicity is worth the cost.
At production data volumes, full loads become a problem. The load window extends as tables grow. Query performance degrades during loads because the table is unavailable or inconsistent. Compute and I/O costs accumulate at a rate proportional to total table size rather than the volume of change. A table that grows by ten thousand rows per day but contains fifty million rows total is being loaded fifty million rows worth of work every cycle, not ten thousand.
Incremental loading limits each cycle to new and changed records. Done correctly, the load time and cost scale with the volume of change rather than the total table size. A warehouse that loads twenty million rows on day one loads only the delta on subsequent days, regardless of how large the total table gets. This is the approach production warehouses use for any table where full reload time or cost becomes unacceptable.
The catch is that incremental loading is harder to implement correctly than full loading, and the failure modes are less visible. A failed full load is obvious: the table is empty or partially loaded. A failed incremental load may leave the table populated but stale, missing records, or carrying phantom rows from a source deletion that wasn't captured. Catching these failures requires monitoring that goes beyond checking whether the load process completed without error.
Many warehouses use a hybrid approach: incremental loads for fact tables and large dimensions, full loads for small reference and lookup tables, and periodic full reloads of incrementally-loaded tables as a reconciliation step to catch any drift that incremental loading missed. The reconciliation cadence depends on the tolerance for drift and the cost of the full reload. Weekly or monthly full reloads are common for high-volume tables where daily full loads would be too expensive but silent drift is unacceptable.
Change data capture
Incremental loading requires a mechanism to identify which records in the source system have changed since the last load. Change data capture (CDC) is the category of techniques that solve this problem. The right approach depends primarily on what the source system exposes and what operational overhead the team running it will accept.
Log-based CDC reads the database's transaction log, also called the redo log or write-ahead log depending on the platform. Every insert, update, and delete that commits in the source database is recorded in the log before it is applied. A CDC connector reads the log and extracts the change events, which include the before and after state of each modified row. Log-based CDC is the most complete approach: it captures every change including hard deletes, preserves the sequence of changes rather than just the final state, and requires no modification to the source system schema.
Getting that access isn't always straightforward. Log-based CDC requires read access to the source system's transaction log, which database administrators are sometimes reluctant to grant. Log retention windows, set by the source system's own configuration, determine how far back the CDC connector can reach if it falls behind. If the connector lags and the unconsumed portion of the log is purged, there is a gap in the change stream that cannot be recovered without a fresh baseline load. Schema changes in the source system, adding or removing columns from a tracked table, can break the CDC pipeline if the connector is not updated to handle the new schema. Managing these dependencies requires coordination with the teams who own the source systems.
Timestamp-based CDC queries the source system for records where an audit timestamp column, typically last-modified or updated-at, is greater than the watermark from the previous load. It is simpler to implement, requires only read access to the source tables, and does not create any dependency on source system internals. The cost is completeness. Timestamp-based CDC misses hard deletes entirely: if a record is deleted from the source system, it leaves no updated timestamp and the CDC query has no way to know it is gone. It is also sensitive to clock synchronization: if the source system's clock is ahead of the warehouse's watermark tracking, records updated in the gap between the two clocks will be missed until the discrepancy is resolved. And if the audit timestamp is not indexed in the source system, the CDC query may run a full table scan on every cycle.
Trigger-based CDC uses database triggers on the source system to write change events to a staging table as they occur. The warehouse reads from the staging table rather than the source tables directly. This approach is complete and doesn't require transaction log access, but it imposes write overhead on the source system for every insert, update, and delete. Source system teams are generally reluctant to accept this overhead, and maintaining the trigger definitions as the source schema evolves is a recurring operational burden.
The practical choice for most warehouse environments is log-based CDC where it can be established and maintained, and timestamp-based CDC where log access is unavailable or the source system is not a relational database. Using log-based CDC for systems where deletes matter and timestamp-based for the rest is a reasonable split that most teams can operate without significant overhead.
Watermarks and state management
An incremental load needs to know where the previous load stopped. The mechanism for tracking this is the watermark: a stored value representing the boundary between what has already been loaded and what hasn't been seen yet.
For timestamp-based CDC, the watermark is typically the maximum last-modified timestamp observed in the previous load, minus a small safety margin to account for transactions that were in-flight when the load ran. For log-based CDC, the watermark is a log sequence number or equivalent position identifier that tells the connector where to resume reading.
Watermark management sounds trivial and becomes a source of subtle bugs in production. Watermarks that are stored in the pipeline's own metadata store rather than in the warehouse itself create a single point of failure: if the metadata store is lost or inconsistent, the load position is unknown and a full reload may be required to establish a safe baseline. Storing watermarks in the warehouse itself, as rows in a dedicated load tracking table, makes them durable, queryable, and auditable alongside the data they govern.
The safety margin on timestamp watermarks deserves explicit attention. A load that runs at exactly the watermark timestamp may miss records that were committed to the source system with that exact timestamp but were not visible when the load query ran. The underlying database-systems property is transaction isolation: under read-committed semantics, a query sees only transactions committed before it started, so an in-flight transaction with a timestamp at or below the watermark can commit moments later and never appear in the load. Replication lag compounds the same problem for sources read off a replica. Subtracting a small buffer, typically a few minutes, from the high watermark and reprocessing that window on the next cycle catches these records. The cost is that some records are read and processed twice. The load logic must be idempotent enough to handle this without producing duplicates.
Load ordering and dependency management
Warehouse tables have dependencies on each other that constrain the order in which they can be loaded. Dimension tables must be current before the fact tables that reference them can load correctly. Parent entities load before children. Derived tables and aggregations come last.
In a small warehouse with a handful of tables, load ordering can be managed by convention: load dimensions first, then facts, then aggregates. As the warehouse grows to dozens or hundreds of tables, managing the dependency graph by convention becomes error-prone. The first time a new table is added without correctly placing it in the execution order, the load produces incorrect results that may not surface until analysis reveals a discrepancy.
The more reliable approach is to derive the load order from the dependency relationships themselves. If the warehouse's physical schema encodes referential integrity constraints, those constraints form the dependency graph. A topological sort of the dependency graph produces a valid load order. Any new table added to the schema automatically gets a correct position in the sequence by virtue of the foreign keys that define its dependencies.
Where referential integrity constraints are not enforced in the physical schema, as is common in cloud warehouses where constraint enforcement is optional and often disabled for performance reasons, the dependency graph needs to be maintained explicitly. Some teams maintain it as metadata alongside the table definitions. Others derive it from the transformation logic: if a table's load SQL references another table, that's a dependency. Either approach works; the important thing is that the dependency graph exists somewhere other than in people's heads and is updated when tables are added or changed.
Parallel loading is possible wherever the dependency graph allows it. Tables with no dependencies between them can load simultaneously. In a star schema, all dimension tables are independent of each other and can load in parallel. Parallelizing dimension loads, then loading all fact tables that reference only those dimensions, can substantially reduce total load window duration compared to purely sequential execution.
Idempotency and recovery
Production load pipelines fail. Source systems are unavailable. Network connections drop. Queries time out. Downstream tables are locked by long-running queries. A warehouse that isn't designed for failure will need manual intervention every time something goes wrong, which in production is regularly.
Idempotency is the property that allows a load to be re-run without producing incorrect results. An idempotent load produces the same final state whether it runs once or ten times against the same input data. This is not the default behavior of most load operations. An insert that doesn't check for existing rows produces duplicates on re-run. An accumulating aggregate that applies to a table before checking whether this period has already been processed produces incorrect totals. Designing loads to be idempotent from the start costs effort upfront and saves significant time and anxiety in production.
The most common pattern for achieving idempotency in fact loading is the merge or upsert operation: check whether a record with the given key already exists, update it if it does, insert it if it doesn't. This is more expensive than a straight insert but handles re-runs cleanly. For tables where update semantics are not appropriate, an alternative is to delete all rows for the affected time period before re-inserting: the delete-then-insert pattern produces idempotent behavior at the cost of temporarily removing data from the table during the load window.
For dimension loads using Type 2 SCD handling, idempotency is more involved. The load logic must correctly determine whether a given input row represents a new version of an existing dimension entity or a re-delivery of a version that was already loaded. Hash-based change detection, where the load computes a hash of the incoming attribute values and compares it to the hash stored on the current dimension row, is the standard approach. If the hash matches, the row is already current and nothing needs to happen. If it doesn't, a new version is created. Running the same input through this logic twice produces the same result, which is the property needed for safe re-runs.
Partial failure recovery requires knowing exactly how far a load got before it failed. Loads that process data in identifiable batches, committing each batch and recording its completion, allow recovery to resume from the last successful batch rather than restarting from scratch. This matters significantly for long-running loads: restarting a six-hour load from the beginning because it failed in hour five is unacceptable in most production environments. Designing loads with explicit progress checkpoints makes partial recovery practical.
Late-arriving data
Source systems do not deliver data in strict chronological order. Transactions get processed late. Systems are unavailable during maintenance windows and deliver a backlog when they come back. Manual corrections arrive days or weeks after the events they correct. A warehouse that can only handle data arriving in sequence will be wrong whenever reality doesn't cooperate.
Late-arriving facts are transactions whose event date falls in a past reporting period but which arrive in the current load cycle. If the warehouse has already closed the reporting period for that date, simply inserting the late fact produces incorrect historical results unless the period totals are recalculated. How to handle this depends on how the warehouse stores and serves aggregations. Warehouses that serve queries directly against atomic fact tables automatically include late-arriving facts once they are inserted, because each query recalculates against whatever rows exist. Warehouses that pre-aggregate into summary tables need to identify which summary periods are affected by late-arriving facts and refresh those periods specifically.
Late-arriving dimensions are entities whose descriptive records arrive after the fact records that reference them. The standard handling is what Kimball calls the unknown member, sometimes also called the inferred member: a dimension row with a well-known surrogate key, typically zero or negative one, and attribute values indicating that the actual record hasn't arrived yet. When a fact record arrives referencing a business key that isn't yet in the dimension, the fact is loaded with the inferred-member surrogate. When the actual dimension record arrives later, the inferred member is updated in place with the real attributes (or a Type 2 version transition is recorded, depending on the dimension's SCD strategy). This preserves referential integrity and ensures no facts are dropped while their dimension records are pending.
The date range of late arrivals that needs to be accommodated is a design parameter worth establishing explicitly. Most warehouses see the majority of late-arriving data within a few days of the event date. A small number of corrections arrive weeks or months later. The handling for each range can be different: automated correction for data within a standard late-arrival window, exception handling and review for corrections outside it. Defining these thresholds before building the pipeline avoids ad-hoc decisions in production when an unusual late arrival appears.
Monitoring and observability
A warehouse load that produces no errors is not necessarily a load that produced correct results. Load success and data correctness are different things. A CDC connector that falls behind its log retention window fails silently until the gap becomes too large to ignore. A timestamp watermark that advances past a batch of records with a data quality problem loads the bad records without complaint. Deleted records that weren't captured by the incremental load stay visible in queries indefinitely.
Row count monitoring is the most basic form of load observability: track how many rows were loaded in each cycle and alert when the count deviates significantly from the historical pattern. Zero rows when the typical count is fifty thousand is almost certainly wrong. Five million rows when the typical count is fifty thousand warrants investigation before the data is used. Row counts don't catch everything, but they catch a large proportion of the obvious failures quickly.
Freshness monitoring tracks how recently each table was successfully loaded and alerts when a table hasn't been updated within its expected cadence. A reporting table that should update hourly but hasn't been touched in four hours is either failing silently or blocked by a dependency that hasn't run. Teams that surface this through automated alerting catch it before users notice stale data. Teams that don't surface it get the report at 9am.
Duplicate detection and data quality checks are the two most commonly skipped monitoring categories. Checking for duplicate rows by key on a scheduled basis is cheap insurance against imperfect idempotency in incremental loads. Duplicates don't announce themselves; they surface when an analyst reports sales numbers are double what they should be. Data quality checks that run after loading, verifying value ranges, null rates, referential integrity, and business metric plausibility, catch problems that would otherwise propagate into reports invisibly. Neither substitutes for upstream data quality work, but both catch things that upstream work misses.
Load duration tracking is worth adding once the warehouse has more than a handful of tables. A load that suddenly takes twice as long as usual is telling you something. Catching the signal before the load window starts competing with query workloads is considerably easier than diagnosing the problem under pressure when it does.
Practical guidance
Most operational failures in production warehouses were predictable at design time. These are the countermeasures worth building in before they're needed.
Build idempotency in from the start, not as a retrofit. A load pipeline that isn't idempotent is a liability that will be discovered under pressure. Merges over inserts, delete-then-insert over truncate-reload for incremental periods, hash-based change detection for SCD handling (the hash is computed over only the tracked attributes, not the full row, so unrelated source-system fields don't trigger spurious version creates): these patterns cost more to build initially and pay back substantially when the first re-run is needed.
Test incremental loads against known gaps and late arrivals before production. Create a test scenario where a source record is delayed by two days, a source record is hard-deleted, and a source record is updated multiple times in quick succession. Verify the warehouse handles all three correctly. Most incremental load implementations that haven't been tested against these scenarios fail at least one of them.
Make watermarks durable and visible. Store them in the warehouse itself alongside the data they govern. Include the watermark state in standard monitoring so the team knows immediately if a load advanced the watermark without loading the expected volume of data.
Keep raw staging separate from governed warehouse tables. The staging layer is where source data lands before it has been validated and transformed. Mixing staging and governed data in the same layer makes it difficult to enforce access controls, hard to reason about data quality boundaries, and easy for analysts to accidentally query unvalidated data. Physical separation, even just a separate schema, enforces the boundary.
Plan for source system schema changes before they happen. Source schemas change, and the load pipelines that depend on them break when they do. Having a defined process for propagating source schema changes through the pipeline, rather than discovering the break when the load fails, reduces the time between a source schema change and a corrected pipeline. The process can be as simple as a checklist, but it needs to exist.
Related content
The data warehouse fundamentals pillar covers ETL and ELT at an architectural level. The ETL vs ELT comparison covers the choice between the two approaches in detail. The dimensional modeling pillar covers the load ordering constraints that follow from fact and dimension relationships. The full implementation depth for change data capture, including the initial baseline handoff and schema evolution failure modes, is covered in the dedicated technique article. Slowly changing dimensions covers the change-handling that interacts with CDC at the dimensional layer. The data extraction models article covers the seven extraction patterns a warehouse loader runs against in practice and the per-source decision matrix that determines which fits where. Data warehouse testing covers the validation, regression, and performance disciplines that keep these loads correct as the model and the sources evolve. Data cleansing covers the staging-to-curated rule pattern and the role of declarative tests at the transform layer. Data masking covers the static, dynamic, and on-the-fly architectures and the mask-before-load versus mask-in-warehouse axis that determines which fits a given workload. The referential integrity decision framework covers the cloud-era choice between database-enforced foreign keys, informational constraints, and transformation-layer assertions for the load-ordering dependency graph this pillar describes. Change data capture, watermark, idempotency, slowly changing dimension, data quality, data contract, and referential integrity all have glossary entries.
