Incremental loading at production data volumes is impractical without a way to identify what changed in the source since the last cycle. Change data capture is the category of techniques that answer that question. The warehouse loading and operations pillar introduces the three established approaches, log-based, timestamp-based, and trigger-based, at overview depth. This article covers the mechanics, the operational concerns, and the failure modes that determine whether a CDC pipeline is reliably correct or quietly wrong.
TL;DR. Log-based change data capture is the production default where the source allows it. The initial snapshot-to-streaming handoff is where most implementations go wrong. Lag, retention margin, and schema evolution disciplines determine whether the pipeline is reliably correct or just usually correct.
Log-based CDC: reading the transaction log
Every transactional database maintains a log of committed writes before they are applied to the underlying tables. Postgres calls it the write-ahead log (WAL), MySQL the binary log (binlog), Oracle uses redo logs, and SQL Server exposes a parallel mechanism via CDC tables derived from the transaction log. The concept is the same: every insert, update, and delete is recorded before being applied, with enough information to reconstruct the change.
Log-based CDC reads this log directly. A connector parses the binary format and emits a stream of change events carrying the table, the operation type, the primary key, and the before-and-after state of each row. The output flows to a message bus (Kafka, Pulsar) and then to the warehouse loader.
Debezium is the canonical open-source implementation, with connectors for the major relational databases. Most cloud data integration platforms either embed Debezium or implement equivalent logic against the same database mechanisms.
The mechanics of subscription differ per database. Postgres exposes logical replication slots: the connector creates a slot at a specific LSN and reads forward. The slot prevents WAL cleanup past its position, which is how Postgres guarantees the connector can resume after a restart. MySQL exposes binlog reading via the replication protocol; the connector authenticates as a replica. SQL Server is a slightly more involved case because it exposes two separate mechanisms that get conflated in casual usage: Change Data Capture (CDC) populates change tables on a scheduled job, optimized for downstream extraction; Change Tracking (CT) is a lighter synchronous mechanism that records only that a row changed, not what it changed from. Most warehouse CDC connectors target the CDC tables; CT is the right choice when only "what changed since when" matters and the values themselves come from a follow-up query.
Log-based CDC captures every change including hard deletes, preserves the sequence of changes, and requires no source-schema modification. The cost is operational: log access typically requires DBA cooperation, retention windows constrain how far the connector can fall behind, and schema changes on tracked tables can break the connector if it isn't updated.
Lag accumulates when LSN B falls behind the head of the log. Retention must exceed maximum tolerable lag, or the slot's window into the log is lost and a fresh baseline is required.
Timestamp-based CDC: query-based watermarking
Where log access is unavailable, where the source is not a relational database, or where the source rate is low enough that log-based overhead is not justified, timestamp-based CDC is the alternative. The connector queries for records whose audit timestamp (updated_at, last_modified) is greater than the watermark from the previous cycle, processes the results, and advances the watermark.
Three failure modes are routine.
Hard deletes are invisible. A removed record leaves no updated timestamp; the query has no way to know it is gone. Workarounds: soft deletes in the source (a deleted_at column), periodic reconciliation against source row counts, or accepting the limitation and documenting it.
Clock skew creates gaps. If the source clock is ahead of the warehouse's watermark, records committed in the gap fall through. The safety-buffer pattern stores the watermark a few minutes behind the maximum observed timestamp and reprocesses that window each cycle. The cost is that some records are processed twice, which requires the consumer to be idempotent.
The audit column must be indexed. Without it, every cycle's query runs a full table scan on the source, which at modest table sizes is unacceptable in production. Its absence is a common reason teams switch from timestamp-based to log-based mid-project.
Timestamp-based CDC is the right answer for SaaS sources accessed via polling APIs, for sources where log access cannot be justified, and for tables whose change rate is low enough that polling is acceptable. It is the wrong answer wherever hard deletes matter or the source clock cannot be trusted.
Trigger-based CDC: source-side capture
Trigger-based CDC uses database triggers on the source system to write change events to a staging table as they occur. The warehouse reads from the staging table rather than the source tables directly.
The approach is complete (triggers fire on every commit, including deletes) and does not require log access. The capture is transactionally consistent with the write that produced it. The cost is that every write to a tracked table now requires at least one additional write to the staging table, which source-system teams reasonably push back on for high-throughput tables. The maintenance burden is recurring: every schema change requires trigger updates, and trigger definitions accumulate across the schema in ways that complicate migrations.
Trigger-based CDC fits when log access is unavailable, the source is relational, and the source team accepts the overhead. Outside those conditions, log-based is the better default and timestamp-based the better fallback.
Initial baseline and the streaming handoff
A CDC pipeline that has only streamed changes since it started cannot represent source state from before. Every production deployment needs a way to load the initial state and then transition to streaming, and the transition is where most implementations introduce silent data loss.
Naive approaches fail in predictable ways. Snapshot first, then start the connector: the window between snapshot completion and connector start drops changes. Start the connector first, then snapshot: the warehouse ends up with both the snapshot row and replayed events for the same row, producing duplicates.
The correct mechanic is a coordinated handoff:
The connector opens a snapshot transaction at a specific LSN (Postgres exports a snapshot; SQL standard calls it repeatable read isolation), begins buffering log events from that LSN, and reads the snapshot rows. The snapshot reads use that transaction's view, so they see exactly the state at LSN N regardless of how long the snapshot takes. When the snapshot completes, the connector switches to replaying the log from LSN N. The overlap between LSN N and LSN N+M produces no duplicates because each event is applied via idempotent merge on primary key. From LSN N+M onward, the connector is processing genuinely new changes.
Getting this wrong manifests as silent data loss: records modified or deleted during the snapshot window end up at their snapshot values rather than their final state. Row counts look right; values quietly disagree. Debezium implements this correctly out of the box, including the watermark-based incremental snapshot algorithm (introduced in version 1.6) that handles snapshot resumption and parallelism without the consistency hazards a naive implementation introduces. Custom connectors that get it wrong are rarely caught by load monitoring.
Schema evolution under CDC
Source schemas change. The failure mode per CDC approach varies.
Column additions are usually safe. Log-based connectors emit the new column on subsequent events. Timestamp-based SELECT * queries pick them up. Trigger-based capture needs the trigger updated. Warehouse consumers need to accept new columns without breaking, typically via a nullable target column.
Column removals are subtler. Log-based connectors keep emitting for tables that no longer reference the dropped column, but consumers expecting it will break. The discipline is to remove the column from the consumer first, then the source.
Column renames are the worst case. The log records the rename as a drop followed by an add, which is ambiguous: was the data meant to carry forward, or is the new column unrelated? Most consumers treat it as data loss. The mitigation is a two-step migration: add the new column, dual-write during a migration window, drop the old once consumers switch.
Type changes are the silent-corruption case. Consumers with typed schemas (Parquet, Avro, warehouse table definitions) either reject or coerce. Schema registries (Confluent's Schema Registry, Apicurio, AWS Glue) catch this when the pipeline enforces validation; without enforcement, the corruption ships.
Operational disciplines
Lag monitoring. For log-based CDC, lag is the LSN gap between the head of the log and the connector's position, converted via log write rate to a time estimate. For timestamp-based CDC, it is the gap between source high-water timestamp and watermark. Lag exceeding planned retention or polling cycle is the early warning of an unsurfaced failure.
Retention margin. Log retention must exceed the maximum tolerable connector downtime. Postgres replication slots are particularly demanding: an unconsumed slot prevents WAL cleanup, and an inactive connector can run the source's disk to zero. Slot disk usage is a first-class metric; a runbook for replacing a stuck slot is non-negotiable. The Postgres documentation on logical replication slots and the Debezium Postgres connector docs both surface this failure mode prominently, and the disk-pressure incidents that follow from missing it are common enough to be a recurring topic in practitioner forums.
Dead-letter handling. Some events the consumer cannot process: schema mismatches, corrupted payloads, downstream constraint violations. Route them to a dead-letter topic or table that the on-call rotation reviews regularly, not a silent sink.
Consumer idempotency. Replay is inevitable: the snapshot-to-streaming handoff, connector restarts, recoveries from backup. Every event must produce the same warehouse state on its tenth application as on its first. Merge by primary key is the standard pattern; the SCD Type 2 hash-based change detection is the dimensional analogue.
Schema registry coordination. Where multiple consumers read the same stream, the schema is shared infrastructure. A registry lets consumers either tolerate forward-compatible changes automatically or block on validation. Without one, single-consumer pipelines work and multi-consumer pipelines drift.
Choosing an approach
| Approach | When to use | Key constraint |
|---|---|---|
| Log-based CDC | Transactional relational source with log access | Operational coordination with source DBAs; slot disk pressure on Postgres |
| Timestamp-based CDC | SaaS or polling APIs; sources without log access; low-rate reference tables | Misses hard deletes; requires an index on the audit column |
| Trigger-based CDC | No log access and no reliable audit timestamps | Per-write overhead on the source's hot path |
Default to log-based wherever the source allows it; the deletes-captured property alone justifies it for any workload where source deletes are meaningful.
A mixed approach is common in production: log-based for high-volume operational tables, timestamp-based for slowly changing reference data, occasionally trigger-based for what falls outside both. The cost of running two approaches concurrently is mostly in monitoring; the underlying pipelines do not interact.
CDC is rarely the right answer for tables small enough to fully reload in a tolerable window. A nightly full reload of a thousand-row reference table costs less to operate and reason about than any CDC mechanism would.
Related
The receiving end of CDC is the warehouse, and slowly changing dimension handling of captured changes is its own substantial topic. The warehouse loading and operations pillar covers the broader operational context.
Reference
- Martin Kleppmann, Designing Data-Intensive Applications, O'Reilly, 2017. Chapter 11 covers change streams and event sourcing in depth from the database-systems perspective. The canonical practitioner reference for the underlying mechanics.
- Debezium documentation. The reference implementation of log-based CDC for the major relational databases; the docs cover the snapshot-to-streaming handoff in particular detail.
