Data warehouse metadata: catalogs, lineage, and repositories

A warehouse without metadata is a database with mystery columns in it. Practitioners learn this the second time someone asks what cust_seg_3 means and nobody can answer without reading the load code. The data warehouse pillar treats metadata as a supporting concern; this article is the supporting concern in detail: what counts as warehouse metadata, where it physically lives in a 2026 stack, how the old single-repository pattern gave way to federated data catalogs and machine-readable lineage, and where teams typically lose control of it.

TL;DR. The three-category split (technical, business, operational metadata) still holds, but the storage pattern has changed. A modern stack federates metadata across the warehouse's information schema, the transformation tool's DAG, the orchestrator's run history, the BI layer's semantic model, and a data catalog that stitches them together. Data lineage, once a nice-to-have, is the metadata surface auditors and platform teams now ask for first. OpenLineage is the emerging cross-tool standard for emitting it.

What counts as warehouse metadata

The working definition is the one that has held since Bagley coined the term in 1968: metadata is data about data. The useful refinement, for warehouse work specifically, is the three-category split that practitioners converged on through the 1990s and 2000s. The categories are descriptive rather than physical; the same row in a catalog row might satisfy more than one.

Technical metadata describes the shape and location of data. Schema names, table names, column names, data types, primary and foreign key constraints, partition and clustering keys, file formats, storage locations, owning database or warehouse, refresh frequency. This is the metadata the warehouse itself emits as a byproduct of existing; every cloud warehouse exposes most of it through INFORMATION_SCHEMA views or an equivalent system catalog.

Business metadata describes what the data means to the people who use it. Plain-language descriptions of tables and columns, business-glossary definitions of metrics, ownership, data classification (PII, financial, public), regulatory tags (GDPR, HIPAA, SOX-relevant), the team or domain accountable for the data's correctness. This is the metadata the warehouse cannot emit because it does not know it; somebody has to write it down and keep it current.

Operational metadata describes what the warehouse is doing and has done. Load run history with start and end timestamps, row counts in and out per stage, success and failure status, query history, query cost and bytes scanned, refresh latency, the watermark each incremental load advanced to. This is the metadata the orchestrator and the warehouse emit together; the discipline is collecting it somewhere queryable rather than letting it expire from the platform's default retention window.

These categories aren't competing definitions. A column ownership tag is business metadata; the row in the catalog recording who set the tag and when is operational metadata; the column's data type sitting next to both is technical metadata. The split is useful because the three categories are sourced differently, owned differently, and decay at different rates.

From metadata repository to federated catalog

The classic model that most data-warehouse textbooks describe, where metadata lives in one repository that "manages" the warehouse, came out of a different stack. In an on-premises warehouse built around a single ETL tool and a single BI tool, the ETL tool's repository was the natural home: it knew the schemas, the transformations, the loads, and exported a metadata bridge to the BI tool. The Common Warehouse Metamodel (CWM) was the OMG standard for exchanging this metadata between tools; in practice, most large warehouses ran on the metadata model of whichever vendor dominated their stack.

That arrangement does not survive contact with the cloud data platform. A 2026 warehouse touches more independent systems than the single-repository model assumes. The warehouse itself is one source of truth (Snowflake, BigQuery, Redshift, Databricks). The transformation layer is another (dbt, SQLMesh, or one of the cloud-native equivalents) and carries its own model graph. The ingestion tier (Fivetran, Airbyte, Debezium, or a hand-rolled CDC pipeline) carries source-to-landing schema mapping. The orchestrator (Airflow, Dagster, Prefect) carries the DAG and the run history. The BI layer (Looker, Tableau, Power BI, Hex) carries the semantic model and the metric definitions. The data quality layer (Great Expectations, dbt tests, Monte Carlo, Soda) carries the expectations and the test history. Each of these systems holds metadata the others need; none of them holds all of it.

The current pattern is to federate. Each system continues to own the metadata it natively produces, and a data catalog sits above the stack ingesting from each, presenting a unified searchable surface, and writing back the human-authored business metadata that none of the operational systems can produce on their own. The major catalogs (Atlan, Alation, Collibra, plus open-source DataHub, Amundsen, and OpenMetadata) differ on UX, governance features, and pricing, but the underlying architecture has converged: connectors against each upstream system, a central graph store, search and discovery on top, and an API that lets downstream systems query the catalog itself.

The old metadata repository did not disappear; it got disaggregated. The warehouse's information schema is the technical-metadata repository. The transformation tool's manifest is the model-metadata repository. The orchestrator's metadata DB is the run-history repository. The catalog is the index that makes them queryable as if they were one.

Where the metadata actually lives

A concrete walk-through helps fix the picture. In a representative 2026 cloud stack:

Layer	What it owns	How to read it
Warehouse (Snowflake / BigQuery / Redshift / Databricks)	Schemas, tables, columns, types, partition and clustering keys, masking policies, table tags	`INFORMATION_SCHEMA` views, `SHOW` commands, or the platform's catalog API
Object storage + table format (Iceberg, Delta, Hudi)	Table schemas, snapshots, partition spec, file-level statistics, schema evolution history	Table-format metadata files (Iceberg's `metadata.json`, Delta's `_delta_log/`)
Transformation tool (dbt, SQLMesh)	Model DAG, column-level lineage, tests, documentation YAML, exposures	`manifest.json`, `catalog.json` generated by the parser
Ingestion (Fivetran, Airbyte, Debezium)	Source-to-target schema map, sync history, change-event metadata	Tool's metadata API; for Debezium, the change event envelope itself
Orchestrator (Airflow, Dagster, Prefect)	Task graph, run history, retry counts, durations, run-level lineage	Tool's metadata DB or REST API
BI / semantic layer (Looker, Tableau, Power BI, dbt Semantic Layer)	Metric definitions, dashboard-to-table lineage, usage history	Tool's API or content-export endpoints
Data quality (dbt tests, Great Expectations, Soda, Monte Carlo)	Test definitions, run results, anomaly detection state	Tool's results store
Catalog (Atlan, Alation, Collibra, DataHub, OpenMetadata)	Federated index of all of the above plus human-authored business metadata	Catalog UI, API, and increasingly the catalog itself as a queryable system

The discipline isn't to consolidate all of this into one store; that battle was lost a decade ago and didn't need to be re-fought. The discipline is to know which system is the source of truth for which kind of metadata, and to point the catalog and any downstream consumer at that source rather than at a stale copy elsewhere. When the warehouse's column comment disagrees with the catalog's column description, the catalog is wrong; when the dbt model's documentation contradicts the catalog's table description, refresh the catalog from dbt.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

Lineage as a first-class metadata concern

Lineage is the second piece of the picture that has shifted since the textbook era. A traditional metadata repository recorded lineage as a set of source-to-target column mappings inside the ETL tool's metadata model. That worked when one ETL tool owned the entire transformation graph. It does not work when transformations are split across ingestion (Fivetran lands raw tables), transformation (dbt builds intermediate and marts models), and ad hoc SQL (analytics engineers create derived tables outside the model).

Two developments changed the lineage picture.

The first is that dbt and its peers made transformation lineage a first-class artifact of the build. The transformation tool parses the SQL, resolves model references, and emits a DAG with column-level dependencies. dbt docs generate produces a manifest.json plus a catalog.json that together give you every model, its sources, its dependents, and a column-level lineage graph that drives both the dbt documentation UI and the catalog connectors that ingest it. Column-level lineage from raw landing tables through marts to the dashboard layer is now derivable from the artifacts, not separately recorded as ETL mappings.

The second is OpenLineage, an open standard for emitting lineage events from any tool that produces or consumes datasets. A pipeline component (an Airflow task, a Spark job, a dbt run, a Flink job) emits an event when it starts and completes, naming the input datasets it read and the output datasets it produced. A backend (Marquez is the reference implementation; the major catalogs ingest OpenLineage events directly) assembles the events into a cross-tool lineage graph. The point isn't that any one tool can't produce lineage on its own; the point is that no single tool sees the whole graph, and OpenLineage gives them a shared vocabulary for stitching their fragments together.

For warehouse teams, the practical consequence is that lineage is now expected end-to-end (source system to dashboard), automated from build artifacts and runtime events rather than maintained by hand, and queryable from outside the originating tools. Auditors asking "where did this number on the executive dashboard come from" expect a traceable answer in minutes, not an investigation. Platform teams asking "what breaks if I drop this column" expect an impact analysis from the catalog, not a slack thread.

Operational metadata and the modern stack

The third category that has shifted in shape is operational metadata. The traditional metadata repository stored ETL run history as job statuses; the modern stack treats operational metadata as observability data, with the same expectations of granularity, retention, and queryability that platform engineering applies to application telemetry.

A 2026 warehouse pipeline typically emits operational metadata at three levels of detail. The orchestrator records task-level runs (started, completed, failed, duration, retries). The transformation tool records model-level runs (rows produced, tests passed and failed, freshness). The warehouse itself records query-level history (bytes scanned, credits consumed, slot time, queue time, user, warehouse or compute pool). The aggregation of all three is what answers questions like "which model is responsible for half of last month's Snowflake bill" or "which dashboard's queries spiked our BigQuery on-demand cost yesterday."

The shift in how this metadata gets used has been real. Cost observability tools (SELECT for Snowflake, BigQuery's INFORMATION_SCHEMA.JOBS, Databricks' system tables, plus third-party platforms layered on top) treat operational metadata as a primary cost-optimization surface. Data observability tools (Monte Carlo, Bigeye, Soda, Datafold) treat the combination of operational metadata and schema metadata as the signal layer for detecting silent data quality failures. In both cases, the value of the operational metadata depends on how completely the orchestrator, the transformation tool, and the warehouse expose their run histories, and how reliably the catalog ingests them.

Default retention is the trap. Most cloud warehouses retain query history at a useful grain for a relatively short window (Snowflake's QUERY_HISTORY view covers the last 14 days at default settings; the full 365-day ACCOUNT_USAGE.QUERY_HISTORY is more complete but lags by up to 45 minutes). Teams that want longer or more correlated history copy operational metadata into the warehouse itself on a daily schedule, where it becomes queryable alongside the business data and is no longer subject to platform retention. Treating operational metadata as just another loaded table is a discipline that pays off the first time someone asks a quarterly cost question.

The data-mesh framing

Data mesh, as a sociotechnical framing for analytical platforms, has changed how metadata responsibilities are allocated. The core mesh principles, federated governance and domain ownership of data products, imply that metadata is a domain responsibility, not a central one. Each domain owns its data products' descriptions, classifications, quality contracts, and lineage; a central team owns the platform that lets domains publish and lets consumers discover.

The implication for metadata management is concrete. Business metadata, the category least amenable to automation, is the one most often missing in practice. A centralized metadata team trying to document every table in the warehouse is a chronically under-resourced effort that produces stale descriptions. Pushing the responsibility to the domain that produces the data, and tying it to the data-product contract that domain publishes, is the organizational change that actually keeps business metadata current. The data catalog, in this framing, is the discovery layer over domain-owned data products, not the metadata system of record.

This is a governance change more than a technical one; the catalog tooling supports either model. But teams adopting catalogs without thinking about who owns which kind of metadata typically discover, six months in, that ingestion is working and human-authored descriptions are not appearing because nobody's job description includes writing them.

Where teams typically lose control

The recurring failure modes in metadata practice are predictable enough to enumerate.

Description drift. Column descriptions in the catalog, written when a model was first published, no longer match what the column contains. The fix is to make descriptions an artifact of the model definition (in dbt, in the YAML next to the SQL), so updating the model and updating the description happen in the same code change.

Lineage gaps at the seams. End-to-end lineage requires every tool in the pipeline to emit it. A team running dbt-internal lineage cleanly will still have a gap from the raw landing tables to the source systems unless the ingestion tool emits lineage too, and another gap from marts to the dashboard layer unless the BI tool participates. The pragmatic answer is to identify the seams, decide which are worth closing, and accept the rest as documented gaps rather than pretending the graph is complete.

Catalog as the second source of truth. A catalog ingesting from upstream tools that also allows direct editing in the catalog UI creates two copies of the same metadata, and they diverge. The discipline is to push edits back to the source system wherever the source system is the system of record (column descriptions to dbt YAML, table tags to warehouse metadata, ownership to the catalog itself if the catalog is the chosen source for ownership), and to treat the catalog's editable fields as a small, controlled set rather than a free-for-all.

Operational metadata that expires before it's useful. Default platform retention windows are shorter than most analytical needs. A quarterly cost retrospective wanting last quarter's query history is too late if the platform retained 14 days. The fix is to copy operational metadata into the warehouse on a daily schedule and treat it as a managed table.

Untagged sensitive data. PII, financial, or regulated columns that aren't classified are the metadata gap that turns into a compliance incident. Automated classifiers (Snowflake's classification feature, BigQuery's data profiler, third-party scanners) reduce the manual burden, but the policy decisions (what counts as PII for this organization, what masking applies to it) are human and have to be encoded somewhere the warehouse and the catalog both honor.

Glossary as a desert. A business glossary that lists 400 terms with no clear authority, no review cadence, and no integration into the catalog UI becomes a wiki nobody reads. The terms that actually carry weight (revenue, active customer, churn, MRR, the few that are contested) deserve sharp definitions and explicit owners; the rest can wait until they are contested too.

What good metadata practice looks like

The threshold question for any warehouse team is whether someone unfamiliar with the platform can answer four questions in under five minutes:

What does this column mean.
Where did this value come from.
Who owns this table.
When did this data last refresh.

If yes, the metadata layer is working. If any of those takes longer than five minutes, the gap is the next thing to fix. The specific tooling matters less than the property of having a coherent answer to each.

The architectural commitments to support that property are modest. Treat the warehouse's information schema as the source of truth for technical metadata and don't duplicate it elsewhere. Treat the transformation tool's manifest as the source of truth for model documentation and lineage, and treat catalog descriptions as a view over it. Emit lineage events from every tool that participates in the pipeline; adopt OpenLineage where the tools support it. Copy operational metadata into the warehouse on a schedule so retention is yours to set. Assign business metadata to the domain that produces the data; review it on a cadence; tie unfulfilled descriptions to the data-product contract rather than to a central wishlist.

The receiving end of metadata is the warehouse loading and operations pillar, where operational metadata sits alongside scheduling and monitoring. The role of metadata in warehouse automation is its own substantial topic: model-driven generation tools encode much of what would otherwise be hand-authored metadata into the model itself, which is what makes load logic derivable rather than maintainable.

Reference

The Data Warehouse Toolkit, Kimball and Ross, 3rd edition, Wiley, 2013. Chapter 19 on ETL subsystems treats metadata as one of the 34 ETL subsystems with characteristic specificity about what good practice looks like in practice.
OpenLineage specification. The current cross-tool standard for emitting lineage events; the spec, the reference implementations, and the list of integrating systems are all on the project site.
dbt documentation. The canonical reference for treating model documentation and column-level lineage as build artifacts.