Skip to article
Data Warehouse Info

A practitioner's reference for analytical data warehousing.

Reference Articles · Technique Deep-Dives · Courses · Glossary

Data vault modeling


Data vault modeling

Data vault modeling organizes warehouse data into hubs, links, and satellites for auditability, source traceability, and resilience to schema changes. This guide covers the core structures, Data Vault 2.0 additions, the mart layer requirement, and when data vault is the right choice.

By Farhan Ahmed Khan


Data vault modeling is a warehouse design methodology built around three table types: hubs hold business keys, links hold relationships between hubs, and satellites hold descriptive attributes with full history. It was developed by Dan Linstedt and formalized in the early 2000s, with a significant revision in 2013 that added performance structures and broadened the methodology's scope.

The design problem data vault solves is different from the one dimensional modeling solves. Dimensional modeling optimizes for query simplicity and analytical performance. Data vault optimizes for auditability, source traceability, and resilience when source system schemas change. These are not competing solutions to the same problem, which is why many organizations use both in the same warehouse: data vault for the integration layer, dimensional marts on top for BI consumption.

The three core structures

Every data vault model is built from three types of tables. What each type stores, and what it deliberately excludes, is worth understanding precisely before anything else.

Hubs store business keys. A business key is the identifier the business uses to refer to an entity: a customer number, a product code, an order identifier, an employee ID. The hub table for customers contains one row per unique customer business key, a surrogate key for the hub row, the date the key was first loaded, and the name of the source system it came from. Nothing else. No descriptive attributes. No relationships.

Keeping business keys in hubs and nothing else is the methodology's most important structural rule. It separates the question "does this entity exist" from the question "what do we know about it." Source systems can change how they describe a customer without touching the hub. New source systems can be added without restructuring existing tables. Multiple source systems that each use different internal identifiers for the same real-world customer can each contribute their own hub rows. The hub holds the keys. Everything else goes elsewhere.

Links store relationships between hubs. A link table for order line items might connect an order hub, a product hub, a customer hub, and a date hub. It contains the surrogate keys of each participating hub, its own surrogate key for the link row, a load date, and a record source. No descriptive attributes. No measures. Links are insert-only: if a relationship changes, a new row is added. The full history of relationships is always present, which is useful for auditability but means link tables grow large when relationships change frequently.

Satellites store descriptive attributes and their full history, attached to either a hub or a link. A customer satellite might contain address, email, phone number, segment, and acquisition channel. A link satellite might contain the quantity and price associated with a specific order-line-product relationship. Each satellite row carries the surrogate key of the parent hub or link, a load date, optionally a load end date, and a hash of the attribute values used to detect changes.

Satellites are where most of the data lives. A single hub might have several satellites attached to it, each tracking a different set of attributes from different source systems on different load schedules. An order hub might have one satellite populated by the order management system and another by the fulfillment system. They coexist independently and load independently. Adding a third source doesn't require modifying either existing satellite. The insert-only rule applies here as it does everywhere in the vault: rows are added when attributes change, existing rows are never updated, and the full attribute history is preserved.

sat_customer_oms
address, email

hub_customer
business key

sat_customer_crm
segment, tenure

hub_order
business key

link_customer_order
relationship

sat_link_pricing
quantity, unit price

Data Vault 2.0 additions

The 2013 revision of the methodology, commonly called Data Vault 2.0 or DV2, made several practical additions to the original design. The core hub-link-satellite structure is unchanged. DV2 adds performance structures, a layer for business logic, and explicit accommodation for non-relational and streaming data sources.

Point-in-time tables, called PITs, address a real performance problem in raw vault queries. To reconstruct what a hub entity looked like at a specific point in time, a query must join the hub to each of its satellites and apply a date-range filter on each join. With multiple satellites on a hub, this produces complex multi-satellite joins that are expensive to execute repeatedly. A PIT table pre-computes the satellite row keys themselves, keyed on hub surrogate plus snapshot date, so the multi-satellite reconstruction collapses from a date-range join on each satellite to an equi-join into the PIT. PITs are rebuilt periodically rather than maintained in real time; they are derived from the raw vault, not part of it.

Bridge tables serve a similar performance function for link-heavy queries. Traversing multiple links to answer a question about an entity can require several joins. A bridge table pre-joins a set of frequently queried links, creating a wider structure the mart layer can join against more efficiently. Like PITs, bridges are derived and rebuilt rather than part of the raw vault.

The business vault is the layer where business rules, derived attributes, and computed metrics are applied to raw vault data. The raw vault is a direct reflection of source data: no business logic applied, no interpretive decisions made. The business vault sits above it and applies transformations: standardizing units, computing tenure from load dates, deriving segments from raw attributes, resolving conflicts when multiple source systems disagree. This separation means the raw vault can always be used to audit what the source systems actually said, independently of any business logic that has been layered on top.

Same-as links, or SALs, handle entity resolution across source systems. When a customer in the CRM and an account in the ERP represent the same real-world entity, a same-as link records that equivalence without merging the two hub rows. The separate hub rows, each with their original business keys, remain intact. The SAL says "these are the same." In audit environments where the ability to trace data back to its exact source cannot be compromised by merging records, this matters. Data Vault 2.0 also formalizes hierarchical links (capturing parent-child relationships within a single entity type) and non-historized links (used for transactional facts where each event is immutable and history tracking on the link itself is unnecessary). The three link variants cover the structural cases the methodology needs.

The mart layer

Data vault does not query well directly. The normalized hub-link-satellite structure is designed for loading and auditing, not for the aggregations and joins that BI tools generate. Querying a data vault directly to produce an analytical report requires navigating multiple hubs, links, and satellites, applying date-range filters on each satellite join, and assembling a result that might have come from a dozen tables. This is possible but expensive and difficult to abstract for end users.

The mart layer solves this. On top of the raw vault and business vault, dimensional marts or other analytical structures are built for BI consumption. The mart tables look exactly like a dimensional model: fact tables, dimension tables, star schemas optimized for analytical queries. They are populated by transformations that read from the vault and produce the structures that BI tools expect.

This is the tax data vault imposes. A dimensional model can be queried directly by a BI tool without an intermediate layer. A data vault model requires the mart layer before it becomes analytically usable. The mart layer has to be designed, built, and maintained. When business requirements change, changes may need to propagate through both the vault and the mart.

The argument for paying this tax is that the vault layer itself is more durable and more auditable than a pure dimensional warehouse. Adding a new source system is a matter of adding satellites and links. Source schema changes are absorbed in the satellite that tracks the changed source without cascading to other parts of the model. The raw data history is always intact, independent of whatever analytical structures have been built on top of it. For organizations where those properties matter more than the cost of the additional layer, data vault is worth it.

The mart layer also creates a useful organizational boundary. The data engineering team owns the vault. The analytics engineering team owns the mart. Vault loading is about fidelity to source data. Mart design is about analytical utility. These are different skills and different concerns. Separating them at the architectural boundary can simplify team ownership, though it adds coordination overhead when changes need to cross the boundary.

Source systems
OMS, CRM, ERP, ...

Raw Vault
hubs, links, satellites
insert-only, source-faithful

Business Vault
derived attributes,
resolved conflicts, computed metrics

Mart Layer
dimensional stars,
BI-consumable

Advertisement
300 × 250

Loading behavior

Data vault loading has properties that distinguish it from dimensional loading, most of them favorable for large-scale parallel operation.

Hubs, links, and satellites can be loaded independently and in parallel. There are no surrogate key lookups of the kind that create ordering dependencies in dimensional loads. A hub row's key is derived from the business key itself, either through a sequence or more commonly through a hash function applied to the business key. Because the key derivation is deterministic, a satellite can be loaded without waiting for the hub to finish, since both the hub loader and the satellite loader will derive the same hub key from the same business key independently.

Hash keys are the more common approach in modern data vault implementations and deserve some explanation. Rather than assigning an arbitrary sequence number to each hub row, the hub key is the hash (typically MD5 or SHA-1) of the business key value. This means the key is reproducible: any system that sees the same business key will derive the same hub key without needing to look up the sequence assignment. This enables fully parallel, distributed loading across multiple systems or processes without a shared sequence generator as a bottleneck. Hash collisions are theoretically possible. In practice, at warehouse scale (typical business-key cardinality well under 10^12 across a corpus's lifetime), collision probability for MD5 is on the order of 10^-26 per pair, and Linstedt's own framing treats the property as a non-issue once the math is laid out. The cases where collisions become a real concern are adversarial inputs, not natural business keys.

The insert-only behavior means load logic is simpler in one respect: there are no updates. A hub row is either present or it isn't. A satellite row is either the latest version or it isn't. The merge logic that dimensional loads require for SCD handling is replaced by a simpler check: is this business key already in the hub, and has the hash of these satellite attributes changed since the last load. If the key is new, insert a hub row. If the hash has changed, insert a new satellite row. If neither condition holds, do nothing.

Record source tracking is not optional in data vault. Every hub row, link row, and satellite row carries the name of the source system that contributed it. It is a core audit requirement of the methodology, not a nice-to-have. When a discrepancy appears between what the warehouse reports and what a source system recorded, the record source field is how the investigation starts.

When data vault is and isn't the right choice

The conditions under which data vault earns its overhead are specific. Applying it outside those conditions produces an expensive warehouse that is harder to query than a dimensional model without providing the properties that justified the added complexity.

The hybrid architecture, data vault as the integration layer with dimensional marts on top, is the most common production deployment pattern. It captures the integration and auditability properties of data vault while delivering the query performance and BI compatibility of dimensional modeling. The vault absorbs source system complexity. The mart layer exposes the analytical surface. Teams that try to use pure data vault as an analytical layer, querying hubs, links, and satellites directly from BI tools, typically find the performance and usability unacceptable. Teams that try to use a dimensional model to do what data vault does in high-complexity, high-change source environments typically find the maintenance overhead too high. The hybrid acknowledges that the two approaches solve different problems and uses each for what it does well.

Data vault performs well when source system schemas change frequently. The clearest case is an enterprise replacing a core system: a banking platform migration, an ERP replacement, a transition to a new CRM. In these environments, the source schema before the migration and the source schema after are different, and the warehouse needs to absorb both without requiring a rebuild. A dimensional model built tightly against a source schema needs significant rework when that schema changes. A data vault model absorbs the change in a new or modified satellite without touching the hub structure or existing satellites.

Multiple source systems contributing data about the same entities is another strong signal. If five operational systems all have a concept of "customer" and each uses different business keys, different attribute names, and different data quality standards, data vault provides a natural integration architecture. The hub unifies the identity question. The satellites track each source's contribution independently. The same-as link handles the cases where the same real customer appears under different keys in different systems.

Auditability requirements push toward data vault. Regulatory environments where every data transformation must be traceable, where the ability to show exactly what source data said at exactly what point in time is a compliance requirement, are well served by the insert-only, record-source-tracked vault architecture. The raw vault is, by design, a complete audit trail.

Data vault is the wrong choice when the source environment is small and stable. A warehouse with two or three well-understood source systems, schemas that rarely change, and analytical requirements that are reasonably well defined before building starts is a better dimensional modeling candidate. Data vault adds loading complexity, requires a mart layer, and demands organizational discipline that returns less value when the problems it solves aren't present.

It is also harder to staff. Dimensional modeling is taught widely and understood by most data engineers. Data vault has a smaller practitioner community and more specific conventions that teams need to learn and agree on before they can build consistently. A team encountering data vault for the first time will build it wrong in ways that take months to discover. The methodology is rigorous by design, and rigorous methodologies require experienced practitioners to implement correctly.

Practical guidance

Getting data vault wrong is easy. Most of the mistakes happen before a single table is built.

Establish business key definitions before modeling starts. The hub is only as stable as the business key it holds. If the business key for a customer is the CRM account number in one team's understanding and the ERP customer ID in another's, the hub design will be wrong. Getting explicit agreement on which identifier is the canonical business key for each entity, and documenting it, is prerequisite work. This conversation often surfaces data quality problems in source systems that need to be resolved before the vault can be built correctly.

Separate raw vault and business vault deliberately. The temptation to apply business logic directly in satellite loading is real and should be resisted. Raw satellites should reflect source data as received, including data quality problems. Business vault satellites apply transformations. Keeping these in separate objects means the raw source data is always auditable independently of whatever business logic has been applied, which is the property the methodology was designed to preserve.

Use record source consistently and specifically. A record source value of "ETL" or "warehouse load" is useless for audit purposes. The record source should identify the specific source system and ideally the specific feed or extract that contributed each row. When an auditor asks where a specific value came from, the record source field should answer the question unambiguously.

Size the mart layer effort honestly. Teams that build data vault and underestimate the mart layer sometimes deliver a complete vault with no usable analytical surface. The vault without the mart is an audit archive, not a working analytical warehouse. Planning the mart layer as a first-class deliverable from the start, not as a follow-on project after the vault is complete, prevents this outcome.

Decide hash versus sequence keys before loading starts. Switching the key generation strategy after data has been loaded requires rebuilding the entire vault. The choice between hash keys and sequence keys has implications for parallel loading, cross-system integration, and operational complexity that affect everything downstream. Make it explicitly and document it.

The dimensional modeling pillar covers the alternative approach and the trade-offs between the two methodologies. The data warehouse fundamentals pillar covers where both modeling approaches fit in the broader warehouse architecture. Hub, link, satellite, business key, and point-in-time table all have glossary entries.

Advertisement
970 × 90