Data vault modeling

Data Vault Modelling is a method of designing a database to provide historical storage of data coming in from multiple operational systems with complete tracing of where all the data in the database came from. The method is designed to be resilient to change in the environment. This purpose is mainly achieved by taking the business organisation as starting point for the datamodel, since it is assumed that this will change less often than the operational systems used to support the business.

History and philosophy of Data Vault

In datawarehouse modelling there are two well-known competing options for modelling the layer where the data is stored. Either you model according to Kimball, with conformed dimensions and an enterprise databus, or you model according to Inmon with the database in Third normal form. Both techniques have issues when dealing with changes in the systems feeding the datawarehouse. For conformed dimensions you also have to cleanse data (to conform it) and this is undesirable in a number of cases. Data Vault is designed to avoid or minimize the impact of those issues.

Dan Linstedt, the creator of the method, describes the resulting database as follows:

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise^[1]

An alternative name for the method is "Common Foundational Integration Modelling Architecture."^[2]

Data Vault's philosophy is that all data is relevant data, even if it is "wrong". Data being wrong is a business problem and usually not a technical problem. This means you have to be able to capture all the data. Another issue to which Data Vault is a response is that more and more there is a need for complete auditability and traceability of all the data in the datawarehouse. Due to Sarbanes-Oxley in the USA and similar measures in Europe this is a relevant topic for many business intelligence implementations.

Implementation

Data Vault attempts to solve the problem of dealing with change in the environment by separating the business keys (that do not mutate as often, because they uniquely identify a business entity) from the attributes of those keys. Attributes change at different rates, so you can group attributes together in small tables called Satellites and link those to the business keys that are in tables called Hubs. Associations or transactions between business keys (relating Hubs such as Customer and Product through the Purchase transaction) are modelled using Link tables, that also have satellites describing the attributes of the relation. In other words, the satellites provide the context for the business processes that are captured in Hubs and Links.

Links can link to other Links, to deal with changes in granularity (for instance, adding a new key to a database table would change the grain of the database table). Adding a Link to two Hubs to another Link to a Hub is similar to adding those 3 Hubs to a single Link. See the section on Loading Practices why the latter method is preferred.

Links sometimes link to only one Hub, a construct called 'peg-legged Link' by Dan Linstedt. This occurs when one of the business keys associated by the Link is not a real business key. As an example, take an order form with "order number" as key, and order lines that are keyed with a semi-random number to make them unique. Let's say, "unique number". The latter key is not a real business key, so it is no Hub. However, we do need to use it in order to guarantee the correct granularity for the Link. In this case, we do not use a Hub with surrogate key, but add the business key "unique number" itself to the Link. This is done only when there is no possibility of ever using the business key for another Link or as key for attributes in a Satellite.

All the tables contain metadata, minimally describing at least the source system and the date on which this entry became valid, giving a complete historical view of the data as it enters the data warehouse. Data is never deleted, unless you have a technical error while loading data.

The Data Vault does not maintain referential integrity between tables, but instead all Link tables are many-to-many relationships. This means that missing data is not an issue and also that Hubs, Links and Satellites can be loaded independent of each other, in parallel.

Loading practices

The ETL for updating a Data Vault model is fairly straightforward (see ^[3]). First you have to update all the Hubs. Having done that, you can now resolve all business keys to surrogate ID's. The second step is to add all attributes on the key to the Hub satellites and at the same time create and update all Links between Hubs. This resolves all the Link surrogate keys, enabling you to then add all the Link satellites in step three.

It is easy to verify that the updates inside each step are unrelated and can be done in parallel. The ETL is quite straightforward and lends itself to easy automation or templating. Problems occur only with Links relating to other Links, because resolving the business keys in the Link only leads to another Link that has to be resolved as well. Due to the equivalence of this situation with a Link to multiple Hubs, this difficulty can be avoided by remodelling such cases.

Data Vault and dimensional modelling

The Data Vault modelled layer is normally used to store all the data, all the time. A lot of end-user computing tools expect their data to be contained in a dimensional model, so a conversion is needed. However, the Hubs and related satellites can be considered as Dimensions and the Links and related Satellites as Fact tables in a dimensional model. This enables you to quickly prototype a dimensional model out of a Data Vault model using views.

Data Vault scheme as neural network

The Data Model is patterned off a simplistic view of neurons, dendrites, and synapses - where neurons are associated with Hubs and Hub Satellites, Links are dendrites (vectors of information), and other Links are synapses (vectors in the opposite direction). By utilizing a data mining set of algorithms, links can be scored with confidence and strength ratings. They can be created and dropped on the fly in accordance with learning about relationships that currently don't exist. The model can be automatically morphed, adapted, and adjusted as it is used and fed new structures.

Origin

Data Vault Modeling was originally concepted by Dan Linstedt in 1990 and was released in 2000. Data Vault Modeling is public domain and has been freely available since 2000.

In a series of five articles on The Data Administration Newsletter the basic rules of the Data Vault method are expanded and explained. These contain a general overview^[4], an overview of the components^[5], a discussion about end dates and joins^[6], link tables^[7] and an article on loading practices ^[8].

References

^ op. cit., glossary, page 75
^ op. cit., p. 83
^ Data Vault Series 5 - Loading Practices
^ Data Vault Series 1 - Data Vault Overview
^ Data Vault Series 2 - Data Vault Components
^ Data Vault Series 3 - End Dates and Basic Joins
^ Data Vault Series 4 - Link tables
^ Data Vault Series 5 - Loading Practices

More information

Daniel Linstedt, Kent Graziano, Hans Hultgren (2009). The New Business Supermodel. The Business of Data Vault modelling, 2nd edition. Lulu.com. ISBN 978-1-4357-1914-9. {{cite book}}: Unknown parameter |month= ignored (help)CS1 maint: multiple names: authors list (link)
Thomas C. Hammergren, Alan R. Simon (2009). Data Warehousing for Dummies, 2nd edition. John Wiley & Sons. ISBN 978-0470407479. {{cite book}}: Unknown parameter |month= ignored (help)

Dutch language sources:

Ketelaars, M.W.A.M. (November 25, 2005). "Datawarehouse-modelleren met Data Vault". Database Magazine (DB/M) (7). Array Publications B.V.: 36–40.
Verhagen, K., Vrijkorte, B. (June 10, 2008). "Relationeel versus Data Vault". Database Magazine (DB/M) (4). Array Publications B.V.: 6–9.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[1] . cit., glossary, page 75

[2] . cit., p. 83

[3] Data Vault Series 5 - Loading Practices

[4] Data Vault Series 1 - Data Vault Overview

[5] Data Vault Series 2 - Data Vault Components

[6] Data Vault Series 3 - End Dates and Basic Joins

[7] Data Vault Series 4 - Link tables

[8] Data Vault Series 5 - Loading Practices

[1]