- Data Vault Modeling
Data Vault Modeling is a database modeling method that is designed to provide historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that, apart from the modeling aspect, deals with issues such as auditing, tracing of data, loading speed and resilience to change.
Data Vault Modeling focuses on several things. First, it emphasizes the need to trace of where all the data in the database came from. Second, it makes no distinction between good and bad data ("bad" meaning not conforming to business rules) , leading to "a single version of the facts" versus "a single version of the Truth" , also expressed by Dan Linstedt as "all the data, all of the time". Third, the modeling method is designed to be resilient to change in the business environment where the data being stored is coming from, by explicitly separating structural information from descriptive attributes . Finally, Data Vault is designed to enable parallel loading as much as possible , so that you can scale out for very large implementations.
An alternative (and seldom used) name for the method is "Common Foundational Integration Modelling Architecture."
- 1 History and philosophy
- 2 Basic notions
- 3 Loading practices
- 4 Data Vault and dimensional modelling
- 5 Tools
- 6 Notes
- 7 References
- 8 External links
History and philosophy
In data warehouse modeling there are two well-known competing options for modeling the layer where the data is stored. Either you model according to Ralph Kimball, with conformed dimensions and an enterprise data bus, or you model according to Bill Inmon with the database in Third normal form. Both techniques have issues when dealing with changes in the systems feeding the data warehouse. For conformed dimensions you also have to cleanse data (to conform it) and this is undesirable in a number of cases since this inevitably will lose information. Data Vault is designed to avoid or minimize the impact of those issues, by moving them to areas of the data warehouse that are outside the historical storage area (cleansing in the data marts) and by separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.
Dan Linstedt, the creator of the method, describes the resulting database as follows:
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise
Data Vault's philosophy is that all data is relevant data, even if it is "wrong". Data being wrong is a business problem and usually not a technical problem. This means you have to be able to capture all the data. Another issue to which Data Vault is a response is that more and more there is a need for complete auditability and traceability of all the data in the datawarehouse. Due to Sarbanes-Oxley in the USA and similar measures in Europe this is a relevant topic for many business intelligence implementations, hence the focus of any Data Vault implementation on complete trace-ability and auditability of all information.
Data Vault Modeling was originally conceived by Dan Linstedt in 1990 and was released in 2000 as a public domain modeling method. In a series of five articles on The Data Administration Newsletter the basic rules of the Data Vault method are expanded and explained. These contain a general overview , an overview of the components, a discussion about end dates and joins, link tables and an article on loading practices .
According to Dan Linstedt, the Data Model is inspired by (or patterned off) a simplistic view of neurons, dendrites, and synapses - where neurons are associated with Hubs and Hub Satellites, Links are dendrites (vectors of information), and other Links are synapses (vectors in the opposite direction). By utilizing a data mining set of algorithms, links can be scored with confidence and strength ratings. They can be created and dropped on the fly in accordance with learning about relationships that currently don't exist. The model can be automatically morphed, adapted, and adjusted as it is used and fed new structures .
Data Vault attempts to solve the problem of dealing with change in the environment by separating the business keys (that do not mutate as often, because they uniquely identify a business entity) and the associations between those business keys, from the descriptive attributes of those keys.
The business keys and their associations are structural attributes, forming the skeleton of the data model. The Data Vault method has as one of its main axioms that real business keys only change when the business changes and are therefore the most stable elements from which to derive the structure of a historical database. If you use these keys as the backbone of a Data Warehouse, you can organize the rest of the data around them. This means that choosing the correct keys for the Hubs is of prime importance for the stability of your model . The keys are stored in tables with a few constraints on the structure. These key-tables are called Hubs.
Hubs contain a list of unique business keys with low propensity to change. Hubs also contain a surrogate key for each Hub item and metadata describing the origin of the business key. The descriptive attributes for the information on the Hub (such as the description for the key, possibly in multiple languages) are stored in structures called Satellite tables which will be discussed below.
The Hub contains at least the following fields :
- a surrogate key, used to connect the other structures to this table.
- a business key, the driver for this hub. The business key can consist of multiple fields.
- the record source, can be used to see where the business keys come from and if the primary loading system has all of the keys available in other systems as well.
- optionally, you can also have metadata fields with information about manual updates (user/time) and the extraction date.
A Hub is not allowed to contain multiple business keys, except when two systems deliver the same business key but with collisions that have different meanings.
Hubs should normally have at least one satellite .
This is an example for a Hub-table containing Cars, surprisingly called "Car" (H_CAR). The driving key is Vehicle Identification Number.
Fieldname Description Mandatory? H_CAR_ID Sequence ID and surrogate key for the Hub Yes VEHICLE_ID_NR The business key that drives this hub. Can be more than one field for a composite business key Yes H_RSRC The recordsource of this key when first loaded Yes LOAD_AUDIT_ID An ID into a table with audit information, such as load time, duration of load, number of lines, etc. No
Associations or transactions between business keys (relating for instance the Hubs for Customer and Product with each other through the Purchase transaction) are modeled using Link tables. These tables are basically many-to-many join tables, with some metadata.
Links can link to other Links, to deal with changes in granularity (for instance, adding a new key to a database table would change the grain of the database table). For instance, if you have an association between Customer and Address, you could add a reference to a Link between the Hubs for Product and Transport Company. This could be a Link called Delivery. Referencing a Link in another Link is considered a bad practice, because it introduces dependencies between Links that make parallel loading more difficult. Since a Link to another Link is the same as a new Link with the Hubs from the other Link, in these cases a new Link is the preferred solution (see the section on Loading Practices for more information).
Links sometimes link Hubs to information that is not by itself enough to construct a Hub. This occurs when one of the business keys associated by the Link is not a real business key. As an example, take an order form with "order number" as key, and order lines that are keyed with a semi-random number to make them unique. Let's say, "unique number". The latter key is not a real business key, so it is no Hub. However, we do need to use it in order to guarantee the correct granularity for the Link. In this case, we do not use a Hub with surrogate key, but add the business key "unique number" itself to the Link. This is done only when there is no possibility of ever using the business key for another Link or as key for attributes in a Satellite. This construct has been called a 'peg-legged Link' by Dan Linstedt on his (now defunct) forum.
Links contain the surrogate keys for the Hubs that are linked, their own surrogate key for the Link and metadata describing the origin of the association. The descriptive attributes for the information on the association (such as the time, price or amount) are stored in structures called Satellite tables which are discussed below.
This is an example for a Link-table between two Hubs for Cars (H_CAR) and Persons (H_PERSON). The Link is called "Driver" (L_DRIVER).
Fieldname Description Mandatory? L_DRIVER_ID Sequence ID and surrogate key for the Link Yes H_CAR_ID surrogate key for the Car Hub, the first anchor of the link Yes H_PERSON_ID surrogate key for the Person Hub, the second anchor of the link Yes L_RSRC The recordsource of this association when first loaded Yes LOAD_AUDIT_ID An ID into a table with audit information, such as load time, duration of load, number of lines, etc. No
The Hubs and Links form the structure of the model, but have no temporal attributes and hold no descriptive attributes. These are stored in separate tables called Satellites. These consist of metadata linking them to their parent Hub or Link, metadata describing the origin of the association and attributes, as well as a timeline with start and end dates for the attribute. Where the Hubs and Links provide the structure of the model, the Satellites provide the "meat" of the model, the context for the business processes that are captured in Hubs and Links. These attributes are stored both with regards to the details of the matter as well as the timeline and can range from quite complex (all of the fields describing a clients complete profile) to quite simple (a satellite on a Link with only a Valid-indicator and a timeline).
Usually the attributes are grouped in Satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different Satellites based on their rate of change.
All the tables contain metadata, minimally describing at least the source system and the date on which this entry became valid, giving a complete historical view of the data as it enters the data warehouse.
This is an example for a satellite on the Drivers-link between the Hubs for Cars and Persons, called Driver Insurance (S_DRIVER_INSURANCE). This Satellite contains attributes that are specific to the insurance of the relationship between the Car and the Person driving it, for instance an indicator whether this is the primary driver, the name of the insurance company for this car and person (could also be a separate Hub) an a summary of the number of accidents involving this combination of vehicle and driver. Also included is a reference to a lookup- or reference table called R_RISK_CATEGORY containing the codes for the risk category in which this relationship is deemed to fall.
Fieldname Description Mandatory? S_DRIVER_INSURANCE_ID Sequence ID and surrogate key for the Satellite on the Link No L_DRIVER_ID surrogate key for the Driver Link, the parent of the Satellite Yes S_LDTS Load Date (startdate) for the validity of this combination of attribute values for parent key L_DRIVER_ID Yes S_LEDTS Load End Date (enddate) for the validity of this combination of attribute values for parent key L_DRIVER_ID Yes IND_PRIMARY_DRIVER Indicator whether the driver is the primary driver for this car No (*) INSURANCE_COMPANY The name of the insurance company for this vehicle and this driver No (*) NR_OF_ACCIDENTS The number of accidents by this driver in this vehicle No (*) R_RISK_CATEGORY_CD The risk category for the driver. This is a reference to R_RISK_CATEGORY No (*) S_RSRC The recordsource of the information in this satellite when first loaded Yes LOAD_AUDIT_ID An ID into a table with audit information, such as load time, duration of load, number of lines, etc. No
(*) at least one attribute is mandatory.
Reference tables are a normal part of a healthy Data Vault model. They are there to prevent redundant storage of simple reference data that is referenced a lot. More formally, Dan Linstedt defines reference data as follows:
Any information deemed necessary to resolve descriptions from codes, or to translate keys in to (sic) a consistent manner. Many of these fields are "descriptive" in nature and describe a specific state of the other more important information. As such, reference data lives in separate tables from the raw Data Vault tables.
Reference tables are referenced from Satellites, but never bound with physical foreign keys. There is no prescribed structure for reference tables: use what works best in your specific case, ranging from simple lookup tables to small Data Vaults or even Stars. They can be historical or have no history, but it is recommended that you stick to the natural keys and not create surrogate keys in that case . Normally, Data Vaults have a lot of reference tables, just like any other Data Warehouse.
This is an example of a reference table with risk categories for drivers of vehicles. It can be referenced from any Satellite in the Data Vault. For now we reference it from Satellite S_DRIVER_INSURANCE. The reference table is R_RISK_CATEGORY.
Fieldname Description Mandatory? R_RISK_CATEGORY_CD The code for the risk category Yes RISK_CATEGORY_DESC A description of the risk category No (*)
(*) at least one attribute is mandatory.
The ETL for updating a Data Vault model is fairly straightforward (see Data Vault Series 5 - Loading Practices). First you have to load all the Hubs, creating surrogate ID's for any new business keys. Having done that, you can now resolve all business keys to surrogate ID's if you query the Hub. The second step is to resolve the Links between Hubs and create surrogate ID's for any new associations. At the same time, you can also create all satellites that are attached to Hubs, since you can resolve the key to a surrogate ID. Once you have created all the new Links with their surrogate keys, you can add the Satellites to all the Links.
Since the Hubs are not joined to each other except through Links, you can load all the Hubs in parallel. Since Links are not attached directly to each other, you can load all the Links in parallel as well. Since Satellites can be attached only to Hubs and Links, you can also load these in parallel.
The ETL is quite straightforward and lends itself to easy automation or templating. Problems occur only with Links relating to other Links, because resolving the business keys in the Link only leads to another Link that has to be resolved as well. Due to the equivalence of this situation with a Link to multiple Hubs, this difficulty can be avoided by remodeling such cases and this is in fact the recommended practice .
Data is never deleted from the Data Vault, unless you have a technical error while loading data.
Data Vault and dimensional modelling
The Data Vault modelled layer is normally used to store data. It is not optimized for query performance, nor is it easy to query by the well known query-tools such as Cognos, SAP Business Objects, Pentaho et. al. Since these end-user computing tools expect or prefer their data to be contained in a dimensional model, a conversion is usually necessary.
For this purpose, the Hubs and related Satellites on those Hubs can be considered as Dimensions and the Links and related Satellites on those Links can be viewed as Fact tables in a dimensional model. This enables you to quickly prototype a dimensional model out of a Data Vault model using views. For performance reasons the dimensional model will usually be implemented in relational tables, after approval.
Note that while it is relatively straightforward to move data from a Data Vault model to a (cleansed) dimensional model, the reverse is not as easy.
Due to the recurring structure of the DataVault with only three types of tables, as well as the separation of structural and descriptive attributes, automation is relatively straightforward. Nowadays, there are at several tools supporting Data Vault with wildly varying types of support. Some of the more notable tools include:
- ^ Super Charge your data warehouse, page 74
- ^ The next generation EDW
- ^ Super Charge your data warehouse, page 21
- ^ Super Charge your data warehouse, page 76
- ^ Data Warehousing for Dummies, page 83
- ^ The New Business Supermodel, glossary, page 75
- ^ Data Vault Series 1 - Data Vault Overview
- ^ Data Vault Series 2 - Data Vault Components
- ^ Data Vault Series 3 - End Dates and Basic Joins
- ^ Data Vault Series 4 - Link tables, paragraph 2.3
- ^ Data Vault Series 5 - Loading Practices
- ^ Super Charge your Data Warehouse, paragraph 5.20, page 110.
- ^ Super Charge your data warehouse, page 61, why are business keys important
- ^ Data Vault Forum, Standards section, section 3.0 Hub Rules
- ^ Data Vault Forum, Standards section, section 3.0 Hub Rules
- ^ Super Charge your Data Warehouse, paragraph 8.0, page 146.
- ^ Super Charge your Data Warehouse, paragraph 8.0, page 149.
- ^ Data Vault Series 5 - Loading Practices
- Daniel Linstedt, Kent Graziano, Hans Hultgren (August 2009). The New Business Supermodel. The Business of Data Vault modelling, 2nd edition. Lulu.com. ISBN 978-1-4357-1914-9.
- Thomas C. Hammergren, Alan R. Simon (February 2009). Data Warehousing for Dummies, 2nd edition. John Wiley & Sons. ISBN 978-0470407479.
- Dan Linstedt (December 2010). Super Charge your Data Warehouse. Dan Linstedt. ISBN 978-0-9866757-1-3.
- Ronald Damhof, Lidwine van As (August 2008). "The next generation EDW - Letting go of the idea of a single version of the truth". Database Magazine (DB/M) (Array Publications B.V.). http://prudenza.typepad.com/files/damhof_dbm0508_eng-1.pdf.
- Linstedt, Dan. "Data Vault Series 1 - Data Vault Overview". Data Vault Series. The Data Administration Newsletter. http://www.tdan.com/view-articles/5054/. Retrieved 12 september 2011.
- Linstedt, Dan. "Data Vault Series 2 - Data Vault Components". Data Vault Series. The Data Administration Newsletter. http://www.tdan.com/view-articles/5155/. Retrieved 12 september 2011.
- Linstedt, Dan. "Data Vault Series 3 - End Dates and Basic Joins". Data Vault Series. The Data Administration Newsletter. http://www.tdan.com/view-articles/5067/. Retrieved 12 september 2011.
- Linstedt, Dan. "Data Vault Series 4 - Link Tables". Data Vault Series. The Data Administration Newsletter. http://www.tdan.com/view-articles/5172/. Retrieved 12 september 2011.
- Linstedt, Dan. "Data Vault Series 5 - Loading Practices". Data Vault Series. The Data Administration Newsletter. http://www.tdan.com/view-articles/5285/. Retrieved 12 september 2011.
- Linstedt, Dan. "Data Vault Modeling Specification v1.0.8". Data Vault Forum. Dan Linstedt. http://danlinstedt.com/forums/showthread.php?t=9. Retrieved 14 september 2011.
Dutch language sources
- Ketelaars, M.W.A.M. (November 25, 2005). "Datawarehouse-modelleren met Data Vault". Database Magazine (DB/M) (Array Publications B.V.) (7): 36–40.
- Verhagen, K., Vrijkorte, B. (June 10, 2008). "Relationeel versus Data Vault". Database Magazine (DB/M) (Array Publications B.V.) (4): 6–9.
- The homepage of Dan Linstedt, the inventor of Data Vault modeling
- A website dedicated to Data Vault, maintained by Dan Linstedt
- Youtube video on Data Vault Modeling Approach and Methodology
Data warehouseCreating the data warehouseConceptsVariantsElementsFactDimensionFillingUsing the data warehouseConceptsLanguagesToolsRelatedPeopleProducts
- Comparison of OLAP Servers
- Data warehousing products and their producers
Wikimedia Foundation. 2010.