Dani Schnider works as Principal Consultant, course trainer and DWH/BI Lead Architect for Trivadis in Zurich, Switzerland.
In 2011 at the Oracle OpenWorld conference, I attended a presentation of Kent Graziano about Data Vault Modeling. It was the first time I heard of this special modeling method for Data Warehouses, and after this session I was a bit confused about these data models with hubs, links and satellites. Is it really something new, or is it more or less the same technique we are using for master data versioning using head and version tables? The idea of Data Vault Modeling seems to be interesting, but a lot of details were not clear for me. I ordered the book The Business of Data Vault Modeling of Dan Linstedt, but this book was more confusing than helpful for me. After some discussions with several colleagues at Trivadis, my opinion was clear: Yes, it is more or less the same as our approach described in our DWH book. Problem solved, let's go back to work...
Last week at the Trivadis TechEvent, there was another session about Data Vault Modeling, presented by Hans Hultgren. The presenation was just an introduction into this modeling technique, and most of the information were the same things I already heard two years ago in San Francisco. But after his presentation we had an interesting discussion about the similaritites and differences between Data Vault Modeling and our method of head and version tables. Although Hans has never heard of our approach before, he found it at least "interesting" - whatever that means. After a beer, I decided to do some homework and to design my first draft of a Data Vault Model - based on an example in our book. Here it is.
As described in our Trivadis BI Blueprints and in the book "Data Warehousing mit Oracle", we use the following versioning method to store historical master data:
Let's try to make a first draft for a Data Vault Model based on the example above. The head tables can be compared to Hubs, the version tables to Satellites in a Data Vault Model. The relationships are always designed as Links, i.e.. many-to-many relationshiops between hubs.
Both modeling concepts - head and version tables as well as Data Vault Modeling - are used to design the Core model, i.e. the central data store used for data integration and data versioning. They are definitely not designed for end user queries. For this, data marts using Star Schemas or multidimensional OLAP cubes are typically used.
What is common in both methods is a unique key for each business entity that is independent from a particular version (this is one of the main differences to the Slowly Changing Dimensions type 2 where a particular version is referred). In our approach, we use a head table for this, in a Data Vault Model, this is implemented with a Hub.
Version tables and Satellites are not exactly the same. The version tables contain only the business attributes that can be changed during the time. Whenever at least one of the attributes is changed in the source system, a new version is created in the Data Warehouse, and the previous version is marked as outdated by updating the expiration date. This is different in a Satellite. Here, a new version is create for every change, too. But there is no expiration date, only a load date an propably an effective date as a business attribute.
Another difference is that the static attributes, i.e. business attributes that never change in the source system, are stored in the Satellite as well. In Data Vault Modeling, there is no distinction between static or dynamic attributes - or even relationships. A business attribute is a business attribute - that's it.
Relationships in a Data Vault Model are always many-to-many relationships and stored in Links. This increases the complexity of the data model, but allows more flexibility because new business rules can be implemented without changing the data model. This is one of the main reasons for Data Vault Modeling: In agile environment with many change requests is can be very useful to have a flexible data model that can be enhanced easily. This is also the case for new attributes that are just included with additional Satellites.
What I never read or heard about yet is the complexity of the ETL processes. From my current point of view it makes no difference whether to load source data into head and version tables or into a Data Vault. A kind of delta determination to detect changed attributes must be implemented in both models. But even if it would be easier to load data into a Data Vault, it is more complex and expensive to load the Data Marts from a Data Vault because the queries to determine the correct version of each Satellite is not trivial. But these are only assumptions. We will propably see in the future what the benefits and restrictions of Data Vault Modeling will be.
Thank you very much for this excellent article, Dani ! For me who had never had the opportunity to get an introduction to data vault modeling, it seems to be a good one!
Just a thought: you said that neither Head/Version nor Hub/Satellite is intended to be queried directly, so, I was wondering about the utility of these kinds of models comparing to SCD2 one for example ?
Thank for a lot for this article Dani!
In data vault model, you can choose the number of satellites by splitting attributes based on the frequency of changes(can be usefull for dynamic and static attributes) . About the effective date on Satellites, I'm ok with you... But for me It can be good to add this end date (and a currentflag) at the moment of the load in the satellite. This will simplify a lot queries across versions.
I'm interested to know more about your Master date versioning!
Waiting for the french or english version...
thank you for the introduction to Linstedt's Vault Modeling - the comparision of your method with the Vault model (and its terminology) is very instructive. I never took the time to understand the details of the vault model but in relation to your approach it suddenly starts to make some sense.
Great blog regarding this topic!!!
I was wandering how Master Data Versioning will manage the fact of changing ERP system source. How will your product design be affected? In Data Vault approch I suppose there will be only a new Satellite related to the new Source.
Great to meet you at the Trivadis TechEvent. I did go through this example and took the opportunity to create some additionally attributed models comparing the two modeling patterns. These models are available on my blog hanshultgren.wordpress.com/.../head-version-modeling
Your analysis of the data vault in your comparison does describe the fundamentals of approach. With a couple of small exceptions or considerations. First: Good data vault models do have a distinction between static and dynamic attributes. This is actually a primary factor to take into account when designing your satellites. Most all Hubs will have multiple Satellites. These are designed by rate of change (ranging from static to very dynamic), types of data (logical groupings), and source system (typically when sources have different context values for the same hub key instances). Second: Concerning the expiration date on satellite records. The data vault modeling approach does include the option to include a form of business effective end date. The main point being that the end date is considered to be context (so not part of the key or technical effectivity of a record). These are offered as an option. Those who do not want to have any updates in the data warehouse (insert-only environment) will opt not to include these end dates as they require an update to the end date attribute in the prior record. Those who feel the benefit of a definitive tuple with from-to dates will opt to allow for these forms of updates to the end dates.
Your last point is a good point for everyone to consider. When sourcing data from a data vault pattern DW there will be decisions as to which satellites contain the context attributes that you need – and there will be date math required to re-align the time slices for these attributes. This is mitigated by performance and tuning techniques (perhaps point-in-time tables) and by the dynamics of the new EDW architecture itself (where, for example, subsets of attributes are pulled for each concept – reducing the need for re-aligning dates across several satellites). Coincidentally, your Head-and-Version approach would likely benefit even more so from this dynamic. If the majority of marts required only the static attributes around the main concepts then you could likely avoid a significant amount of the joins to the version tables.
I look forward to more analysis and discussion around this topic with you and the rest of the team at Trivadis.
I've known this technique for several years, it is used in for example BIReady (it is not a true DV generator as most pppl think) I did not know it has its own 'methodology' and was actually used. I consider this technique to be a serious optimization/denormalization variety of a weak anchor vault (which is just a name for the most basic of the Anchor Style modeling techniques). I currently classify this technique as halfway between a 3NF historized data model and an Anchor Style modeling technqiue (like Data Vault or Anchor Modeling).
Thanks for the article. Regarding your comment about the complexity of the ETL processes, they are actually quite simple, and follow a specific easy to use paradigm / template. I've blogged (as have numerous others) about the ease of the ETL templates to get data in to the Data Vault. It's why the auto-generation tools (automation of ETL) have really taken off. The patterns are predictable, consistent, and can be performance tuned.
Regarding your comment to get data out - in to the data marts: "it is more complex and expensive to load the data marts from the Data Vault because the correct version of each Satellite is not trivial." If you look just at the raw DV model, this may be true. However if you include such "query helper tables" (Point in time and bridge) just as Kimball does with Star Schema, then you will find it really isn't all that much harder, nor much more complex. In fact, using a PIT or BRIDGE table, you can construct virtual dimensions, and virtual facts (at the raw level of data) fairly easily. It is also not so complex as you might imagine.
The complexity comes in executing the business rules - but this is a problem that we have in building star schemas based on ANY data model (be it staging, normalized, Head & versioning, etc...) Somewhere the business rules must be executed, and that is where the complexity lives. More expensive on the other hand is a matter of opinion. It also is a matter of how well your hardware performs with MPP style queries (divide and conquer). True MPP enabled systems, and well-laid out (ie: distributed) data queries handle the Data Vault model better than other types of models because of the DV model's ability to evenly partition the data sets across the resources. I cover this in my book: Super Charge Your Data Warehouse.
One thing I want to mention: I've seen this kind of Head / Version modeling before - although slightly differently. It's wrapped in to the implementation of the LDM's sold by Teradata. They apply a notion of "version" and "dates" (the same way this is depicted), however their difference is they wrap it directly in to 3rd normal form. Their implementation drives complexity of the queries and loading system through the roof. In one instance at a client called Center for Medicare and Medicaid - they required 4 to 5 levels of "staging" preparation just to get the data massaged, coalesced and versioned before they could load it to the LDM EDW. In other words, their complexity rating was actually caused by two things: 1) the business rules living upstream of the EDW, 2) the 3nf model dependencies with built in versioning in the table structures.
The Data Vault is not near as complex as systems such as the Teradata LDM's.
Anyhow, just my two cents.
Hope this helps,
PS: you can learn a lot more about the implementation of Data Vault processes at: http://LearnDataVault.com
In April, Hans Huldgren held a presentation about Data Vault Modeling on the Trivadis TechEvent (see
Um was geht es?
Data Vault ist eine Modellierungsmethode für das Datenmodell eines Data Warehouse
Holy sh*t, ^THIS^ is exactly why I like the web ;)
I have used an application before (2007-2010) that used the "classic" valid- and transaction time approach in separate columns. The issue is obviously that relationships either always reference a specific (historized) version directly, or using larger valid time spans, where the integrity is always a pain to use (and you always have to include time stamps in foreign keys).
Just 1-2 years ago, to alleviate this problem, I came up with your EXACT idea, but didn't put it to use, since I'm not currently working on any versioned databases right now - it was just an idle thought at home... What I really like is just stumbling upon this article and finding out that someone, somewhere across the world, had the exact same thought and blogged about it!
Thanks for blogging this, and I would be interested to find out whether you have experienced problems with this method or that you are still very satisfied!