Multi-Version Load in Data Vault
One of the advantages of Data Vault are the easy load patterns that can be used to load hubs, links and satellites. They are well documented in several books about Data Vault. But what happens if you want to load multiple versions in one batch? In this blog post, I will write about multi-version loads into a satellite table.
One of my few onsite events this year was a Data Vault meet-up in Nuremberg, organized by the German Data Vault User Group (DDVUG). Together with my former Trivadis colleague Daniel Zimmermann, I had the opportunity to have a presentation about multi-version loads in Data Vault.
During my part of the presentation, I mentioned to use the “arrival date” instead of the “load date” in the Satellite’s primary key. Probably my wording was a bit imprecise, but the remark led to an endless discussion about date fields in Data Vault. In a large hall full of Data Vault specialists, it can be dangerous to use the wrong wording. But the good news is: I really enjoyed the discussion! This is something that never happens in presentation on a virtual conference.
DDVUG meetup with “social distancing” in a large meeting hall in Nuremberg (Source: Twitter)
What is a Multi-Version Load?
The typical load patterns in Data Vault assume periodic incremental loads. In every load job, the current state of the source data is compared with the newest version in the target satellite. If there are changes, a new version is loaded into the satellite table.
But there are situations where we need a different loading strategy. Example are:
- Initial load of a new satellite with historical data from a historized source system, a Data Lake or a Persistent Staging Area (PSA)
- Initial load of a complete Data Vault schema with historical data from a historized source system, a Data Lake or a Persistent Staging Area (PSA)
- Reload of derived satellites in a Business Data Vault, e.g. when business rules were changed
- Incremental loads of multiple versions in one batch, e.g. in combination with CDC (change data capture)
In all these cases, it is possible that multiple changes for the same business key must be loaded in one step. To explain the difference, let’s have a look at a very simple example: Once a month, the current beer inventory of a microbrewery is written to a satellite table. In this example, the data is loaded from a PSA table into the satellite table in the Data Vault schema on 2 January, 3 February and 1 March, Because there were no changes in the inventory for the beer “Lazariter” (hub key 10) between January and February, no new version is written to the satellite:
Figure 1: Monthy incremental load into a satellite table
On 30 September we decide to reload the satellite from the PSA table again. We don’t want to do this in a loop of monthly load jobs, but in one batch. The resulting data looks like this:
Figure 2: Multi-version load into a satellite table
Challenges with Multi-Version Loads
Unlike with “normal” incremental loads into Data Vault, there are several challenges with a multi-version load. They require slightly different load patterns.
1. Unique Primary Keys
As we can see in figure 2, all rows in the satellite table have the same load date. This would cause primary key violations, because the primary key of a satellite is the combination of hub key and load date. A common approach to solve this issue is to use distinct load dates, e.g. by adding one second (or microsecond) for each version. In some cases this may be the best (or the only possible) solution, but I don’t like this “hack”. My recommendation is to use the date or timestamp when the row arrived in the first layer of the data warehouse. In the presentation, we called it the “arrival date”, but as already mentioned above, this caused confusions and discussions with the audience.
Let’s assume we have a data warehouse with a PSA (Persistence Staging Area), which is loaded daily. The “arrival date” in this case is the date and time when the data is loaded into the PSA table. The data vault tables in our example are loaded monthly – or once with a multi-version load. When we reuse the load date of the PSA table, instead of creating a new load date for the Data Vault layer, we will have a unique primary key for each version.
2. Delta Detection between Versions
In the normal load pattern for a satellite, we have to compare the new version in the (persistent) stage table with the latest version in the satellite table. If the value of at least one attribute changed, a new version is written to the satellite table. There are several ways to implement such a delta detection. A common approach is Data Vault is to calculate a “hash_diff” column for each row, i.e. a hash value of all concatenated columns in the satellite. If the hash_diff value of the source row is different to the hash_diff value of the latest version in the satellite, this indicates that some data was changed since the last load
With a multi-version load, not only the hash_diff values between source and target tables must be compared, but also the different versions of the source table. Let’s assume, the inventory of the beer “Lazariter” changed several times, but not with every version that is stored in the PSA, a new version for the satellite has to be created whenever the inventory column between two PSA versions changed. Identical versions from the PSA (i.e. without a change in one of the columns that are relevant for the satellite) are deduplicated.
3. Multi-Version Load with Multiple Sources
In this simple example, we have only one PSA table as source for the satellite. But what happens if we have to join several PSA tables to load a particular satellite? In this case, the load dates and change frequencies may be different for each source table. How can we derive the correct “arrival date” and the corresponding values for all attributes?
This challenge is well-known for everybody working with Data Vault. It is exactly the same situation that you have if you query data from multiple satellites, for example to load a dimension table. For this purpose, the typical solution is to create a point-in-time (PIT) table. For details, read my blog post Loading Dimensions from a Data Vault Model. The same approach can be used for multiple source tables. The purpose of the PIT table is to derive all time slices that are needed to create the input versions for the multi-version load.
Implementation with biGENIUS
Back to our Data Vault meet-up in Nuremberg. My part of our presentation was just to explain the concept and challenges of multi-version load. But how can all this been implemented? This was the part of my colleague Daniel Zimmermann. He explained in detail how they implemented multi-version load templates with biGENIUS. I don’t explain all the details here, because Daniel plans to write a separate blog post on this. But what I can already reveal: All the challenges described above can be solved with standard SQL functionality.
As soon as the additional blog post is available, I will add the link here.