About Data Warehouse Design Patterns and Lazy Programmers
This post was originally published on this site
I’m a lazy programmer! That was one important detail I learned from Roelant Vos in his training last week. But there were many other interesting topics. A personal summary of a 3-days class about Data Warehouse Design Patterns.
Last week I had the opportunity to attend the class Data Warehouse Design Patterns of Roelant Vos. I have seen Roelant in presentations on data modeling conferences, and I appreciate his blog with a lot of useful information about Data Warehouse architecture and Data Vault implementation. So, when I got the information from my Trivadis colleagues that he will give a training in Zurich, I took the chance to join this 3-days class.
I will not write a detailed review of the whole course in this blog post. But in the following lines you find some notes about three topics that were – from my point of view – important in this training: Pattern Based Design, Persistent Staging Area and Virtual Data Warehouse.
Pattern Based Design
A typical data warehouse architecture consists of multiple layers for loading, integrating and presenting business information from different source systems. The number and names of the layers may vary in each system, but in most environments the data is copied from one layer to another with ETL tools or pure SQL statements. With a good architecture, the patterns to transform and load the data into a particular layer are always similar. This makes it easier (and faster) to develop the ETL processes, because the repeatable patterns can be generated with data warehouse automation (DWA) tool. On Roelant’s blog, an overview of his Data Integration Framework can be found. Code example are available on GitHub.
Why am I a lazy programmer? Like many other developers, I don’t like to write repetitive code. Roelant is a lazy programmer, too. That’s why he spent a lot of time to develop tools and methods to speed up the development tasks by generating the SQL code to build and load data warehouses. During the course, he explained many design patterns with the SQL generator VDW (Virtual Data Warehouse). It can be downloaded from his website for free and is useful for quick prototypes and regression tests of pattern changes.
Persistent Staging Area
An important layer in the proposed architecture is the PSA (Persistant Staging Area). Although this area of the staging layer is optional, it is very practical and highly recommended, especially when the business requirements are not yet clear at the beginning of the project. The PSA is a historical archive of the data from the source systems and replaces the classical volatile Staging Area. The PSA is loaded either with CDC (Change Data Capture) mechanisms or with a delta detection between source system and current version of the PSA. The Persistent Staging Area is the unique source to load the Data Vault tables (Hubs, Links and Satellites) in the Integration Layer.
The advantage of a Persistent Staging Area is that only the currently required parts of the information have to be loaded into the Data Vault tables. We don’t have to care about future requirements, because the original data is still available in the PSA and can be reloaded when it is needed. I have seen and used this approach in several customer projects, and it is a good insurance against unknown business requirements.
Virtual Data Warehouse
When I saw Roelant Vos for the first time on the Data Modeling Zone 2017 conference in Düsseldorf, I was very impressed about an interesting architecture approach he explained: The only persistant layer of a data warehouse is the PSA, and everything else is implemented on top of the PSA tables with views. This is the consistent implementation of a Virtual Data Warehouse which is supported by Roelant’s VDW software. I have seen this in theory and in some presentations, but never in it’s full form in a real project.
This “NoETL” approach has several advantages: 1. It is very easy to change the transformation logic without reloading all layers of your data warehouse. 2. As soon as new data is loaded in the PSA layer, it is immediately visible in all subsequent layers. 3. It is a kind of a schema-on-read data warehouse where you can decide at query time how the source data should be interpreted. That all sounds very nice, but I still see the challenge of good query performance. I think, in most situations, at least the commonly used data must be persisted. But with the view approach, this can be done without much effort, for example with materialized views.
The 3-day course has more than fulfilled my expectations. Although I already used some of the concepts myself (to be honest, most of what Roelant explained was already familiar to me), I really liked the step-by-step construction of a data warehouse architecture using many code examples and live demos. Or as I said as spontaneous feedback at the end of the course: “It was fun”.