Star Schema Design in Oracle: Fundamentals
This post was originally published on this site
What are the design rules for good performance in a star schema in an Oracle database? This blog post series introduces some recommendations for the physical database design. This first post is about constraints and indexes.
For the last few years, I was supporting one of my customers in solving performance problems on their Oracle databases. I have been involved with all kind of applications and sometimes really complex SQL statements. But the very last performance ticket I had to solve was a classical one: Several queries on a star schema (with a very large fact table) used a lot of database resources. When I looked at the schema design of the star schema, I was not surprised about the bad performance. It seems that the developers didn’t know about some typical design recommendations for a star schema in an Oracle database. This inspired me for this blog post. I think there are more people out there that aren’t aware of the basic rules to design a star schema in an Oracle database.
When I started to write down some of these general recommendations, I realized that there quite a lot of options and additional notes to mention. Instead of one long post, I decided to split it into multiple parts. This first blog post is about the fundamental concepts of a star schema and the basic rules for constraints and indexes on dimension and fact tables in Oracle.
Star Schema in a Nutshell
A star schema is the implementation of a dimensional data model in a relational database. It usually consists of a fact table that references multiple dimension tables. The fact table contains numerical measurements (or “facts”) and dimension keys that refer the corresponding dimensions. The dimension tables contain all information that describes the facts, their attributes are used for filtering and grouping the facts.
If you are not familiar with the concepts of dimensional modeling and star schemas, here some recommendations to read:
- A short overview of star schemas can be found in the Data Warehousing Guide, section “About Star Schemas” of the Oracle documentation.
- If you want to learn the basic concepts of dimensional modeling, read the standard reference The Data Warehouse Toolkit by Ralph Kimball.
- For more implementation details and special use cases, another good book is Star Schema, The Complete Reference by Christopher Adamson.
For this blog post series, I will use an example of a star schema with one fact table and four dimension tables. We want to have an overview of beer deliveries of a craft beer brewery to multiple customers. All customer related descriptions are stored in the dimension table DIM_CUSTOMER. The different beers of the craft beer brewery are defined in dimension table DIM_BEER. Each delivery contains a set of bottles from the bottling of a specific brew batch on a particular day. The bottling information is stored in dimension table DIM_BOTTLING. Finally, we have a dimension table DIM_DATE. This kind of calendar dimension typically exists in each star schema. The fact table FACT_BEER_DELIVERY consists of dimension keys to refer the four dimension table, as well as some measurements, e.g. the number of bottles.
Figure 1: Star schema for a craft beer brewery
A typical query in a star schema joins the fact table with one or more dimension tables and group the data by descriptive columns of the dimensions. The dimension columns are also used to filter the facts. The measurements are aggregated, usually with a SUM function.
With the following query, we want to find out how many bottles of India Pale Ale were delivered in 2020 to private customers. The facts is grouped by beer name, city of the delivery address and calendar month:
FROM fact_beer_delivery f
JOIN dim_beer b ON (b.beer_id = f.beer_id)
JOIN dim_customer c ON (c.customer_id = f.customer_id)
JOIN dim_date d ON (d.date_id = f.delivery_date)
WHERE d.calendar_year = 2020
AND c.private_person = 'Y'
AND b.style = 'India Pale Ale'
For this example star schema, we now want to have a look how the physical database design in an Oracle database should be implemented. First, let’s have a look at constraints and indexes.
Physical Database Design
An appropriate logical design of the data model for the star schema is the prerequisite for a good data mart. However, another important task is the physical design of the database objects. In addition to the dimension and fact tables, we need to take care of constraints and indexes on the tables. Other features like partitioning, materialized views, etc., I will focus on the next blog posts.
Database constraints ensure data integrity within the tables and the validity of the relationships between the tables. Additionally, they are useful for documentation of the data model and can be used by the optimizer to in some cases to improve query performance. However, constraints can have a negative effect in load performance. The different types of constraints and how they are used in a data warehouse environment are described here.
- Primary Key Constraints: Each dimension table in a star schema must contain a primary key. The primary key column is usually a numeric surrogate key. Natural keys or even multi-column primary keys should be avoided. The only exception is the calendar dimension. Here it is common in Oracle databases to use the date as primary key of the dimension table. The benefit is that the date dimension key in the fact table can be used easily as a partition key (see next blog post). On the fact table, we usually omit a primary key constraint.
- Foreign Key Constraints: The relationships between fact and dimension tables are implemented with foreign key constraints. In many data warehouses, these constraints are not required, because data integrity is already guaranteed by the load jobs (e.g. key lookups of the surrogate keys in dimension tables). However, it is recommended to define the foreign key constraints anyway. They are useful for some kind of query transformations, e.g. join elimination or advanced query rewrite. A very powerful feature in the Oracle database are reliable constraints. They document the relationships and can be used by the optimizer, but data integrity is not checked during the loads. For further details, read my blog post Foreign Key Constraints in an Oracle Data Warehouse.
- Unique Constraints: For key lookups in the dimension tables, i.e. the translation from a natural business key to the corresponding surrogate key, it is useful to create an additional unique constraint on each dimension table. The unique constraint contains the business key column(s) of the lowest granularity level of the dimension. This is often one column that identifies a row in the dimension table (e.g. beer name in DIM_BEER). But in some cases, a business key can be a combination of multiple columns (e.g. batch number and bottling date in DIM_BOTTLING). For Slowly Changing Dimensions Type 2, each row in the dimension table is a specific version of the dimension key. Therefore, the validity must be part of the unique constraint (e.g. valid_from in DIM_CUSTOMER).
- NOT NULL Constraints: All dimension keys and measurements in a fact table, and most of the attributes in the dimension tables are usually mandatory. A common approach is to use default values for missing values. If we know that a column is always filled by the load jobs, it is highly recommended to define a NOT NULL constraint for that column. As you can see in the example above, this is the case for almost any columns in the star schema. Especially the dimension keys in the fact table must be mandatory to avoid outer joins in our queries.
- Check Constraints: For additional verification of the data (e.g. value range checks), it is possible to define a check constraint on a column. This kind of quality checks is usually done as part of the load jobs. Therefore, check constraints are rarely used in data warehouses.
For each primary key and unique constraint on the dimension tables, a unique index is created automatically. The purpose of these indexes is to guarantee uniqueness of the constraints. But what additional indexes are required in a star schema?
Indexes are useful for queries with a strong selectivity, i.e. when only a small percentage of the data is selected from a table. This is usually not the case in data warehouses. Typical queries on a star schema read many facts and aggregate them by several dimension columns. So, does it make sense at all to create indexes on a star schema? Here some general rules:
- On each dimension table, we have a unique index on th primary key column and a unique index on the business keys. These indexes are automatically created by the primary key and unique constraints, as described above. Additional indexes are rarely used, because the dimension tables are typically small, and the selectiviy on the filter columns is usually weak. In some cases (for very large dimension tables), it may be helpful to create indexes on filter columns with a high selectivity. My general recommendation for dimension tables: Don’t create any additional indexes.
On the fact table, the rule is simple: Create a bitmap index on each dimension key. These indexes are used for joins on dimensions with a strong selectivity (= small percentage of data) in the WHERE condition of the dimension, but are also useful for a special method to join tables in a star schema, called Star Transformation.
A star transformation will take place when we have WHERE conditions on two or more dimension tables. The selectivity for each of the filter conditions is too weak to use an index, but the combination returns only a small percentage of the data from the fact table. In our example above, let’s say we sold many bottles of India Pale Ale (IPA) in 2020, and also many beers to private customers, but only a small percentage of the IPAs were sold to private customers. With a star transformation, the optimizer will combine the bitmap indexes for the foreign keys on DIM_BEER and DIM_CUSTOMER to find all valid combinations for the query. With the small result set of this BITMAP AND operation, the corresponding rows from the fact table can be selected efficiently. The principle of a Star Transformation is to combine two or more bitmap indexes on the fact table to derive the facts that satisfy the filter conditions on all dimensions. In our example and in figure 2, this is shown for two dimensions, but the same principle works also for three or more dimension tables. A more detailed description of how Star Transformation works, you can find in the Oracle blog post Optimizer Transformations: Star Transformation by Sunil Chakkappen.
Figure 2: Principle of star transformation (with two dimensions)
Note: To allow star transformations in your Oracle database, you have to enable it by setting the configuration parameter star_transformation_enabled to TRUE.
Some developers prefer to use B-tree indexes instead of bitmap indexes for dimension keys with many distinct values. But this is an exceptional case. Even though it is often claimed otherwise, bitmap indexes also work fine for a high number of distinct values. I use them for all dimension keys in fact tables, independent on the number of distinct values.
Another reason for B-tree indexes is when you use Standard Edition instead of Enterprise Edition. In Oracle Database Standard Edition, bitmap indexes are not available, but star transformation is supported anyway. In this case, you can create B-tree indexes on the fact table. The star transformation also works with B-tree indexes.
A short summary of the basic rules for the physical database design of star schemas in Oracle:
- Create a primary key constraint on the surrogate key of the dimension. This automatically creates a unique index.
- Create a unique constraint on the business key column(s). This automatically creates a unique index.
- Additional indexes on the dimension tables are usually not required.
- Create a foreign key constraint and a bitmap index on each dimension key column.
- A primary key constraint on the fact table is usually not required.