This post was originally published on this site
A typical reason of many performance issues with SQL are function calls in the WHERE condition. They make it difficult for the query optimizer to determine the cardinality. Fortunately, there are several ways to solve this problem.
Function calls – or any kind of SQL expressions in a WHERE condition – can cause performance issues in complex queries. The reasons are not the functions themselves, but the calculation of the cardinality by the optimizer. They cause wrong estimations and can lead to inappropriate join orders and suboptimal execution plans.
I will show this behaviour with a simple query on one table. The table ADDRESSES contains 255691 rows, 37153 of them are addresses in Germany. That is around 15% of the rows.
SELECT COUNT(*) FROM addresses
WHERE UPPER(ctr_code) = 'DE';
COUNT(*)
----------
37153
The cardinality (column E-Rows) in the execution plan shows the estimated rows, calculated by the optimizer. If there is a big difference to the number of actual rows (column A-Rows), the optimizer was not able to estimate the correct cardinality. This is the case here: The optimizer estimates only 2557 rows.
-------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| ADDRESSES | 1 | 2557 | 37153 |
-------------------------------------------------------------------
The estimation (2557 rows) is 1% of the rows in the table (255691 rows). This is the default that the Oracle optimizer uses for function calls and expressions. How can we tell the opimizer to do a better estimation? There are several possibilities, as you will see in the following tips.
Tip 1: Avoid Function Calls
The best way is to avoid function calls and expressions in the WHERE condition. This is not always possible, but depending on the expression or the data contents, this is often the simplest solution. For this example, we could use a CHECK constraint on the ADDRESSES table that guarantees that the country code is always stored in upper case. So, we don’t need the UPPER functions anymore in the query. As you can see, the optimizer is now able to estimate the correct cardinality in column E-Rows.
SELECT COUNT(*) FROM addresses WHERE ctr_code = 'DE'
-------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| ADDRESSES | 1 | 37153 | 37153 |
-------------------------------------------------------------------
Tip 2: Function-based Index
If the function call is really required and it is not possible to avoid it, one solution is to create a function-based index, i.e. an index on the expression that is used in the WHERE condition:
CREATE INDEX adr_fbi_upper_ctr_code
ON addresses (UPPER(ctr_code));
To see the effect on the cardinality estimation, we re-gather the statistics after creating the index with the following parameters:
BEGIN
dbms_stats.gather_table_stats
(ownname => USER,
tabname => 'ADDRESSES',
method_opt => 'FOR ALL COLUMNS SIZE SKEWONLY',
no_invalidate => FALSE);
END;
/
When we run the test query again and look at the execution plan, we can see several interesting details:
- The estimated cardinality in column E-Rows is correct now. The optimizer is now able to do a good estimation.
- The function-based index ADR_FBI_UPPER_CTR_CODE is used for the query. This is not always the case, as described below.
- An access predicate SYS_NC00015$”=’DE’ is used. We will see later where this exactly comes from.
SELECT COUNT(*) FROM addresses WHERE UPPER(ctr_code) = 'DE'
-------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX RANGE SCAN| ADR_FBI_UPPER_CTR_CODE | 1 | 37153 | 37153 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ADDRESSES"."SYS_NC00015$"='DE')
An interesting detail: In this example, the index is used because of the COUNT(*) in the query. If we select any or all columns of the table, e.g. with a SELECT * FROM, a full table scan is performed. Why? Because reading 15% of the table is faster with a full table scan than with an index range scan. The details I will explain if future blog post. The imporant message here is, that even if the function-based index is not used for this query, the optimizer is still able to estimate the correct cardinality (E-Rows = 37153).
SELECT * FROM addresses WHERE UPPER(ctr_code) = 'DE'
------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 37153 |
|* 1 | TABLE ACCESS FULL| ADDRESSES | 1 | 37153 | 37153 |
------------------------------------------------------------------
Tip 3: Virtual Column
Another solution is to create a virtual column, containing the required function call. A virtual column in a table can be accessed like any other column, but it is not stored in the table. Instead, the expression or function that is defined behind the virtual column, is executed automatically, whenever the column is read. For our example, we can create a virtual column UPPER_CTR_CODE with the following command:
ALTER TABLE addresses
ADD (upper_ctr_code VARCHAR2(2) AS (UPPER(ctr_code)) VIRTUAL);
This new column can either be used directly in the query:
SELECT COUNT(*) FROM addresses WHERE upper_ctr_code = 'DE'
But we can also run the original query with the function call in the WHERE condtion. In both cases, the execution plan will be the same:
SELECT COUNT(*) FROM addresses WHERE UPPER(ctr_code) = 'DE'
-------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| ADDRESSES | 1 | 37153 | 37153 |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(UPPER("CTR_CODE")='DE')
Like in the example with the function-based index, the cardinality is estimated correctly, i.e. the estimated rows is the same as the actual rows. Like before, this is only the case when the statistics are re-gathered after creating the virtual column. Statistics are also gathered for virtual columns and are therefore available for the optimizer to estimate cardinality and costs.
It is even possible to create an index on a virtual column. It this case, the behaviour would be exactly the same as with a function-based index.
Tip 4: Extended Statistics
The third option to affect the cardinality estimation is with extended statistics. Extended statistics are additional statistics for expressions (like a function call) or column groups of related columns. They can be created with the function dbms_stats.create_extended_stats. This function does not gather the statistics, it is used only to prepare the additional metadata for the statistics. So, after the function call, we must call dbms_stats.gather_tables_stats again:
SQL> SELECT dbms_stats.create_extended_stats (USER, 'ADDRESSES', '(UPPER(ctr_code))') from dual;
DBMS_STATS.CREATE_EXTENDED_STATS(USER,'ADDRESSES','(UPPER(CTR_CODE))')
----------------------------------------------------------------------
SYS_STU9J40#VS#IMBAGKHVY2Q7U04
SQL> BEGIN
2 dbms_stats.gather_table_stats(ownname => USER,
3 tabname => 'ADDRESSES',
4 method_opt => 'FOR ALL COLUMNS SIZE SKEWONLY',
5 no_invalidate => FALSE);
6 END;
7 /
PL/SQL procedure successfully completed.
The effect is exactly the same as before, as you can see in the cardinality estimation in column E-Rows:
SELECT COUNT(*) FROM addresses WHERE UPPER(ctr_code) = 'DE'
-------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| ADDRESSES | 1 | 37153 | 37153 |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(UPPER("CTR_CODE")='DE')
To show that the extended statistics are responsible for this result, let’s delete them and run the same query again:
BEGIN
dbms_stats.drop_extended_stats(user, 'ADDRESSES', '(UPPER(CTR_CODE))');
END;
/
What we see now, is the same execution plan, but with the default estimation of 1% in column E-Rows:
SELECT COUNT(*) FROM addresses WHERE UPPER(ctr_code) = 'DE'
-------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| ADDRESSES | 1 | 2557 | 37153 |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(UPPER("CTR_CODE")='DE')
Which is the best solution?
As you can see, there are several ways to influence the optimizer to correctly estimate the cardinality of function calls. But which of them is the best solution?
My favourite is still to avoid function calls whenever possible. When columns are used unmodified in the WHERE condition, it is much easier for the optimizer to estimate the right cardinality than with expressions or function calls.
If you must or want to use a function call anyway, there are three options. Technically, they all do the same: They create a virtual column for the table. When the table statistics are gathered the next time, additional column statistics are calculated for this virtual column. Let’s look how this was implemented in detail for our examples:
- Function-based index: A hidden virtual column SYS_NC00015$ was added to the table, and an index for this column was created. Although virtual columns were officially released with Oracle 11g, they were already available in Oracle 8i, but only for this particular use case of a function-based index.
- Virtual column: Here, the virtual column was created manually with a proper column name UPPER_CTR_CODE. It is visible in the table description like any other column, and it can also be indexed if required.
- Extended statistics: Here again, a hidden virtual column SYS_STU9J40#VS#IMBAGKHVY2Q7U04 is added for the purpose of additional column statistics. Like with the function-based index, the column is not visible in the table description, but it is used automatically, when the function call is executed.
Which of these options you want to use, is a question of personal flavour. I prefer to create an explicit virtual column, because it is visible in the table definition and can also be used like a normal column of the table.
How about PL/SQL functions?
In this blog post, I used a very simple example with the standard SQL function UPPER. But all the described solutions can also be used in combination with PL/SQL functions, however some restrictions and design tips are important in such situations. This will be the topic of my next blog post. Hopefully soon…