10 Tips to Improve ETL Performance – Revised for ADWC
The Autonomous Data Warehouse Cloud (ADWC) is a self-configuring, fast, secure and scalable platform for data warehouses. Does this mean we don’t have to take care anymore about performance of our ETL processes? Which performance tips are still important for us, and where can we hand over the responsibility to ADWC? A revised version of an old blog post, with regard to Oracle’s Data Warehouse Cloud solution.
Last summer, I published a blog post with performance tips for ETL jobs: 10 Tips to Improve ETL Performance. Now, it’s summer again, and I am running several ETL jobs on the Autonomous Data Warehouse Cloud to test its features and limitations.This is a good time for a revised version of my blog post, with a special focus on ADWC environments. What is still the same, what changes with ADWC?
What is the impact of the Autonomous Data Warehouse Cloud on ETL performance? Is the night still too short?
In my original blog post, I wrote about the following performance tips for ETL:
- Use Set-based Operations
- Avoid Nested Loops
- Drop Unnecessary Indexes
- Avoid Functions in WHERE Condition
- Take Care of OR in WHERE Condition
- Reduce Data as Early as Possible
- Use WITH to Split Complex Queries
- Run Statements in Parallel
- Perform Direct-Path INSERT
- Gather Statistics after Loading each Table
Of course, all these tips are still valid, and I recommend them to use in every ETL process. But some of them are more important, some of them not relevant anymore, if you run your data warehose on ADWC. Let’s go through the list step by step.
1. Use Set-based Operations
Architecture and configuration of ADWC are designed for a high throughput of large data sets in parallel mode. If you run your load jobs with row-by-row executions, using cursor loops in a procedural language or a row-based ETL tool, ADWC is the wrong environment for you. Of course, it is possible to load data with such programs and tools into an ADWC database, but don’t expect high performance improvements compared to any other database environment.
Data Warehouse Automation frameworks and modern ELT tools are able to use the benefits of the target database and run set-based operations. If you use any tools that are able to generate or execute SQL statements, you are on the right track with ADWC.
2. Avoid Nested Loops
As I already mentioned in the original blog post, Nested Loop Joins are one of the main causes for ETL performance problems. This join method is usually not feasible for ETL jobs and often the reason for poor load performance. In most situations when the optimizer decides to choose a nested loop, this is in combination with an index scan. Because almost no indexes exist in an ADWC environment (see next section), this problem is not relevant anymore.
3. Drop Unnecessary Indexes
ADWC doesn’t like indexes. If you try to create an index, you will get an error message:
CREATE INDEX s_order_item_delivery_date_idx
ON s_order_item (delivery_date);
ORA-01031: insufficient privileges
Although this seems to be a very hard restriction, it is actually a good approach for a data warehouse environment. There are only a few reasons for indexes in a data warehouse:
- Unique indexes are used to prove the uniqueness of primary key and unique constraints. This is the only case where ADWC still allows to create indexes. If you define such constraints, an index is created as usual. To improve the ETL performance, it is even possible to create the primary keys with RELY DISABLE NOVALIDATE. In this case, no index is created, but you have to guarantee in the load process or with additional quality checks that no duplicates are loaded.
- In a star schema, bitmap indexes on the dimension keys of a fact table are required to perform a Star Transformation. In ADWC, a Vector Transformation is used instead (this transformation was introduced with Oracle Database In-Memory). So, there is no need for these indexes anymore.
- For selective queries that return only a small subset of data, an index range scan may be useful. For these kind of queries, the optimizer decides to use a Bloom filter in ADWC as an alternative to the (missing) index.
So, the only case where indexes are created in ADWC, are primary key and unique constraints. No other indexes are allowed. This solves a lot of performance issues in ETL jobs.
Performance Tips 4 to 7
These are general tips for writing fast SQL statements. Complex queries and expressions in WHERE conditions are hard to be evaluated by the query optimizer and can lead to wrong estimations and poor execution plans. This does not change if you move your data warehouse to ADWC. Of course, performance issues can be “solved” in ADWC by increasing the number of CPUs (see next section), but a more elegant and sustainable approach is to keep your SQL simple and fast. This is the case on all databases on premises and in the Cloud.
8. Run Statements in Parallel
Queries and DML statements are executed in parallel by default in ADWC, if more than 1 CPU core is allocated. Parallel DML (PDML) is enabled by default for all sessions. Normally, PDML has to be enabled per session with an ALTER SESSION ENABLE PARALLEL DML command. This is not necessary in ADWC.
The typical way of performance tuning in ADWC is to increase the number of CPUs and therefore the parallel degree of the executed SQL statements. Some call this KIWI (“kill it with iron”) approach, Oracle calls it “elastic scaling”. The number of CPU cores can be assigned to your data warehouse environment at runtime. This works fine. The number of CPUs can be adjusted any time on the web interface of the Oracle Cloud Infrastructure. After changing the number to the new value, the system is scaled up or down. This takes a few minutes, but no interrupt of services or restart of the database is required. The only detail you have to keep in mind: the number of CPU cores has an impact on the costs of the Cloud infrastructure.
The number of CPU cores can be adjusted any time in the Autonomous Data Warehouse Cloud
The degree of parallelism (DOP) is computed by the optimizer with the Auto DOP mechanism (PARALLEL_DEGREE_POLICY = AUTO). All initialization parameters for parallel execution are configured automatically and cannot be changed, not even on session level.
PARALLEL hints are neighter required nor recommended. By default, they are ignored in ADWC. But if you need them (or you think you need them), it is possible to enable them on session level:
ALTER SESSION SET OPTIMIZER_IGNORE_PARALLEL_HINTS = FALSE;
The parameter OPTIMIZER_IGNORE_PARALLEL_HINTS was introduced with Oracle 18c, but is available in ADWC, too (ADWC is currently a mixture of Oracle 184.108.40.206 and Oracle 18c). It is one of the few initialization parameters that can be modified in ADWC (see documentation). By default, it is TRUE, so all parallel hints are ignored.
9. Perform Direct-Path INSERT
Because Parallel DML is enabled by default in ADWC, most INSERT and MERGE statements are executed as Direct-Path load operations. Only for serial DML statements (which only occur on an ADWC database with one CPU core), the APPEND hint has to be added to the INSERT and MERGE statements. This is the only hint that is not ignored by default in ADWC.
But pay attention: Parallel DML and even an APPEND hint do not guarantee a Direct-Path INSERT. If referential integrity is enabled on the target table, Direct-Path is disabled and a Conventional INSERT is performed. This can be avoided with reliable constraints, as described in blog post Foreign Key Constraints in an Oracle Data Warehouse.
10. Gather Statistics after Loading each Table
Direct-Path load operations are not only much faster than Conventional DML statements, they have another good side effect in ADWC: Online statistics gathering was improved and is now able to gather object statistics automatically after each Direct-Path load operation. I explained this in my last blog post Gathering Statistics in the Autonomous Data Warehouse Cloud. Only after Conventional DML statements, it is required to call DBMS_STATS to gather statistics. Unfortunately, this is not done (yet) automatically.
As you can see from the length of this blog post, the Autonomous Data Warehouse Cloud is not a complete self-configuring environment that solves all performance issues automatically. It is still important to know how the Oracle database works and how efficient ETL processes have to be designed. Set-based operations and reliable constraints are mandatory, and bad SQL statements will still be bad, even in an Autonomous Database.
But there are many simplifications in ADWC. The consistent usage of Parallel DML and Direct-Path load operations, including online statistics gathering, makes it easier to implement fast ETL jobs. And many performance problems of ETL jobs are solved because no indexes are allowed.