This post was originally published on this site
Which data warehouse cloud platform is faster: Oracle Autonomous Data Warehouse or Snowflake Cloud Data Warehouse? The short answer: it depends. For a more detailed answer, read this blog post.
Skiing is not recommended at the moment. Because of the current pandemic situation, it’s better to stay at home. So, I do a different kind of winter sports this year and play around with Snowflake. No, I’m not writing about building a snowman or anything like that. Currently, I’m doing some tests with Snowflake, a cloud-based data warehousing platform, that runs on Amazon AWS, Microsoft Azure and Google Cloud. I never worked before with Snowflake databases, and my main focus is to understand how this data warehousing solution works, what is similar to Oracle (which I know a little bit better) and where are the differences. As a first test, I wanted to find out how fast the Cloud platform of Snowflake works, compared to an Autonomous Data Warehouse in the Oracle Cloud.
Another type of a Snowflake table – but not the one I used for my performance tests
Of course, it is not possible to compare the two cloud solutions directly. Although Snowflake looks surprisingly familiar for me as an Oracle developer, it uses a different technical architecture and other performance features than Oracle. The results of any performance tests also depend on the type of queries. So, my experiments may not be meaningful for all kind of data warehouses. But for me, it was important to get a first impression of Snowflake with my own use case, not only with the demo schemas of Snowflake Inc.
Use Case: Craft Beer Data Vault
Example Data Vault schema for my performance tests
In 2017, I created a Data Vault example model for a video tutorial about Data Vault modeling. The use case for the demo was the small microbrewery of my wife and me. One year later, I used the same example schema for ETL tests with the Oracle Autonomous Data Warehouse. For the presentation Craft Beer in the Cloud: A Data Warehouse Case Study I reused and extended the data model to show how Virtual Data Marts can be built with views on top of the Data Vault tables. For this, a PIT (point-in-time) table and some history views are created for every hub and its satellite tables. On top of the history view layer, I created some dimension and fact views to provide two star schemas. The basic principle for this approach I explained in the presentation and in a previous blog post. For the situation here, it is just important to know that each query on a star schema needs to join many tables in the Data Vault schema.
To show this more clearly, here an example: In a Snowflake workbook, I executed a query that joins a fact view with two dimension views (slowly changing dimension type 2).
SQL queries can be executed in a browser-based Snowflake worksheet
To get the result for this query, Snowflake must scan and join 27 tables of the Data Vault schema (6 hubs, 6 PIT tables, 3 links, 12 satellites).
An excerpt of the execution plan for the query above in Snowflake
The execution plan in Oracle shows that only 13 tables (6 PIT tables, 3 links, 4 satellites) must be scanned and joined because of Join Elimination.
An excerpt of the execution plan in Oracle for the same query
Running the Performance Tests
My performance tests contain four star schema queries. All of them join a fact view with one or two dimension views, using different types of Slowly Changing Dimensions (SCD). The queries answer the following questions:
- How many bottles were delivered per month and beer name? (SCD type 1)
- Which beer style was ordered how often in 2019 in which city? (SCD type 1)
- Which beer style was ordered how often in 2019 in which city? (SCD type 2)
- Which beer style was ordered how often in which year and city? (SCD type 2)
To run the queries in an Oracle database, I use SQL Developer on an Autonomous Data Warehouse in the Oracle Cloud. As an Oracle ACE Director, I’m lucky to have an account with more resources than the usual “Always Free” account, so I can scale up and down the ADW database with different number of CPUs. This is important, as you will see later. I ran the queries on the ADW with 1, 2, 4 and 8 CPU.
Oracle Autonomous Data Warehouse: Scale Up/Down the number of CPUs
For the Snowflake environment, I use a trial account in the Amazon AWS cloud. To run the queries and check the results, runtimes measures and execution plans, I use the browser-based web interface of Snowflake. This is the most common way to work with Snowflake platform. Since yesterday I know that it is also possible to connect to Snowflake with SQL Developer. My Trivadis colleague Philipp Salvisberg provided a JDBC proxy for this, as he described in his blog post Accessing Snowflake from SQL Developer.
Scaling is also possible in Snowflake by using different Virtual Warehouses to run the SQL statements. The size of a Warehouse defines how many servers are used in the Compute Layer: X-Small (1 server), Small (2 servers), Medium (4 servers), Large (8 servers). It is either possible to resize a Warehouse or to switch to another Warehouse. Both can be done in the web interface or on command line. I used four different warehouses and switched between them.
Snowflake: The Warehouse size can be changed to scale the Compute Layer
In both environments, I executed the queries multiple times. The test results are the average of all executions, except the first one (usually an outlier because no data is chached). So the preconditions should be similar for both platforms:
- Physical reads are reduced by the usage of the Database Buffer Cache (Oracle) or the Data Cache (Snowflake).
- Result caching is used by default on both platforms. Because all subsequent executions of a SQL statement would run in milliseconds, I disabled it in both environments. In Oracle with the hint no_result_cache, in Snowflake with a session command alter session set use_cached_result=false;
- Both platforms can be scaled. In Oracle by the number of CPUs (1, 2, 4, 8), in Snowflake with the Warehouse size (XS, S, M, L). The behaviour is not exactly the same, but it is the recommended way to scale the platform for the size of a data set.
- Oracle ADW and Snowflake both use Hybrid Columnar Compression to store the table data. Snowflake additionally uses an automatic “micro-partition pruning” to split the data into many small partitions. Oracle allows several methods of partitioning of the tables. For this use case, I did not use the Oracle Partitioning features.
- Indexes are not supported in Snowflake and not recommended in Oracle ADW. The only useful purpose of indexes in ADW are primary key constraints and unique constraints. For this test case, I implemented referential integrity with reliable constraints in Oracle. Therefore, no indexes are created. The primary/foreign key constraints are required by the optimizer for Join Elimination in Oracle. In Snowflake, I also defined all primary and foreign key constraints, but they are neither checked nor used by the optimizer.
Tests with Original Data Set
The original demo schema contains around 360’000 orders with 460’000 order items. This is much more beer than we sell in our real microbrewery, but it is a data volume that is useful for some performance tests. As you can see in the result charts of my performance tests, all queries run within a few seconds. The queries in Oracle run in less than 3 seconds, Snowflake seems to be slower (between 5 and 15 seconds). Neither the number of CPUs in Oracle nor the Warehouse size in Snowflake have a noticeable impact on the query performance.
Test results with small data set
But why is Oracle in general faster than Snowflake? One reason is certainly Join Elimination. Because of the foreign key constraints, the Oracle optimizer knows about the relationships between the tables and can eliminate unneeded joins to tables that are not required (all hubs and some of the satellites). Another reason is the different implementation of data caching. A high percentage of this small data set fits into the Database Buffer Cache in Oracle, which is stored in memory. In Snowflake, the Data Cache contains all data of the previous executed queries. So, after the first execution, 100% of the data can be read from the cache. But this cache is not in memory, it is stored on SSD disks. So even with data caching, physical reads are required in Snowflake.
Blow Up the Data Set by Factor 10
Because Oracle ADW and Snowflake are designed for larger data sets, the architecture of these two database solutions is perhaps an overkill for my simple queries on 27 Data Vault tables? I want to know more about the scaling capabilities of these two platforms with a larger data set. For this purpose, I installed another schema with my brewery data and increased the number of sales orders and beer deliveries by factor 10. The new demo schema contains now 3.6 million orders and 4.6 million order items. Can we still call this a “microbrewery”?
Even more interesting than the naming of the brewery are the test results for the queries on the new data set. Oracle ADW now begins to scale as expected: The more CPUs are available, the shorter are the execution times of the queries. Because of a higher parallel degree, more memory can be used in the PGA (private global area) for the hash joins and data aggregations. Therefore, less I/O on the TEMP tablespace is required. This is what I expected.
Test results with medium data set
The test results for Snowflake are definitely not what I expected. The response times for the first two queries (SCD type 1) seem to be mostly independent from the Warehouse size (XS, S, M or L). For the SCD type 2 queries, it is even more surprising: These queries are much faster with a Warehouse of size X-Small (i.e. 1 server) than with any bigger Warehouse size.
For every query executed in Snowflake, it is possible to look at the “Query Profile” with runtime statistics. This may help to find the reasons for this strange behaviour.
Query Profiles for query 3 with Warehouse size X-Small and Large
A high percentage of the execution time on Warehouse Large is spent for network traffic and synchronisation between the servers. As a Snowflake newbie, I don’t understand the details (yet), but for me it seems to be something similar as if you run a SQL statement in Oracle with a too high parallel degree. Perhaps the amount of data is still too small?
Blow Up the Data Set by Factor 100
Let’s blow up the data again! In a third brewery schema, I installed the same tables again, but this time with 36 million orders and 46 million order items. By the way, this affects several tables in the Data Vault schema: 3 hubs, 5 links, 3 satellites and 3 PIT tables contain 100 times more rows. Of course, this has an impact on the execution times of all four queries. Some of them run for several minutes.
As you can see in the charts of the test results, both platforms scale now as expected. The more hardware power is available, the faster the queries can be executed.
An Oracle ADW with only one CPU is not a good idea anyway, but with this amount of data, it is really recommended to configure at least 4 CPUs. As you can see, the most expensive query 4 runs for about 45 minutes with 1 CPU. With 2 CPUs, it is already significantly faster, and with 4 or 8 CPUs, the response times of the queries are acceptable.
Test results with large data set
Snowflake finally gets going on these data sets. Most of the queries in this schema are faster than on ADW, network traffic and synchronisation between the servers now no longer carries any weight. The number of servers (i.e. the Warehouse size XS, S, M or L) is not as important as expected for this type of queries. Even for sizes X-Small and Small, the response times of the queries are acceptable. This could be different for other types of queries with more calculations.
Summary
I hope to show you with this blog post, that there is no simple answer to the question “which data warehouse platform is faster”. Many details have an impact on the query performance: technical architecture and configuration of the Cloud platform, amount and distribution of the data, type and complexity of the queries. For this type of queries (almost no calculations, but joins of many tables), Snowflake seems to be slower than Oracle ADW for smaller data sets, but faster with high amount of data. For other data models or complex queries with many calculations, it might be different.
If somebody asks me: “Which Cloud platform is faster, Oracle or Snowflake?”, my answer will still be: “It depends”.
And if somebody asks me about the number of sales order of our real microbrewery, I would reply: “For this we don’t need a Cloud Data Warehouse”. We have around 50 orders per year.